dbTalk Databases Forums  

An ETL Strategy

comp.databases.olap comp.databases.olap


Discuss An ETL Strategy in the comp.databases.olap forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
Domenico Discepola
 
Posts: n/a

Default An ETL Strategy - 07-08-2004 , 08:43 AM






I have developed the following ETL strategy for all my data warehouse loads.
I tend to follow, as closely as possible, the methodology developed by Ralph
Kimball. First, I distinguish between the 2 possible load types: full
refresh or incremental. The former truncates all existing data in the
destination table (dimension or fact) whereas the latter compares the
incoming data to the existing data, record by record, field by field.
Needless to say, a full refresh load takes much less time to process.
Terminology: a surrogate key is a meaningless integer that will be used as
the primary key in the dimension/fact table. All incoming production keys
will be replaced by a surrogate key. Type 1 and 2 slowly changing
dimensions are part of Ralph Kimball's methodology. Transformation and
cleansing routines will be discussed later. All comments are welcome.



Full Refresh Pseudocode



For each incoming record

transform the record

cleanse the fields

output to file

get next record

end



load output file to database





Incremental Load Pseudocode



For each incoming record

get lookup table

get maximum surrogate key from database

transform the record

cleanse the fields

determine if incoming record exists (using production key and
lookup table)

if incoming record exists

get existing matching record from database

determine fields that changed

if fields have changed

replace existing record in database with new changes (type 1 slowly changing
dimension) or insert new record in database (type 2 slowly changing
dimension)

end if

else

insert new record in database

end if

get next record

end










Reply With Quote
  #2  
Old   
Joerg Narr
 
Posts: n/a

Default Re: An ETL Strategy - 07-08-2004 , 01:04 PM






Hi Domenico,


Quote:
Needless to say, a full refresh load takes much less time to process.
This is wrong. Usually you decide which data to load by comparing a field in
the production table. This can be a time stamp or a flag (already loaded =
1, not loaded or changed = 0). So the amount being loaded is minimal.
Imagine a Telco company trying to fully import its data where the delta
alone is sometimes several hundred million of rows a day!

Joerg




Reply With Quote
  #3  
Old   
Domenico Discepola
 
Posts: n/a

Default Re: An ETL Strategy - 07-08-2004 , 02:08 PM




"Joerg Narr" <n_o_spa_mjoerg_narr (AT) hotmail (DOT) com> wrote

Quote:
Hi Domenico,


Needless to say, a full refresh load takes much less time to process.

This is wrong. Usually you decide which data to load by comparing a field
in
the production table. This can be a time stamp or a flag (already loaded =
1, not loaded or changed = 0). So the amount being loaded is minimal.
Imagine a Telco company trying to fully import its data where the delta
alone is sometimes several hundred million of rows a day!

Joerg


Unless I misunderstand, what you are talking about is would be handled using
the "incremental" load functionality. The initial load into a data
warehouse (using the initial export of data) is handled much quicker using
this full refresh method specifically because you do not have to compare
fields to see if they have changed.




Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.