An ETL Strategy - Data Cleansing -
07-08-2004
, 09:39 AM
Please refer to my other post entitled An ETL Strategy - Data
Transformation. Once data has been transformed, it is passed on to the
cleansing routine. The following data cleansing methodology seems to work
for me. Points to note are that: 1. cleansing starts off at a very low
level which deals with the more physical aspects of the data, and 2. having
data in a relational format certainly helps. From there, cleansing proceeds
through 4 additional steps which gradually deal with the more logical
aspects of the data.
Step 1: basic record formation includes checking for the correct # of
fields, that each field is of the proper data type (integer, alphabetic,
decimal, etc.)
Step 2: constraint checking (including nulls and default values)
Step 3: data masking (data should "look" proper - e.g. zip code, telephone
numbers, etc.)
Step 4: simple / single field lookups (e.g. product code exists, customer
number exists, etc.)
Step 5: complex / multi field lookups (e.g. if zip code = xxx then state
must = yyy)
Steps 3, 4, and 5 are considered advanced data cleansing. Steps 1 and 2 are
considered basic data cleansing. Although you can code around this, all
cleansing steps should include lower steps. For instance, if you decide to
perform cleansing at level 3, then you should automatically cleanse at
levels 1 and 2. Exceptions produced during each cleansing stage must be
handled in other routines.
Comments are welcomed. |