dbTalk Databases Forums  

Some people use the SAS datastep for ETL. What do you think of that?

comp.databases.olap comp.databases.olap


Discuss Some people use the SAS datastep for ETL. What do you think of that? in the comp.databases.olap forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
irishhacker@gmail.com
 
Posts: n/a

Default Some people use the SAS datastep for ETL. What do you think of that? - 04-27-2007 , 06:43 PM






A few months back, while sparring with the SAS zealots at the
comp.soft-sys.sas , one of the SAS users said that he uses SAS ( the
SAS datastep in particular ) for ETL all the time, instead of using
SQL directly, or a GUI-based ETL tool. Does this make sense?

SAS is a specialized language, not GUI-based, and allows for some
procedural old-fashioned syntax choices that are not allowed with the
SQL SELECT statement.

There are many types of data transformation ( data munging, data
preparation, data cleaning, etc.). ETL is one type. Preparing data for
statistical analysis is another type of data transformation. I don't
have a background in ETL, and perhaps the required data transformation
tasks are not that complex for MOST ETL tasks ( but for ALL, that I
have a hard time believing ). If the data is relatively clean and
simple, a GUI-based tool makes more sense.

I do know from many years of experience that when preparing complex
and messy data for statistical analysis, a GUI-based tool is not the
way to go. Neither is the SQL programming language, because it has too
many built in restrictions - there are too many features that are not
allowed because it would violate the relational mathematical model.

Data transformation products that emphasize a non-SQL specialized
language include:
SAS
SPSS
PSPP ( www.gnu.org/software/pspp )
DAP ( directory.fsf.org/math/stats/Dap.html )
vilno ( code.google.com/p/vilno )

What about Kettle and KETL? These seem to use a GUI, they don't seem
to mention being focused on a programming language. So I am not so
sure about them.

The marketing departments for very expensive ETL products seem to want
people to believe that an ETL product that is focused on a programming
language instead of a GUI is not worth having. Um, that's marketing,
and I suspect that is not always true. (Sometimes it is though).

For some data transformation problems, but not all, the best choice
might be an all-purpose programming language, i.e. Perl, Python, or
Ruby. Everyone has a different name for data transformation, Perl-ers
call it data munging.


Reply With Quote
  #2  
Old   
Troels Arvin
 
Posts: n/a

Default Re: Some people use the SAS datastep for ETL. What do you think ofthat? - 04-29-2007 , 03:01 PM






On Fri, 27 Apr 2007 16:43:47 -0700, irishhacker wrote:
Quote:
A few months back, while sparring with the SAS zealots at the
comp.soft-sys.sas , one of the SAS users said that he uses SAS ( the SAS
datastep in particular ) for ETL all the time, instead of using SQL
directly, or a GUI-based ETL tool. Does this make sense?
At a SAS course I attended, the instructor pointed out that nowadays,
most use if SAS is _not_ really statistics-work, as one might expect.
Rather, SAS is used for ETL-stuff, and reporting.

Personally, I find that SAS may be a good ETL tool, albeit with some
limitations:
- SAS is expensive, and with no free test-only download option. So
with SAS, you need to somehow know SAS already (perhaps through
school) before you can know how it is to work with.
- SAS is the ugliest and most inconsistent programming language I've
ever seen. This may easily lead to "write-only" code, i.e. code
which is very hard to understand for others than the one who
wrote the code.

About GUI/non-GUI:
I found that GUI-based tools are often a pain with regards to
traceability: They often produce "code" which is full of GUI-related
nonsense which means that it's very hard to deduce what changed between
different versions of the application. E.g., I've been recently been
involved in an ETL-like task where Microsoft's SQL Server Integration
Services (SSIS) tool was chosen. I'm never going to touch SSIS again for
a number of reasons -- one being that SSIS's XML contained blobs of
strange binary-like code; and even the smalles change in the SSIS package
(like a changed file path) resulted in numerous strange and very hard-to-
read code differences. In other words, some of the -- seemingly very nice
-- GUI-based ETL tools may turn out to be a version control nightmare
because _important_ changes are hidden in heaps of gibberish "code"-
changes.

In my opinion is that you may get very solid solutions, simply using SQL.
Staging/temporary tables, views with INSTEAD-triggers, functions, and
procedures will get you very far. The last pieces of glue needed (e.g.
some functionality which downloads import-data via FTP) may easily be
performed by scripts, such as perl/python/shell, or SAS (if you have
relevant licensing); the script language should be chosen based on what's
popular in your organization.

--
Regards,
Troels Arvin <troels (AT) arvin (DOT) dk>
http://troels.arvin.dk/


Reply With Quote
  #3  
Old   
Seth Grimes
 
Posts: n/a

Default Re: Some people use the SAS datastep for ETL. What do you think ofthat? - 05-10-2007 , 04:12 AM



I wrote my first SAS program almost 20 years ago.

The statement that "SAS is the ugliest and most inconsistent programming
language I've ever seen" is spot on. As a language, it's glue that ties
together data transformations with calls to procedures. SAS long-ago
created Macro language to overcome 4GL deficiencies and the company built-in
SQL as well quite a few years ago, which added to the ugliness by injecting
obscurity (Macro) and duplicativeness (SQL).

BUT... SAS 4GL is capable of doing a VERY broad range of data
transformations, on input and on output, with a very rich metadata
management environment. I would have no qualms about doing ETL with SAS...
IF I WAS ALREADY WORKING MY CURRENT PROBLEM IN A SAS ENVIRONMENT.

By the way, nowadays I do my ETL with custom Python programs.

Seth


Troels Arvin wrote:
Quote:
On Fri, 27 Apr 2007 16:43:47 -0700, irishhacker wrote:
A few months back, while sparring with the SAS zealots at the
comp.soft-sys.sas , one of the SAS users said that he uses SAS ( the SAS
datastep in particular ) for ETL all the time, instead of using SQL
directly, or a GUI-based ETL tool. Does this make sense?

At a SAS course I attended, the instructor pointed out that nowadays,
most use if SAS is _not_ really statistics-work, as one might expect.
Rather, SAS is used for ETL-stuff, and reporting.

Personally, I find that SAS may be a good ETL tool, albeit with some
limitations:
- SAS is expensive, and with no free test-only download option. So
with SAS, you need to somehow know SAS already (perhaps through
school) before you can know how it is to work with.
- SAS is the ugliest and most inconsistent programming language I've
ever seen. This may easily lead to "write-only" code, i.e. code
which is very hard to understand for others than the one who
wrote the code.

About GUI/non-GUI:
I found that GUI-based tools are often a pain with regards to
traceability: They often produce "code" which is full of GUI-related
nonsense which means that it's very hard to deduce what changed between
different versions of the application. E.g., I've been recently been
involved in an ETL-like task where Microsoft's SQL Server Integration
Services (SSIS) tool was chosen. I'm never going to touch SSIS again for
a number of reasons -- one being that SSIS's XML contained blobs of
strange binary-like code; and even the smalles change in the SSIS package
(like a changed file path) resulted in numerous strange and very hard-to-
read code differences. In other words, some of the -- seemingly very nice
-- GUI-based ETL tools may turn out to be a version control nightmare
because _important_ changes are hidden in heaps of gibberish "code"-
changes.

In my opinion is that you may get very solid solutions, simply using SQL.
Staging/temporary tables, views with INSTEAD-triggers, functions, and
procedures will get you very far. The last pieces of glue needed (e.g.
some functionality which downloads import-data via FTP) may easily be
performed by scripts, such as perl/python/shell, or SAS (if you have
relevant licensing); the script language should be chosen based on what's
popular in your organization.


--
Seth Grimes Alta Plana Corp, analytical computing & data management
Intelligent Enterprise magazine (CMP), Contributing Editor
grimes (AT) altaplana (DOT) com http://altaplana.com 301-270-0795


Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.