dbTalk Databases Forums  

Re: Importing ugly data

comp.databases.filemaker comp.databases.filemaker


Discuss Re: Importing ugly data in the comp.databases.filemaker forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
Mike
 
Posts: n/a

Default Re: Importing ugly data - 01-02-2007 , 11:43 AM






<snip>

Quote:
I have a somewhat similar situation that I have to deal with routinely.
My approach:

Create a separate FileMaker file specifically to clean up the data,
then import from that into my main database. Call it Cleanup.fp7 for
the sake of discussion.

The first step is to get rid of the first rows of garbage in the input
file, so that Cleanup.fp7 is presented with a consistent set of fields
for the first record.

Then I import the data into Cleanup.fp7.

Cleanup.fp7 has a series of scripts that do several things:

Find all the records that contain garbage, based on suitable criteria,
and delete those.

Parse out names and other data fields into more rational structure,
using scripts with Text functions.

Normalize data to a consistent set of naming conventions, using Find
and Replace (for example, make Street, St. and St into a standard
abbreviation St)

And so on.

Before I could design the Cleanup.fp7 file I had to know the final form
for import of data into my main database, and I also had to know the
structure of the "good" data and the garbage in the input file. Armed
with that, I could define fields and scripts in Cleanup.fp7 to
accomplish the job. There was a lot of trial and error to get it right.
I did not import anything into my main database until i was confident
that the data was right in Cleanup.fp7.

Once I had it right, I could use Cleanup.fp7 over and over to clean up
sets of new data as they came to me.

Hope this helps.
Yes, it does, thx. I think that I may very well have to have either a
separate file or a separate table at least. If FMs text parsing is
robust enough, and I'm smart enough, then it might work.



Reply With Quote
  #2  
Old   
Mike
 
Posts: n/a

Default Re: Importing ugly data - 01-02-2007 , 11:49 AM






On 2007-01-02 08:48:30 -0500, "Bill Marriott" <wjm (AT) wjm (DOT) org> said:

Quote:
I'm surprised no one has mentioned the web viewer functionality yet.
This could eliminate much of the hassle of getting the data from the
web site. You can point the web viewer at the protected site and log in
manually. Or, you could figure out a way to include the credentials
within the URL, if possible.

Once the data you're interested in is showing in the web viewer, you
can click a button to grab the content of that page as the HTML source
text. The button would be something like:

Set Field [YourField, GetLayoutObjectAttribute ( "wv"; "content" )]

Where "wv" is the name you gave the web viewer object in layout mode
(using the Object Inspector palette), and YourField is the place where
you'd like to put the data.

Now with several clicks eliminated, you can direct your calculation
skills toward parsing the data. Without knowing the precise format of
the information it is very difficult for me to recommend a specific
strategy. But it is often easier than you think.

For example, if there is something unique about each non-garbage line,
you could use this characteristic in your formula to quicken the chase.
Perhaps there is always a <TD> in there? The HTML code looks more
complicated to human eyes, but it's often more helpful for parsing out
relevant bits of data.

If you have FileMaker 8.5 Advanced, you can write a spiffy recursive
function to parse out the content. If you have the non-Advanced
version, you can write a script.

And finally you'd use a script to transform the cleaned-up lines into
tidy new records in a table.
Thanks to all...

I had hoped to so something similar to using a perlmod to extract the
data, but I looked at the raw html and unfortunately it's not even a
table - it's just preformatted text with spaces.

Quote:
By the way, you say you're on a Mac, but that you have to log into IE?
Does the web site somehow not work with Safari?

Sorry, it must have been confusing. The system that I have to use to
extract the data is XP, then I will be (somehow) getting the html or
text files to my personal machine (Mac). Our company doesn't allow
connecting computers to their net, or even a hard drive, or installing
any software, etc. I will be emailing the files home.

Also I can't read the stuff from the web from my Mac, as the site is
only available from within the company. hence I'm stuck with reading a
file. More klunk than I really want.

Perhaps the approach to take is to simply test the assumption that the
preformatting is consistent, and use brute force to find the "rows" I
want to parse.



Reply With Quote
  #3  
Old   
Bill Marriott
 
Posts: n/a

Default Re: Importing ugly data - 01-02-2007 , 01:30 PM



The beauty of it is that you wouldn't need to install any software or
"connect" your computer to the network!!

The web viewer is web compatible... if they allow you to browse an outside
site like MSN or FileMaker.com, you can do this.

If you want proof-of-concept, send me an email and I'll give you a web link
to try.



"Mike" <exceptionsTakeMeOut (AT) earthlink (DOT) net> wrote

Quote:
On 2007-01-02 08:48:30 -0500, "Bill Marriott" <wjm (AT) wjm (DOT) org> said:

I'm surprised no one has mentioned the web viewer functionality yet. This
could eliminate much of the hassle of getting the data from the web site.
You can point the web viewer at the protected site and log in manually.
Or, you could figure out a way to include the credentials within the URL,
if possible.

Once the data you're interested in is showing in the web viewer, you can
click a button to grab the content of that page as the HTML source text.
The button would be something like:

Set Field [YourField, GetLayoutObjectAttribute ( "wv"; "content" )]

Where "wv" is the name you gave the web viewer object in layout mode
(using the Object Inspector palette), and YourField is the place where
you'd like to put the data.

Now with several clicks eliminated, you can direct your calculation
skills toward parsing the data. Without knowing the precise format of the
information it is very difficult for me to recommend a specific strategy.
But it is often easier than you think.

For example, if there is something unique about each non-garbage line,
you could use this characteristic in your formula to quicken the chase.
Perhaps there is always a <TD> in there? The HTML code looks more
complicated to human eyes, but it's often more helpful for parsing out
relevant bits of data.

If you have FileMaker 8.5 Advanced, you can write a spiffy recursive
function to parse out the content. If you have the non-Advanced version,
you can write a script.

And finally you'd use a script to transform the cleaned-up lines into
tidy new records in a table.

Thanks to all...

I had hoped to so something similar to using a perlmod to extract the
data, but I looked at the raw html and unfortunately it's not even a
table - it's just preformatted text with spaces.


By the way, you say you're on a Mac, but that you have to log into IE?
Does the web site somehow not work with Safari?


Sorry, it must have been confusing. The system that I have to use to
extract the data is XP, then I will be (somehow) getting the html or text
files to my personal machine (Mac). Our company doesn't allow connecting
computers to their net, or even a hard drive, or installing any software,
etc. I will be emailing the files home.

Also I can't read the stuff from the web from my Mac, as the site is only
available from within the company. hence I'm stuck with reading a file.
More klunk than I really want.

Perhaps the approach to take is to simply test the assumption that the
preformatting is consistent, and use brute force to find the "rows" I want
to parse.




Reply With Quote
  #4  
Old   
Bill Marriott
 
Posts: n/a

Default Re: Importing ugly data - 01-02-2007 , 03:17 PM



Actually, strike that ... forgot that while the Web Viewer is IWP
compatible, the methods for obtaining the content of it don't work reliably.

"Bill Marriott" <wjm (AT) wjm (DOT) org> wrote

Quote:
The beauty of it is that you wouldn't need to install any software or
"connect" your computer to the network!!

The web viewer is web compatible... if they allow you to browse an outside
site like MSN or FileMaker.com, you can do this.

If you want proof-of-concept, send me an email and I'll give you a web
link to try.



"Mike" <exceptionsTakeMeOut (AT) earthlink (DOT) net> wrote in message
news:2007010212500975249-exceptionsTakeMeOut (AT) earthlinknet (DOT) ..
On 2007-01-02 08:48:30 -0500, "Bill Marriott" <wjm (AT) wjm (DOT) org> said:

I'm surprised no one has mentioned the web viewer functionality yet.
This could eliminate much of the hassle of getting the data from the web
site. You can point the web viewer at the protected site and log in
manually. Or, you could figure out a way to include the credentials
within the URL, if possible.

Once the data you're interested in is showing in the web viewer, you can
click a button to grab the content of that page as the HTML source text.
The button would be something like:

Set Field [YourField, GetLayoutObjectAttribute ( "wv"; "content" )]

Where "wv" is the name you gave the web viewer object in layout mode
(using the Object Inspector palette), and YourField is the place where
you'd like to put the data.

Now with several clicks eliminated, you can direct your calculation
skills toward parsing the data. Without knowing the precise format of
the information it is very difficult for me to recommend a specific
strategy. But it is often easier than you think.

For example, if there is something unique about each non-garbage line,
you could use this characteristic in your formula to quicken the chase.
Perhaps there is always a <TD> in there? The HTML code looks more
complicated to human eyes, but it's often more helpful for parsing out
relevant bits of data.

If you have FileMaker 8.5 Advanced, you can write a spiffy recursive
function to parse out the content. If you have the non-Advanced version,
you can write a script.

And finally you'd use a script to transform the cleaned-up lines into
tidy new records in a table.

Thanks to all...

I had hoped to so something similar to using a perlmod to extract the
data, but I looked at the raw html and unfortunately it's not even a
table - it's just preformatted text with spaces.


By the way, you say you're on a Mac, but that you have to log into IE?
Does the web site somehow not work with Safari?


Sorry, it must have been confusing. The system that I have to use to
extract the data is XP, then I will be (somehow) getting the html or text
files to my personal machine (Mac). Our company doesn't allow connecting
computers to their net, or even a hard drive, or installing any software,
etc. I will be emailing the files home.

Also I can't read the stuff from the web from my Mac, as the site is only
available from within the company. hence I'm stuck with reading a file.
More klunk than I really want.

Perhaps the approach to take is to simply test the assumption that the
preformatting is consistent, and use brute force to find the "rows" I
want to parse.






Reply With Quote
  #5  
Old   
Bill Marriott
 
Posts: n/a

Default Re: Importing ugly data - 01-02-2007 , 10:37 PM



Mike,

After thinking about this a little bit, if I were in your situation, I'd
probably build a small standalone with Runtime Revolution
(http://www.runrev.com) and put it on a CD. This doesn't have to be
"installed" -- it would leave the hard disk of the host computer untouched.
I'd run the utility from the CD, grab the contents of the web page, and
email the parsed results to myself.

The reason I'd go with RunRev instead of FileMaker 8.5 is that it doesn't
sound like you have 8.5 Advanced (which would be required to make a
runtime), the CD would necessarily be write-protected (and FileMaker more or
less has to be able to write to disk), and you would have more facilities
for parsing the data from the web page. (Not that FileMaker couldn't do it,
it would just be easier.)

"Mike" <exceptionsTakeMeOut (AT) earthlink (DOT) net> wrote

Quote:
Sorry, it must have been confusing. The system that I have to use to
extract the data is XP, then I will be (somehow) getting the html or text
files to my personal machine (Mac). Our company doesn't allow connecting
computers to their net, or even a hard drive, or installing any software,
etc. I will be emailing the files home.

Also I can't read the stuff from the web from my Mac, as the site is only
available from within the company. hence I'm stuck with reading a file.
More klunk than I really want.

Perhaps the approach to take is to simply test the assumption that the
preformatting is consistent, and use brute force to find the "rows" I want
to parse.




Reply With Quote
  #6  
Old   
tomislav
 
Posts: n/a

Default Re: Importing ugly data - 01-03-2007 , 09:18 AM



(somehow) getting the html or
Quote:
text files to my personal machine (Mac)

I've had a similar case once. I was able to get html file as you have
mentioned. At home I opened Safari, opened desired file in Safari. Table
rows and columns were preserved. I copied content inside of the table
and pasted in Excel. After that import to FMP is easy. It will only work
if you have html file. (if is text file chang extension to .html and
open in Safari)


Reply With Quote
  #7  
Old   
manet
 
Posts: n/a

Default Re: Importing ugly data - 01-03-2007 , 05:08 PM



tomislav <tomislav (AT) htnet (DOT) hr> wrote:

Quote:
I've had a similar case once. I was able to get html file as you have
mentioned. At home I opened Safari, opened desired file in Safari. Table
rows and columns were preserved.
if you can, that seems a good solution.

and if you need to parse, you can do some things with combinations of
Applescript and FMP scripting

slow, but can be simple.

--
Philippe Manet


Reply With Quote
  #8  
Old   
Mike
 
Posts: n/a

Default Re: Importing ugly data - 01-07-2007 , 04:21 AM



On 2007-01-03 18:08:47 -0500, pmanet (AT) invivo (DOT) edu (manet) said:

Quote:
tomislav <tomislav (AT) htnet (DOT) hr> wrote:

I've had a similar case once. I was able to get html file as you have
mentioned. At home I opened Safari, opened desired file in Safari. Table
rows and columns were preserved.

if you can, that seems a good solution.

and if you need to parse, you can do some things with combinations of
Applescript and FMP scripting

slow, but can be simple.
Well I was all set to punt and use a perl script at home - meaning that
I had to review the (unreadable) perl of my youth.

The weirdness of my company meant that I would have to save the data,
email it home, run it thru the script, then import it into FM, then
export the results, and email them back to work! Nice! They're a little
paranoid.

Anyway, much to my chagrin...I sifted thru the first batches of data
(starting on 1/1) in Excel, and found that the task is pointless!
There's no real need to examine millions of financial log entries when
the results for every day balance out within a dollar or so.

Thus my poring over RegEx was a waste, but I did learn a lot of FM from
youse guys. Thanks...



Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.