dbTalk Databases Forums  

Reading of Large (20k+ lines) Flat File

comp.databases.pick comp.databases.pick


Discuss Reading of Large (20k+ lines) Flat File in the comp.databases.pick forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
Robert S. Lobel
 
Posts: n/a

Default Reading of Large (20k+ lines) Flat File - 10-01-2011 , 02:32 PM






What is te best way to read very long flat files in d3?

Rob

Reply With Quote
  #2  
Old   
Kevin Powick
 
Posts: n/a

Default Re: Reading of Large (20k+ lines) Flat File - 10-01-2011 , 05:43 PM






On 2011-10-01 15:32:59 -0400, "Robert S. Lobel" <RobertLobel (AT) COX (DOT) net> said:

Quote:
What is te best way to read very long flat files in d3?
Which version of D3?
Which O/S?
What is the format of the flat file (CSV)?
Do you intend to manipulate the data read in, or do you just plan to
write out a record per row?

Answers to the above may help to determine the "best" method.

--
Kevin Powick

Reply With Quote
  #3  
Old   
Scott Ballinger
 
Posts: n/a

Default Re: Reading of Large (20k+ lines) Flat File - 10-01-2011 , 06:51 PM



Quote:
What is te best way to read very long flat files in d3?
Rob, I think you have two options:
1. %read - this requires that you parse your records out of the
readblock and manage the records that are split across two blocks.
2. qselect then readnext (presuming that your flat file records are
delimited with a CRLF or LF).

The first is the most efficient, as you only use the readblock size
amount of D3 memory, but it requires more complex code. The second is
dirt-shit simple. It has the disadvantage of first reading the entire
file into D3 memory. I like the second. Machines are fast enough now
that even reading a multi-GB file this way does not kill the system,
and I'm lazy. But if your flat file is really really huge then your D3
session will die on you with an "out of memory" error and you will
need to use %read.

/Scott Ballinger
Pareto Corporation
Edmonds WA USA
206 713 6006

Reply With Quote
  #4  
Old   
Tony Gravagno
 
Posts: n/a

Default Re: Reading of Large (20k+ lines) Flat File - 10-02-2011 , 03:04 AM



"Robert S. Lobel" wrote:
Quote:
What is te best way to read very long flat files in d3?

I haven't used this technique but I think it would be interesting to
try considering how often this question comes up...


See this URL:
http://www.fastechws.com/tricks/unix..._mid_files.php
That documents how to use the Linux Head and Tail commands to create a
Mid function which pulls lines from the middle of a host file. The
page says it's pretty performant. YMMV

In short:
tail -n +linenum filename | head -n numlines

Reduce that to:
mid startline numlines filename

So in BASIC:

SUB MID(START,LINES,FILE,BLOCK)
EXECUTE "!mid ":START:" ":LINES:" ":FILE CAPTURING BLOCK
RETURN VALUE


And finally in your applications, get the data blocks without exposing
the platform-specific details:

START = 1
LINES = 20
FILE = "/tmp/bigfile.txt"
BLOCK = ""
LOOP
CALL MID(START,LINES,FILE,BLOCK)
UNTIL BLOCK = "" DO
CT = DCOUNT(BLOCK,@AM)
FOR N = 1 TO CT
* process a line
NEXT N
START += LINES
REPEAT

That allows you to change the MID function based on OS, or
DBMS-specific techniques. In other words, port from D3/Linux to
QM/Windows and your app code won't change but the one MID function
will.

If anyone tries this, please post here.

To do this with Windows, Google for "windows head tail".

HTH
T

Reply With Quote
  #5  
Old   
Frank Winans
 
Posts: n/a

Default Re: Reading of Large (20k+ lines) Flat File - 10-02-2011 , 08:46 AM



"Tony Gravagno" wrote
Quote:
In short:
tail -n +linenum filename | head -n numlines

You could get lne count by parsing output of wc -l filename

Or instead of several tail commands use just one
split -l 100 filename bob
to generate linux files bobaa, bobab, bobac, ...
{that is lowercase of SPLIT -L 100 filename BOB }
of max 100 lines each file.
Now SELECT unix:/tmp with a0 = "bob]"
Later you can delete them with rm bob??

got more than 26 * 26 lines? need a longer suffix. For example
split -l 100 -a 3 filename bob
to create bobaaa bobaab etc, etc
rm bob??? to nuke 'em later.

Reply With Quote
  #6  
Old   
Excalibur21
 
Posts: n/a

Default Re: Reading of Large (20k+ lines) Flat File - 10-02-2011 , 07:36 PM



On Oct 2, 6:32*am, "Robert S. Lobel" <RobertLo... (AT) COX (DOT) net> wrote:
Quote:
What is te best way to read very long flat files in d3?

Rob
Hi
TL support assure me that 20k is a very short file. In fact I often
use much larger items.
In D3 Windows the COPY Dos: followed by Count, DIM and Matparse is
exceptionally easy and very quick

Reply With Quote
  #7  
Old   
Tony Gravagno
 
Posts: n/a

Default Re: Reading of Large (20k+ lines) Flat File - 10-02-2011 , 07:59 PM



Excalibur21 wrote:

Quote:
TL support assure me that 20k is a very short file. In fact I often use much larger items.
You needed TL support to tell you that?

Seriously, it's not just about line count, it's about how wide the
lines are too.

Quote:
In D3 Windows the COPY Dos: followed by Count, DIM and Matparse is exceptionally easy and very quick
There's a number of things I'd take issue with there.

1) Why copy data into hashed space only to open and read it, when you
can just open and read it from OS space? You've virtually doubled the
processing time for no reason, especially if your hashed files are
only 4k (default) and you need to go through the pain of frame
linkage.
2) Why DIM then matparse into a dynamic array when you can just read
an item into a dynamic array?
3) Maybe you meant MatRead, in which case you can skip the count and
related Dim and just do this:
DIM BLOCK()
MATREAD BLOCK FROM ...
The MatRead will automatically dimension the block array for you.


For reference, I've used %read on files that are tens of gigabytes in
size with hundreds of thousands of wide lines, blocking with small
buffers in a manner similar to what I described in my other post on
this topic. (Be sure to convert (CR)LF to @AM)

The only real way to work out the "best" method of processing files
like this is to try a few methods and see if the performance is
reasonable for your specific application. You might find that a
simple Read with a ForNext though attributes is fine. You will
certainly find that with large strings, FlashBASIC is MUCH better than
non-flashed code, maybe good enough to preclude anything but the most
simple approach.

T

Reply With Quote
  #8  
Old   
Robert S. Lobel
 
Posts: n/a

Default Re: Reading of Large (20k+ lines) Flat File - 10-03-2011 , 07:59 AM



We are using AIX/d3 7.4.
I saved a 25k row xls file to a tab delimited txt file and copied it to AIX,
converting the CR/LF to AM.
Even though the rows are less than 50 characters, it took approximately 20
seconds to extract a small amount of data (with field command) from each
line using non-dim'd read.
I thought this process could be done quicker.
Thanks, guys, for all of your responses.

Rob

"Kevin Powick" <nospam (AT) spamless (DOT) com> wrote

Quote:
On 2011-10-01 15:32:59 -0400, "Robert S. Lobel" <RobertLobel (AT) COX (DOT) net
said:

What is te best way to read very long flat files in d3?

Which version of D3?
Which O/S?
What is the format of the flat file (CSV)?
Do you intend to manipulate the data read in, or do you just plan to write
out a record per row?

Answers to the above may help to determine the "best" method.

--
Kevin Powick

Reply With Quote
  #9  
Old   
Scott Ballinger
 
Posts: n/a

Default Re: Reading of Large (20k+ lines) Flat File - 10-03-2011 , 10:24 AM



On Oct 3, 5:59*am, "Robert S. Lobel" <RobertLo... (AT) COX (DOT) net> wrote:
Quote:
I saved a 25k row xls file to a tab delimited txt file and copied it to AIX,
converting the CR/LF to AM.
Even though the rows are less than 50 characters, it took approximately 20
seconds to extract a small amount of data (with field command) from each
line using non-dim'd read.
I thought this process could be done quicker.
Rob, yes than can definitely be done quicker:

1. From Excel save your 25000 row .xls as a tab delim file. Put it
somewhere like /tmp/myfile.txt. You might want to remove any column
headers.

2. From D3 TCL: qselect /tmp/ myfile.txt (note the space between /tmp/
and myfile.txt). You should get something like "[404] 25000 items
selected out of 1 items."

3. Make a simple basic program like this:

tab = char(9)
cr = char(13)
loop
readnext line else exit
convert tab to @am in line
convert cr to "" in line
data1 = line<1>
data2 = line<2>
dataN = line<N>
...etc
repeat


This should process hundreds, if not thousands, of lines per second.
The D3 qselect process will convert LFs to AMs as it reads the entire
myfile.txt into memory, and the readnext function will parse on the
AMs and feed each line to you one at a time.

I have used this technique to parse large (4K) flat file records out
of files with many millions of records. It is very fast. At some point
D3 will run out of memory if the file is too large, but in practice
you just don't come across files that large very often.

/Scott Ballinger
Pareto Corporation
Edmonds WA USA
209 713 6006

Reply With Quote
  #10  
Old   
Scott Ballinger
 
Posts: n/a

Default Re: Reading of Large (20k+ lines) Flat File - 10-03-2011 , 10:48 AM



One other thing: if you get the "overflow runaway condition detected!
Continue/Quit (C/Q)?" message you will want to "set-runaway-limit
100000" or some other appropriately large number before the qselect.
The set-runaway-limit is 7000 frames I think, so you are likely to get
that message if myfile.txt is larger than 7000 x 4000 = 28MB.

/Scott

Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.