dbTalk Databases Forums  

Reading of Large (20k+ lines) Flat File

comp.databases.pick comp.databases.pick


Discuss Reading of Large (20k+ lines) Flat File in the comp.databases.pick forum.



Reply
 
Thread Tools Display Modes
  #11  
Old   
Gene Buckle
 
Posts: n/a

Default Re: Reading of Large (20k+ lines) Flat File - 10-03-2011 , 11:53 AM






To: Robert S. Lobel
Robert wrote:
Quote:
From Newsgroup: comp.databases.pick

We are using AIX/d3 7.4.
I saved a 25k row xls file to a tab delimited txt file and copied it to AIX,
converting the CR/LF to AM.
Even though the rows are less than 50 characters, it took approximately 20
seconds to extract a small amount of data (with field command) from each
line using non-dim'd read.
I thought this process could be done quicker.
Thanks, guys, for all of your responses.

You might want to look into using %fgets() to read the file one line at a time.

eg:

filvar = (char*)%fopen("/path/to/file.csv", "r")
if filvar <> 0 then
loop
retval = (char*)%fgets(line, 128, (char*)filvar)
until retval = 0 do
* process line here

repeat
end

128 is the max size of the read buffer. I'd set it to be a bit bigger than
the longest individual line in the file.

retval will become zero when you hit either an error or the end of the file.
I didn't find any reference to a builtin "%feof()" call, which would be the
'correct' way to property detect the end of the file.

g.


--
Proud owner of F-15C 80-0007
http://www.f15sim.com - The only one of its kind.
http://www.diy-cockpits.org/coll - Go Collimated or Go Home.
Some people collect things for a hobby. Geeks collect hobbies.

ScarletDME - The red hot Data Management Environment
A Multi-Value database for the masses, not the classes.
http://www.scarletdme.org - Get it _today_!

Political correctness is a doctrine, fostered by a delusional, illogical
minority, and rabidly promoted by an unscrupulous mainstream media, which
holds forth the proposition that it is entirely possible to pick up a turd
by the clean end.
--- Synchronet 3.15a-Win32 NewsLink 1.91
The Retro Archive - telnet://bbs.retroarchive.org

Reply With Quote
  #12  
Old   
x
 
Posts: n/a

Default Re: Reading of Large (20k+ lines) Flat File - 10-03-2011 , 01:22 PM






I'm not sure if D3 has it but other systems have READSEQ / WRITESEQ
for this very situation.

Lucian

Reply With Quote
  #13  
Old   
Excalibur21
 
Posts: n/a

Default Re: Reading of Large (20k+ lines) Flat File - 10-04-2011 , 04:36 PM



On Oct 3, 11:59*am, Tony Gravagno <tony_grava... (AT) nospam (DOT) invalid>
wrote:
Quote:
Excalibur21 wrote:
TL support assure me that 20k is a very short file. *In fact I often use much larger items.

You needed TL support to tell you that?

Seriously, it's not just about line count, it's about how wide the
lines are too.

In D3 Windows the COPY Dos: *followed by Count, DIM and Matparse is exceptionally easy and very quick

There's a number of things I'd take issue with there.

1) Why copy data into hashed space only to open and read it, when you
can just open and read it from OS space? *You've virtually doubled the
processing time for no reason, especially if your hashed files are
only 4k (default) and you need to go through the pain of frame
linkage.
2) Why DIM then matparse into a dynamic array when you can just read
an item into a dynamic array?
3) Maybe you meant MatRead, in which case you can skip the count and
related Dim and just do this:
* DIM BLOCK()
* MATREAD BLOCK FROM ...
The MatRead will automatically dimension the block array for you.

For reference, I've used %read on files that are tens of gigabytes in
size with hundreds of thousands of wide lines, blocking with small
buffers in a manner similar to what I described in my other post on
this topic. *(Be sure to convert (CR)LF to @AM)

The only real way to work out the "best" method of processing files
like this is to try a few methods and see if the performance is
reasonable for your specific application. *You might find that a
simple Read with a ForNext though attributes is fine. *You will
certainly find that with large strings, FlashBASIC is MUCH better than
non-flashed code, maybe good enough to preclude anything but the most
simple approach.

T
As usual a totally ill conceived rant.
When are you going to get your head around the DOS: operation which
converts CRLF to AM at lightning speed.
Why use DIM? The fact that a dimensioned matrix operation is vastly
faster than an extract
Why bring it into D3 - to save it for permanent record and possible
further analysis of course.
As for %open that only became reliable in version 9.1. I had to
downgrade it in a hurry when it started bombing a clients user count.
Why specify the DIM? for clarity for future maintenance plus one needs
the count to process the array.

As for 25 seconds to process the request that is abysmal and indicates
something else astray.
Peter McMurray

Reply With Quote
  #14  
Old   
Tony Gravagno
 
Posts: n/a

Default Re: Reading of Large (20k+ lines) Flat File - 10-05-2011 , 02:31 AM



Excalibur21 wrote:

Quote:
As usual a totally ill conceived rant.
Your recommendation:

COPY DOS:/PATH/FILE.EXT
TO: (HASHED.FILE

OPEN "HASHED.FILE" TO FV ELSE STOP
READ ITEM FROM FV,"FILE.EXT" ELSE STOP
DIM ARRAY()
CT = DCOUNT(ITEM,@AM)
DIM ARRAY(CT)
MATPARSE ARRAY FROM ITEM


My recommendation:

OPEN "DOS:/PATH" TO FV ELSE STOP
DIM ARRAY()
MATREAD ARRAY FROM FV,"FILE.EXT" ELSE STOP


Your TCL command plus 6 line program just got reduced to 3 lines.
Now, if that doesn't work, say so, but "totally ill conceived rant"?


Quote:
When are you going to get your head around the
DOS: operation which converts CRLF to AM at
lightning speed.
Get my head around it? I use and recommend using OSFI all the time.
I'm the only one here who has tried for years to get PS/RD/TL to make
better use of it.

You didn't notice that my recommendation to "read it from OS space"
implied using DOS: as shown above.



Quote:
Why use DIM? The fact that a dimensioned matrix operation is vastly
faster than an extract
I didn't say "Why use DIM?". My sentence was longer. Read what I
wrote then look at my code. I'm happy to be corrected but please
correct something I actually said.


Quote:
Why bring it into D3 - to save it for permanent record
and possible further analysis of course.
Of course? You just changed the definition of the task and
invalidated most of the suggestions in this thread.

And pulling a large item into frame space could just create
unnecessary burden on the system, especially for file-saves. IF that
were a part of the task definition, you could consider just leaving it
in the host OS "for permanent record and possible further analysis"
.... of course.


Quote:
As for %open that only became reliable in version 9.1. I had to
downgrade it in a hurry when it started bombing a clients user count.
If you've found an environment-specific bug, thanks for reporting it
here. But we're talking about how to use the technology. A release
or platform-specific issue doesn't negate the concept. "Only became
reliable in version 9.1"? Seriously? %open has been a part of the
system for over 15 years.

Oh yeah, you do realize that 9.1 doesn't exist yet, right?


Quote:
Why specify the DIM? for clarity for future maintenance plus one needs
the count to process the array.
Again, you didn't read the rest of my sentence, and no, you don't need
a count, see my code above.


Quote:
As for 25 seconds to process the request that
is abysmal and indicates something else astray.
Well, you're not responding to my post there. I think you're
responding to Rob. It sounds to me like the following factors are
affecting his performance:

1) Not flashed.
2) 20 seconds (not 25 according to Rob) seems to include item read
time which should be separated from processing time.
3) Sequential index through 20 thousand attributes is always going to
be painful, try using Delete(var,1) trick and always operate on atb1.
Or use one of the others discussed here.
4) Rather than extract and multiple field statements on each line
consider converting delimiters (commas?) to system delimiters (@vm)
and the pain of parsing might be reduced by referencing dynamic array
elements.

One could also read in the whole block, then maybe for/next through
the block to break it into something like 5 blocks of 4000 lines
(watching not to break lines). The pain of reaching out from atb1 to
4001, 4002, 4003 ... 19999... will be reduced because there will only
be 4000 attributes per block. The overhead of busting up the block is
trivial compared to parsing every single time through the entire
block.

There are so many ways to skin this cat. But citing any more could be
perceived as a totally ill conceived rant, as usual...

T

Reply With Quote
  #15  
Old   
Excalibur21
 
Posts: n/a

Default Re: Reading of Large (20k+ lines) Flat File - 10-09-2011 , 10:28 PM



On Oct 5, 6:31*pm, Tony Gravagno <tony_grava... (AT) nospam (DOT) invalid> wrote:
Quote:
Excalibur21 wrote:
As usual a totally ill conceived rant.

Your recommendation:

COPY DOS:/PATH/FILE.EXT
TO: (HASHED.FILE

OPEN "HASHED.FILE" TO FV ELSE STOP
READ ITEM FROM FV,"FILE.EXT" ELSE STOP
DIM ARRAY()
CT = DCOUNT(ITEM,@AM)
DIM ARRAY(CT)
MATPARSE ARRAY FROM ITEM

My recommendation:

OPEN "DOS:/PATH" TO FV ELSE STOP
DIM ARRAY()
MATREAD ARRAY FROM FV,"FILE.EXT" ELSE STOP

Your TCL command plus 6 line program just got reduced to 3 lines.
Now, if that doesn't work, say so, but "totally ill conceived rant"?

When are you going to get your head around the
DOS: operation which converts CRLF to AM at
lightning speed.

Get my head around it? *I use and recommend using OSFI all the time.
I'm the only one here who has tried for years to get PS/RD/TL to make
better use of it.

You didn't notice that my recommendation to "read it from OS space"
implied using DOS: as shown above.

Why use DIM? The fact that a dimensioned matrix operation is vastly
faster than an extract

I didn't say "Why use DIM?". *My sentence was longer. *Read what I
wrote then look at my code. *I'm happy to be corrected but please
correct something I actually said.

Why bring it into D3 - to save it for permanent record
and possible further analysis of course.

Of course? *You just changed the definition of the task and
invalidated most of the suggestions in this thread.

And pulling a large item into frame space could just create
unnecessary burden on the system, especially for file-saves. *IF that
were a part of the task definition, you could consider just leaving it
in the host OS "for permanent record and possible further analysis"
... of course.

As for %open that only became reliable in version 9.1. *I had to
downgrade it in a hurry when it started bombing a clients user count.

If you've found an environment-specific bug, thanks for reporting it
here. *But we're talking about how to use the technology. *A release
or platform-specific issue doesn't negate the concept. *"Only became
reliable in version 9.1"? *Seriously? *%open has been a part of the
system for over 15 years.

Oh yeah, you do realize that 9.1 doesn't exist yet, right?

Why specify the DIM? for clarity for future maintenance plus one needs
the count to process the array.

Again, you didn't read the rest of my sentence, and no, you don't need
a count, see my code above.

As for 25 seconds to process the request that
is abysmal and indicates something else astray.

Well, you're not responding to my post there. *I think you're
responding to Rob. *It sounds to me like the following factors are
affecting his performance:

1) Not flashed.
2) 20 seconds (not 25 according to Rob) seems to include item read
time which should be separated from processing time.
3) Sequential index through 20 thousand attributes is always going to
be painful, try using Delete(var,1) trick and always operate on atb1.
Or use one of the others discussed here.
4) Rather than extract and multiple field statements on each line
consider converting delimiters (commas?) to system delimiters (@vm)
and the pain of parsing might be reduced by referencing dynamic array
elements.

One could also read in the whole block, then maybe for/next through
the block to break it into something like 5 blocks of 4000 lines
(watching not to break lines). *The pain of reaching out from atb1 to
4001, 4002, 4003 ... 19999... will be reduced because there will only
be 4000 attributes per block. *The overhead of busting up the block is
trivial compared to parsing every single time through the entire
block.

There are so many ways to skin this cat. *But citing any more could be
perceived as a totally ill conceived rant, as usual...

T
Hi
I like a good laugh thanks Tony.
I make a simple comment re a simple question and you are off taking
issue without thinking.May I remind you that you are the one who said
that COPY DOS: new nothing about CRLF to AM when I first raised it a
couple of years ago. It seems you have had a rethink.
COPY is an excellent way of having a quick look at a small file - the
chap did say 20k. I do this and even use dump to double check things
for one off jobs
..
The files I often deal with have headers and footers that need
balancing so I need a count.
Some are produced by COBOL programmers and often have odd blank lines
or errors so the Copy gives a quick and easy check with the option of
correcting it in editor before charging on. Of course I use Open file
etc for the regular jobs.

I did of course mean 9.01 was the first version that did not crash the
user count for optimised code in Windows and yes the error had been
there all along. Specifically if a window was closed in Hostaccess
without being logged off the user count was not reduced eventually
leaving the machine locked up and needing a restore.

What made you assume that the files I am reading are on the same
server as the D3? In most cases they are not., so yes I need to store
them. I put my first credit card system in in 1983 and there are a
number of reasons we have separate machines. Mobil decreed that their
system could not be linked into any network so we used sneaker net to
take floppies from one to another. On one client we had 35 remote
sites so we dedicated a workstation to the task eventually the
workstation was not even in the same State as the server.
As for changing commas to delimiters, I find that a very strange
idea. The only safe way to do a comma delimited file is field by
field as a field may contain a comma between quotes.
On the technical side the length of the records has no direct effect
on the size of the dimensioned array. It is always a contiguous block
of n * 20 byte variables. The actual variable referred will be
contained within the 20 bytes if it is less than 11 bytes long
otherwise it is stored like any other variable in general space with a
pointer in the array position.
Have a good day
Peter McMurray

Reply With Quote
  #16  
Old   
Tony Gravagno
 
Posts: n/a

Default Re: Reading of Large (20k+ lines) Flat File - 10-09-2011 , 10:44 PM



As usual you're missing what I am saying and refuting things I'm not
saying. OK, you're right. You always are.

>Have a good day

Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.