dbTalk Databases Forums  

Simple disk-based sorting method

comp.databases comp.databases


Discuss Simple disk-based sorting method in the comp.databases forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
Citizen Bob
 
Posts: n/a

Default Simple disk-based sorting method - 04-13-2007 , 07:40 AM






I have a disk file which contains 370,260 records of average length 60
ASCII text characters. I need to sort them on the first 9 characters,
which are set off from the rest by a space.

Once sorted the first record looks like
100100100 08:35-08:39 4 0 0 1 -115 -1 75 -100 dd17.txt
and the last one looks like
500900900 12:41-13:25 44 1 885 0 0 1 925 -200 dd21.txt

There is not enough memory in my Win2K/SP4 system to implement the
traditional memory-based sorting algorithms, so I need to use a method
that is disk-file based. I have plenty of disk space.

I imagine that this method would open the input disk file and begin
reading records one at a time and performing some kind of sort
algorithm placing the result in a temporary output file. Once the
original file has been swept, the process starts over and runs again
and again for the required number of iterations until the entire list
is sorted in ascending order. I will implement this method in standard
DOS-based C (Microsoft Optimizing C Compiler "cl") using the Win2K
DVM.

Suggestions and references please.



--

"Nothing in the world can take the place of perseverance. Talent
will not; nothing is more common than unsuccessful men with talent.
Genius will not; unrewarded genius is almost a proverb. Education
will not; the world is full of educated derelicts. Persistence and
determination alone are omnipotent."
--Calvin Coolidge

Reply With Quote
  #2  
Old   
Ed Prochak
 
Posts: n/a

Default Re: Simple disk-based sorting method - 04-13-2007 , 09:43 AM






On Apr 13, 8:40 am, s... (AT) uce (DOT) gov (Citizen Bob) wrote:
Quote:
I have a disk file which contains 370,260 records of average length 60
ASCII text characters. I need to sort them on the first 9 characters,
which are set off from the rest by a space.

Once sorted the first record looks like
100100100 08:35-08:39 4 0 0 1 -115 -1 75 -100 dd17.txt
and the last one looks like
500900900 12:41-13:25 44 1 885 0 0 1 925 -200 dd21.txt

There is not enough memory in my Win2K/SP4 system to implement the
traditional memory-based sorting algorithms, so I need to use a method
that is disk-file based. I have plenty of disk space.

I imagine that this method would open the input disk file and begin
reading records one at a time and performing some kind of sort
algorithm placing the result in a temporary output file. Once the
original file has been swept, the process starts over and runs again
and again for the required number of iterations until the entire list
is sorted in ascending order. I will implement this method in standard
DOS-based C (Microsoft Optimizing C Compiler "cl") using the Win2K
DVM.

Suggestions and references please.

--

"Nothing in the world can take the place of perseverance. Talent
will not; nothing is more common than unsuccessful men with talent.
Genius will not; unrewarded genius is almost a proverb. Education
will not; the world is full of educated derelicts. Persistence and
determination alone are omnipotent."
--Calvin Coolidge

I think a variation of Radix sort would work nicely. Any algorithms
textbook should discuss it. And the definitive reference is Knuth's
volume Sorting and Searching. This would be the comp.programming
answer, given your description of programming in C. (newsgroups are
arranged by topic for a reason.)

OTOH, you posted this in comp.databases so I'll now assume you want a
DB oriented answer. And that answer would be to pick a DBMS that runs
on your platform (WIN2K is pretty old), load the data into a table
with primary key (pkx) being those first 9 characters, then just let
the DBMS do the sorting when you issue a command like

SELECT * FROM tablex ORDER BY pkx ;

HTH,
ed




Reply With Quote
  #3  
Old   
Citizen Bob
 
Posts: n/a

Default Re: Simple disk-based sorting method - 04-13-2007 , 10:34 AM



On 13 Apr 2007 07:43:19 -0700, "Ed Prochak" <edprochak (AT) gmail (DOT) com>
wrote:

Quote:
There is not enough memory in my Win2K/SP4 system to implement the
traditional memory-based sorting algorithms, so I need to use a method
that is disk-file based. I have plenty of disk space.

I imagine that this method would open the input disk file and begin
reading records one at a time and performing some kind of sort
algorithm placing the result in a temporary output file. Once the
original file has been swept, the process starts over and runs again
and again for the required number of iterations until the entire list
is sorted in ascending order. I will implement this method in standard
DOS-based C (Microsoft Optimizing C Compiler "cl") using the Win2K
DVM.

I think a variation of Radix sort would work nicely.
I will take a look. There is an article in Wikipedia that gives
several methods for doing the Radix sort. Can you recommend a
particular method for my case. The order within a group of identical
keys is not important. The only thing that is important is that the
end result is sorted on the key in ascending order.

Without having studied it, my first guess is that I can use a MSD sort
which appears to require ten disk files, one for each possible digit
[0-9]. I place each record into one of the 10 files according to its
rank in the order [0-9]. Then I concatenate the files in order [0-9]
to get a file sorted on the MSD. Then I do the same thing again on the
next significant digit and work thru all digits of my key, which
amounts to 9 sweeps.

Does that sound right?

Quote:
your description of programming in C. (newsgroups are
arranged by topic for a reason.)
The use of C was incidental and not really relevant to my query. I
chose database newsgroups because database experts know about sorting
whereas C programmers do not necessarily know anything about sorting.
Once I understand the procedure, I can write the code myself.

Quote:
OTOH, you posted this in comp.databases so I'll now assume you want a
DB oriented answer.
I want to know how to implement the sort procedure itself so I can
write a program to do it using disk files instead of memory arrays.

Quote:
And that answer would be to pick a DBMS that runs
on your platform
That is what I am trying to avoid. I want to implement the procedure
myself.

Quote:
(WIN2K is pretty old),
"old" == "dependable", "reliable", "functional".

It works, which is more than can be said about the newer versions of
Windows.


--

"Nothing in the world can take the place of perseverance. Talent
will not; nothing is more common than unsuccessful men with talent.
Genius will not; unrewarded genius is almost a proverb. Education
will not; the world is full of educated derelicts. Persistence and
determination alone are omnipotent."
--Calvin Coolidge


Reply With Quote
  #4  
Old   
Gene Wirchenko
 
Posts: n/a

Default Re: Simple disk-based sorting method - 04-13-2007 , 11:52 PM



spam (AT) uce (DOT) gov (Citizen Bob) wrote:

Quote:
On 13 Apr 2007 07:43:19 -0700, "Ed Prochak" <edprochak (AT) gmail (DOT) com
wrote:
[snip]

Quote:
(WIN2K is pretty old),

"old" == "dependable", "reliable", "functional".

It works, which is more than can be said about the newer versions of
Windows.
from My Sig Collection:

legacy (adj) - A pejorative term used in the computer industry meaning
"it works."

Sincerely,

Gene Wirchenko

Computerese Irregular Verb Conjugation:
I have preferences.
You have biases.
He/She has prejudices.


Reply With Quote
  #5  
Old   
--CELKO--
 
Posts: n/a

Default Re: Simple disk-based sorting method - 04-14-2007 , 08:02 AM



Quote:
legacy (adj) - A pejorative term used in the computer industry meaning "it works."
... but nobody remembers how it works. I always thought we ought to
call them "family curse" instead



Reply With Quote
  #6  
Old   
Citizen Bob
 
Posts: n/a

Default Re: Simple disk-based sorting method - 04-14-2007 , 11:47 AM



On 13 Apr 2007 07:43:19 -0700, "Ed Prochak" <edprochak (AT) gmail (DOT) com>
wrote:

Quote:
I think a variation of Radix sort would work nicely.
I wrote the C program and it works beautifully. I ended up using the
LSD instead of the MSD because I wanted an ascending sort order.

I sort a total of 370,260 records with the key 9 digits long. The
disk-file-based program runs to completion in 30 seconds.

Now I want to sort another data set but this time the key consists of
a 4 digit number that can be either plus or minus and I want a
descending sort order.

Any suggestions on how to use the Radix Sort on that data set.


--

"Nothing in the world can take the place of perseverance. Talent
will not; nothing is more common than unsuccessful men with talent.
Genius will not; unrewarded genius is almost a proverb. Education
will not; the world is full of educated derelicts. Persistence and
determination alone are omnipotent."
--Calvin Coolidge


Reply With Quote
  #7  
Old   
Last Boy Scout
 
Posts: n/a

Default Re: Simple disk-based sorting method - 04-14-2007 , 03:44 PM



Try using a bubble sort. It is not fast but it works. There are many
variations of sorting methods. Also look for sorting utilities. On IBM
mainframes we used a utility called DFSORT, which is an add-on sort
utility, which was very fast. Some sort utilities can take a bubble
sort and put it into 2 threads and sort alternetively from the bottom
and the top at the same time, using multi-threading. You might also
look at Java at the www.sun.com website and see if it has a sorting
utility that might be useful. One thing I know is that the IBM method
used a work file to do the sort and was extremely fast.

Another technique was to use a method that sorted first by quartiles by
placing the original records into a new file based on ranges using
something like a-i j-l m-r s-z or something like that. Then doing a
bubble sort. This makes the sorting faster by splitting up the work.

Look on the internet for Bubblesort and you should find links to other
sorting techniques.

Citizen Bob wrote:
Quote:
I have a disk file which contains 370,260 records of average length 60
ASCII text characters. I need to sort them on the first 9 characters,
which are set off from the rest by a space.

Once sorted the first record looks like
100100100 08:35-08:39 4 0 0 1 -115 -1 75 -100 dd17.txt
and the last one looks like
500900900 12:41-13:25 44 1 885 0 0 1 925 -200 dd21.txt

There is not enough memory in my Win2K/SP4 system to implement the
traditional memory-based sorting algorithms, so I need to use a method
that is disk-file based. I have plenty of disk space.

I imagine that this method would open the input disk file and begin
reading records one at a time and performing some kind of sort
algorithm placing the result in a temporary output file. Once the
original file has been swept, the process starts over and runs again
and again for the required number of iterations until the entire list
is sorted in ascending order. I will implement this method in standard
DOS-based C (Microsoft Optimizing C Compiler "cl") using the Win2K
DVM.

Suggestions and references please.



--

"Nothing in the world can take the place of perseverance. Talent
will not; nothing is more common than unsuccessful men with talent.
Genius will not; unrewarded genius is almost a proverb. Education
will not; the world is full of educated derelicts. Persistence and
determination alone are omnipotent."
--Calvin Coolidge

Reply With Quote
  #8  
Old   
Bob Stearns
 
Posts: n/a

Default Re: Simple disk-based sorting method - 04-14-2007 , 04:04 PM



Citizen Bob wrote:
Quote:
On 13 Apr 2007 07:43:19 -0700, "Ed Prochak" <edprochak (AT) gmail (DOT) com
wrote:


I think a variation of Radix sort would work nicely.


I wrote the C program and it works beautifully. I ended up using the
LSD instead of the MSD because I wanted an ascending sort order.

I sort a total of 370,260 records with the key 9 digits long. The
disk-file-based program runs to completion in 30 seconds.

Now I want to sort another data set but this time the key consists of
a 4 digit number that can be either plus or minus and I want a
descending sort order.

Any suggestions on how to use the Radix Sort on that data set.


--

"Nothing in the world can take the place of perseverance. Talent
will not; nothing is more common than unsuccessful men with talent.
Genius will not; unrewarded genius is almost a proverb. Education
will not; the world is full of educated derelicts. Persistence and
determination alone are omnipotent."
--Calvin Coolidge
How about just using the whole number + 10000 to select the random
location in the disk dataset then reading the disk dataset sequentially,
ignoring any empty locations? If that won't work, then you don't have
unique keys.


Reply With Quote
  #9  
Old   
mAsterdam
 
Posts: n/a

Default Re: Simple disk-based sorting method - 04-14-2007 , 07:11 PM



Quote:
Try using a bubble sort.
Ehrm no, don't. I sure the OP knows not to use it, but there are
others reading this. Bubble sort is a nice (because of easy
visualization) sort algorithm for beginners. Nothing more.


Reply With Quote
  #10  
Old   
David Kelly
 
Posts: n/a

Default Re: Simple disk-based sorting method - 04-14-2007 , 10:08 PM



Citizen Bob wrote:
Quote:
I imagine that this method would open the input disk file and begin
reading records one at a time and performing some kind of sort
algorithm placing the result in a temporary output file. Once the
original file has been swept, the process starts over and runs again
and again for the required number of iterations until the entire list
is sorted in ascending order. I will implement this method in standard
DOS-based C (Microsoft Optimizing C Compiler "cl") using the Win2K
DVM.
This is a classic Computer Science problem that has been solved many times.

An easy way to solve your problem is to just open the source file and
bite off as much as you can sort in memory.

Sort it. And write an output file.

Take another bite, sort, and write to a different output file.

Repeat until the entire original file has been bitten off and written to
smaller (but internally sorted) temporary files.

Now open all the temporary output files. Read the first record from
each. Write to your final output file the one record which sorts first.
Replenish it with the next record from the file it came from. Repeat
until all records in the temporary files have been copied to the output
file.

If you end up with 1,000 output files and can not open all at once, bite
off as many as you can, say 20, and merge them into a single file, and
another 20. And then merge the merged files, generation after
generation, until you are down to a single file.


Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.