dbTalk Databases Forums  

Simple disk-based sorting method

comp.database comp.database


Discuss Simple disk-based sorting method in the comp.database forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
Citizen Bob
 
Posts: n/a

Default Simple disk-based sorting method - 04-13-2007 , 08:40 AM






I have a disk file which contains 370,260 records of average length 60
ASCII text characters. I need to sort them on the first 9 characters,
which are set off from the rest by a space.

Once sorted the first record looks like
100100100 08:35-08:39 4 0 0 1 -115 -1 75 -100 dd17.txt
and the last one looks like
500900900 12:41-13:25 44 1 885 0 0 1 925 -200 dd21.txt

There is not enough memory in my Win2K/SP4 system to implement the
traditional memory-based sorting algorithms, so I need to use a method
that is disk-file based. I have plenty of disk space.

I imagine that this method would open the input disk file and begin
reading records one at a time and performing some kind of sort
algorithm placing the result in a temporary output file. Once the
original file has been swept, the process starts over and runs again
and again for the required number of iterations until the entire list
is sorted in ascending order. I will implement this method in standard
DOS-based C (Microsoft Optimizing C Compiler "cl") using the Win2K
DVM.

Suggestions and references please.



--

"Nothing in the world can take the place of perseverance. Talent
will not; nothing is more common than unsuccessful men with talent.
Genius will not; unrewarded genius is almost a proverb. Education
will not; the world is full of educated derelicts. Persistence and
determination alone are omnipotent."
--Calvin Coolidge

Reply With Quote
  #2  
Old   
Last Boy Scout
 
Posts: n/a

Default Re: Simple disk-based sorting method - 04-14-2007 , 04:44 PM






Try using a bubble sort. It is not fast but it works. There are many
variations of sorting methods. Also look for sorting utilities. On IBM
mainframes we used a utility called DFSORT, which is an add-on sort
utility, which was very fast. Some sort utilities can take a bubble
sort and put it into 2 threads and sort alternetively from the bottom
and the top at the same time, using multi-threading. You might also
look at Java at the www.sun.com website and see if it has a sorting
utility that might be useful. One thing I know is that the IBM method
used a work file to do the sort and was extremely fast.

Another technique was to use a method that sorted first by quartiles by
placing the original records into a new file based on ranges using
something like a-i j-l m-r s-z or something like that. Then doing a
bubble sort. This makes the sorting faster by splitting up the work.

Look on the internet for Bubblesort and you should find links to other
sorting techniques.

Citizen Bob wrote:
Quote:
I have a disk file which contains 370,260 records of average length 60
ASCII text characters. I need to sort them on the first 9 characters,
which are set off from the rest by a space.

Once sorted the first record looks like
100100100 08:35-08:39 4 0 0 1 -115 -1 75 -100 dd17.txt
and the last one looks like
500900900 12:41-13:25 44 1 885 0 0 1 925 -200 dd21.txt

There is not enough memory in my Win2K/SP4 system to implement the
traditional memory-based sorting algorithms, so I need to use a method
that is disk-file based. I have plenty of disk space.

I imagine that this method would open the input disk file and begin
reading records one at a time and performing some kind of sort
algorithm placing the result in a temporary output file. Once the
original file has been swept, the process starts over and runs again
and again for the required number of iterations until the entire list
is sorted in ascending order. I will implement this method in standard
DOS-based C (Microsoft Optimizing C Compiler "cl") using the Win2K
DVM.

Suggestions and references please.



--

"Nothing in the world can take the place of perseverance. Talent
will not; nothing is more common than unsuccessful men with talent.
Genius will not; unrewarded genius is almost a proverb. Education
will not; the world is full of educated derelicts. Persistence and
determination alone are omnipotent."
--Calvin Coolidge

Reply With Quote
  #3  
Old   
mAsterdam
 
Posts: n/a

Default Re: Simple disk-based sorting method - 04-14-2007 , 08:11 PM



Quote:
Try using a bubble sort.
Ehrm no, don't. I sure the OP knows not to use it, but there are
others reading this. Bubble sort is a nice (because of easy
visualization) sort algorithm for beginners. Nothing more.


Reply With Quote
  #4  
Old   
David Kelly
 
Posts: n/a

Default Re: Simple disk-based sorting method - 04-14-2007 , 11:08 PM



Citizen Bob wrote:
Quote:
I imagine that this method would open the input disk file and begin
reading records one at a time and performing some kind of sort
algorithm placing the result in a temporary output file. Once the
original file has been swept, the process starts over and runs again
and again for the required number of iterations until the entire list
is sorted in ascending order. I will implement this method in standard
DOS-based C (Microsoft Optimizing C Compiler "cl") using the Win2K
DVM.
This is a classic Computer Science problem that has been solved many times.

An easy way to solve your problem is to just open the source file and
bite off as much as you can sort in memory.

Sort it. And write an output file.

Take another bite, sort, and write to a different output file.

Repeat until the entire original file has been bitten off and written to
smaller (but internally sorted) temporary files.

Now open all the temporary output files. Read the first record from
each. Write to your final output file the one record which sorts first.
Replenish it with the next record from the file it came from. Repeat
until all records in the temporary files have been copied to the output
file.

If you end up with 1,000 output files and can not open all at once, bite
off as many as you can, say 20, and merge them into a single file, and
another 20. And then merge the merged files, generation after
generation, until you are down to a single file.


Reply With Quote
  #5  
Old   
Citizen Bob
 
Posts: n/a

Default Re: Simple disk-based sorting method - 04-15-2007 , 02:41 AM



On Sun, 15 Apr 2007 02:11:03 +0200, mAsterdam <mAsterdam (AT) vrijdag (DOT) org>
wrote:

Quote:
Try using a bubble sort.

Ehrm no, don't. I sure the OP knows not to use it, but there are
others reading this. Bubble sort is a nice (because of easy
visualization) sort algorithm for beginners. Nothing more.
With O(N^2) complexity, the bubble sort is a poor candidate for what I
need to do, with my 370K records to sort.


--

"Nothing in the world can take the place of perseverance. Talent
will not; nothing is more common than unsuccessful men with talent.
Genius will not; unrewarded genius is almost a proverb. Education
will not; the world is full of educated derelicts. Persistence and
determination alone are omnipotent."
--Calvin Coolidge


Reply With Quote
  #6  
Old   
Citizen Bob
 
Posts: n/a

Default Re: Simple disk-based sorting method - 04-15-2007 , 02:47 AM



On Sat, 14 Apr 2007 15:44:01 -0500, Last Boy Scout
<BadBill (AT) whitehouse (DOT) gov> wrote:

Quote:
Try using a bubble sort. It is not fast but it works.
I have written C code for the bubble sort so I am familar with it. But
I still have the problem that I have limited memory so I would have to
implement it on disk files which is not all that difficult.

And there is the added problem that it grows in complexity as O(N^2),
whereas the Radix Sort grows as O(N). Actually the Radix Sort grows as
O(kN) where k is the key length. However it is still linear in the
number of records N. I was impressed with the speed of the disk-file
based Radix Sort.

Also I want to use the Radix Sort since I already have the code
written which means I can use it as a C function with variable
parameters.


--

"Nothing in the world can take the place of perseverance. Talent
will not; nothing is more common than unsuccessful men with talent.
Genius will not; unrewarded genius is almost a proverb. Education
will not; the world is full of educated derelicts. Persistence and
determination alone are omnipotent."
--Calvin Coolidge


Reply With Quote
  #7  
Old   
Citizen Bob
 
Posts: n/a

Default Re: Simple disk-based sorting method - 04-15-2007 , 02:47 AM



On Sat, 14 Apr 2007 22:08:10 -0500, David Kelly <n4hhe (AT) yahoo (DOT) com>
wrote:

Quote:
I imagine that this method would open the input disk file and begin
reading records one at a time and performing some kind of sort
algorithm placing the result in a temporary output file. Once the
original file has been swept, the process starts over and runs again
and again for the required number of iterations until the entire list
is sorted in ascending order. I will implement this method in standard
DOS-based C (Microsoft Optimizing C Compiler "cl") using the Win2K
DVM.

This is a classic Computer Science problem that has been solved many times.

An easy way to solve your problem is to just open the source file and
bite off as much as you can sort in memory.

Sort it. And write an output file.

Take another bite, sort, and write to a different output file.

Repeat until the entire original file has been bitten off and written to
smaller (but internally sorted) temporary files.

Now open all the temporary output files. Read the first record from
each. Write to your final output file the one record which sorts first.
Replenish it with the next record from the file it came from. Repeat
until all records in the temporary files have been copied to the output
file.

If you end up with 1,000 output files and can not open all at once, bite
off as many as you can, say 20, and merge them into a single file, and
another 20. And then merge the merged files, generation after
generation, until you are down to a single file.
I chose the Radix Sort and it works beautifully. I only have to open
11 files at a time - the input or the output file and 10 temporary
files. I discover the digit value in the key and use it as an index
into an array of FILE pointers.

FILE *fp_temp[10];
fprintf(fp_temp[key],"%s",record);

--

"Nothing in the world can take the place of perseverance. Talent
will not; nothing is more common than unsuccessful men with talent.
Genius will not; unrewarded genius is almost a proverb. Education
will not; the world is full of educated derelicts. Persistence and
determination alone are omnipotent."
--Calvin Coolidge


Reply With Quote
  #8  
Old   
David Kelly
 
Posts: n/a

Default Re: Simple disk-based sorting method - 04-15-2007 , 05:44 PM



Citizen Bob wrote:
Quote:
An easy way to solve your problem is to just open the source file and
bite off as much as you can sort in memory.

Sort it. And write an output file.

Take another bite, sort, and write to a different output file.

Repeat until the entire original file has been bitten off and written to
smaller (but internally sorted) temporary files.

Now open all the temporary output files. Read the first record from
each. Write to your final output file the one record which sorts first.
Replenish it with the next record from the file it came from. Repeat
until all records in the temporary files have been copied to the output
file.

If you end up with 1,000 output files and can not open all at once, bite
off as many as you can, say 20, and merge them into a single file, and
another 20. And then merge the merged files, generation after
generation, until you are down to a single file.

I chose the Radix Sort and it works beautifully. I only have to open
11 files at a time - the input or the output file and 10 temporary
files. I discover the digit value in the key and use it as an index
into an array of FILE pointers.

FILE *fp_temp[10];
fprintf(fp_temp[key],"%s",record);
Whatever floats your boat. But I think what you describe is clumsy and
only barely works in the specific application you describe. Grow the
application and it will break as your solution is delicate.

You are counting on an even distribution of the data which is already
too large to handle in memory (doesn't matter what you use to sort in
memory). You are counting on being able to break it into only 10
segments based on a pattern, and that none of those segments will be too
large to handle in memory.

Your first pass deals the data into 10 different files. Your 2nd pass
loads each file in memory to perform the actual sort. Then 3rd pass
concatenates the files into one.

What I suggested was to take as much as you can and sort it, then write
a temporary file. Repeat until all data is now in multiple temporary
files, the contents of each is sorted. Then open all of these files and
your first data record will be the first data record from one of those
files. Each temporary file is now ordered so your final sort only has to
compare the current record from each.

If you have more temporary files than you can open simultaneously then
the procedure can recurse, merging temporary files until a single file
remains.

If you like having names for things, splitting into multiple sorted
files and putting them back together is called a merge sort.


Reply With Quote
  #9  
Old   
Last Boy Scout
 
Posts: n/a

Default Re: Simple disk-based sorting method - 04-15-2007 , 10:03 PM



http://c2.com/cgi/wiki/quickDiff?CountingSort

David Kelly wrote:
Quote:
Citizen Bob wrote:

An easy way to solve your problem is to just open the source file and
bite off as much as you can sort in memory.

Sort it. And write an output file.

Take another bite, sort, and write to a different output file.

Repeat until the entire original file has been bitten off and written
to smaller (but internally sorted) temporary files.

Now open all the temporary output files. Read the first record from
each. Write to your final output file the one record which sorts
first. Replenish it with the next record from the file it came from.
Repeat until all records in the temporary files have been copied to
the output file.

If you end up with 1,000 output files and can not open all at once,
bite off as many as you can, say 20, and merge them into a single
file, and another 20. And then merge the merged files, generation
after generation, until you are down to a single file.

I chose the Radix Sort and it works beautifully. I only have to open
11 files at a time - the input or the output file and 10 temporary
files. I discover the digit value in the key and use it as an index
into an array of FILE pointers.

FILE *fp_temp[10];
fprintf(fp_temp[key],"%s",record);

Whatever floats your boat. But I think what you describe is clumsy and
only barely works in the specific application you describe. Grow the
application and it will break as your solution is delicate.

You are counting on an even distribution of the data which is already
too large to handle in memory (doesn't matter what you use to sort in
memory). You are counting on being able to break it into only 10
segments based on a pattern, and that none of those segments will be too
large to handle in memory.

Your first pass deals the data into 10 different files. Your 2nd pass
loads each file in memory to perform the actual sort. Then 3rd pass
concatenates the files into one.

What I suggested was to take as much as you can and sort it, then write
a temporary file. Repeat until all data is now in multiple temporary
files, the contents of each is sorted. Then open all of these files and
your first data record will be the first data record from one of those
files. Each temporary file is now ordered so your final sort only has to
compare the current record from each.

If you have more temporary files than you can open simultaneously then
the procedure can recurse, merging temporary files until a single file
remains.

If you like having names for things, splitting into multiple sorted
files and putting them back together is called a merge sort.

Reply With Quote
  #10  
Old   
Last Boy Scout
 
Posts: n/a

Default Re: Simple disk-based sorting method - 04-15-2007 , 10:06 PM



David Kelly wrote:
Quote:
Citizen Bob wrote:

An easy way to solve your problem is to just open the source file and
bite off as much as you can sort in memory.

Sort it. And write an output file.

Take another bite, sort, and write to a different output file.

Repeat until the entire original file has been bitten off and written
to smaller (but internally sorted) temporary files.

Now open all the temporary output files. Read the first record from
each. Write to your final output file the one record which sorts
first. Replenish it with the next record from the file it came from.
Repeat until all records in the temporary files have been copied to
the output file.

If you end up with 1,000 output files and can not open all at once,
bite off as many as you can, say 20, and merge them into a single
file, and another 20. And then merge the merged files, generation
after generation, until you are down to a single file.

I chose the Radix Sort and it works beautifully. I only have to open
11 files at a time - the input or the output file and 10 temporary
files. I discover the digit value in the key and use it as an index
into an array of FILE pointers.

FILE *fp_temp[10];
fprintf(fp_temp[key],"%s",record);

Whatever floats your boat. But I think what you describe is clumsy and
only barely works in the specific application you describe. Grow the
application and it will break as your solution is delicate.

You are counting on an even distribution of the data which is already
too large to handle in memory (doesn't matter what you use to sort in
memory). You are counting on being able to break it into only 10
segments based on a pattern, and that none of those segments will be too
large to handle in memory.

Your first pass deals the data into 10 different files. Your 2nd pass
loads each file in memory to perform the actual sort. Then 3rd pass
concatenates the files into one.

What I suggested was to take as much as you can and sort it, then write
a temporary file. Repeat until all data is now in multiple temporary
files, the contents of each is sorted. Then open all of these files and
your first data record will be the first data record from one of those
files. Each temporary file is now ordered so your final sort only has to
compare the current record from each.

If you have more temporary files than you can open simultaneously then
the procedure can recurse, merging temporary files until a single file
remains.

If you like having names for things, splitting into multiple sorted
files and putting them back together is called a merge sort.
The thing about sorting numbers is you can sort from highest to lowest
and just write the file and then append to the end of the file.


Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.