dbTalk Databases Forums  

Simple disk-based sorting method

comp.database comp.database


Discuss Simple disk-based sorting method in the comp.database forum.



Reply
 
Thread Tools Search this Thread Display Modes
  #11  
Old   
Citizen Bob
 
Posts: n/a

Default Re: Simple disk-based sorting method - 04-16-2007 , 06:53 AM






On Sun, 15 Apr 2007 16:44:29 -0500, David Kelly <n4hhe (AT) yahoo (DOT) com>
wrote:

Quote:
I think what you describe is clumsy
You don't like the Radix Sort.

Quote:
and only barely works in the specific application you describe. Grow the
application and it will break as your solution is delicate.
I did grow the application. The flat file I was testing earlier has
only 20 days of data, whereas my collection thus far is 37 days and
grows each work day.

Therefore the full flat file has 684,981 records and a size of
38,828,711 bytes. The key has 9 decimal digits. The procedure has to
do 6,164,829 sorting operations total.

I am able to sort it in 63 seconds on a 2.4 GHz Celeron D with 512 MB
RAM running Win2K/SP4 DVM and a WD Caviar IDE drive. I was prepared to
accept as much as an hour, and it only takes 1 minute. That is not my
idea of "clumsy" or "only barely works".

I was able to get around negative numerical keys by offsetting them
with the minimum value, sorting and restoring them. If the keys were
alphanumeric, I would convert each character to its ACSII hex value
and expand the sort bins to 16 to accommodate hex numbers.

As I mentioned the thing that makes the Radix Sort so valueable,
besides being disk-file-based, is that its complexity is O(N). Other
sort methods in general are either O(N^2) or O(N log N).

Another thing I like is the simplicity of the Radix Sort. I am not a
database person, so for me this is an excursion into an area I have
not mastered. The fact that the Radix Sort is so easy to implement in
C is a blessing.

It is overkill for small sorts, for which I use the C Library function
"qsort". Now I have both ends of the spectrum covered.



--

"Prediction is very difficult, especially about the future."
--Niels Bohr


Reply With Quote
  #12  
Old   
David Kelly
 
Posts: n/a

Default Re: Simple disk-based sorting method - 04-16-2007 , 10:47 PM






Citizen Bob wrote:
Quote:
As I mentioned the thing that makes the Radix Sort so valueable,
besides being disk-file-based, is that its complexity is O(N). Other
sort methods in general are either O(N^2) or O(N log N).
I see nothing about the Radix sort which makes it "disk based."

Your solution is clumsy because it depends on your source data having
some sort of order and uniform distribution.

Quote:
Another thing I like is the simplicity of the Radix Sort. I am not a
database person, so for me this is an excursion into an area I have
not mastered. The fact that the Radix Sort is so easy to implement in
C is a blessing.
Should be several sort routines in your existing C libraries.

What I described wasn't so much as a sort algorithm but a means of
overflowing in temporary files. I said to sort in memory, but didn't
make any recommendations as to how one should sort in memory.

Quote:
It is overkill for small sorts, for which I use the C Library function
"qsort". Now I have both ends of the spectrum covered.
Most any in-memory sort will be faster than any disk-based sort.


Reply With Quote
  #13  
Old   
Citizen Bob
 
Posts: n/a

Default Re: Simple disk-based sorting method - 04-17-2007 , 08:09 AM



On Mon, 16 Apr 2007 22:47:08 -0500, David Kelly <n4hhe (AT) yahoo (DOT) com>
wrote:

Quote:
I see nothing about the Radix sort which makes it "disk based."
It was easy to adapt it to a disk based procedure. And the fact that
it is O9N) makes it ideal for disk based operations.

Quote:
Your solution is clumsy because it depends on your source data having
some sort of order and uniform distribution.
I specified the sort of order, namely that the key was purely numeric.

I do not understand your comment about "uniform distribution".

Quote:
Should be several sort routines in your existing C libraries.
I am using the standard library that comes with the Microsoft
Optimizing C Compiler. The only sort routine that exists in that
library is qsort.

Quote:
What I described wasn't so much as a sort algorithm but a means of
overflowing in temporary files. I said to sort in memory, but didn't
make any recommendations as to how one should sort in memory.
The Radix Sort can be implemented solely with disk files. There is no
need to build memory-based arrays.

Win2K builds a very large disk cache which means that a lot of the
disk-based operations are implemented in memory. By using the disk
file system as storage for the sort procedure, I am taking advantage
of this very large, efficient system.

Quote:
It is overkill for small sorts, for which I use the C Library function
"qsort". Now I have both ends of the spectrum covered.

Most any in-memory sort will be faster than any disk-based sort.
Does that include the overhead of creating a more complex procedure to
manage the memory-to-disk operations, which you have described as " a
means of overflowing in temporary files".

Sorting 38MB in 63 seconds it fast enough for my purposes. Going to
all the trouble of writing the procedures for "means of
overflowing in temporary files" is not worth any potential savings in
time. And I am not convinced it would actually reduce the time,
because the overhead could be excessive.




--

"Prediction is very difficult, especially about the future."
--Niels Bohr


Reply With Quote
  #14  
Old   
Paul Linehan
 
Posts: n/a

Default Re: Simple disk-based sorting method - 07-05-2007 , 11:24 AM






Citizen Bob wrote:

Quote:
I have a disk file which contains 370,260 records of average length 60
ASCII text characters. I need to sort them on the first 9 characters,
which are set off from the rest by a space.

Once sorted the first record looks like
100100100 08:35-08:39 4 0 0 1 -115 -1 75 -100 dd17.txt
and the last one looks like
500900900 12:41-13:25 44 1 885 0 0 1 925 -200 dd21.txt

What's wrong with importing it into Excel?



Paul...



Reply With Quote
  #15  
Old   
Lennart
 
Posts: n/a

Default Re: Simple disk-based sorting method - 07-05-2007 , 12:29 PM



Paul Linehan wrote:
Quote:
Citizen Bob wrote:

I have a disk file which contains 370,260 records of average length 60
ASCII text characters. I need to sort them on the first 9 characters,
which are set off from the rest by a space.

Once sorted the first record looks like
100100100 08:35-08:39 4 0 0 1 -115 -1 75 -100 dd17.txt
and the last one looks like
500900900 12:41-13:25 44 1 885 0 0 1 925 -200 dd21.txt


What's wrong with importing it into Excel?

What is the maximum number of rows in an XL sheet? Last time I checked
it was way less than 370,260.


/Lennart



Reply With Quote
  #16  
Old   
David Segall
 
Posts: n/a

Default Re: Simple disk-based sorting method - 07-06-2007 , 12:26 AM



spam (AT) uce (DOT) gov (Citizen Bob) wrote:

Quote:
I have a disk file which contains 370,260 records of average length 60
ASCII text characters. I need to sort them on the first 9 characters,
which are set off from the rest by a space.

Once sorted the first record looks like
100100100 08:35-08:39 4 0 0 1 -115 -1 75 -100 dd17.txt
and the last one looks like
500900900 12:41-13:25 44 1 885 0 0 1 925 -200 dd21.txt

There is not enough memory in my Win2K/SP4 system to implement the
traditional memory-based sorting algorithms, so I need to use a method
that is disk-file based. I have plenty of disk space.

I imagine that this method would open the input disk file and begin
reading records one at a time and performing some kind of sort
algorithm placing the result in a temporary output file. Once the
original file has been swept, the process starts over and runs again
and again for the required number of iterations until the entire list
is sorted in ascending order. I will implement this method in standard
DOS-based C (Microsoft Optimizing C Compiler "cl") using the Win2K
DVM.
I suggest you use the Microsoft supplied sort. You can obtain
instructions by typing HELP SORT at the command prompt.


Reply With Quote
  #17  
Old   
Paul Linehan
 
Posts: n/a

Default Re: Simple disk-based sorting method - 07-06-2007 , 05:05 AM






David Segall wrote:

Quote:
I suggest you use the Microsoft supplied sort. You can obtain
instructions by typing HELP SORT at the command prompt.


Or try downloading the bash shell at www.cygwin.com and
using the unix sort on windows.



Paul...



Reply With Quote
  #18  
Old   
Paul Linehan
 
Posts: n/a

Default Re: Simple disk-based sorting method - 07-06-2007 , 05:10 AM





Lennart wrote:

Quote:
What's wrong with importing it into Excel?

What is the maximum number of rows in an XL sheet? Last time I
checked it was way less than 370,260.

Oops - silly me! It's 65536.


www.cygwin.com could be a way of doing it. Or use perl
with cygwin. Or the Microsoftoft sort as suggested by another poster.



Paul...



Quote:
/Lennart


Reply With Quote
  #19  
Old   
David Segall
 
Posts: n/a

Default Re: Simple disk-based sorting method - 07-06-2007 , 01:21 PM



"Paul Linehan" <plinehan__A (AT) T__yahoo__D (DOT) OT__COM> wrote:

Quote:


David Segall wrote:

I suggest you use the Microsoft supplied sort. You can obtain
instructions by typing HELP SORT at the command prompt.



Or try downloading the bash shell at www.cygwin.com and
using the unix sort on windows.
Shame on you! That is not supplied by Microsoft. Everything you need
for shell level Unix is available free, and generally open source,
from Microsoft
<http://technet.microsoft.com/en-us/interopmigration/bb380242.aspx>.
It can be downloaded from
<http://www.microsoft.com/downloads/details.aspx?familyid=896c9688-601b-44f1-81a4-02878ff11778&displaylang=en>.


Reply With Quote
Reply




Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.