dbTalk Databases Forums  

Hashing for DISTINCT or GROUP BY in SQL

comp.databases.theory comp.databases.theory


Discuss Hashing for DISTINCT or GROUP BY in SQL in the comp.databases.theory forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
-CELKO-
 
Posts: n/a

Default Hashing for DISTINCT or GROUP BY in SQL - 10-12-2010 , 11:35 AM






In the old days, a DISTINCT or GROUP BY in an SQL engine were done by
a sort. Back then we built early SQLs on top of existing file systems
and we had pretty good sort procedures in the library. How many kids
today have ever seen a polyphase merge sort on tape drives?

But today with parallel hardware and good hashing algorithms, would it
be faster to use hashing to cluster equivalent classes of data
together?

Obviously, if two data values are equal, they will have the same hash
for all hashing functions. But two different values can have the same
hash for any one hashing function.

Does there exist a set of hashing functions, H1(), H2(), .., Hn()
which will produce at least one different result for any pair of data
values?

Reply With Quote
  #2  
Old   
Erwin
 
Posts: n/a

Default Re: Hashing for DISTINCT or GROUP BY in SQL - 10-12-2010 , 12:44 PM






On 12 okt, 18:35, -CELKO- <jcelko... (AT) earthlink (DOT) net> wrote:

Quote:
Does there exist a set of hashing functions, H1(), H2(), .., Hn()
which will produce at least one different result for any pair of data
values?
If you can explain to me how a _SET_ (of hashing functions) can
"PRODUCE" a result, then I might consider trying to think about this.

Reply With Quote
  #3  
Old   
Razvan Socol
 
Posts: n/a

Default Re: Hashing for DISTINCT or GROUP BY in SQL - 10-12-2010 , 02:25 PM



I assume that the hashing functions are applied to the same input
(which is the entire row) and the hashes are somehow concatenated. As
long as the size of the combined hashes is less then the size of the
data row, there will always be the possibility to have two different
rows which have the same hash. The pair may be difficult to find, one
or both rows may be non-sensical, but they theoretically exist, so the
answer to the last question is no (at least no practical hashing
functions, that reduce the size of the input).

Regarding the other question, I think hashing would be faster than
sorting (on today's parallel hardware), but it needs to be accompanied
by a normal comparison afterwards, to produce the correct results.

Razvan

Reply With Quote
  #4  
Old   
Tegiri Nenashi
 
Posts: n/a

Default Re: Hashing for DISTINCT or GROUP BY in SQL - 10-12-2010 , 03:08 PM



On Oct 12, 8:35*am, -CELKO- <jcelko... (AT) earthlink (DOT) net> wrote:
Quote:
In the old days, a DISTINCT or GROUP BY in an SQL engine were done by
a sort. Back then we built early SQLs on top of existing file systems
and we had pretty good sort procedures in the library. How many kids
today have ever seen a polyphase merge sort on tape drives?

But today with parallel hardware and good hashing algorithms, would it
be faster to use hashing to cluster equivalent classes of data
together?

Obviously, if two data values are equal, they will have the same hash
for all hashing functions. But two different values can have the same
hash for any one hashing function.

Does there exist a set of hashing functions, H1(), H2(), .., Hn()
which will produce at least one different result for any pair of data
values?
This days generics programming is ubiquitous. In C++ there is Standard
Templates Library, while in Java there is Collections Library. Either
way one usually declares a Set object, and not SortedSet, or HashSet.
This way one can easily switch from one implementation to the other,
and delegate hard concurrency and performance issues to masters like
Doug Lee. Speaking SQL, some of the "big 3" vendors still code in C,
but that is another story...

Reply With Quote
  #5  
Old   
Roy Hann
 
Posts: n/a

Default Re: Hashing for DISTINCT or GROUP BY in SQL - 10-12-2010 , 03:08 PM



-CELKO- wrote:

Quote:
How many kids
today have ever seen a polyphase merge sort on tape drives?
I turned 50 last birthday and I've never seen one, and I live in
England.

--
Roy

Reply With Quote
  #6  
Old   
Roy Hann
 
Posts: n/a

Default Re: Hashing for DISTINCT or GROUP BY in SQL - 10-12-2010 , 03:13 PM



Tegiri Nenashi wrote:

Quote:
Speaking SQL, some of the "big 3" vendors still code in C,
but that is another story...
C is less awful than a lot of the alternatives. At least is doesn't get
all knotted up over "objects", whatever they are.

--
Roy

Reply With Quote
  #7  
Old   
Cimode
 
Posts: n/a

Default Re: Hashing for DISTINCT or GROUP BY in SQL - 10-12-2010 , 03:28 PM



On 12 oct, 19:44, Erwin <e.sm... (AT) myonline (DOT) be> wrote:
Quote:
On 12 okt, 18:35, -CELKO- <jcelko... (AT) earthlink (DOT) net> wrote:

Does there exist a set of hashing functions, H1(), H2(), .., Hn()
which will produce at least one different result for any pair of data
values?

If you can explain to me how a _SET_ (of hashing functions) can
"PRODUCE" a result, then I might consider trying to think about this.
Since the above SQL-hack attempt fundamentally ignores the necessity
for the sub system onto providing effectiveness in constraint
specialization and header definitions runtime implementations, it
remains nothing else than a dead end revisitation of OBJECTID.

The theory of *key-reduction* (for a lack of a better word) as whole
is simply a waste of time.

Reply With Quote
  #8  
Old   
Clifford Heath
 
Posts: n/a

Default Re: Hashing for DISTINCT or GROUP BY in SQL - 10-13-2010 , 03:07 AM



-CELKO- wrote:
Quote:
But two different values can have the same
hash for any one hashing function.
So use the hash to reduce the number of key values you have to compare normally.

Quote:
Does there exist a set of hashing functions, H1(), H2(), .., Hn()
which will produce at least one different result for any pair of data
values?
A set of such hash functions, if you concatenate their values,
would be just one hash function. Unless the result contains all
the entropy that was in the parameters, you'll get collisions.
If it does contain all the entropy, there's no advantage in hashing.

I'm sure I've seen query plans back as far as MS SQL Server 2000 that
use a hash for grouping/distinct, but my memory may have failed me.

Clifford Heath.

Reply With Quote
  #9  
Old   
-CELKO-
 
Posts: n/a

Default Re: Hashing for DISTINCT or GROUP BY in SQL - 10-13-2010 , 09:15 PM



Let me see if I can be clearer.

There is a proof that for KNOWN set, we can always find a perfect
hashing function; a stronger proof for a minimal perfect hashing
function also exists. The gimmick is that you have to know the set in
advance, which is fine for things like keywords in a programming
language. This makes a very fast symbol table. But in general they
are complicated and ugly; no more simple MOD() and shift stuff

But if I have a multi-set S of tuples {s1, s2, ..sn} that is not known
in advance, can I find a family of hash functions that I can apply
something like this, in parallel

1) Do Hash_1(S) => bucket[1], bucket[2] .., bucket[n]
2) keep all bucket[i] where MIN(s) = MAX(s); they have one value
3) Do Hash_2 (S) on buckets where MIN(s) <> MAX(s)
4) repeat until all buckets are single valued on the hash key s

The idea is to re-hash until all buckets have one value for the hash
key columns.

Reply With Quote
  #10  
Old   
paul c
 
Posts: n/a

Default Re: Hashing for DISTINCT or GROUP BY in SQL - 10-14-2010 , 10:24 AM



On Oct 12, 1:08*pm, Roy Hann <specia... (AT) processed (DOT) almost.meat> wrote:
Quote:
-CELKO- wrote:
How many kids
today have ever seen a polyphase merge sort on tape drives?

I turned 50 last birthday and I've never seen one, and I live in
England.

--
Roy
After the late 1960's at least in the IBM part of the western dp world
(excluding, for example, India), most installations had disks to use
for 'sortwk...' (aka sort work files). Tape sorts were pretty
hilarious to watch, usually two or three tape drives were used as
'work area'. In some installations, close to 50% of those runs had to
be re-started because of tape head dirt or other malfunctions. It was
important to use fairly new tape and to clean the heads first. From
moment to moment, the drives would switch between 'read' and 'read
backwards' which played hob with the tape motors and drive belts and
the thin 1/2" tape. Movie crews liked them for the appearance of
meaningful action. I'd say they were an order-of-magnitude slower
than disk-based sorts. Programming in those days had a high ratio of
manual labour.
81b39258-7323ab72-68a05e2552

Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.