dbTalk Databases Forums  

Hashing for DISTINCT or GROUP BY in SQL

comp.databases.theory comp.databases.theory


Discuss Hashing for DISTINCT or GROUP BY in SQL in the comp.databases.theory forum.



Reply
 
Thread Tools Display Modes
  #31  
Old   
Cimode
 
Posts: n/a

Default Re: Hashing for DISTINCT or GROUP BY in SQL - 10-18-2010 , 07:59 AM






Quote:
In fact, I wonder
what the hell a 'theory of implementation' would look like. *I'm
tempted to suggest it would look like a 'theory of brick masoning', or
a 'theory of mixing flour and water for bread baking'. *Sounds odd, I
must say.
Personally, I wander why would relational theorist consider that they
could get away with not integrating one as a extension of the logical
relational model. Nowadays, attempts at building a relationally sound
system remains bound to programmer's preferences.

But I like your analogy.

Reply With Quote
  #32  
Old   
paul c
 
Posts: n/a

Default Re: Hashing for DISTINCT or GROUP BY in SQL - 10-18-2010 , 08:59 AM






On 18/10/2010 2:31 AM, Cimode wrote:
Quote:
What is the object of optimization in a relational perspective ? What
does it aim at, physically wise ? What are the modalities to defined
to achieve such aim ? Don't you have the impression that the
definition of*optimization* in such broad perspective is the
relational theorists's way sweep the dust under the rug ?

I just mean that the optimizations typically used are adhoc. In
algebra, we could write a 'reference' constraint, a more general form of
'foreign key' as B{F} = B{F} <AND> A{K}. That could be implemented
directly.

It must be a very desireable optimization to minimize physical work even
if that work is only the movement of electrons in the face of the
equally desireable ability to 'update', say when a relvar is assigned.
When faced with B' := B <OR> I, where the apostrophe refers to a relvar
B after an 'insert', it's fairly obvious that the reference constraint
can be tested against the 'before' value and only the tuples 'I' need to
be tested (as opposed to all the tuples in B'). This strikes me as both
a logical and physical optimization.

Being somewhat inept at relational calculus I don't have a ready example
for that. I know that Codd expected his calculus to be more easily
optimizable but I think that would still result in a similar situation.
This may have been written about and I just haven't seen it, but there
seems to be some missing implementation theory, to do with axioms, if
you will, for analyzing such expressions and recognizing when the set
doesn't need to be tested, ie., when only one term in an algebraic
expression needs to be tested against a constraint.

(The reference constraint becomes quite convoluted when a strict foreign
key constraint is desired and there are accordingly more ways to write
it, but an optimization 'theory' would need to handle that too. A
strict algebra won't have any keywords such as 'KEY'. A comprehensive
optimization theory (eg. axioms?) for algebra expressions doesn't exist
as far as I know. Even though Date points out a number of useful
optimizations in his Intro book, as I recall he doesn't get down to
brass tacks with the perhaps most elementary one, candidate keys.)

Reply With Quote
  #33  
Old   
paul c
 
Posts: n/a

Default Re: Hashing for DISTINCT or GROUP BY in SQL - 10-18-2010 , 09:14 AM



On 18/10/2010 6:59 AM, paul c wrote:
Quote:
That could be implemented directly.
Meant to write: That could be implemented directly at the expense of a
lot of unneeded physical work.

Reply With Quote
  #34  
Old   
paul c
 
Posts: n/a

Default Re: Hashing for DISTINCT or GROUP BY in SQL - 10-18-2010 , 09:33 AM



On 18/10/2010 2:31 AM, Cimode wrote:
Quote:
As years pass by and database science goes further and further into
collective IT industry amnesia, I somehow believe*old timers* (note
that I do consider myself an old timer) have simply failed to explore
the opportunities openned by Codd, and remained, implementation wise,
too bound to the IBM historical context of how relational
implementation is to be conceived. Such lack of perpective limited RM
expressive power and constrained it(unhappy pun!) to empirism rather
than applying it to how we conceive information encoding as a whole.
I remember a lot of amnesia! I posed a question I was curious about
even though it is probably pointless today, to the folklore usenet
group, about the origin of the term 'field' as applied to file systems.
It might have been an unfair question since that group is mostly
interested in the cultural history of hardware and operating systems as
opposed to all software, but I thought it was more apt for that group
than this one. There are several mathematical definitions for the term
'field' but the term also dates back to the 1940's use of Hollerith
punch cards and possibly much earlier for all I know (I remember seeing
punched cards that had pre-printed 'nested fields', eg., telephone
number might have sub-fields of area code and number.

Anyway, it may be because I wrote the question badly, but so the
reaction of that group seems to be that there was no mathematical
inspiration behind the term. I had guessed that possibly the
originators of Fortran might have thought it had special meaning being
that the early Fortran files often consisted solely of numbers of
different kinds and there is a definition of a 'closed field' of numbers
that some mathematicians use. But logicians at one time also used the
term to describe the union of what they called 'converse domains'.
Maybe neither influenced the computer use of the term but so far nobody
seems certain of that.

Reply With Quote
  #35  
Old   
Cimode
 
Posts: n/a

Default Re: Hashing for DISTINCT or GROUP BY in SQL - 10-18-2010 , 10:45 AM



[Snipped]
Quote:
When a constraint applies to a relation (or maybe I should say when a
relation satisfies a constraint), how and when can we apply the
constraint solely to a restriction or projection of the relation and
know that the constraint is still satisfied by the relation? *
Precisely. And whatelse, appart from a *mathematical* computing model
or relational implementation, can tell us that such method is actually
a preferable way to encode facts ? I have not seen a single line of
coherent work involving relation cardinalities in equations to
quantify logical IO's in a meaningful way to be represented on a
linear adress scheme.

Quote:
There are
lots of extant adhoc re-writings used by various 'optimizers', some of
which are have more formal underpinnings than others, such as the
re-writing of a disjunctive query into a conjunctive form.
Traditionally, both human and machine 'optimizers' have used metrics
which I would classify as fairly crude, akin to what engineers call
'first approximations', the most common one being the counting of
physical I/O's.
Precisely.*
A symptomatic example of immaturity of database science is the
inability of actually defining *under which physical conditions*, the
logical implementation is both possible and by what criteria it could
be considered sound. We should not count physical IO's but rather the
opposite. Logical IO's should be counted then reduced as a part of a
mathematical computing model.

Quote:
A typical 'recipe' first looks for a few 'ingredients'
such as keys followed by physical indexes and some will compare
alternate indexes before 'deciding'. *Not many will eliminate redundancy
in a query or constraint. *Optimization progress seems to involve
whittling away at special cases. *I suspect the re-writing usually
consists of re-writing queries, not the relations being queried. *Just
musing out loud.- Masquer le texte des messages précédents -
Recipes. That's is exactly where we are at, regarding implementation:
cooking.

Reply With Quote
  #36  
Old   
Cimode
 
Posts: n/a

Default Re: Hashing for DISTINCT or GROUP BY in SQL - 10-18-2010 , 11:31 AM



On 18 oct, 15:59, paul c <anonym... (AT) not-for-mail (DOT) invalid> wrote:
Quote:
On 18/10/2010 2:31 AM, Cimode wrote:

What is the object of optimization in a relational perspective ? What
does it aim at, physically wise ? *What are the modalities to defined
to achieve such aim ? Don't you have the impression that the
definition of*optimization* *in such broad perspective is the
relational theorists's way sweep the dust under the rug ?

I just mean that the optimizations typically used are adhoc. *In
algebra, we could write a 'reference' constraint, a more general form of
'foreign key' as B{F} = B{F} <AND> A{K}. *That could be implemented
directly.
Sure but what tells you that doing that such implementation is
preferable than some other ? You are describing a possibility but how
do you formalize this solution as being a part of a model? How do you
evaluate the limits of such approach ?

Quote:
It must be a very desireable optimization to minimize physical work even
if that work is only the movement of electrons in the face of the
equally desireable ability to 'update', say when a relvar is assigned.
When faced with B' := B <OR> I, where the apostrophe refers to a relvar
B after an 'insert', it's fairly obvious that the reference constraint
can be tested against the 'before' value and only the tuples 'I' need to
be tested (as opposed to all the tuples in B'). *This strikes me as both
a logical and physical optimization.
Actually, I believe it is the way around. We should not start from
attempting to minimize physical IO's but rather see how the logical
IO's can be reduced starting from the logical relational operation at
hand. The physical IO's can only be reduced *if and only if* the
logical IO's can properly be quantified and optimized.

Quote:
Being somewhat inept at relational calculus I don't have a ready example
for that. *I know that Codd expected his calculus to be more easily
optimizable but I think that would still result in a similar situation.
Exactly, but Codd somehow suggested that the chances for a physical IO
to be optimized when the logical IO's is not , are NULL
To my understanding, he considered that logical IO should always take
precedence over the physical IO in the evaluation of a computing
scheme.

Quote:
This may have been written about and I just haven't seen it, but there
seems to be some missing implementation theory, to do with axioms, if
you will, for analyzing such expressions and recognizing when the set
doesn't need to be tested, ie., when only one term in an algebraic
expression needs to be tested against a constraint.

(The reference constraint becomes quite convoluted when a strict foreign
key constraint is desired and there are accordingly more ways to write
it, but an optimization 'theory' would need to handle that too. *A
strict algebra won't have any keywords such as 'KEY'. *A comprehensive
optimization theory (eg. axioms?) for algebra expressions doesn't exist
as far as I know. *Even though Date points out a number of useful
optimizations in his Intro book, as I recall he doesn't get down to
brass tacks with the perhaps most elementary one, candidate keys.)
Foreign KEY is a already a complex case. My point is that we lack the
model to represent individual relations in a computing perspective in
the first place. In such perspective, quantifying logical IO's for
operations seems premature.

Reply With Quote
  #37  
Old   
Gene Wirchenko
 
Posts: n/a

Default Re: Hashing for DISTINCT or GROUP BY in SQL - 10-18-2010 , 01:59 PM



On Mon, 18 Oct 2010 14:33:47 +0000 (UTC), paul c
<anonymous (AT) not-for-mail (DOT) invalid> wrote:

Quote:
On 18/10/2010 2:31 AM, Cimode wrote:
As years pass by and database science goes further and further into
collective IT industry amnesia, I somehow believe*old timers* (note
that I do consider myself an old timer) have simply failed to explore
the opportunities openned by Codd, and remained, implementation wise,
too bound to the IBM historical context of how relational
implementation is to be conceived. Such lack of perpective limited RM
expressive power and constrained it(unhappy pun!) to empirism rather
than applying it to how we conceive information encoding as a whole.

I remember a lot of amnesia! I posed a question I was curious about
even though it is probably pointless today, to the folklore usenet
group, about the origin of the term 'field' as applied to file systems.
alt.folklore.computers.

Quote:
It might have been an unfair question since that group is mostly
interested in the cultural history of hardware and operating systems as
opposed to all software, but I thought it was more apt for that group
than this one. There are several mathematical definitions for the term
Well, no. We deal with all aspects of it. Maybe, nobody knew. I
had never considered the issue myself.

Quote:
'field' but the term also dates back to the 1940's use of Hollerith
punch cards and possibly much earlier for all I know (I remember seeing
punched cards that had pre-printed 'nested fields', eg., telephone
number might have sub-fields of area code and number.

Anyway, it may be because I wrote the question badly, but so the
reaction of that group seems to be that there was no mathematical
inspiration behind the term. I had guessed that possibly the
originators of Fortran might have thought it had special meaning being
that the early Fortran files often consisted solely of numbers of
different kinds and there is a definition of a 'closed field' of numbers
that some mathematicians use. But logicians at one time also used the
term to describe the union of what they called 'converse domains'.
Maybe neither influenced the computer use of the term but so far nobody
seems certain of that.
It might just be by extension to another type of delimited area.

Sincerely,

Gene Wirchenko

Reply With Quote
  #38  
Old   
Sampo Syreeni
 
Posts: n/a

Default Re: Hashing for DISTINCT or GROUP BY in SQL - 10-19-2010 , 07:18 PM



On Oct 12, 7:35*pm, -CELKO- <jcelko... (AT) earthlink (DOT) net> wrote:

Quote:
In the old days, a DISTINCT or GROUP BY in an SQL engine were done by
a sort. Back then we built early SQLs on top of existing file systems
and we had pretty good sort procedures in the library. How many kids
today have ever seen a polyphase merge sort on tape drives?
Simulated that on disk, been there. You have to be rather keen on the
"tape" labels unless you were fancy enough to well-errorcheck in code.

Quote:
But today with parallel hardware and good hashing algorithms, would it
be faster to use hashing to cluster equivalent classes of data
together?
In case of flat distributions and good enough caching, often yes.
Which is why e.g. under Oracle we rarely see sort-merge joins, and
often see hash joins. (I hope the Grace, zigzag or one of the other
more pipelined and well-cached ones.) When there is skew, adaptive
subdivision algorithms work better, as in dynamic hashing or 0-
complete trees of pages. For flat distributions with limited total
cardinality, binning (and by extension dynamic radix sort like
algorithms) work better. In any case, buffer trees are near-optimal in
aggregate, while splay-tree and other near-cache-oblivious designs can
both lead to near-linear access and cache locality.

Quote:
Does there exist a set of hashing functions, H1(), H2(), .., Hn()
which will produce at least one different result for any pair of data
values?
In general, obviously not unless the number of different inputs
exceeds the number of possible outputs. That is the pigeon-hole
principle right there.

At the very limit when the possible numbers of inputs and outputs are
precisely equal, it's a definite yes. Then you'd be aiming at a pseudo-
random permutation in an arbitrary field. That can always be done by
choosing some suitable elementary distribution of cycle lengths which
sum up to the total width and then mixing up the elementary ones,
randomly. It's messy and expensive, but yes, it can be done.

If the number of different inputs is below the number of possible
outputs, obviously yes. You could just use a potentially *huge* lookup
table. But what you're likely looking for here and in the previous
case is a computationally efficient means of distributing the values.
In all cases you'd want to have some extra, global metric of "good
redistribution" in excess as well.

This sort of thing is then highly nontrivial. The physicists have
formalized this sort of intuition pretty well within ergodic theory,
using continuous variables. There it's about globally measure-
preserving mappings. But the discrete case remains nasty, in small-
number-cases proven-to-be-insoluble, in very small cases proven-to-be-
inapproximable as well, and even in larger cases still computationally
intractable when phrased as an optimization problem; cryptographers
battle with this stuff everyday, because what you're probably asking
for is also the ideal mixing function within a symmetric cipher, given
a constant key.

Finally, let's not forget that this framework perhaps isn't the best
one for us. Most of our data isn't random, and most certainly it isn't
processed in a random order. That is why a simple linear congruence
relation works so well as a hash function: two separate inputs will
not lead to the same outcome unless they are precisely the number of
possible outcomes apart. Increasing input values lead to increasing
output values, except when we wrap around, so that we can utilize
linear instead of random scans further down the stack. We can always
choose the wraparound to be mutually prime to most of the known cycles
our data might have, like day-month-year cycles and estimated peaks in
name statistics if they should statistically affect the input. And as
long as you hash, you just can't avoid ones, zeroes, nulls and the
like being rather singular in the input; you have to do something
different with them in any case.

Reply With Quote
  #39  
Old   
paul c
 
Posts: n/a

Default Re: Hashing for DISTINCT or GROUP BY in SQL - 10-19-2010 , 10:07 PM



On 19/10/2010 5:18 PM, Sampo Syreeni wrote:
Quote:
Simulated that on disk, been there. You have to be rather keen on the
"tape" labels unless you were fancy enough to well-errorcheck in code.

'simulate' is a very slippery word. disks don't have capstans.

Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.