dbTalk Databases Forums  

Re: Stochastic Queries

comp.databases.theory comp.databases.theory


Discuss Re: Stochastic Queries in the comp.databases.theory forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
Jason Lepack
 
Posts: n/a

Default Re: Stochastic Queries - 09-10-2007 , 07:26 AM






It depends... I don't know of any "good" methods using ANSI SQL, but
there are methods that can be used depending on what type of SQL you
are using.

Try google... I found all of these solutions by typing into Google:
"language-name" random records

In T-SQL there's the newid() method:
http://www.sqlteam.com/article/using...y-sort-records

Oracle has sample():
http://www.adp-gmbh.ch/ora/sql/examp...ct_sample.html

MySQL has rand():
http://webxadmin.free.fr/article/mys...ecords-170.php

Cheers,
Jason Lepack

On Sep 10, 2:21 am, Joe Thurbon <use... (AT) thurbon (DOT) com> wrote:
Quote:
As a part of some experimental work I've been doing, I need to write
what I think is best described as a stochastic quota query.

I would like to write a query like

SELECT RANDOM 5 COLUMN_FOO FROM TABLE_BAR

where RANDOM is like TOP, except that for successive queries, I get a
(possibly) different set of 5 COLUMN_FOO values from table TABLE_BAR.

Has anyone seen such a beast?

Thanks in advance,
Joe



Reply With Quote
  #2  
Old   
-CELKO-
 
Posts: n/a

Default Re: Stochastic Queries - 09-10-2007 , 11:45 AM






Quote:
As a part of some experimental work I've been doing, I need to write what I think is best described as a stochastic quota query.
You can write some proprietary kludges, but frankly they have problems
with skewed distributions. Better to stick to a stats package to draw
your samples (with or without replacements? Tiered? etc.?) There was
a good book years ago with a title somethign like a SAMPLER OF
SAMPLING that went into the details.

One trick I have used is to build a table:

CREATE TABLE Samples
(seq_nbr INTEGER NOT NULL PRIMARY KEY,
random_nbr INTEGER NOT NULL); -- or whatever data type you need

then load it with the particular random numbers I wanrted from a stat
package. More work, but MUCH better results.



Reply With Quote
  #3  
Old   
DBMS_Plumber
 
Posts: n/a

Default Re: Stochastic Queries - 09-19-2007 , 03:21 PM



On Sep 10, 9:45 am, -CELKO- <jcelko... (AT) earthlink (DOT) net> wrote:
Quote:
You can write some proprietary kludges, but frankly they have problems
with skewed distributions.
....

Quote:
then load it with the particular random numbers I wanrted from a stat
package. More work, but MUCH better results.
In the first place, this does not answer the question the OP asked.

My advice to the OP would be to write a Java app that generated these
queries. It's been done before. I suggest the OP read "Massive
Stochastic Testing of SQL" by Don Slutz of Microsoft. To my certain
knowledge at least two of the major DBMS vendors have a similar
testing facility. They're not a panacea, but they're useful. For extra
points you might want to include joins, unions and what-not in your
range of query features.

But the the question Joe answers, I am VERY interested in
understanding why these 'proprietery kludges' (either to seelct random
samples, or to generate random numbers) 'have problems' with skewed
distributions. (Disclosure - I worked on one.)

These 'proprietary kludges' sweat a lot of details. Joe's solution
here is wrong. Repeated use of the same set of random numbers will
generate an identical sample each time it's used. From a practical
point of view, you would be required to generate a new set of random
values for each query. I am also curious to know how you use this
table to restrict the rows being returned in the general case to any
kind of uniform random sample.




Reply With Quote
  #4  
Old   
DBMS_Plumber
 
Posts: n/a

Default Re: Stochastic Queries - 09-19-2007 , 03:24 PM



On Sep 19, 1:21 pm, DBMS_Plumber <paul_geoffrey_br... (AT) yahoo (DOT) com>
wrote:
Quote:
In the first place, this does not answer the question the OP asked.
Upon further review, I am incorrect. Joe's post does attempt to
answer the OP's question. I believed the OP wanted to select a
different set of columns for each query.

My criticism of Joe's response stands, however.



Reply With Quote
  #5  
Old   
-CELKO-
 
Posts: n/a

Default Re: Stochastic Queries - 09-20-2007 , 06:01 PM



Quote:
I suggest the OP read "Massive Stochastic Testing of SQL" by Don Slutz of Microsoft.
I need to get a copy, too.

You might want to look at a current article by Ben Gan on the use of
RAND() and NEWID() and differences in SQL Server 2000 and 2005. He
covers the problems with how a CASE expression uses RAND(),
deterministic functions, duplicate values, etc.

Quote:
I am VERY interested in understanding why these 'proprietary kludges' (either to select random samples, or to generate random numbers) 'have problems' with skewed distributions. (Disclosure - I worked on one.)
The SQL products I have worked with use traditional Linear
Congruential algorithms, which they inherited from C and UNIX. As you
get larger and larger samples, you get skewing and duplicate values.
Knuth Vol #2, Chapter 3 has a good history and some remarks about the
history.

The best (worst?) horror story in the 1970's was the discovery that an
IBM FORTRAN routine was not valid. It trashed quite a few PhD
projects. That was the most popular tool in those days for
research.

Quote:
Repeated use of the same set of random numbers will generate an identical sample each time it's used.
Not if the population changes each time. But I understand your point.

However, consider that one of the advantages of RNG is that you can
repeat an experiment. I worked with some lab equipment on an old DEC
PDP-11 more decades ago than I really like to remember with a special
circuit card. This thing had a speck of radioactive material and a
simple Geiger counter tube to create **quantum level** random digits.
Of course people did not seeing that black & yellow radiation symbol
on equipment in those days (Cold War Era) so I am not sure if it is
still available. You can probably use background radiation or radio
noise today with sensitive equipment.

But if I want a fixed table, I think most statisticians would agree
that the RAND corporation "Table of One Million Random Digits" has
been tested in every possible way for mathematical correctness. That
is a fixed table available on diskette!

Quote:
From a practical point of view, you would be required to generate a new set of random values for each query. I am also curious to know how you use this table to restrict the rows being returned in the general case to any kind of uniform random sample.
I start at the front of the RAND Corporation table and pull digits
until I get to the end. I can go to some university sites and get
bigger validated tables, too.






Reply With Quote
  #6  
Old   
DBMS_Plumber
 
Posts: n/a

Default Re: Stochastic Queries - 09-20-2007 , 06:49 PM



On Sep 20, 4:01 pm, -CELKO- <jcelko... (AT) earthlink (DOT) net> wrote:
Quote:
I suggest the OP read "Massive Stochastic Testing of SQL" by Don Slutz of Microsoft.

I need to get a copy, too.
It doesn't talk about this subject. Only about generating lots of
independent SQL queries to test the robustness of the SQL processor.
It's fascinating - but not apropos.

Quote:
You might want to look at a current article by Ben Gan on the use of
RAND() and NEWID() and differences in SQL Server 2000 and 2005. He
covers the problems with how a CASE expression uses RAND(),
deterministic functions, duplicate values, etc.
Familiar with it.

As an aside, Postgres had random number generation functions in 1993.
In fact, that system developed an entire class of optimizer
modifications to cope with the general class of 'non deterministic'
functions, which includes 'rand()', but just as easily might apply to
'has_the_missile_been_fired_yet()'. It was pleasing to see SQL Server
arrive at the same conclusions 15 years later. <snark off>

Quote:
I am VERY interested in understanding why these 'proprietary kludges' (either to select random samples, or to generate random numbers) 'have problems' with skewed distributions. (Disclosure - I worked on one.)

The SQL products I have worked with use traditional Linear
Congruential algorithms, which they inherited from C and UNIX. As you
get larger and larger samples, you get skewing and duplicate values.
Knuth Vol #2, Chapter 3 has a good history and some remarks about the
history.

The best (worst?) horror story in the 1970's was the discovery that an
IBM FORTRAN routine was not valid. It trashed quite a few PhD
projects. That was the most popular tool in those days for
research.
I'm intimately familiar with this literature. There's also an
excellent discussion in the more recent _Numerical_Recipes_ books.

But how do you know what RNG these products use? Did you see their
source code?

Writing a very good RNG is not hard. It's 10% scholarship, 10% code
effort, and 80% testing.

[ snip ]

Quote:
However, consider that one of the advantages of RNG is that you can
repeat an experiment.
There are a large number of RNG algorithms which, when handed the
same seed, generate the same sequence of values, making it possible to
use of the same list of random values as often as one would like.

[ snip ]

Quote:
But if I want a fixed table, I think most statisticians would agree
that the RAND corporation "Table of One Million Random Digits" has
been tested in every possible way for mathematical correctness. That
is a fixed table available on diskette!
And is the same set of numbers used repeatedly. Which means
subsequent samples derived using the same table of values -- or
samples derived from other sources using the same table of values --
are no longer independent.

[ snip ]

A raft of other practical problems present themselves with the use
of a static table. I'll describe one of them.

To use that table of random numbers, you need to plug it into a
query somehow. This means either a) modifying the table to add a
column containing a random value, or b) writing a join. Now the
problem with either approach is that a query processor--not being
aware of the random nature of what is going on--is going to move the
restriction or the join operation to wherever the physical planner
sees fit to place it.

Confronted with these problems -- customers wanting bernoulli
samples, for instance, but getting instead samples whose size
distribution was all over the shop -- DBMS vendors added random number
generators and sampling algorithms to their products in very deep
ways. It's not just the quality of the RNG that counts. The whole
query processor needs to change.

And even then, users wonder why the sampling query is no faster than
the original. So they added yet another layer of complexity to make it
go fast, as well as generate "statistically rigorous" samples. (Note:
They're not. But they're usually good enough for government work.)

In a nutshell, DBMS vendors (well - the one or two I am most
familiar with) added random() functions and sampling with enormous
care. They deal with a small number of very, very sophisticated
customers who subject these products to tremendous amounts of
validation. If these customers are disappointed, they get upset. (If
they're happy they never say so. Such is life.)



Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.