dbTalk Databases Forums  

Picking a row from each group

comp.databases comp.databases


Discuss Picking a row from each group in the comp.databases forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
Tom Anderson
 
Posts: n/a

Default Picking a row from each group - 05-03-2010 , 08:14 AM






Hi,

I'm a programmer with only rudimentary knowledge of SQL, and i'm wondering
about how to implement a sort of versioned, time-based data store.

For the sake of argument, let's say i have a table like:

CREATE TABLE weather_reports (
station_number INTEGER PRIMARY KEY,
report_date DATE NOT NULL,
temperature INTEGER
);

And some data like:

station_number report_date temperature
-------------- ----------- -----------
1 2010-05-02 10
1 2010-05-03 11
2 2010-05-01 27
2 2010-05-02 30

Firstly, how can i ask for:

the latest temperature from each station

Given that the reports involved may be from different dates?

Secondly, how can i ask for:

the numbers of all the stations where the latest temperature was over 20

?

If i was doing this in imperative code, the former would look like:

for each station:
sort all the reports by date
take the latest

But i don't know how to transpose this to SQL. I'm guessing that involves
some combination of GROUP BY station_number, to give the aspect of working
over all the stations, ORDER BY report_date, to give the aspect of looking
at the data in terms of time, and some operator which takes the first row
in each sorted group. I don't know what that last operator is, and i don't
know how to put them all together.

I suspect the second query could be written as a simple select over the
results of the first:

SELECT station_number
FROM (the first query, projected over at least station_number and temperature)
WHERE temperature > 20

Sorry to ask such a basic question; if there's any fine manual i could
read online that would let me work this out myself, please don't hesitate
to tell me to read it.

Thanks,
tom

--
Science is the outcome of being prepared to live without certainty and
therefore a mark of maturity. -- AC Grayling

Reply With Quote
  #2  
Old   
Robert Klemme
 
Posts: n/a

Default Re: Picking a row from each group - 05-03-2010 , 11:18 AM






On 03.05.2010 15:14, Tom Anderson wrote:
Quote:
Hi,

I'm a programmer with only rudimentary knowledge of SQL, and i'm
wondering about how to implement a sort of versioned, time-based data
store.

For the sake of argument, let's say i have a table like:

CREATE TABLE weather_reports (
station_number INTEGER PRIMARY KEY,
report_date DATE NOT NULL,
temperature INTEGER
);

And some data like:

station_number report_date temperature
-------------- ----------- -----------
1 2010-05-02 10
1 2010-05-03 11
2 2010-05-01 27
2 2010-05-02 30

Firstly, how can i ask for:

the latest temperature from each station

Given that the reports involved may be from different dates?
Here's one way with a Join to an inline view that calculates the max
date for each station:

select wr.station_number, wr.temperature
from weather_reports wr join (
select station_number, max(report_date) max_date
from weather_reports
group by station_number
) mr on mr.station_number = wr.station_number
and mr.max_date = wr.report_date

Quote:
Secondly, how can i ask for:

the numbers of all the stations where the latest temperature was over 20?

If i was doing this in imperative code, the former would look like:

for each station:
sort all the reports by date
take the latest

But i don't know how to transpose this to SQL. I'm guessing that
involves some combination of GROUP BY station_number, to give the aspect
of working over all the stations, ORDER BY report_date, to give the
aspect of looking at the data in terms of time, and some operator which
takes the first row in each sorted group. I don't know what that last
operator is, and i don't know how to put them all together.
You need GROUP BY but you do not need ORDER BY - rather you need MAX.

Quote:
I suspect the second query could be written as a simple select over the
results of the first:

SELECT station_number
FROM (the first query, projected over at least station_number and
temperature)
WHERE temperature > 20
Absolutely correct. Now you only need the first query. :-)

You can also combine them in one. Taking the first one as basis:

select wr.station_number, wr.temperature
from weather_reports wr join (
select station_number, max(report_date) max_date
from weather_reports
group by station_number
) mr on mr.station_number = wr.station_number
and mr.max_date = wr.report_date
where wr.temperature > 20

Quote:
Sorry to ask such a basic question; if there's any fine manual i could
read online that would let me work this out myself, please don't
hesitate to tell me to read it.
There are tons. It may be a good idea to read the intro for the RDBMS
you are using.

Kind regards

robert


--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Reply With Quote
  #3  
Old   
Thomas Kellerer
 
Posts: n/a

Default Re: Picking a row from each group - 05-03-2010 , 11:20 AM



Tom Anderson wrote on 03.05.2010 15:14:
Quote:
station_number report_date temperature
-------------- ----------- -----------
1 2010-05-02 10
1 2010-05-03 11
2 2010-05-01 27
2 2010-05-02 30

Firstly, how can i ask for:

the latest temperature from each station
SELECT station_number, temperature
FROM weather_reports wr
WHERE wr.report_date = (SELECT max(report_date)
FROM weather_reports w2
WHERE w2.station_number = wr.station_number)


Quote:
Given that the reports involved may be from different dates?

Secondly, how can i ask for:

the numbers of all the stations where the latest temperature was over 20

Just add a WHERE temperature > 20 to above statement.


Quote:
and some operator which takes the first row in each sorted group.
I don't know what that last operator is, and i don't know how to put them all together.
This is called windowing functions, and the corresponding function would be first_value()


Quote:
Sorry to ask such a basic question; if there's any fine manual i could
read online that would let me work this out myself, please don't
hesitate to tell me to read it.
http://www.postgresql.org/docs/curre...al-window.html

Windowing functions are defined in the SQL standard and supported by all major DBMS
The above tutorial will also work for other DBMS that comply with the standard

Thomas

Reply With Quote
  #4  
Old   
Ben Finney
 
Posts: n/a

Default Re: Picking a row from each group - 05-03-2010 , 09:54 PM



Tom Anderson <twic (AT) urchin (DOT) earth.li> writes:

Quote:
I'm a programmer with only rudimentary knowledge of SQL, and i'm
wondering about how to implement a sort of versioned, time-based data
store.
It's a good exercise, thanks for raising it here.

Quote:
For the sake of argument, let's say i have a table like:

CREATE TABLE weather_reports (
station_number INTEGER PRIMARY KEY,
report_date DATE NOT NULL,
temperature INTEGER
);

And some data like:

station_number report_date temperature
-------------- ----------- -----------
1 2010-05-02 10
1 2010-05-03 11
2 2010-05-01 27
2 2010-05-02 30
That data doesn't match the table definition, and will be rejected by a
conformant RDBMS:

=====
=> INSERT INTO reading VALUES
(1, '2010-05-02 04:55', 10),
(1, '2010-05-03 04:52', 11),
(2, '2010-05-01 07:15', 27),
(2, '2010-05-02 07:28', 30);
ERROR: duplicate key value violates unique constraint "reading_pkey"
=====

So the primary key for the reading will need to be on both the station
identifier *and* when the reading was made.

Also, it could do with a little better design:

* Things that are assigned numbers for identifiers are frequently,
later, assigned more complex identifiers. I prefer to refer to these
values as ‘code’ from the beginning to be clear that we're not
counting anything, merely using an arbitrary identifier.

* A reading will be made at a specific time, so the data type should be
a timestamp.

* The temperature will be to some level of accuracy, probably
sub-degree, so the data type should be a real.

* There's likely no useful meaning from a temperature record without a
temperature value, so that attribute should also be NOT NULL. In
general, despite SQL's defaults, NULL is best avoided unless you have
a specific reason to introduce its complexities and potentials for
silent errors.

Here's my attempt (using PostgreSQL 8.4):

=====
CREATE TABLE reading (
station_code integer NOT NULL,
read_time timestamp NOT NULL,
temperature real NOT NULL,
PRIMARY KEY (station_code, read_time)
);
COMMENT ON TABLE reading IS
'An instantaneous temperature reading at a station.';

INSERT INTO reading (station_code, read_time, temperature) VALUES
(1, '2010-05-02 04:55', 10),
(1, '2010-05-03 04:52', 11),
(2, '2010-05-01 07:15', 27),
(2, '2010-05-02 07:28', 30);
=====

Quote:
Firstly, how can i ask for:

the latest temperature from each station

Given that the reports involved may be from different dates?
You would order the reports by timestamp, and ask for the latest value:

=====
=> SELECT
station_code,
MAX(read_time) AS read_time
FROM reading
GROUP BY station_code
;

station_code | read_time
--------------+---------------------
1 | 2010-05-03 04:52:00
2 | 2010-05-02 07:28:00
(2 rows)
=====

You can then use that relation to join with the reading table and get
the resulting temperature:

=====
=> SELECT
station_code,
temperature
FROM
reading
NATURAL JOIN (
SELECT
station_code,
MAX(read_time) AS read_time
FROM reading
GROUP BY station_code
) AS station_latest
;

station_code | temperature
--------------+-------------
1 | 11
2 | 30
(2 rows)

Quote:
Secondly, how can i ask for:

the numbers of all the stations where the latest temperature was
over 20
Refine the above query to get only the attributes you're interested in
(the ‘SELECT’ clause) and only the tuples you're interested in (the
‘WHERE’ clause):

=====
=> SELECT
station_code
FROM
reading
NATURAL JOIN (
SELECT station_code, MAX(read_time) AS read_time
FROM reading
GROUP BY station_code
) AS station_latest
WHERE
temperature > 20
;

station_code
--------------
2
(1 row)
=====


Quote:
If i was doing this in imperative code
[…]
But i don't know how to transpose this to SQL.
Thinking in imperative terms won't help you much. You don't have
iteration operations available.

With a relational database, you're encouraged to think in terms of
operations on whole relations: i.e. set operations like restrict, join,
project, difference, and so on.

Rather than making an imperative algorithm that then won't really
translate very well to SQL, instead ask: Given this set of data (the
entire database), how can I use set operations to extract only the data
I want?

Quote:
Sorry to ask such a basic question; if there's any fine manual i could
read online that would let me work this out myself, please don't
hesitate to tell me to read it.
You would do well to read C.J. Date's “Introduction to Database Systems”
<URL:http://en.wikipedia.org/wiki/Special:BookSources/0321197844>.

--
\ “There is no reason anyone would want a computer in their |
`\ home.” —Ken Olson, president, chairman and founder of Digital |
_o__) Equipment Corp., 1977 |
Ben Finney

Reply With Quote
  #5  
Old   
Jasen Betts
 
Posts: n/a

Default Re: Picking a row from each group - 05-04-2010 , 03:12 AM



On 2010-05-03, Tom Anderson <twic (AT) urchin (DOT) earth.li> wrote:
Quote:
Hi,

I'm a programmer with only rudimentary knowledge of SQL, and i'm wondering
about how to implement a sort of versioned, time-based data store.
which database? they're not all the same.

Quote:
For the sake of argument, let's say i have a table like:

CREATE TABLE weather_reports (
station_number INTEGER PRIMARY KEY,
report_date DATE NOT NULL,
temperature INTEGER
);

And some data like:

station_number report_date temperature
-------------- ----------- -----------
1 2010-05-02 10
1 2010-05-03 11
2 2010-05-01 27
2 2010-05-02 30

Firstly, how can i ask for:

the latest temperature from each station
Given that the reports involved may be from different dates?
select distinct on station_number * from weather_reports
order by station_number,report_date desc;

Quote:
Secondly, how can i ask for:

the numbers of all the stations where the latest temperature was over 20
select station_number from ( select distinct on station_number * from weather_reports
order by station_number, report_date desc ) as latest where temperature > 20;

The above is all AFAIK standard SQL99, which is a standard that
different databases suppoert ot different extents. it works for me,
but may not work for you.


--- news://freenews.netfront.net/ - complaints: news (AT) netfront (DOT) net ---

Reply With Quote
  #6  
Old   
Lennart Jonsson
 
Posts: n/a

Default Re: Picking a row from each group - 05-04-2010 , 08:55 AM



On 2010-05-04 10:12, Jasen Betts wrote:
[...]
Quote:
select station_number from ( select distinct on station_number * from weather_reports
order by station_number, report_date desc ) as latest where temperature> 20;

The above is all AFAIK standard SQL99, which is a standard that
different databases suppoert ot different extents. it works for me,
but may not work for you.
I've never seen that syntax before, what does:

select distinct on station_number * from weather_reports

mean?

/Lennart

Reply With Quote
  #7  
Old   
Ben Finney
 
Posts: n/a

Default Re: Picking a row from each group - 05-04-2010 , 06:08 PM



Lennart Jonsson <erik.lennart.jonsson (AT) gmail (DOT) com> writes:

Quote:
On 2010-05-04 10:12, Jasen Betts wrote:
The above is all AFAIK standard SQL99, which is a standard that
different databases suppoert ot different extents. it works for me,
but may not work for you.

I've never seen that syntax before, what does:

select distinct on station_number * from weather_reports

mean?
The expression on which to distinguish should be enclosed in
parentheses:

SELECT DISTINCT
ON (station_number)
*
FROM weather_reports;

According to the PostgreSQL docs, this is not standard SQL:

The clause DISTINCT ON is not defined in the SQL standard.

<URL:http://www.postgresql.org/docs/8.4/static/sql-select.html#SQL-DISTINCT>

--
\ “… a Microsoft Certified System Engineer is to information |
`\ technology as a McDonalds Certified Food Specialist is to the |
_o__) culinary arts.” —Michael Bacarella |
Ben Finney

Reply With Quote
  #8  
Old   
Lennart Jonsson
 
Posts: n/a

Default Re: Picking a row from each group - 05-04-2010 , 09:54 PM



On 2010-05-05 01:08, Ben Finney wrote:
[...]
Quote:
The expression on which to distinguish should be enclosed in
parentheses:

SELECT DISTINCT
ON (station_number)
*
FROM weather_reports;

According to the PostgreSQL docs, this is not standard SQL:

The clause DISTINCT ON is not defined in the SQL standard.

URL:http://www.postgresql.org/docs/8.4/s...l#SQL-DISTINCT

Thanks for the link. That explains the need for the order by clause in
the inner query. Interesting to see how vendors struggle to invent new
ways to shoot one self in the foot ... ;-)

/Lennart

Reply With Quote
  #9  
Old   
Tom Anderson
 
Posts: n/a

Default Re: Picking a row from each group - 05-06-2010 , 02:32 PM



On Tue, 4 May 2010, Jasen Betts wrote:

Quote:
On 2010-05-03, Tom Anderson <twic (AT) urchin (DOT) earth.li> wrote:

I'm a programmer with only rudimentary knowledge of SQL, and i'm wondering
about how to implement a sort of versioned, time-based data store.

which database? they're not all the same.
As yet, no database. My hope is to write something generic that would work
with many different databases, so i'm particularly interested in
standards-based solutions, optimistic as that may be. Although ...

Quote:
the latest temperature from each station
Given that the reports involved may be from different dates?

select distinct on station_number * from weather_reports
order by station_number,report_date desc;
That's a very nice way of expressing it, even if it is
PostgreSQL-specific!

tom

--
My goal wasn't to make a ton of money. It was to build good computers. --
Woz

Reply With Quote
  #10  
Old   
Tom Anderson
 
Posts: n/a

Default Re: Picking a row from each group - 05-07-2010 , 12:28 AM



On Tue, 4 May 2010, Ben Finney wrote:

Quote:
Tom Anderson <twic (AT) urchin (DOT) earth.li> writes:

CREATE TABLE weather_reports (
station_number INTEGER PRIMARY KEY,
report_date DATE NOT NULL,
temperature INTEGER
);

And some data like:

station_number report_date temperature
-------------- ----------- -----------
1 2010-05-02 10
1 2010-05-03 11
2 2010-05-01 27
2 2010-05-02 30

That data doesn't match the table definition, and will be rejected by a
conformant RDBMS:
Doh. Really sorry about that. delete the PRIMARY KEY and insert before
the closing paren:

CONSTRAINT weather_report_pk PRIMARY KEY(station_number, report_date)

As you inferred, of course.

The irony is that i wrote the table definition in DDL because i wanted to
be precise. Sadly, all i did was show precisely how careless i was.

Quote:
Firstly, how can i ask for:

the latest temperature from each station

Given that the reports involved may be from different dates?

You would order the reports by timestamp, and ask for the latest value:

=> SELECT
station_code,
MAX(read_time) AS read_time
FROM reading
GROUP BY station_code
;

You can then use that relation to join with the reading table and get
the resulting temperature:

=> SELECT
station_code,
temperature
FROM
reading
NATURAL JOIN (
SELECT
station_code,
MAX(read_time) AS read_time
FROM reading
GROUP BY station_code
) AS station_latest
;
Great - very clearly explained, thank you.

Quote:
If i was doing this in imperative code [?] But i don't know how to
transpose this to SQL.

Thinking in imperative terms won't help you much. You don't have
iteration operations available.

With a relational database, you're encouraged to think in terms of
operations on whole relations: i.e. set operations like restrict, join,
project, difference, and so on.
Right. I'm slowly learning to do this. Specifically, the idea of using a =
max(a) to select the row with the greatest value of some column was new,
and not at all intuitive, to me.

tom

--
My goal wasn't to make a ton of money. It was to build good computers. --
Woz

Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.