Relational query with path expressions -
04-02-2009
, 10:58 PM
I get the impression that a lot of the time, a query (or a large
portion of a query) can be thought of as a navigation from attribute
to attribute using a sequence of joins. This made me think there may
be some merit in thinking of attribute names as global in nature and
the user only has to specify the start and end attributes and the DBMS
is able to deduce the joins that are required. The potential upshot
would be that queries are simplified and the user is freed from the
burden of understanding the way the data has been decomposed into
separate relations. Just recently I read about the so called
Universal Relation which takes this approach. However there are a
number of criticisms - such as those described in a paper by Kent.
I think the biggest problem is that quite often there are multiple
useful paths. For example, consider Chris Date's supplier example
from his book Introduction to Database Systems. The relations are
S(S#, SNAME, STATUS, CITY)
Supplier number S# has name SNAME, status STATUS
and is located in CITY
P(P#, PNAME, COLOR, WEIGHT, CITY)
Part number P# has name PNAME, colour COLOR,
weight WEIGHT in pounds and is located in CITY
SP(S#, P#, QTY)
S# shipped QTY of P#
There are two obvious ways to navigate from S# to P#. One can either
go via the shipments relation SP (where S# and P# appear in the same
relation), or one can join relations S and P on S.CITY = P.CITY.
Presumably to define the Universal Relation one would be forced to
pick one of these paths as the default, and rename attributes
accordingly. E.g. we could arbitrarily assume the default
relationship between suppliers and parts is through the shipments, and
rename CITY as SCITY in S, and CITY as PCITY in P, so they have
distinct roles in the Universal Relation.
It seems that explicit joins are a better way to give the user precise
control over the access paths. However I find it interesting that a 3-
way natural join on S,P and SP imposes both of the above navigation
paths at the same time, and that's rarely what's required.
After thinking about that problem for a while, it occurred to me that
access paths should emphasise navigation from relation to relation,
rather than from attribute to attribute (as suggested by the hyper-
graphs in the literature on the Universal Relation).
The following describes some ideas for a query language on a
relational database based on path expressions that allow relation
names to be specified in the path. It is more closely related to the
relational algebra than the calculus. It's assumed there are no NULLs
and queries work on sets not bags.
Relvars
-------
The name of a base relvar evaluates to its current relation value.
For example, the path expression
P
evaluates to the current value of the relvar named P.
Restriction
-----------
A restriction can be taken using a postfix operator which is a boolean
expression enclosed in square brackets and parameterised on the
attributes of the relation. For example, the path expression
P[COLOR = 'red']
evaluates to the restriction on P where parts are red.
For convenience it's assumed it's not necessary to specify a selector
as in COLOR('red').
Projection
----------
The following path expression subsequently takes the projection onto
attributes P# and WEIGHT:
P[COLOR = 'red'].(P#,WEIGHT)
Note that since path expressions evaluate to sets, we know that
duplicates are removed from the projection.
When projecting onto a single attribute the brackets aren't required.
For example,
S.S#
is the projection of relation S onto attribute S#.
Join with projection
--------------------
Let R1 and R2 be relations, then the path expression R1.R2 denotes the
natural join of R1 and R2 followed by a projection onto the attributes
of R2. For example,
P[COLOR = 'red'].SP
retrieves the tuples in SP associated with the shipment of a red part.
The idea that R1.R2 can drop attributes from R1 may seem strange. A
benefit arises for example in path expressions like R1.R2.R3 because
it can avoid joining on attributes which happen to have the same name
in R1 and R3. This can help to control the joins along a path. For
example,
P[COLOR = 'red'].SP.S.SNAME
retrieves the names of the suppliers who ship at least one red part,
whereas
P[COLOR = 'red'].S.SNAME
returns the names of suppliers that are located in a city where there
are red parts.
This is exactly the distinction discussed in the introduction.
A brief comparison to other query languages
-------------------------------------------
From Date's book, the following SQL retrieves supplier names for
suppliers who supply at least one red part:
SELECT DISTINCT S.SNAME
FROM S
WHERE S.S# IN
( SELECT SP.S#
FROM SP
WHERE SP.P# IN
( SELECT P.P#
FROM P
WHERE P.COLOR = 'red'));
The following expression in the relational algebra is somewhat more
concise:
( ( ( P WHERE COLOR = 'red' )
JOIN SP ) { S# } JOIN S ) { SNAME }
Note that the above needs to project the result of the join of P and
SP onto { S# } before joining with S in order to avoid the join on
P.CITY = S.CITY.
For those interested in the comparison, Date also provides the same
query in the relational calculus and the domain calculus.
Evidently the path expression
P[COLOR = 'red'].SP.S.SNAME
is simpler than the above alternatives.
Join without projection
-----------------------
Given relations R1,...,Rn, (R1,...,Rn) denotes the natural join of
R1,...,Rn. For example,
(S,P)
gets all combinations of supplier and part information such that they
are located in the same city.
Aggregate operators
-------------------
Let R be a relation and every attribute of R is defined on a domain
that supports summation. Then the path expression R.@sum denotes a
relation with the same attribute names as R that contains a single
tuple which maps each attribute name to the sum of that attribute over
tuples from R. If R is empty then R.@sum contains a single tuple full
of zeros. For example,
SP[P# = 'P2'].QTY.@sum
retrieves the total quantity of shipments for part P2.
Similarly we define operators @min, @max, @avg, except there is no
tuple in the output if the input relation is empty (i.e. has no
tuples).
R.@count evaluates to a relation with a single tuple with a single
attribute which gives the number of tuples in R. For example,
S.@count
retrieves the total number of suppliers.
R.@exists evaluates to a relation with a single tuple with a single
boolean attribute that is true iff R is non-empty.
Generalisation of projection
----------------------------
It was stated earlier that R.(x1,...,xn) denotes a projection of R
onto attributes named x1,...,xn. This can be regarded as a special
case of something much more general - where we allow path expressions
instead of just attribute names for each xi.
Formally the result is defined as a union of joins where the union is
taken over the tuples t of R, and the joins are taken over the
relations {t}.xi for each i where {t} denotes the relation consisting
of the single tuple t. Typically the {t}.xi have no attributes in
common so the joins are cartesian products.
For example,
P.(COLOR, SP)
evaluates to the relation which is the natural join of P and SP,
projected onto the attributes COLOR,S#,P#,QTY.
To help understand this it may be helpful to pretend that the dot
operator distributes over the comma separated list. i.e. we think of
P.(COLOR, SP) as (P.COLOR, P.SP). However we don't simply take the
cartesian product P.COLOR x P.SP. Instead we account for the shared
prefix P in the paths which "correlate" these two path expressions.
In other words, for each tuple from P that we use to navigate to its
COLOR, we note that the same tuple is involved in the navigation from
P to SP by joining on P.P# = SP.P#.
It is convenient to support a syntax that adds extra attributes to an
existing relation. The path expression R.(*,x1,...,xn) is equivalent
to R.(y1,...,ym,x1,...,xn) where y1,...,ym are the attributes of R.
The aggregate operators can be used to aggregating over groups. For
example,
P.(P#, SP.QTY.@sum)
returns a relation where for each part, it returns the part number and
the total shipment quantity (which could be zero if there are no
shipments of that part).
Note that
P.(P#, SP.QTY.@min)
retrieves the minimum quantity of shipments for each part number, but
only for parts where there exists at least one shipment by some
supplier.
We can group over more than one attribute. For example,
P.(PNAME,CITY).(*,P.WEIGHT.@sum)
retrieves the sum of the weights, grouped over PNAME and CITY.
As a more complicated example,
P.(P#, SP.(QTY,S.STATUS).(@min,@max,@avg))
retrieves for each part, the part number and the minimum, maximum and
average shipment QTY, and the minimum, maximum and average supplier
status. However for this to make sense we would need some convention
to assign unique attribute names in the output.
Generalisation of restriction
-----------------------------
So far it has have assumed that for the restriction R[b(x1,...,xn)],
the boolean expression b can only be parameterised by attribute names
x1,...,xn from R. This is generalised to allow the xi to be path
expressions.
R[b(x1,...,xn)] is formalised as a restriction on R where for each
tuple t in R, b is evaluated as a boolean function parameterised on
the {t}.xi, which most generally are relations. If however, a {t}.xi
is a relation containing a single tuple defined on a single attribute,
then it can be treated as a scalar value within the boolean
expression.
For example,
P[SP.S.@count > 1].P#
retrieves the part numbers of the parts supplied by more than one
supplier.
P[SP.S# = {'S2'}].P#
retrieves part numbers for parts that are shipped only by supplier S2.
S[not SP.P[P#='P2'].@exists].SNAME
or equivalently,
S[not 'P2' in SP.P.P#].SNAME
retrieves names of suppliers that don't supply part P2.
Renaming
--------
Renaming of attributes is supported. For example
S.(CITY as SCITY, SP.P.CITY as PCITY)
retrieves all pairs of city names (SCITY, PCITY) such that a supplier
in city SCITY supplies a part to city PCITY
Calculated attributes
---------------------
calculated attributes are supported. For example,
P.(P#, WEIGHT*454 as W)[W > 10000]
retrieves part number P# and weight W in grams for each part with
weight > 10000 grams
Path expressions with global scope
----------------------------------
A path expression that begins with '..' is assumed to have a global
scope. For example:
S[STATUS < ..S.STATUS.@max].S#
retrieves supplier numbers for suppliers with status less than the
current maximum status.
As another example
S[SP.P# = ..P.P#].SNAME
retrieves supplier names for suppliers who ship all parts |