![]() | |
![]() |
| | Thread Tools | Display Modes |
#11
| |||
| |||
|
|
Granted, if you have a choice between join and generalized projection you can infer the type of binary operator from the type of operand. In the expression x.y if both are relations, you interpret it as join, and if one is a set of attributes, than its a projection. That is some economy of notation. |
#12
| |||
| |||
|
|
just a suggestion, start a phony family tree at ancestry.com, try to put in a source for a marriage between two people and then check if ancestry's pathetic interpretation of the hierarchical gedcom so-called standard assigns the source to both spouses. |
#13
| |||
| |||
|
|
I get the impression that a lot of the time, a query (or a large portion of a query) can be thought of as a navigation from attribute to attribute using a sequence of joins. This made me think there may be some merit in thinking of attribute names as global in nature and the user only has to specify the start and end attributes and the DBMS is able to deduce the joins that are required. The potential upshot would be that queries are simplified and the user is freed from the burden of understanding the way the data has been decomposed into separate relations. Just recently I read about the so called Universal Relation which takes this approach. However there are a number of criticisms - such as those described in a paper by Kent. I think the biggest problem is that quite often there are multiple useful paths. |
|
For example, consider Chris Date's supplier example from his book Introduction to Database Systems. The relations are S(S#, SNAME, STATUS, CITY) Supplier number S# has name SNAME, status STATUS and is located in CITY P(P#, PNAME, COLOR, WEIGHT, CITY) Part number P# has name PNAME, colour COLOR, weight WEIGHT in pounds and is located in CITY SP(S#, P#, QTY) S# shipped QTY of P# There are two obvious ways to navigate from S# to P#. One can either go via the shipments relation SP (where S# and P# appear in the same relation), or one can join relations S and P on S.CITY = P.CITY. Presumably to define the Universal Relation one would be forced to pick one of these paths as the default, and rename attributes accordingly. E.g. we could arbitrarily assume the default relationship between suppliers and parts is through the shipments, and rename CITY as SCITY in S, and CITY as PCITY in P, so they have distinct roles in the Universal Relation. It seems that explicit joins are a better way to give the user precise control over the access paths. However I find it interesting that a 3- way natural join on S,P and SP imposes both of the above navigation paths at the same time, and that's rarely what's required. After thinking about that problem for a while, it occurred to me that access paths should emphasise navigation from relation to relation, rather than from attribute to attribute (as suggested by the hyper- graphs in the literature on the Universal Relation). The following describes some ideas for a query language on a relational database based on path expressions that allow relation names to be specified in the path. It is more closely related to the relational algebra than the calculus. It's assumed there are no NULLs and queries work on sets not bags. Relvars ------- The name of a base relvar evaluates to its current relation value. For example, the path expression P evaluates to the current value of the relvar named P. Restriction ----------- A restriction can be taken using a postfix operator which is a boolean expression enclosed in square brackets and parameterised on the attributes of the relation. For example, the path expression P[COLOR = 'red'] evaluates to the restriction on P where parts are red. For convenience it's assumed it's not necessary to specify a selector as in COLOR('red'). Projection ---------- The following path expression subsequently takes the projection onto attributes P# and WEIGHT: P[COLOR = 'red'].(P#,WEIGHT) Note that since path expressions evaluate to sets, we know that duplicates are removed from the projection. When projecting onto a single attribute the brackets aren't required. For example, S.S# is the projection of relation S onto attribute S#. Join with projection -------------------- Let R1 and R2 be relations, then the path expression R1.R2 denotes the natural join of R1 and R2 followed by a projection onto the attributes of R2. For example, P[COLOR = 'red'].SP retrieves the tuples in SP associated with the shipment of a red part. The idea that R1.R2 can drop attributes from R1 may seem strange. A benefit arises for example in path expressions like R1.R2.R3 because it can avoid joining on attributes which happen to have the same name in R1 and R3. This can help to control the joins along a path. For example, P[COLOR = 'red'].SP.S.SNAME retrieves the names of the suppliers who ship at least one red part, whereas P[COLOR = 'red'].S.SNAME returns the names of suppliers that are located in a city where there are red parts. This is exactly the distinction discussed in the introduction. A brief comparison to other query languages ------------------------------------------- From Date's book, the following SQL retrieves supplier names for suppliers who supply at least one red part: SELECT DISTINCT S.SNAME FROM S WHERE S.S# IN ( SELECT SP.S# FROM SP WHERE SP.P# IN ( SELECT P.P# FROM P WHERE P.COLOR = 'red')); The following expression in the relational algebra is somewhat more concise: ( ( ( P WHERE COLOR = 'red' ) JOIN SP ) { S# } JOIN S ) { SNAME } Note that the above needs to project the result of the join of P and SP onto { S# } before joining with S in order to avoid the join on P.CITY = S.CITY. For those interested in the comparison, Date also provides the same query in the relational calculus and the domain calculus. Evidently the path expression P[COLOR = 'red'].SP.S.SNAME is simpler than the above alternatives. Join without projection ----------------------- Given relations R1,...,Rn, (R1,...,Rn) denotes the natural join of R1,...,Rn. For example, (S,P) gets all combinations of supplier and part information such that they are located in the same city. Aggregate operators ------------------- Let R be a relation and every attribute of R is defined on a domain that supports summation. Then the path expression R.@sum denotes a relation with the same attribute names as R that contains a single tuple which maps each attribute name to the sum of that attribute over tuples from R. If R is empty then R.@sum contains a single tuple full of zeros. For example, SP[P# = 'P2'].QTY.@sum retrieves the total quantity of shipments for part P2. Similarly we define operators @min, @max, @avg, except there is no tuple in the output if the input relation is empty (i.e. has no tuples). R.@count evaluates to a relation with a single tuple with a single attribute which gives the number of tuples in R. For example, S.@count retrieves the total number of suppliers. R.@exists evaluates to a relation with a single tuple with a single boolean attribute that is true iff R is non-empty. Generalisation of projection ---------------------------- It was stated earlier that R.(x1,...,xn) denotes a projection of R onto attributes named x1,...,xn. This can be regarded as a special case of something much more general - where we allow path expressions instead of just attribute names for each xi. Formally the result is defined as a union of joins where the union is taken over the tuples t of R, and the joins are taken over the relations {t}.xi for each i where {t} denotes the relation consisting of the single tuple t. Typically the {t}.xi have no attributes in common so the joins are cartesian products. For example, P.(COLOR, SP) evaluates to the relation which is the natural join of P and SP, projected onto the attributes COLOR,S#,P#,QTY. To help understand this it may be helpful to pretend that the dot operator distributes over the comma separated list. i.e. we think of P.(COLOR, SP) as (P.COLOR, P.SP). However we don't simply take the cartesian product P.COLOR x P.SP. Instead we account for the shared prefix P in the paths which "correlate" these two path expressions. In other words, for each tuple from P that we use to navigate to its COLOR, we note that the same tuple is involved in the navigation from P to SP by joining on P.P# = SP.P#. It is convenient to support a syntax that adds extra attributes to an existing relation. The path expression R.(*,x1,...,xn) is equivalent to R.(y1,...,ym,x1,...,xn) where y1,...,ym are the attributes of R. The aggregate operators can be used to aggregating over groups. For example, P.(P#, SP.QTY.@sum) returns a relation where for each part, it returns the part number and the total shipment quantity (which could be zero if there are no shipments of that part). Note that P.(P#, SP.QTY.@min) retrieves the minimum quantity of shipments for each part number, but only for parts where there exists at least one shipment by some supplier. We can group over more than one attribute. For example, P.(PNAME,CITY).(*,P.WEIGHT.@sum) retrieves the sum of the weights, grouped over PNAME and CITY. As a more complicated example, P.(P#, SP.(QTY,S.STATUS).(@min,@max,@avg)) retrieves for each part, the part number and the minimum, maximum and average shipment QTY, and the minimum, maximum and average supplier status. However for this to make sense we would need some convention to assign unique attribute names in the output. Generalisation of restriction ----------------------------- So far it has have assumed that for the restriction R[b(x1,...,xn)], the boolean expression b can only be parameterised by attribute names x1,...,xn from R. This is generalised to allow the xi to be path expressions. R[b(x1,...,xn)] is formalised as a restriction on R where for each tuple t in R, b is evaluated as a boolean function parameterised on the {t}.xi, which most generally are relations. If however, a {t}.xi is a relation containing a single tuple defined on a single attribute, then it can be treated as a scalar value within the boolean expression. For example, P[SP.S.@count > 1].P# retrieves the part numbers of the parts supplied by more than one supplier. P[SP.S# = {'S2'}].P# retrieves part numbers for parts that are shipped only by supplier S2. S[not SP.P[P#='P2'].@exists].SNAME or equivalently, S[not 'P2' in SP.P.P#].SNAME retrieves names of suppliers that don't supply part P2. Renaming -------- Renaming of attributes is supported. For example S.(CITY as SCITY, SP.P.CITY as PCITY) retrieves all pairs of city names (SCITY, PCITY) such that a supplier in city SCITY supplies a part to city PCITY Calculated attributes --------------------- calculated attributes are supported. For example, P.(P#, WEIGHT*454 as W)[W > 10000] retrieves part number P# and weight W in grams for each part with weight > 10000 grams Path expressions with global scope ---------------------------------- A path expression that begins with '..' is assumed to have a global scope. For example: S[STATUS < ..S.STATUS.@max].S# retrieves supplier numbers for suppliers with status less than the current maximum status. As another example S[SP.P# = ..P.P#].SNAME retrieves supplier names for suppliers who ship all parts |
#14
| |||
| |||
|
|
For example, given binary relation R(x,y), one could define a unary function f that maps a set of x values to a set of y values as follows * * for all X, f(X) = {y | exists x in X, R(x,y)} It can be shown that f distributes over union and this "additive" property on sets is preserved over function composition. *In fact it is easy to show that composition of these unary functions is equivalent to a join + projection on the associated binary relations. |
#15
| |||
| |||
|
|
"David BL" <davi... (AT) iinet (DOT) net.au> wrote in message news:edd73d6f-678e-4a1c-bd1f-ac59e06f9bb1 (AT) s12g2000prc (DOT) googlegroups.com... I get the impression that a lot of the time, a query (or a large portion of a query) can be thought of as a navigation from attribute to attribute using a sequence of joins. This made me think there may be some merit in thinking of attribute names as global in nature and the user only has to specify the start and end attributes and the DBMS is able to deduce the joins that are required. The potential upshot would be that queries are simplified and the user is freed from the burden of understanding the way the data has been decomposed into separate relations. Just recently I read about the so called Universal Relation which takes this approach. However there are a number of criticisms - such as those described in a paper by Kent. I think the biggest problem is that quite often there are multiple useful paths. I disagree with your line of reasoning because relation names have significance. For example, suppose that amongst the myriad relations in the manufacturing system for a synthetic rubber plant, two stand out: one that specifies the recipies that /can be used/ to produce batches of rubber, and another with exactly the same attributes that specifies the recipies that /are being used/ to produce batches of rubber. Here it is not just the juxtaposition of attributes that conveys information but also the names of the relations. |
#16
| |||
| |||
|
|
On Apr 3, 4:31 am, David BL <davi... (AT) iinet (DOT) net.au> wrote: For example, given binary relation R(x,y), one could define a unary function f that maps a set of x values to a set of y values as follows for all X, f(X) = {y | exists x in X, R(x,y)} It can be shown that f distributes over union and this "additive" property on sets is preserved over function composition. In fact it is easy to show that composition of these unary functions is equivalent to a join + projection on the associated binary relations. My interpretation of this is the following. You took binary named relation (Codd model operates named relations), and abstracted away named perspective. Then, you suggest that the join between binary relations is defined the same way as in algebra of binary relations. This allows you to express graph queries. |
#17
| ||||||
| ||||||
|
|
I'm afraid I didn't follow all of that. |
|
I'll explain what I was stating in more detail... Let R1(x,y) and R2(y,z) be two named binary relations on named attributes. |
|
Define unary function fxy that maps sets of x's to sets of y's using R1 as follows: * * for all X, fxy(X) = {y | exists x in X, R1(x,y)} Similarly define fyz that maps sets of y's to sets of z's using R2 * * for all Y, fyz(Y) = {z | exists y in Y, R2(y,z)} Let R3(x,z) be the join of R1,R2 on y, projected onto {x,z}. *Let fxz be derived from R3 as follows: * * for all X, fxz(X) = {z | exists x in X, R3(x,z)} Then it turns out that fxz = fyz o fxy |
|
This justifies the use of these unary functions to express certain queries, and a functional notation can be simple and intuitive. However there are serious limitations with the approach - due to the fact that 1) there is often more than one useful binary relation between a given pair of attributes; and 2) relations aren't generally decomposable into binary relations in the sense of a JD. Given R(x,y), we can define 4 unary functions: * * for all X, f1(X) = {y | exists x in X, R(x,y)} * * for all Y, f2(Y) = {x | exists y in Y, R(x,y)} * * for all X, f3(X) = {y | for all x in X, R(x,y)} * * for all Y, f4(Y) = {x | for all y in Y, R(x,y)} f1,f2 distribute over union (giving them an additive characteristic) whereas f3,f4 distribute over intersection (giving them a subtractive characteristic). |
|
Each one of these preserves all the information in R: * * R(x,y) = { (x,y) | *y in f1({x}) } * * * * * *= { (x,y) | *x in f2({y}) } * * * * * *= { (x,y) | *y in f3({x}) } * * * * * *= { (x,y) | *x in f4({y}) } It seems that most of the time, queries are concerned with the existential form. However that is not always the case - such as Date's example of finding a supplier that supplies /all/ the parts. These unary functions could be used to answer queries like: *Find any cousin of every friend of every spouse of any ... I find it curious that the simple "duality" between existential and universally quantified forms for the unary functions don't seem to map simply to two alternative join operators in the RA. Let R1(x,y), R2(y,z) be two relations where y stands for all the attributes in common. *In the query language I defined, * * R1.R2 = { (y,z) in R2 | exists (x,y) in R1 } This can be seen as navigating from R1 into those tuples of R2 that are related to /any/ of the tuples of R1 according to the shared attributes. Consider that we define a new operator * where R1*R2 navigates into tuples of R2 that are related to /all/ tuples from R1 according to the shared attributes. * * R1*R2 = { (y,z) in R2 | (x,y') in R1 => (y',z) in R2 } |
|
That would allow for the path expression * * P*SP.S.S# which finds supplier numbers of suppliers that ships /all/ the parts. The following is interesting: * * P[color='red']*SP.P# This finds part numbers for any parts shipped by any supplier that ships all the red parts. *This may include parts that are not red. |
#18
| |||||
| |||||
|
|
I get the impression that a lot of the time, a query (or a large portion of a query) can be thought of as a navigation from attribute to attribute using a sequence of joins. This made me think there may be some merit in thinking of attribute names as global in nature and the user only has to specify the start and end attributes and the DBMS is able to deduce the joins that are required. |
|
I think the biggest problem is that quite often there are multiple useful paths. |
|
After thinking about that problem for a while, it occurred to me that access paths should emphasise navigation from relation to relation, rather than from attribute to attribute (as suggested by the hyper- graphs in the literature on the Universal Relation). |
|
The idea that R1.R2 can drop attributes from R1 may seem strange. A benefit arises for example in path expressions like R1.R2.R3 because it can avoid joining on attributes which happen to have the same name in R1 and R3. This can help to control the joins along a path. |
|
Renaming -------- Renaming of attributes is supported. For example S.(CITY as SCITY, SP.P.CITY as PCITY) retrieves all pairs of city names (SCITY, PCITY) such that a supplier in city SCITY supplies a part to city PCITY |
#19
| |||
| |||
|
|
David BL wrote: I get the impression that a lot of the time, a query (or a large portion of a query) can be thought of as a navigation from attribute to attribute using a sequence of joins. This made me think there may be some merit in thinking of attribute names as global in nature and the user only has to specify the start and end attributes and the DBMS is able to deduce the joins that are required. I like this idea, but it doesn't generalize to joins in general. I think the biggest problem is that quite often there are multiple useful paths. [...] Yes, and global renaming isn't always possible: PERSON(P#, NAME, DAYOFBIRTH, FATHER#, MOTHER#) Here, every FATHER# and every MOTHER# is also a P#. Pretty much all of our queries involving FATHER# and MOTHER# are going to join them with P#. We can't rename to express those joins. Splitting it out doesn't help: PERSON(P#, NAME, DAYOFBIRTH, FATHER#, MOTHER#) FATHER(FATHER#) MOTHER(MOTHER#) We can name FATHER's attribute FATHER# or P#, but not both. (BTW replacing the # attributes with natural keys doesn't help either.) After thinking about that problem for a while, it occurred to me that access paths should emphasise navigation from relation to relation, rather than from attribute to attribute (as suggested by the hyper- graphs in the literature on the Universal Relation). Yes, but sometimes you're going to have to specify which attributes to join on. [...] The idea that R1.R2 can drop attributes from R1 may seem strange. A benefit arises for example in path expressions like R1.R2.R3 because it can avoid joining on attributes which happen to have the same name in R1 and R3. This can help to control the joins along a path. Nice idea, but not completely general: it cannot express joins on FATHER#/MOTHER# above. [...] Renaming -------- Renaming of attributes is supported. For example S.(CITY as SCITY, SP.P.CITY as PCITY) retrieves all pairs of city names (SCITY, PCITY) such that a supplier in city SCITY supplies a part to city PCITY Now I'm happy :-) I initially assumed that SQL worked like this and was startled to discover that it doesn't. |
#20
| |||
| |||
|
|
G(x,y) = { (x,y) | exists y in {x}.f } |
![]() |
| Thread Tools | |
| Display Modes | |
| |