dbTalk Databases Forums  

Why use a composite PK ever?

comp.databases comp.databases


Discuss Why use a composite PK ever? in the comp.databases forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
AT
 
Posts: n/a

Default Why use a composite PK ever? - 06-08-2006 , 11:49 AM






In a fabulous book -- "Practical Issues in Database Management" -- the
fabulous author, Pascal Fabian, lists three reasons for choosing one
candidate PK over another: familiarity, stability, and simplicity. Just
started skimming through the book.

He notes further that those influenced by OO db design tend to use
simple, surrogate keys for PKs in all tables; that this is not
*precluded* by relational theory, but that it's somehow illicit.

I personally think it's a good rule of thumb to create surrogate keys
for almost all tables. Stability seems to be the single-most important
factor to consider. If the database can't uniquely identify a row,
what's the point? Choosing a surrogate key guarantees stability.

Why chance it on a composite key when you can be assured of uniqueness
with a sequence-generated surrogate key?

"Familiarity" seems like a spurious concern, and a poor tradeoff
against both stability(guaranteeing you are uniquely identifying rows)
and simplicity (with queries, and others intuiting your design).

Am I missing something? Why use a composite key *ever* (and by ever, I
mean for most purposes) aside from "familiarity?"

Could someone give a real-world example where "familiarity" is a
compelling reason to choose a composite PK, trumping both stability and
simplicity?

Finally, if there *are* any compelling reasons to prefer composite keys
in some situations, after how many attributes required to create a
composite key should one give up and create a surrogate key? I've seen
composite keys of up to 5 attributes and often wondered why that
particular decision was made. Seems inelegant, but then perhaps I'm
missing some other fundamental here.

It's a great book by the way. Should be required reading for most
database designers.

Dana


Reply With Quote
  #2  
Old   
AT
 
Posts: n/a

Default Re: Why use a composite PK ever? - 06-08-2006 , 12:18 PM






Sorry, I had a dyslexic moment - I meant author Fabian Pascal.

D.


Reply With Quote
  #3  
Old   
Gene Wirchenko
 
Posts: n/a

Default Re: Why use a composite PK ever? - 06-08-2006 , 02:30 PM



On 8 Jun 2006 09:49:02 -0700, dananrg (AT) yahoo (DOT) com wrote:

Quote:
In a fabulous book -- "Practical Issues in Database Management" -- the
fabulous author, Pascal Fabian, lists three reasons for choosing one
candidate PK over another: familiarity, stability, and simplicity. Just
started skimming through the book.

He notes further that those influenced by OO db design tend to use
simple, surrogate keys for PKs in all tables; that this is not
*precluded* by relational theory, but that it's somehow illicit.

I personally think it's a good rule of thumb to create surrogate keys
for almost all tables. Stability seems to be the single-most important
factor to consider. If the database can't uniquely identify a row,
what's the point? Choosing a surrogate key guarantees stability.
There is nothing stopping you from creating the same data under
two surrogate keys.

Quote:
Why chance it on a composite key when you can be assured of uniqueness
with a sequence-generated surrogate key?
You may have more than one surrogate for the same thing. Since
the surrogate is not intrinsic to the data (or you would have used it
for the PK), you can not stop this. Yes, uniqueness of what is used
for the PK can be tricky.

Quote:
"Familiarity" seems like a spurious concern, and a poor tradeoff
against both stability(guaranteeing you are uniquely identifying rows)
and simplicity (with queries, and others intuiting your design).
If I have a client code and an invoice number, it sure is easier
to go looking the invoice line table if its key starts with client
code - invoice number.

[snip]

I think it is somewhat a YMMV issue.

Sincerely,

Gene Wirchenko



Reply With Quote
  #4  
Old   
Michael Zedeler
 
Posts: n/a

Default Re: Why use a composite PK ever? - 06-09-2006 , 04:07 AM



dananrg (AT) yahoo (DOT) com wrote:
Quote:
[...]
Why chance it on a composite key when you can be assured of uniqueness
with a sequence-generated surrogate key?

"Familiarity" seems like a spurious concern, and a poor tradeoff
against both stability(guaranteeing you are uniquely identifying rows)
and simplicity (with queries, and others intuiting your design).

Am I missing something? Why use a composite key *ever* (and by ever, I
mean for most purposes) aside from "familiarity?"
I only use composite keys for tables that store many to many
relationships. Otherwise I agree with you - having composite keys that
have some kind of external interpretation is begging for trouble.

There are a few exceptions, such as here in Denmark where each citizen
has a personal id number that _never_ changes and is guaranteed to be
unique. It consist of the birth date and a serial number (subject to
certain checksum constraints), which can easilly be implemented as a
two-part key. In certain applications I would use that number as primary
key.

Regards,

Michael.
--
Which is more dangerous? TV guided missiles or TV guided families?
I am less likely to answer usenet postings by anonymous authors.
Visit my home page at http://michael.zedeler.dk/


Reply With Quote
  #5  
Old   
David Portas
 
Posts: n/a

Default Re: Why use a composite PK ever? - 06-09-2006 , 04:44 PM



dananrg (AT) yahoo (DOT) com wrote:
Quote:
In a fabulous book -- "Practical Issues in Database Management" -- the
fabulous author, Pascal Fabian, lists three reasons for choosing one
candidate PK over another: familiarity, stability, and simplicity. Just
started skimming through the book.

He notes further that those influenced by OO db design tend to use
simple, surrogate keys for PKs in all tables; that this is not
*precluded* by relational theory, but that it's somehow illicit.

I personally think it's a good rule of thumb to create surrogate keys
for almost all tables. Stability seems to be the single-most important
factor to consider. If the database can't uniquely identify a row,
what's the point? Choosing a surrogate key guarantees stability.

Why chance it on a composite key when you can be assured of uniqueness
with a sequence-generated surrogate key?

"Familiarity" seems like a spurious concern, and a poor tradeoff
against both stability(guaranteeing you are uniquely identifying rows)
and simplicity (with queries, and others intuiting your design).

Am I missing something? Why use a composite key *ever* (and by ever, I
mean for most purposes) aside from "familiarity?"

Could someone give a real-world example where "familiarity" is a
compelling reason to choose a composite PK, trumping both stability and
simplicity?

Finally, if there *are* any compelling reasons to prefer composite keys
in some situations, after how many attributes required to create a
composite key should one give up and create a surrogate key? I've seen
composite keys of up to 5 attributes and often wondered why that
particular decision was made. Seems inelegant, but then perhaps I'm
missing some other fundamental here.

It's a great book by the way. Should be required reading for most
database designers.

Dana
Many tables have more than one candidate key. The choice of one
particular key to be a primary key is unimportant in terms of data
integrity. (I'm referring to Codd's notion of a primary key. The SQL
syntax "PRIMARY KEY" is something else. It need not be the same thing.)

Familiarity and simplicity are as good a reasons as any to choose a
primary key.

Non-surrogate keys are crucial however. Without natural key constraints
how do you enforce business rules about uniqueness? 3NF and above
implies that every table with at least one non-key attribute must have
at least one natural (non-surrogate) key - otherwise there are no
determinants.

--
David Portas



Reply With Quote
  #6  
Old   
Bruce Lewis
 
Posts: n/a

Default Re: Why use a composite PK ever? - 06-09-2006 , 09:33 PM



dananrg (AT) yahoo (DOT) com writes:

Quote:
Finally, if there *are* any compelling reasons to prefer composite keys
in some situations, after how many attributes required to create a
composite key should one give up and create a surrogate key? I've seen
composite keys of up to 5 attributes and often wondered why that
particular decision was made. Seems inelegant, but then perhaps I'm
missing some other fundamental here.
If you're using a database like PostgreSQL or Oracle that supports tuple
comparison, give up after 6 or 8 columns.

If you're using a database like Microsoft SQL Server that does not
support tuple comparison, your queries will be more readable if you give
up after 2 or 3 columns.

For my opinion on the compelling reasons for choosing natural keys, take
the part of Gene's reply about uniqueness and put it in all caps with
eight exclamation points.

--

http://ourdoings.com/ Easily organize and disseminate news and
photos for your family or group.


Reply With Quote
  #7  
Old   
--CELKO--
 
Posts: n/a

Default Re: Why use a composite PK ever? - 06-10-2006 , 10:41 AM



Quote:
[simple, surrogate keys for PKs in all tables] I personally think it's a good rule of thumb to create surrogate keys for almost all tables.
No, absolutley no. First of all, they do not create surrogate keys;
they create exposed physical locators, like IDENTITY or other
auto-numbering schemes. Let me maker an appeal to authority, with a
quote from Dr. Codd: "..Database users may cause the system to generate
or delete a surrogate, but they have no control over its value, nor is
its value ever displayed to them ..."(Dr. Codd in ACM TODS, pp 409-410)
and Codd, E. (1979), Extending the database relational model to capture
more meaning. ACM Transactions on Database Systems, 4(4). pp.
397-434.

This means that a surrogate ought to act like an index; created by the
user, managed by the system and NEVER seen by a user. That means never
used in queries, DRI or anything else that a user does.

Codd also wrote the following:

"There are three difficulties in employing user-controlled keys as
permanent surrogates for entities.

(1) The actual values of user-controlled keys are determined by users
and must therefore be subject to change by them (e.g. if two companies
merge, the two employee databases might be combined with the result
that some or all of the serial numbers might be changed.).

(2) Two relations may have user-controlled keys defined on distinct
domains (e.g. one uses social security, while the other uses employee
serial numbers) and yet the entities denoted are the same.

(3) It may be necessary to carry information about an entity either
before it has been assigned a user-controlled key value or after it has
ceased to have one (e.g. and applicant for a job and a retiree).

These difficulties have the important consequence that an equi-join on
common key values may not yield the same result as a join on common
entities. A solution - proposed in part [4] and more fully in [14] -
is to introduce entity domains which contain system-assigned
surrogates. Database users may cause the system to generate or delete
a surrogate, but they have no control over its value, nor is its value
ever displayed to them....." (Codd in ACM TODS, pp 409-410).

References

Codd, E. (1979), Extending the database relational model to capture
more meaning. ACM Transactions on Database Systems, 4(4). pp. 397-434

A real surrogate key maintained by the RDBMS itself, like an index and
the user never sees it, much less uses it in his code. Ever use
hashing in Teradata?

Quote:
Stability seems to be the single-most important factor to consider. If the database can't uniquely identify a row, what's the point? Choosing a surrogate key guarantees stability.
No, it makes the data unstable. You cannot validate or verify an
auto-numbering value. Think about it. I can look at an ISBN,
validate it with the check digit, then verify it with a search to
Amazon.com, B&N, Bowker, etc. in *any* database, on *any* SQL (or
non-SQL) engine.

By definition, a relational key is a subset of attributes of the
entity. You or your machinery do not invent it so much as discover it.


Quote:
Why chance it on a composite key when you can be assured of uniqueness with a sequence-generated surrogate key?
Why is something you can verify more risk than something you cannot
verify? You use relational keys for data integrity, data portability,
code portability and the fact that if the composite key exists in the
real world, then you **must** model it in your schema. What is the
universal, verifiable single integer for a given (longitude, latitude)
pair? If you are a Douglas Adams fan the answer is 42! Always 42?

All you have done is waste storage and create weak redundancy.

Quote:
"Familiarity" seems like a spurious concern, and a poor tradeoff against both stability (guaranteeing you are uniquely identifying rows) and simplicity (with queries, and others intuiting your design).
Wrong again. Familiarity, in the sense of codes that are universally
underestood by people in a given industry, make data exchange possible.
It also means that users can do "reasonableness" checks on data by
eye, with regular expressions, check digits and other programming
tools.

Quote:
Could someone give a real-world example where "familiarity" is a compelling reason to choose a composite PK, trumping both stability and simplicity?
(longitude, latitude). Geographical hierarchies. Any co-ordinate
system ((x,y), (x,y,z), log-spiral, circular, etc.)

Quote:
Finally, if there *are* any compelling reasons to prefer composite keys in some situations, after how many attributes required to create a composite key should one give up and create a surrogate key?
How big should a vehicle be? The real point is that if you have an
n-part composite key in the data model, you cannot make it go away with
a magic number (ever read the Khabal stuff that is a fad among movie
stars right now? Thinking there is a magic 17-digit Hebrew number that
God puts on everything is not science or logic).

Quote:
've seen composite keys of up to 5 attributes and often wondered why that particular decision was made. Seems inelegant, but then perhaps I'm missing some other fundamental here.
Data integrity and a correct data model. But people do not handle more
than five thigns very well, so that has been considered an upper limit
on human data processing ever since a guy named Brown wrote a paper on
it in the 1950's.

Let's look at the logical problems. First try to create a table with
two columns and try to make them both IDENTITY. If you cannot declare
more than one column to be of a certain data type, then that thing is
not a data type at all, by definition. It is a property which belongs
to the PHYSICAL table, not the LOGICAL data in the table. It also has
to be NULL-able to be a data type in SQL.

Next, create a table with one column and make it an IDENTITY. Now try
to insert, update and delete different numbers from it. If you cannot
insert, update and delete rows from a table, then it is not a table by
definition.

Finally create a simple table with one IDENTITY and a few other
columns. Use a few statements like

INSERT INTO Foobar (a, b, c) VALUES ('a1', 'b1', 'c1');
INSERT INTO Foobar (a, b, c) VALUES ('a2', 'b2', 'c2');
INSERT INTO Foobar (a, b, c) VALUES ('a3', 'b3', 'c3');

To put a few rows into the table and notice that the IDENTITY
sequentially numbered them in the order they were presented. If you
delete a row, the gap in the sequence is not filled in and the sequence
continues from the highest number that has ever been used in that
column in that particular table. This is how we did record numbers in
pre-allocated sequential files in the 1950's, by the way. A utility
program would then "pack" or "compress" the records that were flagged
as deleted or unused to move the empty space to the physical end of the
physical file. IDENTITY leaves the gaps.

But now use a statement with a query expression in it, like this:

INSERT INTO Foobar (a, b, c)
SELECT x, y, z
FROM Floob;

Since a query result is a table, and a table is a set which has no
ordering, what should the IDENTITY numbers be? The entire, whole,
completed set is presented to Foobar all at once, not a row at a time.
There are (n!) ways to number (n) rows, so which one do you pick? The
answer has been to use whatever the *physical* order of the result set
happened to be. That non-relational phrase "physical order" again!

But it is actually worse than that. If the same query is executed
again, but with new statistics or after an index has been dropped or
added, the new execution plan could bring the result set back in a
different physical order. Indexes and statistics are not part of the
logical model.

Can you explain from a logical model viewpoint why the same rows in the
second query get different IDENTITY numbers? In the relational model,
they should be treated the same if all the values of all the attributes
are identical.

Using IDENTITY as a primary key is a sign that there is no data model,
only an imitation of a sequential file system. Since this "magic,
all-purpose, one-size-fits-all" pseudo-identifier exists only as a
result of the physical state of a particular piece of hardware at a
particular time as read by the current release of a particular database
product, how do you verify that an entity has such a number in the
reality you are modeling?

Another cute way to destroy data integrity:

BEGIN ATOMIC
DELETE FROM Foobar
WHERE id = <<some row>>;
INSERT INTO Foobar
VALUES ( <<recreate deleted row>>)
END;

Logically this should do nothing, but since IDENTITY has gaps, it
trashes the data.

Now you are REALLY thinking about relations and keys instead of 1950's
sequential record numbering. Adding an IDENTITY column to either of
these tables as a candidate key would be dangerously redundant; one
query uses the IDENTITY and another uses the real key, and like a man
with two watches, you are never sure what time it is.

Daniel Wetzler found out the hard way that IDENTITY changed behavior in
SQL 2000 and SQL 2005. If you perform the statement below you get only
one dataset which has the described properties.

SELECT DISTINCT IDENTITY (INTEGER) AS fake_id, title1, ..
FROM Foobar
WHERE title1 IS NOT NULL
AND ..

The IDENTITY function makes each row unique so DISTINCT doesn't
eliminate the duplicates in this case. Interestingly, this behavior
seems to have changed in SQL Server 2005. If Iupoi run this as a SELECT
INTO on 2005, the execution plan computes the IDENTITY value after
DISTINCT.

For 2000 the kludge is a bit hard to see. The following should insert
just one row into the target table.

CREATE TABLE Foobar (title1 VARCHAR(10), ..);

INSERT INTO Foobar VALUES ('1', ..);
INSERT INTO Foobar VALUES ('1', ..);

SELECT IDENTITY (INTEGER) AS fake_id, title1, ..
INTO Foobar2
FROM (SELECT DISTINCT title1, ..
FROM Foobar
WHERE ..);

Since we are dealling with a proprietary feature, this is subject to
change without noti ce again.



Reply With Quote
  #8  
Old   
Tony Rogerson
 
Posts: n/a

Default Re: Why use a composite PK ever? - 06-11-2006 , 08:31 AM



Quote:
This means that a surrogate ought to act like an index; created by the
user, managed by the system and NEVER seen by a user. That means never
used in queries, DRI or anything else that a user does.
That's your personal interpretation and not a lot of others and you've
actually missed out what Codd really said, I've summarised this in my blog:
http://sqlblogcasts.com/blogs/tonyro...05/11/427.aspx.

Your ideas around surrogate key usage is completely wrong.

You can still use surrogate keys in an application and they do not need to
be exposed to the application, surrogates are a good way of getting round
the problem when the natural key (primary key) changes - in that situation
you need to stop all user access to your database do the primary key update
and allow users back in, but first - expiring any cache; why do this? Well,
if you have used the primary key as the access through to the data from the
app then there might be old primary key floating around that when a user
clicks update may in fact update the wrong data!

Quote:
Let's look at the logical problems. First try to create a table with
two columns and try to make them both IDENTITY. If you cannot declare
more than one column to be of a certain data type, then that thing is
not a data type at all, by definition. It is a property which belongs
to the PHYSICAL table, not the LOGICAL data in the table. It also has
to be NULL-able to be a data type in SQL.
First, IDENTITY is a property of a column in the same way a PRIMARY KEY is
the property one or more columns - you can only have 1 primary key, you can
only have 1 identity property!

Again, IDENTITY is not a data type, its a property! Its a property I say,
sunk in yet? Its a property I say!!! and, it can exist on a nullable column.

Quote:
Next, create a table with one column and make it an IDENTITY. Now try
to insert, update and delete different numbers from it. If you cannot
insert, update and delete rows from a table, then it is not a table by
definition.
You cannot update the value of IDENTITY, you can delete rows or insert rows
using SET IDENTITY_INSERT <tbl> ON.

What is the property of a surrogate key again? Not changeable!!!

So, the IDENTITY property suddenly becomes valid in both COdd and Dates
definitions of a surrogate key.

Quote:
Finally create a simple table with one IDENTITY and a few other
columns. Use a few statements like

INSERT INTO Foobar (a, b, c) VALUES ('a1', 'b1', 'c1');
INSERT INTO Foobar (a, b, c) VALUES ('a2', 'b2', 'c2');
INSERT INTO Foobar (a, b, c) VALUES ('a3', 'b3', 'c3');

To put a few rows into the table and notice that the IDENTITY
sequentially numbered them in the order they were presented. If you
delete a row, the gap in the sequence is not filled in and the sequence
continues from the highest number that has ever been used in that
column in that particular table. This is how we did record numbers in
pre-allocated sequential files in the 1950's, by the way. A utility
program would then "pack" or "compress" the records that were flagged
as deleted or unused to move the empty space to the physical end of the
physical file. IDENTITY leaves the gaps.

But now use a statement with a query expression in it, like this:

INSERT INTO Foobar (a, b, c)
SELECT x, y, z
FROM Floob;

Since a query result is a table, and a table is a set which has no
ordering, what should the IDENTITY numbers be? The entire, whole,
completed set is presented to Foobar all at once, not a row at a time.
There are (n!) ways to number (n) rows, so which one do you pick? The
answer has been to use whatever the *physical* order of the result set
happened to be. That non-relational phrase "physical order" again!

But it is actually worse than that. If the same query is executed
again, but with new statistics or after an index has been dropped or
added, the new execution plan could bring the result set back in a
different physical order. Indexes and statistics are not part of the
logical model.

Can you explain from a logical model viewpoint why the same rows in the
second query get different IDENTITY numbers? In the relational model,
they should be treated the same if all the values of all the attributes
are identical.
Just how does that prevent usage as a surrogate key?

Quote:
Using IDENTITY as a primary key is a sign that there is no data model,
only an imitation of a sequential file system. Since this "magic,
all-purpose, one-size-fits-all" pseudo-identifier exists only as a
result of the physical state of a particular piece of hardware at a
particular time as read by the current release of a particular database
product, how do you verify that an entity has such a number in the
reality you are modeling?
What do you do when you have no usable natural key, for instance a member
table of a user group? You can't ask people for their national insurance
number - you need to create your own membership id! I'd be interested to
know what primary key you'd create when you are only capturing FullName and
Address?

Remember - there are data protection rules for capturing data that is not
necessary for the purpose you are using it, well here in the UK anyway.,

Quote:
Another cute way to destroy data integrity:

BEGIN ATOMIC
DELETE FROM Foobar
WHERE id = <<some row>>;
INSERT INTO Foobar
VALUES ( <<recreate deleted row>>)
END;

Logically this should do nothing, but since IDENTITY has gaps, it
trashes the data.
SET IDENTITY_INSERT Foobar ON
INSERT INTO Foobar ( always specify your columns you are inserting into!!! )
VALUES ( <<recreate deleted row>>)

SET IDENTITY_INSERT Foobar OFF

Its REALLY BAD PRACTICE not specifying the columns you are inserting into on
the INSERT - you should know that and should set a good example.

Quote:
Now you are REALLY thinking about relations and keys instead of 1950's
sequential record numbering. Adding an IDENTITY column to either of
these tables as a candidate key would be dangerously redundant; one
query uses the IDENTITY and another uses the real key, and like a man
with two watches, you are never sure what time it is.

Daniel Wetzler found out the hard way that IDENTITY changed behavior in
SQL 2000 and SQL 2005. If you perform the statement below you get only
one dataset which has the described properties.

SELECT DISTINCT IDENTITY (INTEGER) AS fake_id, title1, ..
FROM Foobar
WHERE title1 IS NOT NULL
AND ..

The IDENTITY function makes each row unique so DISTINCT doesn't
eliminate the duplicates in this case. Interestingly, this behavior
seems to have changed in SQL Server 2005. If Iupoi run this as a SELECT
INTO on 2005, the execution plan computes the IDENTITY value after
DISTINCT.

For 2000 the kludge is a bit hard to see. The following should insert
just one row into the target table.

CREATE TABLE Foobar (title1 VARCHAR(10), ..);

INSERT INTO Foobar VALUES ('1', ..);
INSERT INTO Foobar VALUES ('1', ..);

SELECT IDENTITY (INTEGER) AS fake_id, title1, ..
INTO Foobar2
FROM (SELECT DISTINCT title1, ..
FROM Foobar
WHERE ..);

Since we are dealling with a proprietary feature, this is subject to
change without noti ce again.

Thats because people have relied not on documented behaviour as specified in
books online but believing the product behaves that way, which, given
another example, people have come unstuck with the ORDER BY in a view!

You really need to get a handle on your bias against IDENTITY, you still do
not understand (after 5 years now, your post is an old canned answer by the
way) the implementation and make fundemental mistakes which is why your
answer is just out and out wrong.

--
Tony Rogerson
SQL Server MVP
http://sqlblogcasts.com/blogs/tonyrogerson - technical commentary from a SQL
Server Consultant
http://sqlserverfaq.com - free video tutorials


"--CELKO--" <jcelko212 (AT) earthlink (DOT) net> wrote

Quote:
[simple, surrogate keys for PKs in all tables] I personally think it's
a good rule of thumb to create surrogate keys for almost all tables.

No, absolutley no. First of all, they do not create surrogate keys;
they create exposed physical locators, like IDENTITY or other
auto-numbering schemes. Let me maker an appeal to authority, with a
quote from Dr. Codd: "..Database users may cause the system to generate
or delete a surrogate, but they have no control over its value, nor is
its value ever displayed to them ..."(Dr. Codd in ACM TODS, pp 409-410)
and Codd, E. (1979), Extending the database relational model to capture
more meaning. ACM Transactions on Database Systems, 4(4). pp.
397-434.

This means that a surrogate ought to act like an index; created by the
user, managed by the system and NEVER seen by a user. That means never
used in queries, DRI or anything else that a user does.

Codd also wrote the following:

"There are three difficulties in employing user-controlled keys as
permanent surrogates for entities.

(1) The actual values of user-controlled keys are determined by users
and must therefore be subject to change by them (e.g. if two companies
merge, the two employee databases might be combined with the result
that some or all of the serial numbers might be changed.).

(2) Two relations may have user-controlled keys defined on distinct
domains (e.g. one uses social security, while the other uses employee
serial numbers) and yet the entities denoted are the same.

(3) It may be necessary to carry information about an entity either
before it has been assigned a user-controlled key value or after it has
ceased to have one (e.g. and applicant for a job and a retiree).

These difficulties have the important consequence that an equi-join on
common key values may not yield the same result as a join on common
entities. A solution - proposed in part [4] and more fully in [14] -
is to introduce entity domains which contain system-assigned
surrogates. Database users may cause the system to generate or delete
a surrogate, but they have no control over its value, nor is its value
ever displayed to them....." (Codd in ACM TODS, pp 409-410).

References

Codd, E. (1979), Extending the database relational model to capture
more meaning. ACM Transactions on Database Systems, 4(4). pp. 397-434

A real surrogate key maintained by the RDBMS itself, like an index and
the user never sees it, much less uses it in his code. Ever use
hashing in Teradata?

Stability seems to be the single-most important factor to consider. If
the database can't uniquely identify a row, what's the point? Choosing a
surrogate key guarantees stability.

No, it makes the data unstable. You cannot validate or verify an
auto-numbering value. Think about it. I can look at an ISBN,
validate it with the check digit, then verify it with a search to
Amazon.com, B&N, Bowker, etc. in *any* database, on *any* SQL (or
non-SQL) engine.

By definition, a relational key is a subset of attributes of the
entity. You or your machinery do not invent it so much as discover it.


Why chance it on a composite key when you can be assured of uniqueness
with a sequence-generated surrogate key?

Why is something you can verify more risk than something you cannot
verify? You use relational keys for data integrity, data portability,
code portability and the fact that if the composite key exists in the
real world, then you **must** model it in your schema. What is the
universal, verifiable single integer for a given (longitude, latitude)
pair? If you are a Douglas Adams fan the answer is 42! Always 42?

All you have done is waste storage and create weak redundancy.

"Familiarity" seems like a spurious concern, and a poor tradeoff against
both stability (guaranteeing you are uniquely identifying rows) and
simplicity (with queries, and others intuiting your design).

Wrong again. Familiarity, in the sense of codes that are universally
underestood by people in a given industry, make data exchange possible.
It also means that users can do "reasonableness" checks on data by
eye, with regular expressions, check digits and other programming
tools.

Could someone give a real-world example where "familiarity" is a
compelling reason to choose a composite PK, trumping both stability and
simplicity?

(longitude, latitude). Geographical hierarchies. Any co-ordinate
system ((x,y), (x,y,z), log-spiral, circular, etc.)

Finally, if there *are* any compelling reasons to prefer composite keys
in some situations, after how many attributes required to create a
composite key should one give up and create a surrogate key?

How big should a vehicle be? The real point is that if you have an
n-part composite key in the data model, you cannot make it go away with
a magic number (ever read the Khabal stuff that is a fad among movie
stars right now? Thinking there is a magic 17-digit Hebrew number that
God puts on everything is not science or logic).

've seen composite keys of up to 5 attributes and often wondered why
that particular decision was made. Seems inelegant, but then perhaps I'm
missing some other fundamental here.

Data integrity and a correct data model. But people do not handle more
than five thigns very well, so that has been considered an upper limit
on human data processing ever since a guy named Brown wrote a paper on
it in the 1950's.

Let's look at the logical problems. First try to create a table with
two columns and try to make them both IDENTITY. If you cannot declare
more than one column to be of a certain data type, then that thing is
not a data type at all, by definition. It is a property which belongs
to the PHYSICAL table, not the LOGICAL data in the table. It also has
to be NULL-able to be a data type in SQL.

Next, create a table with one column and make it an IDENTITY. Now try
to insert, update and delete different numbers from it. If you cannot
insert, update and delete rows from a table, then it is not a table by
definition.

Finally create a simple table with one IDENTITY and a few other
columns. Use a few statements like

INSERT INTO Foobar (a, b, c) VALUES ('a1', 'b1', 'c1');
INSERT INTO Foobar (a, b, c) VALUES ('a2', 'b2', 'c2');
INSERT INTO Foobar (a, b, c) VALUES ('a3', 'b3', 'c3');

To put a few rows into the table and notice that the IDENTITY
sequentially numbered them in the order they were presented. If you
delete a row, the gap in the sequence is not filled in and the sequence
continues from the highest number that has ever been used in that
column in that particular table. This is how we did record numbers in
pre-allocated sequential files in the 1950's, by the way. A utility
program would then "pack" or "compress" the records that were flagged
as deleted or unused to move the empty space to the physical end of the
physical file. IDENTITY leaves the gaps.

But now use a statement with a query expression in it, like this:

INSERT INTO Foobar (a, b, c)
SELECT x, y, z
FROM Floob;

Since a query result is a table, and a table is a set which has no
ordering, what should the IDENTITY numbers be? The entire, whole,
completed set is presented to Foobar all at once, not a row at a time.
There are (n!) ways to number (n) rows, so which one do you pick? The
answer has been to use whatever the *physical* order of the result set
happened to be. That non-relational phrase "physical order" again!

But it is actually worse than that. If the same query is executed
again, but with new statistics or after an index has been dropped or
added, the new execution plan could bring the result set back in a
different physical order. Indexes and statistics are not part of the
logical model.

Can you explain from a logical model viewpoint why the same rows in the
second query get different IDENTITY numbers? In the relational model,
they should be treated the same if all the values of all the attributes
are identical.

Using IDENTITY as a primary key is a sign that there is no data model,
only an imitation of a sequential file system. Since this "magic,
all-purpose, one-size-fits-all" pseudo-identifier exists only as a
result of the physical state of a particular piece of hardware at a
particular time as read by the current release of a particular database
product, how do you verify that an entity has such a number in the
reality you are modeling?

Another cute way to destroy data integrity:

BEGIN ATOMIC
DELETE FROM Foobar
WHERE id = <<some row>>;
INSERT INTO Foobar
VALUES ( <<recreate deleted row>>)
END;

Logically this should do nothing, but since IDENTITY has gaps, it
trashes the data.

Now you are REALLY thinking about relations and keys instead of 1950's
sequential record numbering. Adding an IDENTITY column to either of
these tables as a candidate key would be dangerously redundant; one
query uses the IDENTITY and another uses the real key, and like a man
with two watches, you are never sure what time it is.

Daniel Wetzler found out the hard way that IDENTITY changed behavior in
SQL 2000 and SQL 2005. If you perform the statement below you get only
one dataset which has the described properties.

SELECT DISTINCT IDENTITY (INTEGER) AS fake_id, title1, ..
FROM Foobar
WHERE title1 IS NOT NULL
AND ..

The IDENTITY function makes each row unique so DISTINCT doesn't
eliminate the duplicates in this case. Interestingly, this behavior
seems to have changed in SQL Server 2005. If Iupoi run this as a SELECT
INTO on 2005, the execution plan computes the IDENTITY value after
DISTINCT.

For 2000 the kludge is a bit hard to see. The following should insert
just one row into the target table.

CREATE TABLE Foobar (title1 VARCHAR(10), ..);

INSERT INTO Foobar VALUES ('1', ..);
INSERT INTO Foobar VALUES ('1', ..);

SELECT IDENTITY (INTEGER) AS fake_id, title1, ..
INTO Foobar2
FROM (SELECT DISTINCT title1, ..
FROM Foobar
WHERE ..);

Since we are dealling with a proprietary feature, this is subject to
change without noti ce again.




Reply With Quote
  #9  
Old   
AT
 
Posts: n/a

Default Re: Why use a composite PK ever? - 06-13-2006 , 03:52 AM



Michael Zedeler wrote:
Quote:
dananrg (AT) yahoo (DOT) com wrote:
[...]

There are a few exceptions, such as here in Denmark where each citizen
has a personal id number that _never_ changes and is guaranteed to be
unique. It consist of the birth date and a serial number (subject to
certain checksum constraints), which can easilly be implemented as a
two-part key. In certain applications I would use that number as primary
key.
Khe, khe are you really working with this database or just heard of it,
that personal id is never changing? Funnily we in Latvia had our db
created with help of Denmark consultants and as a result I assume
recieved more or less the same data model. So in real life it turned
out to be changing in about 0.5% cases. So I'd be interested how you in
Denmark bypass at least following problems we had:
1) person finds out that his birthday has changed or was incorrectly
written in his identification documents
2) operator simply enters incorrect birthdate
3) two persons get the same personal id - this probably was our problem
only because initially there wasn't all data enetered online in central
db
4) a person get personal id two times - this is probably a bigger
problem and mostly for foreigners.

So now we have a new system and personal id isn't primary key there, it
is just the same attribute as name, sex and address for example.
So after such cases I'm very sceptical about natural keys as primary
keys and tend to use surrogate keys everywhere and use unique
constraints for natural keys.
I'm ready to accept that natural keys for the first 80 elements of
periodic table probably are immutable )

Gints
http://www.gplivna.eu/



Reply With Quote
  #10  
Old   
Leif B. Kristensen
 
Posts: n/a

Default Re: Why use a composite PK ever? - 06-13-2006 , 08:22 AM



gints.plivna (AT) gmail (DOT) com skrev:

Quote:
Khe, khe are you really working with this database or just heard of
it, that personal id is never changing?
In the databases of the Norwegian national health insurance, where I'm
working, we're doing it the same way. And our databases cover the
entire population of Norway.

Quote:
Funnily we in Latvia had our
db created with help of Denmark consultants and as a result I assume
recieved more or less the same data model. So in real life it turned
out to be changing in about 0.5% cases. So I'd be interested how you
in Denmark bypass at least following problems we had:
1) person finds out that his birthday has changed or was incorrectly
written in his identification documents
2) operator simply enters incorrect birthdate
3) two persons get the same personal id - this probably was our
problem only because initially there wasn't all data enetered online
in central db
4) a person get personal id two times - this is probably a bigger
problem and mostly for foreigners.

So now we have a new system and personal id isn't primary key there,
it is just the same attribute as name, sex and address for example.
So after such cases I'm very sceptical about natural keys as primary
keys and tend to use surrogate keys everywhere and use unique
constraints for natural keys.
I'm ready to accept that natural keys for the first 80 elements of
periodic table probably are immutable )
The problems you mention are real, but in a mature system there are of
course established routines for handling them. No two persons can ever
get the same person ID, because every ID is issued from the same
central authority which is a department under the national tax
register.
--
Leif Biberg Kristensen | Registered Linux User #338009
http://solumslekt.org/ | Cruising with Gentoo/KDE


Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.