![]() | |
#1
| |||
| |||
|
#2
| |||
| |||
|
#3
| |||
| |||
|
|
In a fabulous book -- "Practical Issues in Database Management" -- the fabulous author, Pascal Fabian, lists three reasons for choosing one candidate PK over another: familiarity, stability, and simplicity. Just started skimming through the book. He notes further that those influenced by OO db design tend to use simple, surrogate keys for PKs in all tables; that this is not *precluded* by relational theory, but that it's somehow illicit. I personally think it's a good rule of thumb to create surrogate keys for almost all tables. Stability seems to be the single-most important factor to consider. If the database can't uniquely identify a row, what's the point? Choosing a surrogate key guarantees stability. |
|
Why chance it on a composite key when you can be assured of uniqueness with a sequence-generated surrogate key? |
|
"Familiarity" seems like a spurious concern, and a poor tradeoff against both stability(guaranteeing you are uniquely identifying rows) and simplicity (with queries, and others intuiting your design). |
#4
| |||
| |||
|
|
[...] Why chance it on a composite key when you can be assured of uniqueness with a sequence-generated surrogate key? "Familiarity" seems like a spurious concern, and a poor tradeoff against both stability(guaranteeing you are uniquely identifying rows) and simplicity (with queries, and others intuiting your design). Am I missing something? Why use a composite key *ever* (and by ever, I mean for most purposes) aside from "familiarity?" |
#5
| |||
| |||
|
|
In a fabulous book -- "Practical Issues in Database Management" -- the fabulous author, Pascal Fabian, lists three reasons for choosing one candidate PK over another: familiarity, stability, and simplicity. Just started skimming through the book. He notes further that those influenced by OO db design tend to use simple, surrogate keys for PKs in all tables; that this is not *precluded* by relational theory, but that it's somehow illicit. I personally think it's a good rule of thumb to create surrogate keys for almost all tables. Stability seems to be the single-most important factor to consider. If the database can't uniquely identify a row, what's the point? Choosing a surrogate key guarantees stability. Why chance it on a composite key when you can be assured of uniqueness with a sequence-generated surrogate key? "Familiarity" seems like a spurious concern, and a poor tradeoff against both stability(guaranteeing you are uniquely identifying rows) and simplicity (with queries, and others intuiting your design). Am I missing something? Why use a composite key *ever* (and by ever, I mean for most purposes) aside from "familiarity?" Could someone give a real-world example where "familiarity" is a compelling reason to choose a composite PK, trumping both stability and simplicity? Finally, if there *are* any compelling reasons to prefer composite keys in some situations, after how many attributes required to create a composite key should one give up and create a surrogate key? I've seen composite keys of up to 5 attributes and often wondered why that particular decision was made. Seems inelegant, but then perhaps I'm missing some other fundamental here. It's a great book by the way. Should be required reading for most database designers. Dana |
#6
| |||
| |||
|
|
Finally, if there *are* any compelling reasons to prefer composite keys in some situations, after how many attributes required to create a composite key should one give up and create a surrogate key? I've seen composite keys of up to 5 attributes and often wondered why that particular decision was made. Seems inelegant, but then perhaps I'm missing some other fundamental here. |
#7
| |||||||
| |||||||
|
|
[simple, surrogate keys for PKs in all tables] I personally think it's a good rule of thumb to create surrogate keys for almost all tables. |
|
Stability seems to be the single-most important factor to consider. If the database can't uniquely identify a row, what's the point? Choosing a surrogate key guarantees stability. |
|
Why chance it on a composite key when you can be assured of uniqueness with a sequence-generated surrogate key? |
|
"Familiarity" seems like a spurious concern, and a poor tradeoff against both stability (guaranteeing you are uniquely identifying rows) and simplicity (with queries, and others intuiting your design). |
|
Could someone give a real-world example where "familiarity" is a compelling reason to choose a composite PK, trumping both stability and simplicity? |
|
Finally, if there *are* any compelling reasons to prefer composite keys in some situations, after how many attributes required to create a composite key should one give up and create a surrogate key? |
|
've seen composite keys of up to 5 attributes and often wondered why that particular decision was made. Seems inelegant, but then perhaps I'm missing some other fundamental here. |
#8
| ||||||||
| ||||||||
|
|
This means that a surrogate ought to act like an index; created by the user, managed by the system and NEVER seen by a user. That means never used in queries, DRI or anything else that a user does. |
|
Let's look at the logical problems. First try to create a table with two columns and try to make them both IDENTITY. If you cannot declare more than one column to be of a certain data type, then that thing is not a data type at all, by definition. It is a property which belongs to the PHYSICAL table, not the LOGICAL data in the table. It also has to be NULL-able to be a data type in SQL. |
|
Next, create a table with one column and make it an IDENTITY. Now try to insert, update and delete different numbers from it. If you cannot insert, update and delete rows from a table, then it is not a table by definition. |
|
Finally create a simple table with one IDENTITY and a few other columns. Use a few statements like INSERT INTO Foobar (a, b, c) VALUES ('a1', 'b1', 'c1'); INSERT INTO Foobar (a, b, c) VALUES ('a2', 'b2', 'c2'); INSERT INTO Foobar (a, b, c) VALUES ('a3', 'b3', 'c3'); To put a few rows into the table and notice that the IDENTITY sequentially numbered them in the order they were presented. If you delete a row, the gap in the sequence is not filled in and the sequence continues from the highest number that has ever been used in that column in that particular table. This is how we did record numbers in pre-allocated sequential files in the 1950's, by the way. A utility program would then "pack" or "compress" the records that were flagged as deleted or unused to move the empty space to the physical end of the physical file. IDENTITY leaves the gaps. But now use a statement with a query expression in it, like this: INSERT INTO Foobar (a, b, c) SELECT x, y, z FROM Floob; Since a query result is a table, and a table is a set which has no ordering, what should the IDENTITY numbers be? The entire, whole, completed set is presented to Foobar all at once, not a row at a time. There are (n!) ways to number (n) rows, so which one do you pick? The answer has been to use whatever the *physical* order of the result set happened to be. That non-relational phrase "physical order" again! But it is actually worse than that. If the same query is executed again, but with new statistics or after an index has been dropped or added, the new execution plan could bring the result set back in a different physical order. Indexes and statistics are not part of the logical model. Can you explain from a logical model viewpoint why the same rows in the second query get different IDENTITY numbers? In the relational model, they should be treated the same if all the values of all the attributes are identical. |
|
Using IDENTITY as a primary key is a sign that there is no data model, only an imitation of a sequential file system. Since this "magic, all-purpose, one-size-fits-all" pseudo-identifier exists only as a result of the physical state of a particular piece of hardware at a particular time as read by the current release of a particular database product, how do you verify that an entity has such a number in the reality you are modeling? |
|
Another cute way to destroy data integrity: BEGIN ATOMIC DELETE FROM Foobar WHERE id = <<some row>>; INSERT INTO Foobar VALUES ( <<recreate deleted row>>) END; Logically this should do nothing, but since IDENTITY has gaps, it trashes the data. |
|
Now you are REALLY thinking about relations and keys instead of 1950's sequential record numbering. Adding an IDENTITY column to either of these tables as a candidate key would be dangerously redundant; one query uses the IDENTITY and another uses the real key, and like a man with two watches, you are never sure what time it is. Daniel Wetzler found out the hard way that IDENTITY changed behavior in SQL 2000 and SQL 2005. If you perform the statement below you get only one dataset which has the described properties. SELECT DISTINCT IDENTITY (INTEGER) AS fake_id, title1, .. FROM Foobar WHERE title1 IS NOT NULL AND .. The IDENTITY function makes each row unique so DISTINCT doesn't eliminate the duplicates in this case. Interestingly, this behavior seems to have changed in SQL Server 2005. If Iupoi run this as a SELECT INTO on 2005, the execution plan computes the IDENTITY value after DISTINCT. For 2000 the kludge is a bit hard to see. The following should insert just one row into the target table. CREATE TABLE Foobar (title1 VARCHAR(10), ..); INSERT INTO Foobar VALUES ('1', ..); INSERT INTO Foobar VALUES ('1', ..); SELECT IDENTITY (INTEGER) AS fake_id, title1, .. INTO Foobar2 FROM (SELECT DISTINCT title1, .. FROM Foobar WHERE ..); Since we are dealling with a proprietary feature, this is subject to change without noti ce again. |
|
[simple, surrogate keys for PKs in all tables] I personally think it's a good rule of thumb to create surrogate keys for almost all tables. No, absolutley no. First of all, they do not create surrogate keys; they create exposed physical locators, like IDENTITY or other auto-numbering schemes. Let me maker an appeal to authority, with a quote from Dr. Codd: "..Database users may cause the system to generate or delete a surrogate, but they have no control over its value, nor is its value ever displayed to them ..."(Dr. Codd in ACM TODS, pp 409-410) and Codd, E. (1979), Extending the database relational model to capture more meaning. ACM Transactions on Database Systems, 4(4). pp. 397-434. This means that a surrogate ought to act like an index; created by the user, managed by the system and NEVER seen by a user. That means never used in queries, DRI or anything else that a user does. Codd also wrote the following: "There are three difficulties in employing user-controlled keys as permanent surrogates for entities. (1) The actual values of user-controlled keys are determined by users and must therefore be subject to change by them (e.g. if two companies merge, the two employee databases might be combined with the result that some or all of the serial numbers might be changed.). (2) Two relations may have user-controlled keys defined on distinct domains (e.g. one uses social security, while the other uses employee serial numbers) and yet the entities denoted are the same. (3) It may be necessary to carry information about an entity either before it has been assigned a user-controlled key value or after it has ceased to have one (e.g. and applicant for a job and a retiree). These difficulties have the important consequence that an equi-join on common key values may not yield the same result as a join on common entities. A solution - proposed in part [4] and more fully in [14] - is to introduce entity domains which contain system-assigned surrogates. Database users may cause the system to generate or delete a surrogate, but they have no control over its value, nor is its value ever displayed to them....." (Codd in ACM TODS, pp 409-410). References Codd, E. (1979), Extending the database relational model to capture more meaning. ACM Transactions on Database Systems, 4(4). pp. 397-434 A real surrogate key maintained by the RDBMS itself, like an index and the user never sees it, much less uses it in his code. Ever use hashing in Teradata? Stability seems to be the single-most important factor to consider. If the database can't uniquely identify a row, what's the point? Choosing a surrogate key guarantees stability. No, it makes the data unstable. You cannot validate or verify an auto-numbering value. Think about it. I can look at an ISBN, validate it with the check digit, then verify it with a search to Amazon.com, B&N, Bowker, etc. in *any* database, on *any* SQL (or non-SQL) engine. By definition, a relational key is a subset of attributes of the entity. You or your machinery do not invent it so much as discover it. Why chance it on a composite key when you can be assured of uniqueness with a sequence-generated surrogate key? Why is something you can verify more risk than something you cannot verify? You use relational keys for data integrity, data portability, code portability and the fact that if the composite key exists in the real world, then you **must** model it in your schema. What is the universal, verifiable single integer for a given (longitude, latitude) pair? If you are a Douglas Adams fan the answer is 42! Always 42? All you have done is waste storage and create weak redundancy. "Familiarity" seems like a spurious concern, and a poor tradeoff against both stability (guaranteeing you are uniquely identifying rows) and simplicity (with queries, and others intuiting your design). Wrong again. Familiarity, in the sense of codes that are universally underestood by people in a given industry, make data exchange possible. It also means that users can do "reasonableness" checks on data by eye, with regular expressions, check digits and other programming tools. Could someone give a real-world example where "familiarity" is a compelling reason to choose a composite PK, trumping both stability and simplicity? (longitude, latitude). Geographical hierarchies. Any co-ordinate system ((x,y), (x,y,z), log-spiral, circular, etc.) Finally, if there *are* any compelling reasons to prefer composite keys in some situations, after how many attributes required to create a composite key should one give up and create a surrogate key? How big should a vehicle be? The real point is that if you have an n-part composite key in the data model, you cannot make it go away with a magic number (ever read the Khabal stuff that is a fad among movie stars right now? Thinking there is a magic 17-digit Hebrew number that God puts on everything is not science or logic). 've seen composite keys of up to 5 attributes and often wondered why that particular decision was made. Seems inelegant, but then perhaps I'm missing some other fundamental here. Data integrity and a correct data model. But people do not handle more than five thigns very well, so that has been considered an upper limit on human data processing ever since a guy named Brown wrote a paper on it in the 1950's. Let's look at the logical problems. First try to create a table with two columns and try to make them both IDENTITY. If you cannot declare more than one column to be of a certain data type, then that thing is not a data type at all, by definition. It is a property which belongs to the PHYSICAL table, not the LOGICAL data in the table. It also has to be NULL-able to be a data type in SQL. Next, create a table with one column and make it an IDENTITY. Now try to insert, update and delete different numbers from it. If you cannot insert, update and delete rows from a table, then it is not a table by definition. Finally create a simple table with one IDENTITY and a few other columns. Use a few statements like INSERT INTO Foobar (a, b, c) VALUES ('a1', 'b1', 'c1'); INSERT INTO Foobar (a, b, c) VALUES ('a2', 'b2', 'c2'); INSERT INTO Foobar (a, b, c) VALUES ('a3', 'b3', 'c3'); To put a few rows into the table and notice that the IDENTITY sequentially numbered them in the order they were presented. If you delete a row, the gap in the sequence is not filled in and the sequence continues from the highest number that has ever been used in that column in that particular table. This is how we did record numbers in pre-allocated sequential files in the 1950's, by the way. A utility program would then "pack" or "compress" the records that were flagged as deleted or unused to move the empty space to the physical end of the physical file. IDENTITY leaves the gaps. But now use a statement with a query expression in it, like this: INSERT INTO Foobar (a, b, c) SELECT x, y, z FROM Floob; Since a query result is a table, and a table is a set which has no ordering, what should the IDENTITY numbers be? The entire, whole, completed set is presented to Foobar all at once, not a row at a time. There are (n!) ways to number (n) rows, so which one do you pick? The answer has been to use whatever the *physical* order of the result set happened to be. That non-relational phrase "physical order" again! But it is actually worse than that. If the same query is executed again, but with new statistics or after an index has been dropped or added, the new execution plan could bring the result set back in a different physical order. Indexes and statistics are not part of the logical model. Can you explain from a logical model viewpoint why the same rows in the second query get different IDENTITY numbers? In the relational model, they should be treated the same if all the values of all the attributes are identical. Using IDENTITY as a primary key is a sign that there is no data model, only an imitation of a sequential file system. Since this "magic, all-purpose, one-size-fits-all" pseudo-identifier exists only as a result of the physical state of a particular piece of hardware at a particular time as read by the current release of a particular database product, how do you verify that an entity has such a number in the reality you are modeling? Another cute way to destroy data integrity: BEGIN ATOMIC DELETE FROM Foobar WHERE id = <<some row>>; INSERT INTO Foobar VALUES ( <<recreate deleted row>>) END; Logically this should do nothing, but since IDENTITY has gaps, it trashes the data. Now you are REALLY thinking about relations and keys instead of 1950's sequential record numbering. Adding an IDENTITY column to either of these tables as a candidate key would be dangerously redundant; one query uses the IDENTITY and another uses the real key, and like a man with two watches, you are never sure what time it is. Daniel Wetzler found out the hard way that IDENTITY changed behavior in SQL 2000 and SQL 2005. If you perform the statement below you get only one dataset which has the described properties. SELECT DISTINCT IDENTITY (INTEGER) AS fake_id, title1, .. FROM Foobar WHERE title1 IS NOT NULL AND .. The IDENTITY function makes each row unique so DISTINCT doesn't eliminate the duplicates in this case. Interestingly, this behavior seems to have changed in SQL Server 2005. If Iupoi run this as a SELECT INTO on 2005, the execution plan computes the IDENTITY value after DISTINCT. For 2000 the kludge is a bit hard to see. The following should insert just one row into the target table. CREATE TABLE Foobar (title1 VARCHAR(10), ..); INSERT INTO Foobar VALUES ('1', ..); INSERT INTO Foobar VALUES ('1', ..); SELECT IDENTITY (INTEGER) AS fake_id, title1, .. INTO Foobar2 FROM (SELECT DISTINCT title1, .. FROM Foobar WHERE ..); Since we are dealling with a proprietary feature, this is subject to change without noti ce again. |
#9
| |||
| |||
|
|
dananrg (AT) yahoo (DOT) com wrote: [...] There are a few exceptions, such as here in Denmark where each citizen has a personal id number that _never_ changes and is guaranteed to be unique. It consist of the birth date and a serial number (subject to certain checksum constraints), which can easilly be implemented as a two-part key. In certain applications I would use that number as primary key. Khe, khe are you really working with this database or just heard of it, |
)
#10
| |||
| |||
|
|
Khe, khe are you really working with this database or just heard of it, that personal id is never changing? |
|
Funnily we in Latvia had our db created with help of Denmark consultants and as a result I assume recieved more or less the same data model. So in real life it turned out to be changing in about 0.5% cases. So I'd be interested how you in Denmark bypass at least following problems we had: 1) person finds out that his birthday has changed or was incorrectly written in his identification documents 2) operator simply enters incorrect birthdate 3) two persons get the same personal id - this probably was our problem only because initially there wasn't all data enetered online in central db 4) a person get personal id two times - this is probably a bigger problem and mostly for foreigners. So now we have a new system and personal id isn't primary key there, it is just the same attribute as name, sex and address for example. So after such cases I'm very sceptical about natural keys as primary keys and tend to use surrogate keys everywhere and use unique constraints for natural keys. I'm ready to accept that natural keys for the first 80 elements of periodic table probably are immutable ) |
![]() |
| Thread Tools | |
| Display Modes | |
| |