![]() | |
![]() |
| | Thread Tools | Display Modes |
#1
| |||
| |||
|
#2
| |||
| |||
|
|
I'm confused about some recent trends in database design. When I learned databases, about a quarter of a century ago, the preferred practice was to use natural keys to refer to individual table rows, unless there's some real good reason to go with an artificial key. I've run into a few cases where I chose to use an artificial key, but most of the time I've used a natural key. (I should clarify: if a university's registration office assigns each student a StudentID, before entering that student into the database, I'm treating that key as "natural" in the context of the database, even though one might argue that it's "artificial" in some other context.) This generally surfaced in the choice of one or more columns to be declared as a PRIMARY KEY. Every time the issue of natural keys comes up in Stackoverflow.com, the prevailing view seems to be that the best primary keys are artificial and opaque. And responses that take this view get the votes. They emphasize efficiency (mainly efficiency in the index that you usually get automatically when you declare a PRIMARY KEY). But data integrity seems to be forgotten, here. The same is not generally tru when the conversation turns to referential integrity. Slowly but surely the programming community seems to heve been dragged, kicking and screaming, into turning over referential integrity enforcement to the DBMS. What I don't get is whether I've understood something that today's trendsetters never learned or whether they are taking into consideration matters that I never considered important, back in the day. I've read their arguments carefully, and it seems to me that failure to declare UNIQUE and NOT NULL constraints on the most widely used candidate keys is simply an invitation to degradation of data integrity. PRIMARY KEY is a convenient way to get those two constraints with one declaration. Is data integrity less important than it used to be? Are most people building databases that get embedded in some appplication to be sold to the non technical buyer? Is something else going on that I'm unaware of? Can some body explain to me what these people are thinking? |
#3
| |||
| |||
|
|
Every time the issue of natural keys comes up in Stackoverflow.com, the prevailing view seems to be that the best primary keys are artificial and opaque. |
|
Is data integrity less important than it used to be? |
#4
| |||
| |||
|
|
I'm confused about some recent trends in database design. |
|
What I don't get is whether I've understood something that today's trendsetters never learned or whether they are taking into consideration matters that I never considered important, back in the day. I've read their arguments carefully, and it seems to me that failure to declare UNIQUE and NOT NULL constraints on the most widely used candidate keys is simply an invitation to degradation of data integrity. PRIMARY KEY is a convenient way to get those two constraints with one declaration. |
|
Can some body explain to me what these people are thinking? |
#5
| |||
| |||
|
|
I'm confused about some recent trends in database design. When I learned databases, about a quarter of a century ago, the preferred practice was to use natural keys to refer to individual table rows, unless there's some real good reason to go with an artificial key. I've run into a few cases where I chose to use an artificial key, but most of the time I've used a natural key. (I should clarify: if a university's registration office assigns each student a StudentID, before entering that student into the database, I'm treating that key as "natural" in the context of the database, even though one might argue that it's "artificial" in some other context.) This generally surfaced in the choice of one or more columns to be declared as a PRIMARY KEY. Every time the issue of natural keys comes up in Stackoverflow.com, the prevailing view seems to be that the best primary keys are artificial and opaque. And responses that take this view get the votes. They emphasize efficiency (mainly efficiency in the index that you usually get automatically when you declare a PRIMARY KEY). But data integrity seems to be forgotten, here. The same is not generally tru when the conversation turns to referential integrity. Slowly but surely the programming community seems to heve been dragged, kicking and screaming, into turning over referential integrity enforcement to the DBMS. What I don't get is whether I've understood something that today's trendsetters never learned or whether they are taking into consideration matters that I never considered important, back in the day. I've read their arguments carefully, and it seems to me that failure to declare UNIQUE and NOT NULL constraints on the most widely used candidate keys is simply an invitation to degradation of data integrity. PRIMARY KEY is a convenient way to get those two constraints with one declaration. Is data integrity less important than it used to be? Are most people building databases that get embedded in some appplication to be sold to the non technical buyer? Is something else going on that I'm unaware of? Can some body explain to me what these people are thinking? |
#6
| |||
| |||
|
|
Walter Mitty wrote: Every time the issue of natural keys comes up in Stackoverflow.com, the prevailing view seems to be that the best primary keys are artificial and opaque. I've seen the same tendency at Wikipedia. I've tried to engage in discussions about it there, but I've given up (sometimes, you just have to give up on Wikipedia). In a discussion, I even provided benchmarks to the most eager surrogate-proponent, showing worse performance for a type of query when surrogate keys were used -- but this was ignored. And I believe that the tendency is due to - Misguidedness. When I started in the programming field, I thought that every table should have a generated integer "id" column; it somehow felt "nice". But after a dose of education and experience, the fixed idea went away. I think many people find it somehow comforting to have an "id" handle, instead of appreciating the idea of letting the data itself be the driving force. The navigational versus the declarative approach. - Deficiencies in object relational mappers which either don't support multi-column keys, or don't make it easy to handle multi-column keys. - Some DBMSes which don't support CASCADE in definition of foreign keys. Some of the performance arguments may be valid: There are cases where they are more efficient than natural keys, although the extent of it is probably seldom measured, making it a premature optimization. On the other hand, there are cases where the use of surrogate keys results in _worse_ performance, due to the need for extra joins, for example. And all else being equal, the surrogate columns add extra bytes to the database; if all tables have unneeded surrogate keys, it can actually amount to a sizeable amount of "dead" bytes in the database. But the situation is also due to the SQL standard and the DBMS implementors: - Many people stay away from NATURAL JOIN because it's seen as being dangerous, having the potential for being source of bad, subtle problems if a column is renamed. I wish the SQL standard had defined NATURAL JOIN to be based on referential integrity definitions (foreign keys) instead of naming conventions. Maybe the standard could be extended with a new "NATURAL REFERENTIAL JOIN" or simply "REFERENTIAL JOIN", so that joins could be expressed in a terse and safe way, even where multi-column keys are involved. - I know of no DBMS which implement the obvious optimization where a "costly" natural key based referential integrity constraint (such as multi-column keys involving textual columns) is handled behind the scenes through hidden pointer-like (integer?) row IDs. Is data integrity less important than it used to be? It seems so :-( And not just integrity which is at risk: The less you tell the DBMS about your data, the less it can use the knowledge for semantic query optimization. Also, when you don't express known constraints about your data, the application can't make use of this metadata knowledge for GUI hints, etc. -- Troels |
#7
| |||
| |||
|
|
"Walter Mitty" <wamitty (AT) verizon (DOT) net> wrote in message news:7ahPl.1492$5F2.238 (AT) nwrddc01 (DOT) gnilink.net... I'm confused about some recent trends in database design. When I learned databases, about a quarter of a century ago, the preferred practice was to use natural keys to refer to individual table rows, unless there's some real good reason to go with an artificial key. I've run into a few cases where I chose to use an artificial key, but most of the time I've used a natural key. (I should clarify: if a university's registration office assigns each student a StudentID, before entering that student into the database, I'm treating that key as "natural" in the context of the database, even though one might argue that it's "artificial" in some other context.) This generally surfaced in the choice of one or more columns to be declared as a PRIMARY KEY. Every time the issue of natural keys comes up in Stackoverflow.com, the prevailing view seems to be that the best primary keys are artificial and opaque. And responses that take this view get the votes. They emphasize efficiency (mainly efficiency in the index that you usually get automatically when you declare a PRIMARY KEY). But data integrity seems to be forgotten, here. The same is not generally tru when the conversation turns to referential integrity. Slowly but surely the programming community seems to heve been dragged, kicking and screaming, into turning over referential integrity enforcement to the DBMS. What I don't get is whether I've understood something that today's trendsetters never learned or whether they are taking into consideration matters that I never considered important, back in the day. I've read their arguments carefully, and it seems to me that failure to declare UNIQUE and NOT NULL constraints on the most widely used candidate keys is simply an invitation to degradation of data integrity. PRIMARY KEY is a convenient way to get those two constraints with one declaration. Is data integrity less important than it used to be? Are most people building databases that get embedded in some appplication to be sold to the non technical buyer? Is something else going on that I'm unaware of? Can some body explain to me what these people are thinking? I'm going to use a term that provokes derision from the all-natural crowd: rigidity. If a natural key rigidly designates or describes something--that is, if it permanently identifies somthing in the universe, then there is no /logical/ reason to add an artificial key. But not every instance of a key is a permanent identifier: consider a table where the entire heading is the key. Do you permit updates to such a table? If updates are allowed, then what does it "mean" when one occurs? Did what was represented in the database merely change in appearance, or was it replaced by something else? And if it were replaced, then wouldn't it have been better to issue a delete followed by an insert instead? Wouldn't that more closely represent what actually occurred? The delay between the time that a row is read and the time that an update occurs, and the possibility that a change initiated by another user can occur during that interval justifies the introduction of an artificial or surrogate key, but only in the event that instances of the natural key do not rigidly designate or describe something in the universe. Thanks for your response. I do think your ideas are of interest. But |
#8
| |||
| |||
|
|
Walter Mitty wrote: I'm confused about some recent trends in database design. No you're not. [snip] What I don't get is whether I've understood something that today's trendsetters never learned or whether they are taking into consideration matters that I never considered important, back in the day. I've read their arguments carefully, and it seems to me that failure to declare UNIQUE and NOT NULL constraints on the most widely used candidate keys is simply an invitation to degradation of data integrity. PRIMARY KEY is a convenient way to get those two constraints with one declaration. That misses the point. PRIMARY KEY and UNIQUE are supposed to prevent potentially contradictory facts. They have nothing to do with addressability. If RDBMSs or SQL DBMSs needed the concept of an address they would have it and it would be unique and you wouldn't have to everywhere repeatedly assert the most obvious property of an address is uniqueness. I think the urge to want to address rows is not just a hankering for network databases though. Programmers worry about what to do if a user mis-keys a natural key value which then gets propagated to a number of tables. If they have kept the natural key in one place it needs to be fixed in only one place. I've had quite good luck weening colleagues off spurious synthetic keys by encouraging them to declare their foreign keys with ON UPDATE CASCADE. (One or two demand benchmarks before they believe it is practically free, but generally they get it.) I suspect that it became established "best practice" to use synthetic keys everywhere back when few SQL DBMSs supported ON UPDATE CASCADE and it has just stayed with us. Can some body explain to me what these people are thinking? They aren't. -- Roy |
#9
| |||
| |||
|
|
Based on the discussion in stackoverflow, I can neither confirm nor deny your experience with CASCADE bringing people around. If you say that's been your experience, then I'll believe you. |
#10
| |||
| |||
|
![]() |
| Thread Tools | |
| Display Modes | |
| |