dbTalk Databases Forums  

Natural keys vs Aritficial Keys

comp.databases.theory comp.databases.theory


Discuss Natural keys vs Aritficial Keys in the comp.databases.theory forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
Walter Mitty
 
Posts: n/a

Default Natural keys vs Aritficial Keys - 05-15-2009 , 11:57 AM






I'm confused about some recent trends in database design.

When I learned databases, about a quarter of a century ago, the preferred
practice was to use natural keys to refer to individual table rows, unless
there's some real good reason to go with an artificial key. I've run into a
few cases where I chose to use an artificial key, but most of the time I've
used a natural key. (I should clarify: if a university's registration
office assigns each student a StudentID, before entering that student into
the database, I'm treating that key as "natural" in the context of the
database, even though one might argue that it's "artificial" in some other
context.) This generally surfaced in the choice of one or more columns to
be declared as a PRIMARY KEY.

Every time the issue of natural keys comes up in Stackoverflow.com, the
prevailing view seems to be that the best primary keys are artificial and
opaque. And responses that take this view get the votes. They emphasize
efficiency (mainly efficiency in the index that you usually get
automatically when you declare a PRIMARY KEY). But data integrity seems to
be forgotten, here. The same is not generally tru when the conversation
turns to referential integrity. Slowly but surely the programming community
seems to heve been dragged, kicking and screaming, into turning over
referential integrity enforcement to the DBMS.


What I don't get is whether I've understood something that today's
trendsetters never learned or whether they are taking into consideration
matters that I never considered important, back in the day. I've read their
arguments carefully, and it seems to me that failure to declare UNIQUE and
NOT NULL constraints on the most widely used candidate keys is simply an
invitation to degradation of data integrity. PRIMARY KEY is a convenient
way to get those two constraints with one declaration.

Is data integrity less important than it used to be? Are most people
building databases that get embedded in some appplication to be sold to the
non technical buyer? Is something else going on that I'm unaware of? Can
some body explain to me what these people are thinking?






Reply With Quote
  #2  
Old   
paul c
 
Posts: n/a

Default Re: Natural keys vs Aritficial Keys - 05-15-2009 , 02:26 PM






Walter Mitty wrote:
Quote:
I'm confused about some recent trends in database design.

When I learned databases, about a quarter of a century ago, the preferred
practice was to use natural keys to refer to individual table rows, unless
there's some real good reason to go with an artificial key. I've run into a
few cases where I chose to use an artificial key, but most of the time I've
used a natural key. (I should clarify: if a university's registration
office assigns each student a StudentID, before entering that student into
the database, I'm treating that key as "natural" in the context of the
database, even though one might argue that it's "artificial" in some other
context.) This generally surfaced in the choice of one or more columns to
be declared as a PRIMARY KEY.

Every time the issue of natural keys comes up in Stackoverflow.com, the
prevailing view seems to be that the best primary keys are artificial and
opaque. And responses that take this view get the votes. They emphasize
efficiency (mainly efficiency in the index that you usually get
automatically when you declare a PRIMARY KEY). But data integrity seems to
be forgotten, here. The same is not generally tru when the conversation
turns to referential integrity. Slowly but surely the programming community
seems to heve been dragged, kicking and screaming, into turning over
referential integrity enforcement to the DBMS.


What I don't get is whether I've understood something that today's
trendsetters never learned or whether they are taking into consideration
matters that I never considered important, back in the day. I've read their
arguments carefully, and it seems to me that failure to declare UNIQUE and
NOT NULL constraints on the most widely used candidate keys is simply an
invitation to degradation of data integrity. PRIMARY KEY is a convenient
way to get those two constraints with one declaration.

Is data integrity less important than it used to be? Are most people
building databases that get embedded in some appplication to be sold to the
non technical buyer? Is something else going on that I'm unaware of? Can
some body explain to me what these people are thinking?

I don't think it's you who's confused.


Reply With Quote
  #3  
Old   
Troels Arvin
 
Posts: n/a

Default Re: Natural keys vs Aritficial Keys - 05-15-2009 , 03:48 PM



Walter Mitty wrote:
Quote:
Every time the issue of natural keys comes up in Stackoverflow.com, the
prevailing view seems to be that the best primary keys are artificial
and opaque.
I've seen the same tendency at Wikipedia. I've tried to engage in
discussions about it there, but I've given up (sometimes, you just have
to give up on Wikipedia). In a discussion, I even provided benchmarks to
the most eager surrogate-proponent, showing worse performance for a type
of query when surrogate keys were used -- but this was ignored.

And I believe that the tendency is due to

- Misguidedness. When I started in the programming field,
I thought that every table should have a generated integer
"id" column; it somehow felt "nice". But after a dose of
education and experience, the fixed idea went away.
I think many people find it somehow comforting to have
an "id" handle, instead of appreciating the idea of
letting the data itself be the driving force. The
navigational versus the declarative approach.

- Deficiencies in object relational mappers which
either don't support multi-column keys, or don't make
it easy to handle multi-column keys.

- Some DBMSes which don't support CASCADE in definition
of foreign keys.

Some of the performance arguments may be valid: There are cases where
they are more efficient than natural keys, although the extent of it is
probably seldom measured, making it a premature optimization.
On the other hand, there are cases where the use of surrogate keys
results in _worse_ performance, due to the need for extra joins, for
example. And all else being equal, the surrogate columns add extra bytes
to the database; if all tables have unneeded surrogate keys, it can
actually amount to a sizeable amount of "dead" bytes in the database.

But the situation is also due to the SQL standard and the DBMS
implementors:

- Many people stay away from NATURAL JOIN because it's
seen as being dangerous, having the potential for being
source of bad, subtle problems if a column is renamed.
I wish the SQL standard had defined NATURAL JOIN to
be based on referential integrity definitions (foreign
keys) instead of naming conventions. Maybe the standard
could be extended with a new "NATURAL REFERENTIAL JOIN"
or simply "REFERENTIAL JOIN", so that joins could be
expressed in a terse and safe way, even where multi-column
keys are involved.

- I know of no DBMS which implement the obvious optimization
where a "costly" natural key based referential integrity
constraint (such as multi-column keys involving textual
columns) is handled behind the scenes through hidden
pointer-like (integer?) row IDs.

Quote:
Is data integrity less important than it used to be?
It seems so :-(

And not just integrity which is at risk: The less you tell the DBMS about
your data, the less it can use the knowledge for semantic query
optimization. Also, when you don't express known constraints about your
data, the application can't make use of this metadata knowledge for GUI
hints, etc.

--
Troels


Reply With Quote
  #4  
Old   
Roy Hann
 
Posts: n/a

Default Re: Natural keys vs Aritficial Keys - 05-16-2009 , 03:58 AM



Walter Mitty wrote:

Quote:
I'm confused about some recent trends in database design.
No you're not.

[snip]

Quote:
What I don't get is whether I've understood something that today's
trendsetters never learned or whether they are taking into consideration
matters that I never considered important, back in the day. I've read their
arguments carefully, and it seems to me that failure to declare UNIQUE and
NOT NULL constraints on the most widely used candidate keys is simply an
invitation to degradation of data integrity. PRIMARY KEY is a convenient
way to get those two constraints with one declaration.
That misses the point. PRIMARY KEY and UNIQUE are supposed to prevent
potentially contradictory facts. They have nothing to do with
addressability. If RDBMSs or SQL DBMSs needed the concept of an
address they would have it and it would be unique and you wouldn't
have to everywhere repeatedly assert the most obvious property of an
address is uniqueness.

I think the urge to want to address rows is not just a hankering for
network databases though. Programmers worry about what to do if a user
mis-keys a natural key value which then gets propagated to a number of
tables. If they have kept the natural key in one place it needs to be
fixed in only one place. I've had quite good luck weening colleagues
off spurious synthetic keys by encouraging them to declare their
foreign keys with ON UPDATE CASCADE. (One or two demand benchmarks
before they believe it is practically free, but generally they get it.)
I suspect that it became established "best practice" to use
synthetic keys everywhere back when few SQL DBMSs supported ON UPDATE
CASCADE and it has just stayed with us.

Quote:
Can some body explain to me what these people are thinking?
They aren't.

--
Roy



Reply With Quote
  #5  
Old   
Brian Selzer
 
Posts: n/a

Default Re: Natural keys vs Aritficial Keys - 05-16-2009 , 08:36 AM




"Walter Mitty" <wamitty (AT) verizon (DOT) net> wrote

Quote:
I'm confused about some recent trends in database design.

When I learned databases, about a quarter of a century ago, the preferred
practice was to use natural keys to refer to individual table rows, unless
there's some real good reason to go with an artificial key. I've run into
a
few cases where I chose to use an artificial key, but most of the time
I've
used a natural key. (I should clarify: if a university's registration
office assigns each student a StudentID, before entering that student
into
the database, I'm treating that key as "natural" in the context of the
database, even though one might argue that it's "artificial" in some other
context.) This generally surfaced in the choice of one or more columns to
be declared as a PRIMARY KEY.

Every time the issue of natural keys comes up in Stackoverflow.com, the
prevailing view seems to be that the best primary keys are artificial and
opaque. And responses that take this view get the votes. They emphasize
efficiency (mainly efficiency in the index that you usually get
automatically when you declare a PRIMARY KEY). But data integrity seems
to
be forgotten, here. The same is not generally tru when the conversation
turns to referential integrity. Slowly but surely the programming
community
seems to heve been dragged, kicking and screaming, into turning over
referential integrity enforcement to the DBMS.


What I don't get is whether I've understood something that today's
trendsetters never learned or whether they are taking into consideration
matters that I never considered important, back in the day. I've read
their
arguments carefully, and it seems to me that failure to declare UNIQUE and
NOT NULL constraints on the most widely used candidate keys is simply an
invitation to degradation of data integrity. PRIMARY KEY is a convenient
way to get those two constraints with one declaration.

Is data integrity less important than it used to be? Are most people
building databases that get embedded in some appplication to be sold to
the
non technical buyer? Is something else going on that I'm unaware of? Can
some body explain to me what these people are thinking?

I'm going to use a term that provokes derision from the all-natural crowd:
rigidity. If a natural key rigidly designates or describes something--that
is, if it permanently identifies somthing in the universe, then there is no
/logical/ reason to add an artificial key. But not every instance of a key
is a permanent identifier: consider a table where the entire heading is the
key. Do you permit updates to such a table? If updates are allowed, then
what does it "mean" when one occurs? Did what was represented in the
database merely change in appearance, or was it replaced by something else?
And if it were replaced, then wouldn't it have been better to issue a delete
followed by an insert instead? Wouldn't that more closely represent what
actually occurred?

The delay between the time that a row is read and the time that an update
occurs, and the possibility that a change initiated by another user can
occur during that interval justifies the introduction of an artificial or
surrogate key, but only in the event that instances of the natural key do
not rigidly designate or describe something in the universe.




Reply With Quote
  #6  
Old   
Walter Mitty
 
Posts: n/a

Default Re: Natural keys vs Aritficial Keys - 05-16-2009 , 12:58 PM




"Troels Arvin" <troels (AT) arvin (DOT) dk> wrote

Quote:
Walter Mitty wrote:
Every time the issue of natural keys comes up in Stackoverflow.com, the
prevailing view seems to be that the best primary keys are artificial
and opaque.

I've seen the same tendency at Wikipedia. I've tried to engage in
discussions about it there, but I've given up (sometimes, you just have
to give up on Wikipedia). In a discussion, I even provided benchmarks to
the most eager surrogate-proponent, showing worse performance for a type
of query when surrogate keys were used -- but this was ignored.

And I believe that the tendency is due to

- Misguidedness. When I started in the programming field,
I thought that every table should have a generated integer
"id" column; it somehow felt "nice". But after a dose of
education and experience, the fixed idea went away.
I think many people find it somehow comforting to have
an "id" handle, instead of appreciating the idea of
letting the data itself be the driving force. The
navigational versus the declarative approach.

- Deficiencies in object relational mappers which
either don't support multi-column keys, or don't make
it easy to handle multi-column keys.

- Some DBMSes which don't support CASCADE in definition
of foreign keys.

Some of the performance arguments may be valid: There are cases where
they are more efficient than natural keys, although the extent of it is
probably seldom measured, making it a premature optimization.
On the other hand, there are cases where the use of surrogate keys
results in _worse_ performance, due to the need for extra joins, for
example. And all else being equal, the surrogate columns add extra bytes
to the database; if all tables have unneeded surrogate keys, it can
actually amount to a sizeable amount of "dead" bytes in the database.

But the situation is also due to the SQL standard and the DBMS
implementors:

- Many people stay away from NATURAL JOIN because it's
seen as being dangerous, having the potential for being
source of bad, subtle problems if a column is renamed.
I wish the SQL standard had defined NATURAL JOIN to
be based on referential integrity definitions (foreign
keys) instead of naming conventions. Maybe the standard
could be extended with a new "NATURAL REFERENTIAL JOIN"
or simply "REFERENTIAL JOIN", so that joins could be
expressed in a terse and safe way, even where multi-column
keys are involved.

- I know of no DBMS which implement the obvious optimization
where a "costly" natural key based referential integrity
constraint (such as multi-column keys involving textual
columns) is handled behind the scenes through hidden
pointer-like (integer?) row IDs.

Is data integrity less important than it used to be?

It seems so :-(

And not just integrity which is at risk: The less you tell the DBMS about
your data, the less it can use the knowledge for semantic query
optimization. Also, when you don't express known constraints about your
data, the application can't make use of this metadata knowledge for GUI
hints, etc.

--
Troels
Thanks for your response. I think that both stackoverflow and Wikipedia
make it possible to confuse popularity with excellence.

Perhaps one just has to accept that some responses don't get the recognition
they deserve. By way of an analogy, in a democracy, sometimes the
electorate goofs.

But I think there's something more substantive going on here. I think that
the design center of database design has shifted from the support of an
enterprise need to manage and share data to an application's need for a
reliable persistent store. Most of the people voting are not DBA's or
database designers for an enterprise integrating database. They are
developers who are getting an application ready for sale. The measure of
"goodness" changes accordingly.

As far as efficiency goes, I used to make a dramatic difference between
inefficiencies that tacked an extra 10% onto a retrieval, and
inefficiencies that turned minutes of delay into hours of delay. I found
that keeping the design simple and sound, and keeping things flexible enough
to make fine tuning possible were much more important that squeezing every
last drop of performance out of every single design decision. Maybe today's
developers are like a horse owner trying to win the Kentucky derby, and
considering second place a "loss".

Speaking of efficiency, composite primary keys have almost no effect, if you
get the indexes right. A primary key on (StudentID, CourseID) is equivalent
to a key on (CourseID, StudentID). However an index on (StudentID,
CourseID) could have a very different performance impact than an index on
(CourseID, StudentID). Unfortunately, some people judge the performance
impact of a compound key based on the impact of the freebie index the DBMS
makes for you.
(I know, I know, "freebie" is the wrong word!).

I also remember a time when unique indexes used to perform much, much better
than indexes that permitted duplicates. This was a product specific
deficiency.

Thanks again.






Reply With Quote
  #7  
Old   
Walter Mitty
 
Posts: n/a

Default Re: Natural keys vs Aritficial Keys - 05-16-2009 , 01:01 PM




"Brian Selzer" <brian (AT) selzer-software (DOT) com> wrote

Quote:
"Walter Mitty" <wamitty (AT) verizon (DOT) net> wrote in message
news:7ahPl.1492$5F2.238 (AT) nwrddc01 (DOT) gnilink.net...
I'm confused about some recent trends in database design.

When I learned databases, about a quarter of a century ago, the
preferred
practice was to use natural keys to refer to individual table rows,
unless
there's some real good reason to go with an artificial key. I've run
into a
few cases where I chose to use an artificial key, but most of the time
I've
used a natural key. (I should clarify: if a university's registration
office assigns each student a StudentID, before entering that student
into
the database, I'm treating that key as "natural" in the context of the
database, even though one might argue that it's "artificial" in some
other
context.) This generally surfaced in the choice of one or more columns
to
be declared as a PRIMARY KEY.

Every time the issue of natural keys comes up in Stackoverflow.com, the
prevailing view seems to be that the best primary keys are artificial and
opaque. And responses that take this view get the votes. They emphasize
efficiency (mainly efficiency in the index that you usually get
automatically when you declare a PRIMARY KEY). But data integrity seems
to
be forgotten, here. The same is not generally tru when the conversation
turns to referential integrity. Slowly but surely the programming
community
seems to heve been dragged, kicking and screaming, into turning over
referential integrity enforcement to the DBMS.


What I don't get is whether I've understood something that today's
trendsetters never learned or whether they are taking into consideration
matters that I never considered important, back in the day. I've read
their
arguments carefully, and it seems to me that failure to declare UNIQUE
and
NOT NULL constraints on the most widely used candidate keys is simply an
invitation to degradation of data integrity. PRIMARY KEY is a convenient
way to get those two constraints with one declaration.

Is data integrity less important than it used to be? Are most people
building databases that get embedded in some appplication to be sold to
the
non technical buyer? Is something else going on that I'm unaware of?
Can
some body explain to me what these people are thinking?


I'm going to use a term that provokes derision from the all-natural crowd:
rigidity. If a natural key rigidly designates or describes
something--that is, if it permanently identifies somthing in the universe,
then there is no /logical/ reason to add an artificial key. But not every
instance of a key is a permanent identifier: consider a table where the
entire heading is the key. Do you permit updates to such a table? If
updates are allowed, then what does it "mean" when one occurs? Did what
was represented in the database merely change in appearance, or was it
replaced by something else? And if it were replaced, then wouldn't it have
been better to issue a delete followed by an insert instead? Wouldn't
that more closely represent what actually occurred?

The delay between the time that a row is read and the time that an update
occurs, and the possibility that a change initiated by another user can
occur during that interval justifies the introduction of an artificial or
surrogate key, but only in the event that instances of the natural key do
not rigidly designate or describe something in the universe.


Thanks for your response. I do think your ideas are of interest. But
previous exchanges between you and me have arrived at such an impasse that
I'm not inclined to make the effort to reconcile your world view with mine.
I can see this response leading down the exact same trail that we've been
down so many times before.

Perhaps your reponse will provoke some useful discussion with other
participants. Thanks again.





Reply With Quote
  #8  
Old   
Walter Mitty
 
Posts: n/a

Default Re: Natural keys vs Aritficial Keys - 05-16-2009 , 01:35 PM




"Roy Hann" <specially (AT) processed (DOT) almost.meat> wrote

Quote:
Walter Mitty wrote:

I'm confused about some recent trends in database design.

No you're not.

[snip]

What I don't get is whether I've understood something that today's
trendsetters never learned or whether they are taking into consideration
matters that I never considered important, back in the day. I've read
their
arguments carefully, and it seems to me that failure to declare UNIQUE
and
NOT NULL constraints on the most widely used candidate keys is simply an
invitation to degradation of data integrity. PRIMARY KEY is a convenient
way to get those two constraints with one declaration.

That misses the point. PRIMARY KEY and UNIQUE are supposed to prevent
potentially contradictory facts. They have nothing to do with
addressability. If RDBMSs or SQL DBMSs needed the concept of an
address they would have it and it would be unique and you wouldn't
have to everywhere repeatedly assert the most obvious property of an
address is uniqueness.

I think the urge to want to address rows is not just a hankering for
network databases though. Programmers worry about what to do if a user
mis-keys a natural key value which then gets propagated to a number of
tables. If they have kept the natural key in one place it needs to be
fixed in only one place. I've had quite good luck weening colleagues
off spurious synthetic keys by encouraging them to declare their
foreign keys with ON UPDATE CASCADE. (One or two demand benchmarks
before they believe it is practically free, but generally they get it.)
I suspect that it became established "best practice" to use
synthetic keys everywhere back when few SQL DBMSs supported ON UPDATE
CASCADE and it has just stayed with us.

Can some body explain to me what these people are thinking?

They aren't.

--
Roy

Thanks for your response. You are another regular whose responses often
provide me with valuable insight.

Based on the discussion in stackoverflow, I can neither confirm nor deny
your experience with CASCADE bringing people around. If you say that's been
your experience, then I'll believe you.

What I will say in response kind of generalizes your comment. Some design
decisions involve a trade off between ease of development now and ease of
data management later on. In that trade off, ease of development seems to
win every time. I can't prove this, but I think a lot of those developers
imagine a DBA to just a database baysitter, someone who makes sure the
database isbacked up, prevents hackers from storming the citadel, and does
little else. From my expeience in the 80s and 90s, that was only a small
part of the ways DBAs added value to data.

Thanks again.




Reply With Quote
  #9  
Old   
Roy Hann
 
Posts: n/a

Default Re: Natural keys vs Aritficial Keys - 05-16-2009 , 01:51 PM



Walter Mitty wrote:


Quote:
Based on the discussion in stackoverflow, I can neither confirm nor deny
your experience with CASCADE bringing people around. If you say that's been
your experience, then I'll believe you.
Persuading a colleague who knows you and works with you, and persuading
some random Internet denizen who's goofing off to stir up trouble on
a forum, are two different challenges with different expected outcomes.
:-)

--
Roy





Reply With Quote
  #10  
Old   
cimode@hotmail.com
 
Posts: n/a

Default Re: Natural keys vs Aritficial Keys - 05-16-2009 , 03:24 PM



<Some design decisions involve a trade off between ease of development
now and ease of
data management later on>
Only poor design decisions impose such tradeoffs. Respecting
fundamentals of database management can only make development easier
to create and administer on a continuous basis. Even on non
relational systems such as SQL DBMS's.


Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.