dbTalk Databases Forums  

Is an URL a good primary key? (or what else?)

comp.databases.theory comp.databases.theory


Discuss Is an URL a good primary key? (or what else?) in the comp.databases.theory forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
Lorenzo Lazzeri
 
Posts: n/a

Default Is an URL a good primary key? (or what else?) - 10-06-2010 , 12:43 PM






I'm designing a database for an application which has to do with web
URLs.
I'll have tables to store sites attributes as well as specific pages
and RSS/Atom feeds. I'll have millions of them.
Everything has an URL which is by definition unique. I'm in the
"natural key" bandwagon, but I have doubts to use the URLs themselves
as primary key, for the reason that they can be very (very) long. The
HTTP standard doesn't set a limit, and the smaller max size limit
imposed by a browser is 2048 characters. So I need to have URLs field
of at least CHAR(2048).
If I use URL as primary key, will that size create problems of
performance in joins?

So I tried with a surrogate key, the sha1() hash of the URL. It is
very useful, still long 40 chars, but it has all the problems of
surrogate key: it says nothing about the data it refers to if taken by
itself, and it needs some triggers and on update cascade costraints to
keep the whole schema coherent if something (the url...) changes.
Moreover, I need a further key on the url to speed up searches.

I want to avoid auto incremental id for the well known reasons:
moreover it has to do with the data even less than the hash (i.e.
nothing).

So I'm looking for and advice on it. Should I use URLs as primary key
or what else? Is that performance problem on join really an issue or
can I ignore it? May be because the index is already treated as an
hash internally by the db?

In the meantime I'm exploring another solution. Since I'll have at
least seven tables with URL as discriminant (and couting as I add
features), maybe I should remove the URL from that tables and store
them in another table as schema, domain, port, path, query_string,
auth_user, auth_password. Then I could use the url hash on the other
tables as foreign key as well as primary key. Am I totally wrong on
this way?

For the sake of completeness I'm designing for MySQL 5.1, even if I'd
like to avoid non-standard constructs.

Thank you in advance for any answers.

Lorenzo Lazzeri

P.S. Please forgive my english syntax and don't hesitate to ask me
clarifications where needed.

Reply With Quote
  #2  
Old   
Bob Badour
 
Posts: n/a

Default Re: Is an URL a good primary key? (or what else?) - 10-06-2010 , 01:14 PM






Lorenzo Lazzeri wrote:

Quote:
I'm designing a database for an application which has to do with web
URLs.
I'll have tables to store sites attributes as well as specific pages
and RSS/Atom feeds. I'll have millions of them.
Everything has an URL which is by definition unique. I'm in the
"natural key" bandwagon, but I have doubts to use the URLs themselves
as primary key, for the reason that they can be very (very) long. The
HTTP standard doesn't set a limit, and the smaller max size limit
imposed by a browser is 2048 characters. So I need to have URLs field
of at least CHAR(2048).
If I use URL as primary key, will that size create problems of
performance in joins?

So I tried with a surrogate key, the sha1() hash of the URL. It is
very useful, still long 40 chars, but it has all the problems of
surrogate key: it says nothing about the data it refers to if taken by
itself, and it needs some triggers and on update cascade costraints to
keep the whole schema coherent if something (the url...) changes.
Moreover, I need a further key on the url to speed up searches.

I want to avoid auto incremental id for the well known reasons:
moreover it has to do with the data even less than the hash (i.e.
nothing).

So I'm looking for and advice on it. Should I use URLs as primary key
or what else? Is that performance problem on join really an issue or
can I ignore it? May be because the index is already treated as an
hash internally by the db?

In the meantime I'm exploring another solution. Since I'll have at
least seven tables with URL as discriminant (and couting as I add
features), maybe I should remove the URL from that tables and store
them in another table as schema, domain, port, path, query_string,
auth_user, auth_password. Then I could use the url hash on the other
tables as foreign key as well as primary key. Am I totally wrong on
this way?

For the sake of completeness I'm designing for MySQL 5.1, even if I'd
like to avoid non-standard constructs.

Thank you in advance for any answers.

Lorenzo Lazzeri

P.S. Please forgive my english syntax and don't hesitate to ask me
clarifications where needed.
Lorenzo, the design criteria for primary keys are: uniqueness,
irreducibility, stability, simplicity and familiarity.

Your URLs are unique (so you say), irreducible and familiar. Are they
simple and stable?

While a sha1 hash is almost as unique as the original URL, it is not
quite as unique, and you have already noted such a hash is not familiar.
The hash is neither more stable nor less reducible than the original
URL. The question then becomes not only "Is the hash simpler than the
URL", but "Is the hash sufficiently simpler to compensate for the loss
of familiarity and uniqueness?"

I am unaware of any well known reasons to avoid auto increment numbers.
What are they?

Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.