dbTalk Databases Forums  

Does a normalized design lead to complex queries?

comp.databases comp.databases


Discuss Does a normalized design lead to complex queries? in the comp.databases forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
phlype.johnson@gmail.com
 
Posts: n/a

Default Does a normalized design lead to complex queries? - 10-16-2006 , 07:15 PM






Suppose we have to design a database for a recruitment agency. There
will be a table "candidates" with fields "candidateid","last
name","first name" ; the languages mastered by a candidate as
well as the skills are separated into different tables. Eg. The skills
are stored in the table "skills" with fields "skillid",
"skillname","candidateid" where skillid is the primary
autoincrement key. The language table is defined in a similar way. A
candidate having two skills will lead to two records in the table
"skills"; similarly a candidate mastering 3 languages will lead to
3 records in the table "languages".
Retrieving candidates with skills with skillname="php" and
skillname="asp" as well as languagename="EN" and
languagename="FR" can be done as follows:
SELECT DISTINCT c.name FROM candidates c, skills s, languages l WHERE
l.candidateid=c.candidateid AND s.candidateid=c.candidateid AND
(s.skillname='php' OR s.skillname='asp') AND (l.languagename='FR' OR
l.languagename='EN')
GROUP BY c.name HAVING (count(DISTINCT s.skillid)=2 AND count(DISTINCT
l.languageid)=2)
This already rather complex query (for a non-expert as myself) is
probably the disadvantage of a normalized design.
The problem that bothers me most is the following: what happens if I
want to built a search tool to find candidates using a flexible search
interface; in this interface the user could type "meta queries"
such as: find me the candidates that (master php AND (asp OR net)) AND
(that master French OR (Dutch AND German))
This query looks quite simple but how to translate this to one query
for the normalized database?
Do I first have to do the following conversion:
find me the candidates that (master (php AND asp) OR (php AND net)) AND
(that master French OR (Dutch AND German))
becoming
find me the candidates that (master (php AND asp)) AND (that master
French)
OR
find me the candidates that (master (php AND asp)) AND (that master
Dutch AND German)
OR
find me the candidates that (master (php AND net)) AND (that master
French)
OR
find me the candidates that (master (php AND net)) AND (that master
Dutch AND German)
Next I assume the three ORs have to be combined via UNION statements?
This seems quite complicated so I was wondering if there is no easier
way to create an SQL query starting from the "meta query" above?


Reply With Quote
  #2  
Old   
Ed Prochak
 
Posts: n/a

Default Re: Does a normalized design lead to complex queries? - 10-17-2006 , 04:48 PM






phlype.johnson (AT) gmail (DOT) com wrote:
Quote:
Suppose we have to design a database for a recruitment agency. There
will be a table "candidates" with fields "candidateid","last
name","first name" ; the languages mastered by a candidate as
well as the skills are separated into different tables. Eg. The skills
are stored in the table "skills" with fields "skillid",
"skillname","candidateid" where skillid is the primary
autoincrement key. The language table is defined in a similar way. A
candidate having two skills will lead to two records in the table
"skills";
Bzzzt wrong. Well maybe not necessarily. at least you are talking about
tables too soon and cetainly the description of a skills table is not
Normalized. Think of entities and attributes at this point (yes it is
just a straight substitution of words but the thought process is often
driven by how we phrase the question.)

So candidates has attributes: first name, last name, id, DOB, height,
skills, languages spoken, etc
As part of normalizing you realize a candidate can have more than one
skill. Also you realize skills have some of their own attributes. so
you remove skills from the candidate entity, creating a skills entity

Skills include attributes like name, type(musical, industrial,...),
tools required(yes/no), etc

now candidates can have one or more so this leads directly to a
relation I'll label candidate_skill. it has three attributes: candidate
first name, last name, and skills name.

Implementing this design would be easiest in three tables
CANDIDATES, SKILLS, and CANDIDATE_SKILLS

one candidate with two different skills does require two rows in the
CANDIDATE_SKILLS table. Another candidate with 3skills (2 of them the
same as the first candidate requires three rows in the CANDIDATE_SKILLS
table. And in the skills table is just three skills rows, one for each
unique skill.

There would be foreign key constraints from the CANDIDATE_SKILLS table:
from the skill attribute to the skill PK in the SKILLS table. and
from the name attributes to the name attributes forming the PK in the
CANDIDATE table.


Quote:
... similarly a candidate mastering 3 languages will lead to
3 records in the table "languages".
similarly in a CANDIDATES_LANGUAGES table

Quote:
Retrieving candidates with skills with skillname="php" and
skillname="asp" as well as languagename="EN" and
languagename="FR" can be done as follows:
SELECT DISTINCT c.name FROM candidates c, skills s, languages l WHERE
l.candidateid=c.candidateid AND s.candidateid=c.candidateid AND
(s.skillname='php' OR s.skillname='asp') AND (l.languagename='FR' OR
l.languagename='EN')
GROUP BY c.name HAVING (count(DISTINCT s.skillid)=2 AND count(DISTINCT
l.languageid)=2)
Ignoring normalization for the moment, why do you think you need
DISTINCT?
the GROUP BY will not return multiple JOHN rows in the result set, so
the DISTINCT is totally redundant in this query.


Quote:
This already rather complex query (for a non-expert as myself) is
probably the disadvantage of a normalized design.
Well I have ranted about pseudoKeys like your candidate ID and this
shows one of the disadvantages. Using the tables I described above, you
don't even need to visit the candidates table:

SELECT s.first_name,s.last_name
FROM skills s, languages L
WHERE s.first_name=L.first_name AND s.last_name=L.last_name
AND s.skillname IN ('php' , 'asp') AND L.languagename IN ('FR', 'EN')
GROUP BY s.first_name,s.last_name
HAVING (count(DISTINCT s.skillid)=2 AND count(DISTINCT L.languageid)=2)

But there are many more benefits to a normalized schema that far
outweigh the perceived complexity of this query. Consider that to make
it easier for general users to form queries, rather than denormalizing
the tables, consider creating a view for some of the most common data
requests. For example (mushing first and last name down to cname, but
ou get the idea)

create view CANDIDATE_3SKILLS as
SELECT C.*,s1.skillname skill1, s2.skillname skill2, s3.skillname
skill3
from CANDIDATE C, CANDIDATE_SKILLS s1, CANDIDATE_SKILLS s2,
CANDIDATE_SKILLS s3
where CANDIDATE.cname = s1.cname and CANDIDATE.cname = s2.cname and
CANDIDATE.cname = s3.cname and
s1.skillname > s2.skillname and s2.skillname > s3.skillname

of course this only works for candidates with exactly 3 skills.
Candidates with less do not appear and candidates with more appear more
than once. (part of the trouble of denormalizing)

Quote:
The problem that bothers me most is the following: what happens if I
want to built a search tool to find candidates using a flexible search
interface; in this interface the user could type "meta queries"
such as: find me the candidates that (master php AND (asp OR net)) AND
(that master French OR (Dutch AND German))
Ah, so you are less concerned about the readablilty of the query. good
If you are going for a more natural language query for the user why not
something like:
* find me the candidates that master php and (asp or net) and
that master French or (Dutch and German)

you may want to provide the users some training about grouping via
parentheses. (do you really want someone that speaks ONLY French?)

also it might be much easier if you allow more specific terms such as:
* find me the candidates that program in php and (asp or net) and
that speak French or (Dutch and German)

Quote:
This query looks quite simple but how to translate this to one query
for the normalized database?
a LOT easier than for a denormalized DB!

It is fairly easy for a language parser program. A good software
engineer can dealt with this. (send me an email and I could help or
recommend someone else).


Quote:
Do I first have to do the following conversion:
find me the candidates that (master (php AND asp) OR (php AND net)) AND
(that master French OR (Dutch AND German))
becoming
find me the candidates that (master (php AND asp)) AND (that master
French)
OR
find me the candidates that (master (php AND asp)) AND (that master
Dutch AND German)
OR
find me the candidates that (master (php AND net)) AND (that master
French)
OR
find me the candidates that (master (php AND net)) AND (that master
Dutch AND German)
no.

Quote:
Next I assume the three ORs have to be combined via UNION statements?
This seems quite complicated so I was wondering if there is no easier
way to create an SQL query starting from the "meta query" above?
There are easier ways. You need an interpreter to translate from the
natural meta language query to SQL. It is an interesting kind of
programming. It helps to have a well defined syntax for the "meta
language", but for your case you might have a language that allows some
fluff (such as the opening "find me the " phrase) which makes the user
more comfortable but adds little to the meaning of the entire query.

HTH,
Ed



Reply With Quote
  #3  
Old   
phlype.johnson@gmail.com
 
Posts: n/a

Default Re: Does a normalized design lead to complex queries? - 10-19-2006 , 07:22 AM



Thanks Ed for your nice feedback. I agree that the example was not
completely normalized
however on the redundance of the DISTINCT I disagree. I was planning to
repost my question with some more explanation and as you will see in my
verbose description below: if you leave out the DISTINCT you will
change the result of the query (have a try with the provided mysql
example); the reason is that if you have a person with skills A and B
that also speaks language alpha, the join result of the query looking
for a person having those properties will contain two rows: one for
skill A and one for skill B each time with language alpha, meaning that
alpha appears twice unless you do the COUNT (DISTINCT ...

Here comes my longer post:
To illustrate better my question on whether normalized designs lead to
more complex queries yes or no, I have prepared an example. The example
is a database with the following tables:
*table person with fields:
-persid: autoincrement id
-name: name of the person
*table material with fields:
-materialid: autoincrement id
-material: name of the material eg "wood"
*table color with fields:
-colorid: autoincrement id
-color: name of the color eg "green"
*table persmaterial with fields:
-persmatid: autoincrement id
-persid: link to table person
-materialid: link to table material
*table perscolor with fields:
-perscolorid: autoincrement id
-persid: link to table person
-colorid: link to table color
Using these tables it is straightforward to store the preference of a
certain person for colors and materials.
Now use the statements at the end of the post to create the tables and
populate them with intial values. Next we want to find all persons who
like the colors red or blue and also like the color green. Furthermore
the person should have a preference for iron as material. I understand
there are several ways to obtain the requested result.
Option 1:
The initial query can be written in pseudocode as
Find all persons that (like as color (red OR blue) AND green) AND that
(like as material iron). I do not see right away how to write this as a
query with joins so I rewrite the pseudo query as follows:
Find all persons that ((like as color red AND green) AND that (like as
material iron)) OR ((like as color blue AND green) AND that (like as
material iron))
this leads to the following query
(SELECT p.persid FROM person p, perscolor pc, persmaterial pm WHERE
p.persid=pc.persid AND (pc.colorid=1 OR pc.colorid=2) AND
p.persid=pm.persid AND pm.materialid=2 GROUP BY p.persid HAVING
(count(DISTINCT pc.colorid)=2 AND count(DISTINCT pm.materialid)=1))
UNION
(SELECT p.persid FROM person p, perscolor pc, persmaterial pm WHERE
p.persid=pc.persid AND (pc.colorid=2 OR pc.colorid=3) AND
p.persid=pm.persid AND pm.materialid=2 GROUP BY p.persid HAVING
(count(DISTINCT pc.colorid)=2 AND count(DISTINCT pm.materialid)=1))
You can say that for someone familiar to sql this is not overly
complicated but the problem I see is the following. I want to use this
database in a webapplication where a user will be able to find persons
with certain preferences. The user will be allowed to play around with
AND and ORs in his request. To state it differently the user will have
the possibility to type in a pseudo query like the example I used
above. As you have seen above, to get to the final sql query I need to
do a transformation of the pseudo query to the sql query. Secondly, if
we have many criteria you can see easily that the number of unions can
grow exponentially if the user starts playing around with ANDs and ORs
for a certain property.

Option 2:
The pseudo query
Find all persons that (like as color (red OR blue) AND green) AND that
(like as material iron)
can straightforwardly be coded in a query using subqueries:
SELECT persid FROM person p WHERE
(EXISTS(SELECT * FROM perscolor pc WHERE pc.colorid=1 AND
p.persid=pc.persid)
OR
EXISTS(SELECT * FROM perscolor pc WHERE pc.colorid=3 AND
p.persid=pc.persid))
AND
EXISTS(SELECT * FROM perscolor pc WHERE pc.colorid=2 AND
p.persid=pc.persid)
AND
EXISTS(SELECT * FROM persmaterial pm WHERE pm.materialid=2 AND
p.persid=pm.persid)

Both options are quite different. I am no expert but this is my
understanding of the pros and contras of the 2 options:
*option 1:
-difficult to automatically generate the query from the pseudo query
*option 2:
+easy to generate query from pseudo query
-are subqueries not inefficient? If for each subquery you have to go
over the parameters outside the subquery, it means pretty much that for
each subquery you go over the entire person table; I might be wrong and
subqueries might be more optimized but I have no idea on this

Overall, I would like to know
*if there are other options to translate the above pseudo query into an
sql query?
*what of the otpions (proposed + new ones) are best from a performance
point of view?

-- phpMyAdmin SQL Dump
-- version 2.6.1
-- http://www.phpmyadmin.net
--
-- Host: localhost
-- Generation Time: Oct 19, 2006 at 01:13 PM
-- Server version: 4.1.9
-- PHP Version: 4.3.10
--
-- Database: `aston`
--

-- --------------------------------------------------------

--
-- Table structure for table `color`
--

CREATE TABLE `color` (
`colorid` int(11) NOT NULL auto_increment,
`color` varchar(30) NOT NULL default '',
PRIMARY KEY (`colorid`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=5 ;

--
-- Dumping data for table `color`
--

INSERT INTO `color` VALUES (1, 'red');
INSERT INTO `color` VALUES (2, 'green');
INSERT INTO `color` VALUES (3, 'blue');
INSERT INTO `color` VALUES (4, 'yellow');

-- --------------------------------------------------------

--
-- Table structure for table `material`
--

CREATE TABLE `material` (
`materialid` int(11) NOT NULL auto_increment,
`material` varchar(30) NOT NULL default '',
PRIMARY KEY (`materialid`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=3 ;

--
-- Dumping data for table `material`
--

INSERT INTO `material` VALUES (1, 'wood');
INSERT INTO `material` VALUES (2, 'iron');

-- --------------------------------------------------------

--
-- Table structure for table `perscolor`
--

CREATE TABLE `perscolor` (
`perscolorid` int(11) NOT NULL auto_increment,
`persid` int(11) NOT NULL default '0',
`colorid` int(11) NOT NULL default '0',
PRIMARY KEY (`perscolorid`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=7 ;

--
-- Dumping data for table `perscolor`
--

INSERT INTO `perscolor` VALUES (1, 1, 1);
INSERT INTO `perscolor` VALUES (2, 1, 2);
INSERT INTO `perscolor` VALUES (3, 2, 1);
INSERT INTO `perscolor` VALUES (5, 3, 3);
INSERT INTO `perscolor` VALUES (6, 3, 2);

-- --------------------------------------------------------

--
-- Table structure for table `persmaterial`
--

CREATE TABLE `persmaterial` (
`persmatid` int(11) NOT NULL auto_increment,
`persid` int(11) NOT NULL default '0',
`materialid` int(11) NOT NULL default '0',
PRIMARY KEY (`persmatid`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=6 ;

--
-- Dumping data for table `persmaterial`
--

INSERT INTO `persmaterial` VALUES (1, 1, 1);
INSERT INTO `persmaterial` VALUES (2, 1, 2);
INSERT INTO `persmaterial` VALUES (3, 2, 1);
INSERT INTO `persmaterial` VALUES (5, 3, 2);

-- --------------------------------------------------------

--
-- Table structure for table `person`
--

CREATE TABLE `person` (
`persid` int(11) NOT NULL auto_increment,
`name` varchar(30) NOT NULL default '',
PRIMARY KEY (`persid`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=4 ;

--
-- Dumping data for table `person`
--

INSERT INTO `person` VALUES (1, 'john');
INSERT INTO `person` VALUES (2, 'emily');
INSERT INTO `person` VALUES (3, 'liz');

Ed Prochak wrote:
Quote:
phlype.johnson (AT) gmail (DOT) com wrote:
Suppose we have to design a database for a recruitment agency. There
will be a table "candidates" with fields "candidateid","last
name","first name" ; the languages mastered by a candidate as
well as the skills are separated into different tables. Eg. The skills
are stored in the table "skills" with fields "skillid",
"skillname","candidateid" where skillid is the primary
autoincrement key. The language table is defined in a similar way. A
candidate having two skills will lead to two records in the table
"skills";

Bzzzt wrong. Well maybe not necessarily. at least you are talking about
tables too soon and cetainly the description of a skills table is not
Normalized. Think of entities and attributes at this point (yes it is
just a straight substitution of words but the thought process is often
driven by how we phrase the question.)

So candidates has attributes: first name, last name, id, DOB, height,
skills, languages spoken, etc
As part of normalizing you realize a candidate can have more than one
skill. Also you realize skills have some of their own attributes. so
you remove skills from the candidate entity, creating a skills entity

Skills include attributes like name, type(musical, industrial,...),
tools required(yes/no), etc

now candidates can have one or more so this leads directly to a
relation I'll label candidate_skill. it has three attributes: candidate
first name, last name, and skills name.

Implementing this design would be easiest in three tables
CANDIDATES, SKILLS, and CANDIDATE_SKILLS

one candidate with two different skills does require two rows in the
CANDIDATE_SKILLS table. Another candidate with 3skills (2 of them the
same as the first candidate requires three rows in the CANDIDATE_SKILLS
table. And in the skills table is just three skills rows, one for each
unique skill.

There would be foreign key constraints from the CANDIDATE_SKILLS table:
from the skill attribute to the skill PK in the SKILLS table. and
from the name attributes to the name attributes forming the PK in the
CANDIDATE table.


... similarly a candidate mastering 3 languages will lead to
3 records in the table "languages".

similarly in a CANDIDATES_LANGUAGES table

Retrieving candidates with skills with skillname="php" and
skillname="asp" as well as languagename="EN" and
languagename="FR" can be done as follows:
SELECT DISTINCT c.name FROM candidates c, skills s, languages l WHERE
l.candidateid=c.candidateid AND s.candidateid=c.candidateid AND
(s.skillname='php' OR s.skillname='asp') AND (l.languagename='FR' OR
l.languagename='EN')
GROUP BY c.name HAVING (count(DISTINCT s.skillid)=2 AND count(DISTINCT
l.languageid)=2)

Ignoring normalization for the moment, why do you think you need
DISTINCT?
the GROUP BY will not return multiple JOHN rows in the result set, so
the DISTINCT is totally redundant in this query.


This already rather complex query (for a non-expert as myself) is
probably the disadvantage of a normalized design.

Well I have ranted about pseudoKeys like your candidate ID and this
shows one of the disadvantages. Using the tables I described above, you
don't even need to visit the candidates table:

SELECT s.first_name,s.last_name
FROM skills s, languages L
WHERE s.first_name=L.first_name AND s.last_name=L.last_name
AND s.skillname IN ('php' , 'asp') AND L.languagename IN ('FR', 'EN')
GROUP BY s.first_name,s.last_name
HAVING (count(DISTINCT s.skillid)=2 AND count(DISTINCT L.languageid)=2)

But there are many more benefits to a normalized schema that far
outweigh the perceived complexity of this query. Consider that to make
it easier for general users to form queries, rather than denormalizing
the tables, consider creating a view for some of the most common data
requests. For example (mushing first and last name down to cname, but
ou get the idea)

create view CANDIDATE_3SKILLS as
SELECT C.*,s1.skillname skill1, s2.skillname skill2, s3.skillname
skill3
from CANDIDATE C, CANDIDATE_SKILLS s1, CANDIDATE_SKILLS s2,
CANDIDATE_SKILLS s3
where CANDIDATE.cname = s1.cname and CANDIDATE.cname = s2.cname and
CANDIDATE.cname = s3.cname and
s1.skillname > s2.skillname and s2.skillname > s3.skillname

of course this only works for candidates with exactly 3 skills.
Candidates with less do not appear and candidates with more appear more
than once. (part of the trouble of denormalizing)

The problem that bothers me most is the following: what happens if I
want to built a search tool to find candidates using a flexible search
interface; in this interface the user could type "meta queries"
such as: find me the candidates that (master php AND (asp OR net)) AND
(that master French OR (Dutch AND German))

Ah, so you are less concerned about the readablilty of the query. good
If you are going for a more natural language query for the user why not
something like:
* find me the candidates that master php and (asp or net) and
that master French or (Dutch and German)

you may want to provide the users some training about grouping via
parentheses. (do you really want someone that speaks ONLY French?)

also it might be much easier if you allow more specific terms such as:
* find me the candidates that program in php and (asp or net) and
that speak French or (Dutch and German)

This query looks quite simple but how to translate this to one query
for the normalized database?

a LOT easier than for a denormalized DB!

It is fairly easy for a language parser program. A good software
engineer can dealt with this. (send me an email and I could help or
recommend someone else).


Do I first have to do the following conversion:
find me the candidates that (master (php AND asp) OR (php AND net)) AND
(that master French OR (Dutch AND German))
becoming
find me the candidates that (master (php AND asp)) AND (that master
French)
OR
find me the candidates that (master (php AND asp)) AND (that master
Dutch AND German)
OR
find me the candidates that (master (php AND net)) AND (that master
French)
OR
find me the candidates that (master (php AND net)) AND (that master
Dutch AND German)

no.

Next I assume the three ORs have to be combined via UNION statements?
This seems quite complicated so I was wondering if there is no easier
way to create an SQL query starting from the "meta query" above?

There are easier ways. You need an interpreter to translate from the
natural meta language query to SQL. It is an interesting kind of
programming. It helps to have a well defined syntax for the "meta
language", but for your case you might have a language that allows some
fluff (such as the opening "find me the " phrase) which makes the user
more comfortable but adds little to the meaning of the entire query.

HTH,
Ed


Reply With Quote
  #4  
Old   
Ed Prochak
 
Posts: n/a

Default Re: Does a normalized design lead to complex queries? - 10-19-2006 , 11:43 AM




phlype.johnson (AT) gmail (DOT) com wrote:
Quote:
Thanks Ed for your nice feedback. I agree that the example was not
completely normalized
however on the redundance of the DISTINCT I disagree. I was planning to
repost my question with some more explanation and as you will see in my
verbose description below: if you leave out the DISTINCT you will
change the result of the query (have a try with the provided mysql
example); the reason is that if you have a person with skills A and B
that also speaks language alpha, the join result of the query looking
for a person having those properties will contain two rows: one for
skill A and one for skill B each time with language alpha, meaning that
alpha appears twice unless you do the COUNT (DISTINCT ...
Just addressing this issue quickly. I read the rest of the post later.
The DISTINCT in the HAVING clause is fine and I agree it should stay,
it was the DISTINCT in the SELECT clause which is redundant. GROUP BY
cannot return repeated rows in the result set.

Ed
(BTW, please avoid top posting in your replies)



Reply With Quote
  #5  
Old   
phlype.johnson@gmail.com
 
Posts: n/a

Default Re: Does a normalized design lead to complex queries? - 10-19-2006 , 12:58 PM




Ed Prochak wrote:
Quote:
phlype.johnson (AT) gmail (DOT) com wrote:
Thanks Ed for your nice feedback. I agree that the example was not
completely normalized
however on the redundance of the DISTINCT I disagree. I was planning to
repost my question with some more explanation and as you will see in my
verbose description below: if you leave out the DISTINCT you will
change the result of the query (have a try with the provided mysql
example); the reason is that if you have a person with skills A and B
that also speaks language alpha, the join result of the query looking
for a person having those properties will contain two rows: one for
skill A and one for skill B each time with language alpha, meaning that
alpha appears twice unless you do the COUNT (DISTINCT ...

Just addressing this issue quickly. I read the rest of the post later.
The DISTINCT in the HAVING clause is fine and I agree it should stay,
it was the DISTINCT in the SELECT clause which is redundant. GROUP BY
cannot return repeated rows in the result set.
OK I agree.

I think I might have found an option 3 that is a good intermediate
solution. Again starting from the pseudo query
Find all persons that (like as color (red OR blue) AND green) AND that
(like as material iron)
it can be straightforwardly translated into
SELECT p.persid FROM person p, perscolor pc, persmaterial pm WHERE
p.persid=pc.persid
AND
(pc.colorid=1 OR pc.colorid=2 OR pc.colorid=3)
AND p.persid=pm.persid
AND pm.materialid=2
GROUP BY p.persid HAVING
sum(case when pc.colorid in ('1','3') then 1 else 0 end) >= 1
AND
sum(case when pc.colorid='2' then 1 else 0 end)>=1
AND
sum(case when pm.materialid='2' then 1 else 0 end)>=1
Actually the first part of the query (part before "having") only
constrains the result set and the final selection happens in the
"having" part.
Actually also this would work
SELECT p.persid FROM person p, perscolor pc, persmaterial pm WHERE
p.persid=pc.persid
AND p.persid=pm.persid
GROUP BY p.persid HAVING
sum(case when pc.colorid in ('1','3') then 1 else 0 end) >= 1
AND
sum(case when pc.colorid='2' then 1 else 0 end)>=1
AND
sum(case when pm.materialid='2' then 1 else 0 end)>=1
but as you can see the intermediate table on which the "group by" is
performed will be many times bigger.

Any ideas/improvements on this option?

Phlype



Reply With Quote
  #6  
Old   
Ed Prochak
 
Posts: n/a

Default Re: Does a normalized design lead to complex queries? - 10-21-2006 , 02:12 AM




phlype.johnson (AT) gmail (DOT) com wrote:
[DISTINCT issue resolved]
Quote:
Here comes my longer post:
which contains two main issues: the database design and the user
language parsing issue. In this post I'll address just the database
design issue.

Quote:
To illustrate better my question on whether normalized designs lead to
more complex queries yes or no, I have prepared an example. The example
is a database with the following tables:
*table person with fields:
-persid: autoincrement id
-name: name of the person
First off I won't rant too much at you about ID column because you are
obviously not a database designer. Let me just say, there are places
where they are appropriate and places where they are not. A Person
table is one place where generally they do end up being used (primarily
due to the difficulty of defining all the needed info to uniquely
identify a person. Short of finger prints or a full DNA sequence, we
are always missing something. A consequence is the possibility of
adding the same person twice for example)

Quote:
*table material with fields:
-materialid: autoincrement id
-material: name of the material eg "wood"
*table color with fields:
-colorid: autoincrement id
-color: name of the color eg "green"
*table persmaterial with fields:
-persmatid: autoincrement id
-persid: link to table person
-materialid: link to table material
here is an example of a wrong application of an ID column. these
relationship tables are uniquely identified by the composite Primary
key. In this table that key is:
persid, materialid
You do not need or want persmatid or the perscolid in the perscolor
table.

Quote:
*table perscolor with fields:
-perscolorid: autoincrement id
-persid: link to table person
-colorid: link to table color

Using these tables it is straightforward to store the preference of a
certain person for colors and materials.
yes.

Quote:
Now use the statements at the end of the post to create the tables and
populate them with intial values. Next we want to find all persons who
like the colors red or blue and also like the color green. Furthermore
the person should have a preference for iron as material. I understand
there are several ways to obtain the requested result.
Option 1:
The initial query can be written in pseudocode as
Find all persons that (like as color (red OR blue) AND green) AND that
(like as material iron). I do not see right away how to write this as a
query with joins so I rewrite the pseudo query as follows:
Find all persons that ((like as color red AND green) AND that (like as
material iron)) OR ((like as color blue AND green) AND that (like as
material iron))
this leads to the following query
(SELECT p.persid FROM person p, perscolor pc, persmaterial pm WHERE
p.persid=pc.persid AND (pc.colorid=1 OR pc.colorid=2) AND
p.persid=pm.persid AND pm.materialid=2 GROUP BY p.persid HAVING
(count(DISTINCT pc.colorid)=2 AND count(DISTINCT pm.materialid)=1))
UNION
(SELECT p.persid FROM person p, perscolor pc, persmaterial pm WHERE
p.persid=pc.persid AND (pc.colorid=2 OR pc.colorid=3) AND
p.persid=pm.persid AND pm.materialid=2 GROUP BY p.persid HAVING
(count(DISTINCT pc.colorid)=2 AND count(DISTINCT pm.materialid)=1))
notice that this query (and the option 2 version) never uses the
primary key for the perscolor and persmaterial tables That should
suggest to you that the extra ID fields are not really the PK.

[]
Quote:
-- phpMyAdmin SQL Dump
-- version 2.6.1
-- http://www.phpmyadmin.net
--
-- Host: localhost
-- Generation Time: Oct 19, 2006 at 01:13 PM
-- Server version: 4.1.9
-- PHP Version: 4.3.10
--
-- Database: `aston`
--

-- --------------------------------------------------------

--
-- Table structure for table `color`
--

CREATE TABLE `color` (
`colorid` int(11) NOT NULL auto_increment,
`color` varchar(30) NOT NULL default '',
PRIMARY KEY (`colorid`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=5 ;
just be aware that with the ID as PK, you could end up adding the same
color into the DB twice. Also you cannot enter synonyms for colors (is
pale red the same as pink?)
Just ideas to consider.

Quote:
--
-- Dumping data for table `color`
--

INSERT INTO `color` VALUES (1, 'red');
INSERT INTO `color` VALUES (2, 'green');
INSERT INTO `color` VALUES (3, 'blue');
INSERT INTO `color` VALUES (4, 'yellow');

-- --------------------------------------------------------

--
-- Table structure for table `material`
--

CREATE TABLE `material` (
`materialid` int(11) NOT NULL auto_increment,
`material` varchar(30) NOT NULL default '',
PRIMARY KEY (`materialid`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=3 ;

--
-- Dumping data for table `material`
--

INSERT INTO `material` VALUES (1, 'wood');
INSERT INTO `material` VALUES (2, 'iron');

-- --------------------------------------------------------

--
-- Table structure for table `perscolor`
--

CREATE TABLE `perscolor` (
`perscolorid` int(11) NOT NULL auto_increment,
`persid` int(11) NOT NULL default '0',
`colorid` int(11) NOT NULL default '0',
PRIMARY KEY (`perscolorid`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=7 ;

MySQL does support compond keys doesn't it? so my suggestion was to
change this to:
CREATE TABLE `perscolor` (
`persid` int(11) NOT NULL,
`colorid` int(11) NOT NULL,
PRIMARY KEY (`persid', 'colorid`)
) ;

I think the default option is also a bad idea. Why would wour
application allow reference to a color 0 which doesn't exist? That is a
serious Referential Integrity error. And what would it mean to have and
entry here for John without a color? (1,0) So that is why I deleted the
default option. This clearly applies to persmaterial as well.


Quote:
--
-- Dumping data for table `perscolor`
--

INSERT INTO `perscolor` VALUES (1, 1, 1);
INSERT INTO `perscolor` VALUES (2, 1, 2);
INSERT INTO `perscolor` VALUES (3, 2, 1);
INSERT INTO `perscolor` VALUES (5, 3, 3);
INSERT INTO `perscolor` VALUES (6, 3, 2);

-- --------------------------------------------------------

--
-- Table structure for table `persmaterial`
--

CREATE TABLE `persmaterial` (
`persmatid` int(11) NOT NULL auto_increment,
`persid` int(11) NOT NULL default '0',
`materialid` int(11) NOT NULL default '0',
PRIMARY KEY (`persmatid`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=6 ;

--
-- Dumping data for table `persmaterial`
--

INSERT INTO `persmaterial` VALUES (1, 1, 1);
INSERT INTO `persmaterial` VALUES (2, 1, 2);
INSERT INTO `persmaterial` VALUES (3, 2, 1);
INSERT INTO `persmaterial` VALUES (5, 3, 2);

-- --------------------------------------------------------

--
-- Table structure for table `person`
--

CREATE TABLE `person` (
`persid` int(11) NOT NULL auto_increment,
`name` varchar(30) NOT NULL default '',
PRIMARY KEY (`persid`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=4 ;

--
-- Dumping data for table `person`
--

INSERT INTO `person` VALUES (1, 'john');
INSERT INTO `person` VALUES (2, 'emily');
INSERT INTO `person` VALUES (3, 'liz');

Part of the process of normalization is eliminating unneeded attributes
and identifying the primary key. A compound primary key is valid and
necessary in many cases.

Hope this helps.
Ed



Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.