dbTalk Databases Forums  

[SQL] Table Design for Hierarchical Data

mailing.database.pgsql-sql mailing.database.pgsql-sql


Discuss [SQL] Table Design for Hierarchical Data in the mailing.database.pgsql-sql forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
Lee Hachadoorian
 
Posts: n/a

Default [SQL] Table Design for Hierarchical Data - 04-06-2010 , 12:33 PM






Please point me to another listserv or forum if this question is more
appropriately addressed elsewhere.

I am trying to come up with a structure to store employment data by NAICS
(North American Industrial Classification System). The data uses a
hierarchical encoding scheme ranging between 2 and 5 digits. That is, each
2-digit code includes all industries beginning with the same two digits. 61
includes 611 which includes 6111, 6112, 6113, etc. A portion of the
hierarchy is shown after the sig.

A standard way to store hierarchical data is the adjacency list model, where
each node's parent appears as an attribute (table column). So 6111 would
list 611 as its parent. Since NAICS uses a hierarchical encoding scheme, the
node's name is the same as the node's id, and the parent can always be
derived from the node's id. Storing the parent id separately would seem to
violate a normal form (because of the redundancy).

One way to store this data would be to store at the most granular level
(5-digit NAICS) and then aggregate up if I wanted employment at the 4-, 3-,
or 2-digit level. The problem is that because of nondisclosure rules, the
data is sometimes censored at the more specific level. I might, for example,
have data for 6114, but not 61141, 61142, 61143. For a different branch of
the tree, I might have data at the 5-digit level while for yet another
branch I might have data only to the 3-digit level (not 4 or 5). I think
that means I have to store all data at multiple levels, even if some of the
higher-level data could be reconstructed from other, lower-level data.

Specifically I'd like to know if this should be a single table or should
there be a separate table for each level of the hierarchy (four in all)? If
one table, should the digits be broken into separate columns? Should parent
ids be stored in each node?

More generally, what questions should I be asking to help decide what
structure makes the most sense? Are there any websites, forums, or books
that cover this kind of problem?

Regards,
--Lee

--
Lee Hachadoorian
PhD Student, Geography
Program in Earth & Environmental Sciences
CUNY Graduate Center

A Portion of the NAICS scheme

61 Educational Services
611 Educational Services
6111 Elementary and Secondary Schools
61111 Elementary and Secondary Schools
6112 Junior Colleges
61121 Junior Colleges
6113 Colleges, Universities, and Professional Schools
61131 Colleges, Universities, and Professional Schools
6114 Business Schools and Computer and Management Training
61141 Business and Secretarial Schools
61142 Computer Training
61143 Professional and Management Development Training
etc

Reply With Quote
  #2  
Old   
Michael Glaesemann
 
Posts: n/a

Default Re: [SQL] Table Design for Hierarchical Data - 04-06-2010 , 05:04 PM






On Apr 6, 2010, at 13:33 , Lee Hachadoorian wrote:

Quote:
A standard way to store hierarchical data is the adjacency list model, where
each node's parent appears as an attribute (table column).
Another is nested sets which performs quite nicely for loads which are moreread than write (which I suspect is the case here).

Quote:
So 6111 would
list 611 as its parent. Since NAICS uses a hierarchical encoding scheme, the
node's name is the same as the node's id, and the parent can always be
derived from the node's id. Storing the parent id separately would seem to
violate a normal form (because of the redundancy).
I'd consider the code a representation of the node structure rather than the implementation of the node structure.

Quote:
The problem is that because of nondisclosure rules, the
data is sometimes censored at the more specific level.
I don't know if this is per-user or per-category or what, but it may be something you store separately from the main table.

Quote:
Specifically I'd like to know if this should be a single table or should
there be a separate table for each level of the hierarchy (four in all)?
I'd say one table for hierarchy and possibly another for the permissions data.

Quote:
If one table, should the digits be broken into separate columns?
Probably not.


Quote:
Should parent
ids be stored in each node?
Only if you use an encoding scheme (such as adjacency list) which requires it.

Michael Glaesemann
grzm seespotcode net




--
Sent via pgsql-sql mailing list (pgsql-sql (AT) postgresql (DOT) org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql

Reply With Quote
  #3  
Old   
Richard Broersma
 
Posts: n/a

Default Re: [SQL] Table Design for Hierarchical Data - 04-06-2010 , 05:48 PM



On Tue, Apr 6, 2010 at 3:04 PM, Michael Glaesemann <grzm (AT) seespotcode (DOT) net> wrote:

Quote:
Another is nested sets which performs quite nicely for loads which are more read than write (which I suspect is the case here).
Pg 9.0 has two new features are nice for both Nest set trees.
one is deferrable unique constraints.

While 8.4 has CTE's which are good for querying adjacency list tree,
we need to wait for write-able CTE's (maybe 9.1?) to preform all of
the possible tree modifications.




--
Regards,
Richard Broersma Jr.

Visit the Los Angeles PostgreSQL Users Group (LAPUG)
http://pugs.postgresql.org/lapug

--
Sent via pgsql-sql mailing list (pgsql-sql (AT) postgresql (DOT) org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql

Reply With Quote
  #4  
Old   
Steve Crawford
 
Posts: n/a

Default Re: [SQL] Table Design for Hierarchical Data - 04-06-2010 , 07:23 PM



Lee Hachadoorian wrote:
Quote:
I am trying to come up with a structure to store employment data by
NAICS (North American Industrial Classification System). The data uses
a hierarchical encoding scheme ranging between 2 and 5 digits. That
is, each 2-digit code includes all industries beginning with the same
two digits. 61 includes 611 which includes 6111, 6112, 6113, etc. A
portion of the hierarchy is shown after the sig.
From the http://www.census.gov/eos/www/naics/ website:

"NAICS is a two- through six-digit hierarchical classification system,
offering five levels of detail. Each digit in the code is part of a
series of progressively narrower categories, and the more digits in the
code signify greater classification detail. The first two digits
designate the economic sector, the third digit designates the subsector,
the fourth digit designates the industry group, the fifth digit
designates the NAICS industry, and the sixth digit designates the
national industry. The five-digit NAICS code is the level at which there
is comparability in code and definitions for most of the NAICS sectors
across the three countries participating in NAICS (the United States,
Canada, and Mexico). The six-digit level allows for the United States,
Canada, and Mexico each to have country-specific detail. A complete and
valid NAICS code contains six digits."

I think I'd be inclined to store it as defined above with tables for
sector, subsector, industry-group and NAICS-industry. So the NAICS table
might have a primary key of industry_code (11131, Orange Groves) and a
industry_group column with a foreign-key constraint to the
industry-group table (1113, Fruit and Tree Nut Farming). You might add
a constraint to ensure that the industry-group is the appropriate
substring of the naics code and so on up the heirarchy. If you are
dealing with importing a large amount of static source data for
analysis, these tables will also be tailor-made places to do
pre-aggregation.

Adjacency lists work well in certain cases where the depths of the trees
are variable or indeterminate. For example, think of an employee->boss
org-chart for a large company. The maintenance supervisor for an area
might be a dozen levels below the CEO and be a several levels above the
branch night janitor while the CEO's personal assistant is just one
level down but with no direct reports. The CTE/recursive-query features
in 8.4 are great for this. But in the case you have described, the
number of levels is well defined as is the type of information
associated with each level.

But this all depends on the nature of your source data, how often it is
updated, how big it is and the questions you want answered. It might be
perfectly acceptable to just have the 5-digit code on all your
individual data records and do something like select ... group by
substr(full_naics_code,1,3) where substr(full_naics_code,1,2)='61'). In
this case you will still want to keep the NAICS definition table
separate and link to it.

One question that might impact this is the coding of your source data.
Is it all full 5-digit coding or are some records coded at a high level
of detail and others only to the top-level?

Quote:
One way to store this data would be to store at the most granular
level (5-digit NAICS) and then aggregate up if I wanted employment at
the 4-, 3-, or 2-digit level. The problem is that because of
nondisclosure rules, the data is sometimes censored at the more
specific level. I might, for example, have data for 6114, but not
61141, 61142, 61143. For a different branch of the tree, I might have
data at the 5-digit level while for yet another branch I might have
data only to the 3-digit level (not 4 or 5). I think that means I have
to store all data at multiple levels, even if some of the higher-level
data could be reconstructed from other, lower-level data.

What do you mean by censored? Is the data supplied to you pre-aggregated
to some level and censored to preserve confidentiality or are do you
have the record-level source data and the responsibility to suppress
data in your reports? Is the data suppression ad-hoc (i.e. someone comes
to you and says don't display these five aggregates), based on simple
rules (don't display any aggregate with fewer than 15 records) or on
more complex rules (don't display any data that would allow calculation
of a group of fewer than 15)? My guess is that the multi-table scenario
will be better suited to flagging aggregates for suppression.

Cheers,
Steve

--
Sent via pgsql-sql mailing list (pgsql-sql (AT) postgresql (DOT) org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql

Reply With Quote
  #5  
Old   
silly sad
 
Posts: n/a

Default Re: [SQL] Table Design for Hierarchical Data - 04-07-2010 , 12:41 AM



single table.
nested tree + ordinal parent reference.

nests are calculated in a trigger on insert.

--
Sent via pgsql-sql mailing list (pgsql-sql (AT) postgresql (DOT) org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql

Reply With Quote
  #6  
Old   
silly sad
 
Posts: n/a

Default Re: [SQL] Table Design for Hierarchical Data - 04-07-2010 , 12:43 AM



P.S.
almost foget, do not try any oracle-like "tree-jouns" or "special types"
or such a crap.

your problem as plain as to store a pair of integers
(or numerics (i prefer))

--
Sent via pgsql-sql mailing list (pgsql-sql (AT) postgresql (DOT) org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql

Reply With Quote
  #7  
Old   
Scott Marlowe
 
Posts: n/a

Default Re: [SQL] Table Design for Hierarchical Data - 04-07-2010 , 01:17 AM



On Tue, Apr 6, 2010 at 11:43 PM, silly sad <sad (AT) bankir (DOT) ru> wrote:
Quote:
P.S.
almost foget, do not try any oracle-like "tree-jouns" or "special types" or
such a crap.

your problem as plain as to store a pair of integers
(or numerics (i prefer))
Since it's an identifier and not really a numeric per se, I'd store it
as text. I mean it could as easily be a 5 character alpha code as 5
character number code.

With tet you can create indexes on substring(idfield,1,1),
substring(idfield,1,2), substring(idfield,1,3),
substring(idfield,1,4), and substring(idfield,1,5) for fast lookups
and matching.

--
Sent via pgsql-sql mailing list (pgsql-sql (AT) postgresql (DOT) org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql

Reply With Quote
  #8  
Old   
Achilleas Mantzios
 
Posts: n/a

Default Re: [SQL] Table Design for Hierarchical Data - 04-07-2010 , 02:00 AM



You could also consider the genealogical approach, e.g.


postgres@dynacom=# \d paintgentypes
Table "public.paintgentypes"
Column | Type | Modifiers
---------+-----------+---------------------------------------------------------------------------
id | integer | not null default nextval(('public.paintgentypes_id_seq'::text)::reg class)
name | text | not null
parents | integer[] |
Indexes:
"paintgentypes_pkey" PRIMARY KEY, btree (id)
"paintgentypes_name2" UNIQUE, btree (name) WHERE parents IS NULL
"paintgentypes_uk" UNIQUE, btree (name, parents)
"paintgentypes_first" btree (first(parents))
"paintgentypes_last" btree (last(parents))
"paintgentypes_level" btree (level(parents))
"paintgentypes_name" btree (name)
"paintgentypes_parents" gin (parents gin__int_ops)

The indexes are based on the contrib/intarray package.
It is very fast to do any operation on this tree.
Also it is very fast to search for the parent of any node, or the
children of any node, or the whole subtree of any node, or the depth of anynode in the tree.

The parents of any node to the root, i.e. the path of any node to the root are depicted as
parents[0] : immediate parent
parents[1] : immediate parent of the above parent
......
parents[n] : root of the tree

Στις Tuesday 06 April 2010 20:33:18 ο/η Lee Hachadoorian *γραψε:
Quote:
Please point me to another listserv or forum if this question is more
appropriately addressed elsewhere.

I am trying to come up with a structure to store employment data by NAICS
(North American Industrial Classification System). The data uses a
hierarchical encoding scheme ranging between 2 and 5 digits. That is, each
2-digit code includes all industries beginning with the same two digits. 61
includes 611 which includes 6111, 6112, 6113, etc. A portion of the
hierarchy is shown after the sig.

A standard way to store hierarchical data is the adjacency list model, where
each node's parent appears as an attribute (table column). So 6111 would
list 611 as its parent. Since NAICS uses a hierarchical encoding scheme, the
node's name is the same as the node's id, and the parent can always be
derived from the node's id. Storing the parent id separately would seem to
violate a normal form (because of the redundancy).

One way to store this data would be to store at the most granular level
(5-digit NAICS) and then aggregate up if I wanted employment at the 4-, 3-,
or 2-digit level. The problem is that because of nondisclosure rules, the
data is sometimes censored at the more specific level. I might, for example,
have data for 6114, but not 61141, 61142, 61143. For a different branch of
the tree, I might have data at the 5-digit level while for yet another
branch I might have data only to the 3-digit level (not 4 or 5). I think
that means I have to store all data at multiple levels, even if some of the
higher-level data could be reconstructed from other, lower-level data.

Specifically I'd like to know if this should be a single table or should
there be a separate table for each level of the hierarchy (four in all)? If
one table, should the digits be broken into separate columns? Should parent
ids be stored in each node?

More generally, what questions should I be asking to help decide what
structure makes the most sense? Are there any websites, forums, or books
that cover this kind of problem?

Regards,
--Lee



--
Achilleas Mantzios

--
Sent via pgsql-sql mailing list (pgsql-sql (AT) postgresql (DOT) org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql

Reply With Quote
  #9  
Old   
silly sad
 
Posts: n/a

Default Re: [SQL] Table Design for Hierarchical Data - 04-07-2010 , 02:53 AM



On 04/07/10 11:00, Achilleas Mantzios wrote:

Quote:
Column | Type | Modifiers
---------+-----------+---------------------------------------------------------------------------
id | integer | not null default nextval(('public.paintgentypes_id_seq'::text)::reg class)
name | text | not null
parents | integer[] |

The parents of any node to the root, i.e. the path of any node to the root are depicted as
parents[0] : immediate parent
parents[1] : immediate parent of the above parent
.....
parents[n] : root of the tree
what this schema gives?

(1) the parent branch in one select.
what else?
nothing.

compare it to a nested-tree

id | integer | NOT NULL
name | text | not null
parent | integer |
l | numeric
r | numeric

(1) parent branch in one select
(2) child subtree in one select
(it makes a sence!)



--
Sent via pgsql-sql mailing list (pgsql-sql (AT) postgresql (DOT) org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql

Reply With Quote
  #10  
Old   
Yeb Havinga
 
Posts: n/a

Default Re: [SQL] Table Design for Hierarchical Data - 04-07-2010 , 03:06 AM



Achilleas Mantzios wrote:
Quote:
You could also consider the genealogical approach, e.g.


The parents of any node to the root, i.e. the path of any node to the root are depicted as
parents[0] : immediate parent
parents[1] : immediate parent of the above parent

What I have more than one parent?

regards,
Yeb Havinga


--
Sent via pgsql-sql mailing list (pgsql-sql (AT) postgresql (DOT) org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql

Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.