![]() | |
![]() |
| | Thread Tools | Display Modes |
#1
| |||
| |||
|
#2
| ||||||
| ||||||
|
|
A standard way to store hierarchical data is the adjacency list model, where each node's parent appears as an attribute (table column). |
|
So 6111 would list 611 as its parent. Since NAICS uses a hierarchical encoding scheme, the node's name is the same as the node's id, and the parent can always be derived from the node's id. Storing the parent id separately would seem to violate a normal form (because of the redundancy). |
|
The problem is that because of nondisclosure rules, the data is sometimes censored at the more specific level. |
|
Specifically I'd like to know if this should be a single table or should there be a separate table for each level of the hierarchy (four in all)? |
|
If one table, should the digits be broken into separate columns? |
|
Should parent ids be stored in each node? |
#3
| |||
| |||
|
|
Another is nested sets which performs quite nicely for loads which are more read than write (which I suspect is the case here). |
#4
| |||
| |||
|
|
I am trying to come up with a structure to store employment data by NAICS (North American Industrial Classification System). The data uses a hierarchical encoding scheme ranging between 2 and 5 digits. That is, each 2-digit code includes all industries beginning with the same two digits. 61 includes 611 which includes 6111, 6112, 6113, etc. A portion of the hierarchy is shown after the sig. From the http://www.census.gov/eos/www/naics/ website: |
|
One way to store this data would be to store at the most granular level (5-digit NAICS) and then aggregate up if I wanted employment at the 4-, 3-, or 2-digit level. The problem is that because of nondisclosure rules, the data is sometimes censored at the more specific level. I might, for example, have data for 6114, but not 61141, 61142, 61143. For a different branch of the tree, I might have data at the 5-digit level while for yet another branch I might have data only to the 3-digit level (not 4 or 5). I think that means I have to store all data at multiple levels, even if some of the higher-level data could be reconstructed from other, lower-level data. What do you mean by censored? Is the data supplied to you pre-aggregated |
#5
| |||
| |||
|
#6
| |||
| |||
|
#7
| |||
| |||
|
|
P.S. almost foget, do not try any oracle-like "tree-jouns" or "special types" or such a crap. your problem as plain as to store a pair of integers (or numerics (i prefer)) |
#8
| |||
| |||
|
|
Please point me to another listserv or forum if this question is more appropriately addressed elsewhere. I am trying to come up with a structure to store employment data by NAICS (North American Industrial Classification System). The data uses a hierarchical encoding scheme ranging between 2 and 5 digits. That is, each 2-digit code includes all industries beginning with the same two digits. 61 includes 611 which includes 6111, 6112, 6113, etc. A portion of the hierarchy is shown after the sig. A standard way to store hierarchical data is the adjacency list model, where each node's parent appears as an attribute (table column). So 6111 would list 611 as its parent. Since NAICS uses a hierarchical encoding scheme, the node's name is the same as the node's id, and the parent can always be derived from the node's id. Storing the parent id separately would seem to violate a normal form (because of the redundancy). One way to store this data would be to store at the most granular level (5-digit NAICS) and then aggregate up if I wanted employment at the 4-, 3-, or 2-digit level. The problem is that because of nondisclosure rules, the data is sometimes censored at the more specific level. I might, for example, have data for 6114, but not 61141, 61142, 61143. For a different branch of the tree, I might have data at the 5-digit level while for yet another branch I might have data only to the 3-digit level (not 4 or 5). I think that means I have to store all data at multiple levels, even if some of the higher-level data could be reconstructed from other, lower-level data. Specifically I'd like to know if this should be a single table or should there be a separate table for each level of the hierarchy (four in all)? If one table, should the digits be broken into separate columns? Should parent ids be stored in each node? More generally, what questions should I be asking to help decide what structure makes the most sense? Are there any websites, forums, or books that cover this kind of problem? Regards, --Lee |
#9
| |||
| |||
|
|
Column | Type | Modifiers ---------+-----------+--------------------------------------------------------------------------- id | integer | not null default nextval(('public.paintgentypes_id_seq'::text)::reg class) name | text | not null parents | integer[] | The parents of any node to the root, i.e. the path of any node to the root are depicted as parents[0] : immediate parent parents[1] : immediate parent of the above parent ..... parents[n] : root of the tree |
#10
| |||
| |||
|
|
You could also consider the genealogical approach, e.g. The parents of any node to the root, i.e. the path of any node to the root are depicted as parents[0] : immediate parent parents[1] : immediate parent of the above parent What I have more than one parent? |
![]() |
| Thread Tools | |
| Display Modes | |
| |