dbTalk Databases Forums  

Index structure using CLUSTER, column order and REORG.

comp.databases.ibm-db2 comp.databases.ibm-db2


Discuss Index structure using CLUSTER, column order and REORG. in the comp.databases.ibm-db2 forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
Fin
 
Posts: n/a

Default Index structure using CLUSTER, column order and REORG. - 06-27-2011 , 04:09 PM






Say I have a table with say 5 columns being
(txn_date date,
id integer,
status integer,
value dec(8,2)
symbol integer)

Table contains some 30m records.

I then have 2 indexes being :
create unique index a (txn_date, status, id) include (value, symbol)
create index b (id, status, txn_date)

The date will have far more id's per date, than dates per id's and I
mean FAR more. Perhaps a max of 50 dates per id, whereas many
thousands of ids per date. I am not inserting new id's but I am
inserting via single daily stored procedure more dates per id.

How do I determine which is better to apply a CLUSTER to for REORG
purposes, and how do I determine the optimum order of the fields ?
Also how do I determine PCTFREE and MINPCTUSED values for each ?

Reply With Quote
  #2  
Old   
Lennart Jonsson
 
Posts: n/a

Default Re: Index structure using CLUSTER, column order and REORG. - 07-04-2011 , 01:24 AM






On 2011-06-27 23:09, Fin wrote:
Quote:
Say I have a table with say 5 columns being
(txn_date date,
id integer,
status integer,
value dec(8,2)
symbol integer)

Table contains some 30m records.

I then have 2 indexes being :
create unique index a (txn_date, status, id) include (value, symbol)
create index b (id, status, txn_date)

Why is b not a unique index?

Quote:
The date will have far more id's per date, than dates per id's and I
mean FAR more. Perhaps a max of 50 dates per id, whereas many
thousands of ids per date. I am not inserting new id's but I am
inserting via single daily stored procedure more dates per id.

How do I determine which is better to apply a CLUSTER to for REORG
purposes, and how do I determine the optimum order of the fields ?
Also how do I determine PCTFREE and MINPCTUSED values for each ?
Not sure I understand your issue with clustering. To determine the
cluster index I would look at the queries executed. Correct clustering
avoids sorting, but without knowing what the queries are it's difficult
to determine a clustering order.


/Lennart

Reply With Quote
  #3  
Old   
Lennart Jonsson
 
Posts: n/a

Default Re: Index structure using CLUSTER, column order and REORG. - 07-04-2011 , 01:28 AM



On 2011-07-04 08:24, Lennart Jonsson wrote:
[...]
Quote:
Not sure I understand your issue with clustering. To determine the
cluster index I would look at the queries executed. Correct clustering
avoids sorting, but without knowing what the queries are it's difficult
to determine a clustering order.


Also, you can get some advise on clustering from db2advis, use -mIC.


/Lennart

Reply With Quote
  #4  
Old   
Fin
 
Posts: n/a

Default Re: Index structure using CLUSTER, column order and REORG. - 07-05-2011 , 12:13 AM



Problem is I have literally hundreds of queries to the table in question. I've been looking at snapshots and questioning the number of rows read, versus the number of rows selected, where there is a massive difference and I mean HUGE.

Now I read recently some articles on index structure and the 3 tiers of pages within being Root, Intermediate and Leaf pages. From what I now understand, When constructing an index, it is better to create multi-column indexes.. Which is perfectly understandable and I was always doing so. My problem is which order to create the index columns to avoid leaf scans.

The issue is I have queries that use the ID and date columns independently so assigning one or the other as the first column will always result in leaf scans unless I create more than 1 multi-column index with differing orders.

As in (ID, DATE ....) and another one with (DATE, ID ...). Here I now have to decide where to place the CLUSTER in which of the two indexes. If I onlyhave one index (ID, DATE ...) then any query that has a predicate of WHEREDATE = etc results in leafs scans, so I have to create more than one to avoid such.

Most of REALLY important queries use the DATE to join to other tables and then use a predicate of WHERE ID = x . Now given the nature of the data inthe table with thousands of IDs per date, and say only 50 dates per ID, I'm trying to understand where it is better to place a CLUSTER clause and usethe other index clauses such as PCTFREE and MINPCTUSED, PAGE SPLIT HIGH (as most sorts are ASC).
We are inserting say 50k records daily for new DATEs to existing IDs.

I have another table that I use for PERIOD IDs. 1 record per business day, with 5 dates in sequence and an ID. ID, D1_DATE, D2_DATE, D3_DATE, D4_DATE,D5_DATE.
I use this table to apply sequence to existing records by joining the DATE column of a table to say D2_DATE, I can query other date tables by joining on D1_DATE to find records ocurring in sequence.

This Period table however gets massive rows read values for very few records selected, and I am not sure I understand why. This table only has some 3200 records in it but rows read in the hundreds of millions when joining to other date tables. Here again I have the issue or how to create the index for if I start with (d1_date, d2_date ... ) but have to join on the d3_date we again get the leaf scans. So am I better off creating multi-column indexes with all combinations and where do I add a cluster clause ?

Very confusing for a non-technical self taught pleb like myself. Sorry for the essay but you can see my dilemma.

Reply With Quote
  #5  
Old   
whatever
 
Posts: n/a

Default Re: Index structure using CLUSTER, column order and REORG. - 07-05-2011 , 11:12 AM



On Jul 5, 1:13*am, Fin <tdavi... (AT) hotmail (DOT) com> wrote:
Quote:
Problem is I have literally hundreds of queries to the table in question.I've been looking at snapshots and questioning the number of rows read, versus the number of rows selected, where there is a massive difference and Imean HUGE.

Now I read recently some articles on index structure and the 3 tiers of pages within being Root, Intermediate and Leaf pages. From what I now understand, When constructing an index, it is better to create multi-column indexes. Which is perfectly understandable and I was always doing so. My problemis which order to create the index columns to avoid leaf scans.

The issue is I have queries that use the ID and date columns independently so assigning one or the other as the first column will always result in leaf scans unless I create more than 1 multi-column index with differing orders.

As in (ID, DATE ....) and another one with (DATE, ID ...). Here I now have to decide where to place the CLUSTER in which of the two indexes. If I only have one index (ID, DATE ...) then any query that has a predicate of WHERE DATE = etc results in leafs scans, so I have to create more than one to avoid such.

Most of REALLY important queries use the DATE to join to other tables andthen use a predicate of WHERE ID = x . Now given the nature of the data in the table with thousands of IDs per date, and say only 50 dates per ID, I'm trying to understand where it is better to place a CLUSTER clause and use the other index clauses such as PCTFREE and MINPCTUSED, PAGE SPLIT HIGH (as most sorts are ASC).
We are inserting say 50k records daily for new DATEs to existing IDs.

I have another table that I use for PERIOD IDs. 1 record per business day, with 5 dates in sequence and an ID. ID, D1_DATE, D2_DATE, D3_DATE, D4_DATE, D5_DATE.
I use this table to apply sequence to existing records by joining the DATE column of a table to say D2_DATE, I can query other date tables by joining on D1_DATE to find records ocurring in sequence.

This Period table however gets massive rows read values for very few records selected, and I am not sure I understand why. This table only has some 3200 records in it but rows read in the hundreds of millions when joining to other date tables. Here again I have the issue or how to create the indexfor if I start with (d1_date, d2_date ... ) but have to join on the d3_date we again get the leaf scans. So am I better off creating multi-column indexes with all combinations and where do I add a cluster clause ?

Very confusing for a non-technical self taught pleb like myself. Sorry for the essay but you can see my dilemma.

I take this Question as how do we determine the ORDER of the COLUMNS
in an INDEX and The parameters you mentioned ?

•When creating a multicolumn index, the first columns of the index
should be the ones that are used most often by the predicates in your
query.
•To improve join performance with a multiple-column relational index,
if you have more than one choice for the first key column, use the
column most often specified with the "=" (equijoin) predicate or the
column with the greatest number of distinct values as the first key.

Please check the below link

http://publib.boulder.ibm.com/infoce.../c0005054.html

Cheers
Shashi Mannepalli

Reply With Quote
  #6  
Old   
Lennart Jonsson
 
Posts: n/a

Default Re: Index structure using CLUSTER, column order and REORG. - 07-05-2011 , 04:09 PM



On 2011-07-05 07:13, Fin wrote:
Quote:
Problem is I have literally hundreds of queries to the table in question.
Are these "different" questions or same questions with different
parameters? If you are on 9.7 you can turn on the stmt concentrator if
needed/wanted

I've been looking at snapshots and questioning the number of rows read,
versus the number of rows selected, where there is a massive difference
and I mean HUGE.
Quote:
This usually means there is room for index improvements, so you are on
the right track


Quote:
Now I read recently some articles on index structure and the 3 tiers of pages within being Root, Intermediate and Leaf pages. From what I now understand, When constructing an index, it is better to create multi-column indexes. Which is perfectly understandable and I was always doing so. My problem is which order to create the index columns to avoid leaf scans.

The issue is I have queries that use the ID and date columns independently so assigning one or the other as the first column will always result in leaf scans unless I create more than 1 multi-column index with differing orders.

If you are refering to my comment on why b is not unique, I don't
question the need for extra indexes. It's just that a is unique and b
contains the same columns, but is not unique. You might want to make b
unique as well. The unique property can be used by the optimizer to make
better decisions. There might however be a slight performance penalty
during insert/update


Quote:
As in (ID, DATE ....) and another one with (DATE, ID ...). Here I now have to decide where to place the CLUSTER in which of the two indexes. If I only have one index (ID, DATE ...) then any query that has a predicate of WHERE DATE = etc results in leafs scans, so I have to create more than one to avoid such.

Most of REALLY important queries use the DATE to join to other tables and then use a predicate of WHERE ID = x . Now given the nature of the data in the table with thousands of IDs per date, and say only 50 dates per ID, I'm trying to understand where it is better to place a CLUSTER clause and use the other index clauses such as PCTFREE and MINPCTUSED, PAGE SPLIT HIGH (as most sorts are ASC).
We are inserting say 50k records daily for new DATEs to existing IDs.

I have another table that I use for PERIOD IDs. 1 record per business day, with 5 dates in sequence and an ID. ID, D1_DATE, D2_DATE, D3_DATE, D4_DATE, D5_DATE.
I use this table to apply sequence to existing records by joining the DATE column of a table to say D2_DATE, I can query other date tables by joining on D1_DATE to find records ocurring in sequence.

This Period table however gets massive rows read values for very few records selected, and I am not sure I understand why. This table only has some 3200 records in it but rows read in the hundreds of millions when joining to other date tables. Here again I have the issue or how to create the index for if I start with (d1_date, d2_date ... ) but have to join on the d3_date we again get the leaf scans. So am I better off creating multi-column indexes with all combinations and where do I add a cluster clause ?

Very confusing for a non-technical self taught pleb like myself. Sorry for the essay but you can see my dilemma.
You seem to be doing fine by asking a lot of justified questions. Here's
my two cents, Collect the queries that runs against the table and there
frequency using either snapshots or the sysibmadm.snapdyn_sql view. If
there are hundreds of different questions, start with say the 10 most
read rows questions or 10 most logical read questions. In my experience
they are often the same.

Prepare a file with these questions and there frequency and feed them to
db2advis with option -mIC. The outcome is often a very good start, but
you can't take it's word for granted. Often redundant indexes are
suggested which can be ignored.

To me clustering and order by is closely related, I usually look for
order by and group by queries when determinig clustering index on a table.

Sorry for not giving any specific advise about clustering - but atleast
for me - it is diffiicult without knowing the workload.


/Lennart

Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.