dbTalk Databases Forums  

DB2 optimizer ignoring multi-column stats in multi-column joins?

comp.databases.ibm-db2 comp.databases.ibm-db2


Discuss DB2 optimizer ignoring multi-column stats in multi-column joins? in the comp.databases.ibm-db2 forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
Oliver Schoett
 
Posts: n/a

Default DB2 optimizer ignoring multi-column stats in multi-column joins? - 11-22-2011 , 09:28 AM






In DB2 9.5.8 on AIX, I have come across a bad query plan that is to
some extent caused by the optimizer wrongly estimating the initial
NLJOIN (nested-loop join) of the plan (with a more accurate estimate,
it would not have chosen that plan).

Here is the gist of the situation: we have a join of two tables over
two columns, where the column statistics and the combined column
statistics in both tables are known to the optimizer.

The query is as follows:

select * from M, P
where M.R1 = P.F1
and M.R2 = P.F2

The table M has an index (R2, R1, ...), table P has an index (F1, F2,
....). Both indexes have FIRST2KEYCARD values in syscat.indexes. The
cardinalities are as follows

M.R1 P.F1
126* 100
FIRST2KEYCARD FIRST2KEYCARD
182981 4300
M.R2 P.F2
3107 3128*

From the cardinalities of the base tables and the optimizer's estimate
of the result cardinality, it can be seen that the optimizer
calculates the selectivity of the join predicates to be

1 / 394128 = 1 / (126 * 3128)

In other words, the optimizer seems to have chosen the starred values
as the basis of its selectivity estimate and completely ignored the
combined column statistics - presumably because the starred values
occur in different tables and hence have no combined column
statistics.

The underlying problem seems to be that the optimizer tries to achieve
the *strongest* possible selectivity estimate by selecting the highest
column cardinalities first. Instead, it should use the *weakest*
selectivity estimate, which here would be 1 / 4300 for the combined
column statistics on table P.

The observed pattern of mixing columns from both tables for the
selectivity estimate seems to prevent the application of joint column
statistics for a large number of multi-column join conditions.

Regards,

Oliver Schoett

Reply With Quote
  #2  
Old   
Oliver Schoett
 
Posts: n/a

Default Solved: DB2 optimizer ignoring multi-column stats in multi-columnjoins? - 01-12-2012 , 11:31 AM






Oliver Schoett wrote:
Quote:
In DB2 9.5.8 on AIX, I have come across a bad query plan that is to
some extent caused by the optimizer wrongly estimating the initial
NLJOIN (nested-loop join) of the plan (with a more accurate estimate,
it would not have chosen that plan).

Here is the gist of the situation: we have a join of two tables over
two columns, where the column statistics and the combined column
statistics in both tables are known to the optimizer.
The problem observed is that the optimizer ignores the joint column
statistics and instead multiplies the selectivities of the fields
involved. When the fields are correlated, their combined selectivity is
much weaker than this product.

An explanation for this phenomenon can be found in the following article
by an IBM engineer:

Understand column group statistics in DB2

http://www.ibm.com/developerworks/da...oor/index.html

The article explains that a joint column statistic is used only if the
following conditions are true:

(1) All the fields must occur in equations between one table A and one
table B,

(2) One of the two tables must be the "parent table" of the join in the
sense that

(3) for each join condition, the column statistics of the fields
involved (COLCARD, LOW2KEY, HIGH2KEY) must be consistent with the child
column values being a subset of the parent column values.

In this case, the joint column statistics on the parent table are used.

The problem in our case was that the COLCARD values were greater in one
table for one field and in the other table for the other field, so that
no parent-child relation betwen the tables existed.

Our query performed 650 times faster after we established a parent-child
relationship by modifying the column statistics in the SYSSTAT schema.

However, this has the side effect that the modified statistics are no
longer accurate. Our customer is skeptical about this approach and
wonders whether there is a cleaner way of making the optimizer consider
the joint column statistics?

Regards,

Oliver Schoett

Reply With Quote
  #3  
Old   
Frederik Engelen
 
Posts: n/a

Default Re: Solved: DB2 optimizer ignoring multi-column stats in multi-column joins? - 01-14-2012 , 03:07 AM



On 12 jan, 18:31, Oliver Schoett <oliver.scho... (AT) capgemini (DOT) com> wrote:
Quote:
Oliver Schoett wrote:
In DB2 9.5.8 on AIX, I have come across a bad query plan that is to
some extent caused by the optimizer wrongly estimating the initial
NLJOIN (nested-loop join) of the plan (with a more accurate estimate,
it would not have chosen that plan).

Here is the gist of the situation: we have a join of two tables over
two columns, where the column statistics and the combined column
statistics in both tables are known to the optimizer.

The problem observed is that the optimizer ignores the joint column
statistics and instead multiplies the selectivities of the fields
involved. *When the fields are correlated, their combined selectivity is
much weaker than this product.

An explanation for this phenomenon can be found in the following article
by an IBM engineer:

* * Understand column group statistics in DB2

http://www.ibm.com/developerworks/da...cle/dm-0612kap...

The article explains that a joint column statistic is used only if the
following conditions are true:

(1) All the fields must occur in equations between one table A and one
table B,

(2) One of the two tables must be the "parent table" of the join in the
sense that

(3) for each join condition, the column statistics of the fields
involved (COLCARD, LOW2KEY, HIGH2KEY) must be consistent with the child
column values being a subset of the parent column values.

In this case, the joint column statistics on the parent table are used.

The problem in our case was that the COLCARD values were greater in one
table for one field and in the other table for the other field, so that
no parent-child relation betwen the tables existed.

Our query performed 650 times faster after we established a parent-child
relationship by modifying the column statistics in the SYSSTAT schema.

However, this has the side effect that the modified statistics are no
longer accurate. Our customer is skeptical about this approach and
wonders whether there is a cleaner way of making the optimizer consider
the joint column statistics?

Regards,

Oliver Schoett
Thanks for sharing your solution and the link. I put it on my to-read
list, thanks.

I think what you are looking for instead of manually updating the
statistics is using a statistical view:
http://www.ibm.com/developerworks/da...e/dm-0612chen/

--
Frederik Engelen

Reply With Quote
  #4  
Old   
Oliver Schoett
 
Posts: n/a

Default Re: Solved: DB2 optimizer ignoring multi-column stats in multi-columnjoins? - 01-24-2012 , 10:28 AM



Frederik Engelen wrote:

Quote:
I think what you are looking for instead of manually updating the
statistics is using a statistical view:
http://www.ibm.com/developerworks/da...e/dm-0612chen/

Thank you, that approach looks like it should work. In my case, the bad
query plan starts with a join between two tables that produces millions
of rows and consumes almost the entire running time of the bad plan. To
avoid this, I tried to define this join as a statistical view and run
statistics on it.

Unfortunately, the RUNSTATS was still not finished after 24 hours and
processing 8 billion rows. In other words, the plan chosen by the
optimizer is so bad that I cannot run the RUNSTATS for it.

(The RUNSTATS takes longer than the actual query, because the query
contains some additional selection citeria entered by the user. Since
these data can vary from execution to execution, the statistical view
cannot contain these additional criteria, but must process the tables in
their entirety.)

I think that IBM is needlessly restrictive about applying the joint
column statistics that the optimizer knows about. It should apply these
statistics whenever they are applicable, that is, whenever the join
condition contains some fields known to be correlated.

For example, the join conditions could even relate to different tables
or be arbitrary expressions: if a correlation is known between the
fields T.X and T.Y, then the cardinality estimates for

select ... from ... T, U, V, ... where T.X = U.X and T.Y = V.Y
select ... from ... T, ... where T.X = expr1 and T.Y = expr2

should take this correlation into account.

Regards,

Oliver Schoett

Reply With Quote
  #5  
Old   
Lennart Jonsson
 
Posts: n/a

Default Re: DB2 optimizer ignoring multi-column stats in multi-column joins? - 01-25-2012 , 12:47 AM



On 2011-11-22 16:28, Oliver Schoett wrote:
[...]

Quote:
Here is the gist of the situation: we have a join of two tables over
two columns, where the column statistics and the combined column
statistics in both tables are known to the optimizer.

The query is as follows:

select * from M, P
where M.R1 = P.F1
and M.R2 = P.F2

The table M has an index (R2, R1, ...), table P has an index (F1, F2,
...). Both indexes have FIRST2KEYCARD values in syscat.indexes. The
cardinalities are as follows

M.R1 P.F1
126* 100
FIRST2KEYCARD FIRST2KEYCARD
182981 4300
M.R2 P.F2
3107 3128*

From the cardinalities of the base tables and the optimizer's estimate
of the result cardinality, it can be seen that the optimizer
calculates the selectivity of the join predicates to be

1 / 394128 = 1 / (126 * 3128)

From my understanding (which is a bit limited), the selectivity estimate
for a join is 1 / max(cc1, cc2), AND corresponds to * which would give us:

1 / max(126, 100) * 1 / max(3107, 3128)

I don't have the time right now, but I'll try to investigate whether I
can get db2 to use FIRST2KEYCARD as an estimate for join selectivity and
get back on this.

/Lennart

Reply With Quote
  #6  
Old   
Lennart Jonsson
 
Posts: n/a

Default Re: DB2 optimizer ignoring multi-column stats in multi-column joins? - 01-25-2012 , 05:28 AM



On 2012-01-25 07:47, Lennart Jonsson wrote:
Quote:
On 2011-11-22 16:28, Oliver Schoett wrote:
[...]

Here is the gist of the situation: we have a join of two tables over
two columns, where the column statistics and the combined column
statistics in both tables are known to the optimizer.

The query is as follows:

select * from M, P
where M.R1 = P.F1
and M.R2 = P.F2

The table M has an index (R2, R1, ...), table P has an index (F1, F2,
...). Both indexes have FIRST2KEYCARD values in syscat.indexes. The
cardinalities are as follows

M.R1 P.F1
126* 100
FIRST2KEYCARD FIRST2KEYCARD
182981 4300
M.R2 P.F2
3107 3128*

From the cardinalities of the base tables and the optimizer's estimate
of the result cardinality, it can be seen that the optimizer
calculates the selectivity of the join predicates to be

1 / 394128 = 1 / (126 * 3128)


From my understanding (which is a bit limited), the selectivity estimate
for a join is 1 / max(cc1, cc2), AND corresponds to * which would give us:

1 / max(126, 100) * 1 / max(3107, 3128)

I don't have the time right now, but I'll try to investigate whether I
can get db2 to use FIRST2KEYCARD as an estimate for join selectivity and
get back on this.

Oliver, can you post ddl for M and P and also mimic stats for them?

db2look -d <db> -z schema> -t <table> -m


/Lennart

Reply With Quote
  #7  
Old   
Oliver Schoett
 
Posts: n/a

Default Re: DB2 optimizer ignoring multi-column stats in multi-column joins? - 01-27-2012 , 06:54 AM



Lennart Jonsson wrote:

Quote:
From my understanding (which is a bit limited), the selectivity estimate
for a join is 1 / max(cc1, cc2), AND corresponds to * which would give us:

1 / max(126, 100) * 1 / max(3107, 3128)
Yes, this is exactly what the optimizer seems to be doing, ignoring the
FIRST2KEYCARDS of 182981 and 4300 and thus arriving at a cardinality
estimate that is far too low. In my mind, it should use the highest
possible estimate, which is

1 / min(FIRST2KEYCARD(M.R1, M.R2), FIRST2KEYCARD(P.F1, P.F2)
= 1 / min(182981, 4300) = 1 / 4300

That comes much closer to the actual value.

I now have access to version 9.7.5 of DB2 and can present the actual
cardinalities of the join, obtained with Serge Rielau's package
(https://www.ibm.com/developerworks/m...ts23?lang=en):

24,6238
6,33755e+006
NLJOIN
( 63)
3347
NA
/----------+----------\
36,1058 0,68199
1578 4016,19
FETCH FETCH
( 64) ( 68)
2231,52 30,2802
NA NA
/---+----\ /---+----\
4880,92 2,37706e+007 9,93351 3,83622e+006
12161 NA 22544,3 NA
RIDSCN TABLE: SCHEMA RIDSCN TABLE: SCHEMA
( 65) TABLEMMMMMM ( 69) TABLEPPPPPP
168,626 Q32 22,7089 Q28
NA NA
Quote:
|
4880,92 9,93351
10583 18528,1
SORT SORT
( 66) ( 70)
168,625 22,7086
NA NA
Quote:
|
4880,92 9,93351
10583 18528,1
IXSCAN IXSCAN
( 67) ( 71)
166,817 22,7075
NA NA
Quote:
|
2,37706e+007 3,83622e+006
NA NA
INDEX: SCHEMA INDEX: SCHEMA
IDXMMMMMMN02 IDXPPPPPPQ01
Q32 Q28

The problem can be seen in the IXSCAN (71) on the right, where the
optimizer scans for the two join predicates. The estimated result
cardinality is ~10 (first line), and the actual cardinality is ~18528
(second line).

Sorry, but I cannot post the dblook info for this database, as it is
client confidential. I think, however, that the given numbers together
with the developerworks article I quoted
(http://www.ibm.com/developerworks/da...e/dm-0612chen/) are
sufficient to see how the optimizer arrives at its erroneous estimates.

Regards,

Oliver Schoett

Reply With Quote
  #8  
Old   
Lennart Jonsson
 
Posts: n/a

Default Re: DB2 optimizer ignoring multi-column stats in multi-column joins? - 01-28-2012 , 04:17 AM



On 2012-01-27 13:54, Oliver Schoett wrote:
Quote:
Lennart Jonsson wrote:

From my understanding (which is a bit limited), the selectivity estimate
for a join is 1 / max(cc1, cc2), AND corresponds to * which would give us:

1 / max(126, 100) * 1 / max(3107, 3128)

Yes, this is exactly what the optimizer seems to be doing, ignoring the
FIRST2KEYCARDS of 182981 and 4300 and thus arriving at a cardinality
estimate that is far too low. In my mind, it should use the highest
possible estimate, which is

1 / min(FIRST2KEYCARD(M.R1, M.R2), FIRST2KEYCARD(P.F1, P.F2)
= 1 / min(182981, 4300) = 1 / 4300

That comes much closer to the actual value.

It would be interesting to see the error in estimation for the two
formulas under various circomstances (cardinalities and distribution). I
don't have the time now, but if I come up with something of interest
I'll post.

Quote:
I now have access to version 9.7.5 of DB2 and can present the actual
cardinalities of the join, obtained with Serge Rielau's package
(https://www.ibm.com/developerworks/m...ts23?lang=en):

Thanks Oliver, I only had a quick glance at the profiler before (still
on 9.5), but this looks very nifty indeed. I will surely revisit this
article and try out some of the stuff there.

[...]


/Lennart

Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.