Hi there,
This post may be clumsy, it's not so easy to be concise and precise at
the same time, but I really need an advice here.
I crosspost with comp.databases.theory since I'm not sure the problem is
related to MySQL only.
*Summary*
I have a remark regarding the "ref" join type and the "in" operator. In
my opinion it appears to have some important limitations. But since a
similar behavior occurs with other RDMS such as Oracle, I guess there is
nothing that could be done to improve that scenario ?
*Presentation*
Server version: 5.1.32-community-log MySQL Community Server (GPL)
Engine: InnoDB
I have two tables, entity and positionedelement:
Field Type Null Key Default Extra
------------ ------------ ------- ------ ---------- --------
uidEntity int(11) NO PRI (null)
gid bigint(20) NO UNI (null)
entityName varchar(255) NO (null)
Quote:
show /*formatted for 80 col*/ create index from entity; |
Non_unique Key_name Seq_in_index Column_name Cardinality
----------- ------------ ------------- ------------- ------------
0 PRIMARY 1 uidEntity 3199989
1 fullIdx 1 gid 3199989
1 fullIdx 2 uidEntity 3199989
Field Type Null Key Default Extra
------------------- ---------- ------- ------ ---------- --------
idPositionedElement int(11) NO PRI (null)
startPosition int(11) NO (null)
endPosition int(11) NO (null)
uidEntity int(11) NO MUL (null)
idKnowledgeSet int(11) NO MUL (null)
Quote:
show /*formatted for 80 col*/ create index from positionedelement; |
Non_unique Key_name Seq_in_index Column_name Cardinality
---------- ---------- ------------- ------------------- -----------
0 PRIMARY 1 idPositionedElement 19014004
1 pEntityIdx 1 uidEntity 6338001
1 pEntityIdx 2 idKnowledgeSet 19014004
1 ksIdx 1 idKnowledgeSet 1358143
The idea is that positionedelement references an entity (using uidEntity).
Some hints that may help understand further:
1) The "join repartition" is very disparate, ie some entities are not
referenced by any positionedelement, while some other may have 200.000
references.
2) The mean expected matching value for a random value of uidEntity is
around 2 or 3:
mysql> select count(*) from entity\G
*************************** 1. row ***************************
count(*): 3264185
3) For some strange reason, there is no actual FK constraint between
positionedelement.uidEntity and entity.uidEntity but I'm not sure it
have any influence here.
Ok now some queries:
Quote:
explain select count(0) from entity where uidEntity = 62\G |
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: entity
type: const
possible_keys: PRIMARY
key: PRIMARY
key_len: 4
ref: const
rows: 1
Extra: Using index
Quote:
explain select count(0) from entity where uidEntity in (62,64)\G
*************************** 1. row ***************************
|
id: 1
select_type: SIMPLE
table: entity
type: range
possible_keys: PRIMARY
key: PRIMARY
key_len: 4
ref: NULL
rows: 2
Extra: Using where; Using index
Ok, uidEntity is the PK, no problem here, there will be 2 rows returned.
Quote:
explain select count(0) from positionedelement where uidEntity=(62)\G
*************************** 1. row ***************************
|
id: 1
select_type: SIMPLE
table: positionedelement
type: ref
possible_keys: pEntityIdx
key: pEntityIdx
key_len: 4
ref: const
rows: 24840
Extra: Using index
Quote:
select count(0) from positionedelement where uidEntity = (62)\G
*************************** 1. row ***************************
|
count(0): 98431
Quote:
explain select count(0) from positionedelement where uidEntity=(63)\G
*************************** 1. row ***************************
|
id: 1
select_type: SIMPLE
table: positionedelement
type: ref
possible_keys: pEntityIdx
key: pEntityIdx
key_len: 4
ref: const
rows: 1
Extra: Using index
Quote:
select count(0) from positionedelement where uidEntity = (63)\G
*************************** 1. row ***************************
|
count(0): 0
Quote:
explain select count(0) from positionedelement where uidEntity=(64)\G
*************************** 1. row ***************************
|
id: 1
select_type: SIMPLE
table: positionedelement
type: ref
possible_keys: pEntityIdx
key: pEntityIdx
key_len: 4
ref: const
rows: 13364
Extra: Using index
Quote:
select count(0) from positionedelement where uidEntity = (64)\G
*************************** 1. row ***************************
|
count(0): 14909
Quote:
explain select count(0) from positionedeleent where uidEntity in (62,64)\G
*************************** 1. row ***************************
|
id: 1
select_type: SIMPLE
table: positionedelement
type: range
possible_keys: pEntityIdx
key: pEntityIdx
key_len: 4
ref: NULL
rows: 38204
Extra: Using where; Using index
So everything is correct up to now. The optimizer is able to get a good
evaluation for different value of uidEntity, and is even able to do it
with an in operator, which is really great !
*The problem*
The previous query is a real use case for me, but I also need to get
some extra informations stored in the entity table (like the entityName
column). So the idea is just to make a straight inner join and let it be:
Quote:
explain select count(0) from positionedelement pe inner join entity e
on e.uidEntity = pe.uidEntity where pe.uidEntity in (62,64)\G
|
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: e
type: range
possible_keys: PRIMARY
key: PRIMARY
key_len: 4
ref: NULL
rows: 2
Extra: Using where; Using index
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: pe
type: ref
possible_keys: pEntityIdx
key: pEntityIdx
key_len: 4
ref: basf.e.uidEntity
rows: 3
Extra: Using index
Quote:
select count(0) from positionedelement pe inner join entity e on
e.uidEntity = pe.uidEntity where pe.uidEntity in (62,64)\G
|
*************************** 1. row ***************************
count(0): 113340
Geez !
The execution plan stay the same *BUT* the cost evaluation is totally
different, and in some extend, completely wrong. The optimizer takes the
average matching rows (3 here), instead of the more precise evaluation
is able to do when there is no join as seen previously (38204)
And this cause lot's of havoc in my application where this query is a
sub part of a more complex one, that will be poorly executed because of
the somehow wrong evaluation.
Note that it works perfectly with the "=" operator like here:
Quote:
explain select count(0) from positionedelement pe inner join entity e
on e.uidEntity = pe.uidEntity where pe.uidEntity = (62)\G
|
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: e
type: const
possible_keys: PRIMARY
key: PRIMARY
key_len: 4
ref: const
rows: 1
Extra: Using index
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: pe
type: ref
possible_keys: pEntityIdx
key: pEntityIdx
key_len: 4
ref: const
rows: 24840
Extra: Using index
2 rows in set (0.05 sec)
I may understand that using the "in" operator is somehow more
challenging for the optimizer than a simple "=". But since it's able to
get it right without the join, it really is disappointing that it can't
handle a simple ref join, don't you think ?
Is it a structural consequence of the way the engine is built that can't
be overcome, or do you rather consider that as a temporary limitation
that may disappear in future release of MySQL ?
Am I missing something that may help me around this problem ? For the
moment I force the execution plan using an hint, which is risky if the
actual data repartition completely change and invalidate this forced plan.
The only workaround I could think of, is to iterate on the values in the
in, and for each of them using the "=" operator. This is not very
satisfying, since I may sometimes have 40 to 200 values...
Thanks for any help in understanding this problem.
(fu2)
--
Hugo