dbTalk Databases Forums  

Re: [BUGS] Notes about behaviour of SIMILAR TO operator

mailing.database.pgsql-bugs mailing.database.pgsql-bugs


Discuss Re: [BUGS] Notes about behaviour of SIMILAR TO operator in the mailing.database.pgsql-bugs forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
Tom Lane
 
Posts: n/a

Default Re: [BUGS] Notes about behaviour of SIMILAR TO operator - 11-20-2003 , 05:01 PM






Adam Buraczewski <adamb (AT) nor (DOT) pl> writes:
Quote:
... for example the pattern 'a|z' (which should match single 'a' or 'z'
characters only, according to SQL spec) is converted into POSIX
regular expression in the form of '^a|z$' which matches all strings
beginning with 'a' ('abcdef' for example) and all strings ending with
'z' ('xyz' for example). So the meaning of the pattern is changed,
which is not good.
Hm, that's a mistake, it should probably translate to ^(a|z)$ instead.

Quote:
The behaviour above is also caused by similar_escape(), which converts
'[_]' to '^[.]$' and '[%]' to '^[.*]$', not noticing the simple fact
that these characters are inside brackets.
As near as I can tell, the SQL spec requires special characters to be
escaped when they are inside a bracket construct. So indeed the above
are invalid SQL regexes.

Quote:
Talking about square brackets, it should be noticed that there is a
slight difference between SIMILAR TO and POSIX way of describing named
character classes.
Mmm, yeah, that looks like a mess.

Quote:
This at least could be avoided simply by prepending regular expression
returned by similar_escape() with a magic sequence '***:' which
switches regexp engine into ARE mode.
Good point. Actually, do we want to force ARE mode, or something simpler?
Perhaps ERE or even BRE would be a better match to the SQL spec.

Quote:
I think I am able to write such a patch in my spare time,
Go to it ...

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo (AT) postgresql (DOT) org)


Reply With Quote
  #2  
Old   
Adam Buraczewski
 
Posts: n/a

Default Re: [BUGS] Notes about behaviour of SIMILAR TO operator - 11-21-2003 , 09:48 AM






Hallo Tom,

I decided to improve similar_escape() function during the weekend.
Thank you very much for the excerpt from SQL standard (I think this is
much more complete than the text I found in a Working Draft from
August 1994). However, there are some more issues I'd like to make
clear before function redesigning.

First of all, I found that SQL spec says that when one of the items of
SIMILAR TO construct (<character match value>, <similar pattern> or
<escape character>) is NULL, then result of the whole construct should
be unknown. However, PostgreSQL treats SIMILAR TO ... ESCAPE NULL the
same way as when no ESCAPE clause is present, which is wrong:

('a' similar to 'a') is true
('a' similar to 'a' escape null) is true (should be unknown!)

The behaviour above is caused by the fact that escape character is
passed to similar_escape() without checking for null value, and the
same null value is passed to it when there is no ESCAPE clause at all.
I think that either PostgreSQL should check for nulls in SIMILAR TO
construct before calling similar_escape(), or there should be two
versions of similar_escape() function: one getting only one argument
(for SIMILAR TO without ESCAPE) and second, getting two arguments
(a pattern and an escape char). Which solution is better?

Quote:
As near as I can tell, the SQL spec requires special characters to be
escaped when they are inside a bracket construct. So indeed the above
are invalid SQL regexes.
How the function should behave when such an invalid pattern is passed
as its argument? Should it throw an error (this is what SQL spec
says) or tolerate as much mistakes as possible, generating some
warnings only?

Quote:
Good point. Actually, do we want to force ARE mode, or something simpler?
Perhaps ERE or even BRE would be a better match to the SQL spec.
I think that there is no difference which regexp dialect is choosen,
only the speed matters. Function translating SIMILAR TO patterns into
POSIX regular expressions will be more or less the same. What should
I choose then?

BTW, should I write some regression tests for SIMILAR TO? Is there
any guide how to do it (except PostgreSQL sources)? Should the
changes be written for CVS HEAD only or 7.4/7.3 branches either?

Regards,

--
Adam Buraczewski <adamb (at) nor (dot) pl> * Linux user #165585
GCS/TW d- s-:+>+:- a C+++(++++) UL++++$ P++ L++++ E++ W+ N++ o? K w--
O M- V- PS+ !PE Y PGP+ t+ 5 X+ R tv- b+ DI D G++ e+++>++++ h r+>++ y?


---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org


Reply With Quote
  #3  
Old   
Tom Lane
 
Posts: n/a

Default Re: [BUGS] Notes about behaviour of SIMILAR TO operator - 11-21-2003 , 10:08 AM



Adam Buraczewski <adamb (AT) nor (DOT) pl> writes:
Quote:
('a' similar to 'a' escape null) is true (should be unknown!)
Yeah, you are right; this is because we are overloading a "null" second
parameter to mean "the ESCAPE part wasn't present", which in hindsight
wasn't such a hot idea.

Quote:
I think that either PostgreSQL should check for nulls in SIMILAR TO
construct before calling similar_escape(), or there should be two
versions of similar_escape() function: one getting only one argument
(for SIMILAR TO without ESCAPE) and second, getting two arguments
(a pattern and an escape char). Which solution is better?
I think the latter is the only reasonable solution, but it will be
something that we cannot implement in the 7.4.* series, because adding
another function implies initdb. I'd suggest submitting one patch that
fixes everything but the NULL problem, which we could back-patch into
7.4, and then a second patch that splits the function into two for 7.5.

Quote:
As near as I can tell, the SQL spec requires special characters to be
escaped when they are inside a bracket construct. So indeed the above
are invalid SQL regexes.

How the function should behave when such an invalid pattern is passed
as its argument? Should it throw an error (this is what SQL spec
says) or tolerate as much mistakes as possible, generating some
warnings only?
I don't have a strong opinion --- could go with either behavior. You
might want to take it up on the pgsql-sql list.

Quote:
Good point. Actually, do we want to force ARE mode, or something simpler?
Perhaps ERE or even BRE would be a better match to the SQL spec.

I think that there is no difference which regexp dialect is choosen,
only the speed matters. Function translating SIMILAR TO patterns into
POSIX regular expressions will be more or less the same. What should
I choose then?
I doubt there would be any speed difference. The advantage of a dumber
RE flavor is that it would have fewer "extra" features that might be
unintentionally triggered by a translated pattern, leading to just the
sort of non-SQL-compliant behavior you are complaining of ...

Quote:
BTW, should I write some regression tests for SIMILAR TO?
Sure. Look at some of the existing regression tests for examples.

Quote:
Should the changes be written for CVS HEAD only or 7.4/7.3 branches
either?
I don't see that we'd bother applying it to 7.3, but 7.4 branch yes,
if you avoid any changes in the function's API for the 7.4 version.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings


Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.