dbTalk Databases Forums  

optimzing query with substr

comp.databases.postgresql comp.databases.postgresql


Discuss optimzing query with substr in the comp.databases.postgresql forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
reppisch
 
Posts: n/a

Default optimzing query with substr - 08-01-2008 , 04:51 AM






Hi Ng,

maybe this question is not closely relatd to postgres but maybe someone
can proide a better way than my approach.

I'm trying to implement a blacklist lookup.
The blacklist table contains strings (or beginning-parts of strings)
which are considered "black". Lenght of this test patterns may be up to
25 char. And there may be *a lot* of them.
A pattern 'ABC' in the table allways means 'ABC.*' as a regexp.

Now i need a fast way to test a single string varchar(25) if it matches
against one (ore more) candidates in the table.

---------
A naive way may be having a table
blacklist
(
syskey integer NOT NULL
, haystack varchar(25) NOT NULL
);

containing
1, AB
2, EFG

Testing the string needle = 'ABC' would then lead to the dynamic
creation (because of the variable number of characters in 'needle') of a
select like

select count(*)
from blacklist
where
(substr(haystack,1,1) = '0' or substr(haystack,1,1) ='')
and (substr(haystack,2,1) = '9' or substr(haystack,2,1) ='')
and (substr(haystack,3,1) = '6' or substr(haystack,3,1) ='')

This is very expensive.

---------
Maybe with a table layout like
blacklist
(
syskey integer NOT NULL
, haystack_1 char NOT NULL
...
, haystack_25 char NOT NULL
);

or the postgres-specific array type i could save the substr and use
preapred statements.

The table-access is 99% read-only, so creating extensive indexes would
be ok.
---------

I also could rearrange the table to have haystack-entries
1, AB%
2, EFG%
and use the LIKE operator like

select count(*)
from blacklist
where 'ABC' like haystack

EXPLAIN think's this is quicker.


I think we can do better.

Any ideas?

Regards,
Michael

Reply With Quote
  #2  
Old   
Laurenz Albe
 
Posts: n/a

Default Re: optimzing query with substr - 08-01-2008 , 11:29 AM






reppisch <spam (AT) reppisch (DOT) de> wrote:
Quote:
maybe this question is not closely relatd to postgres but maybe someone
can proide a better way than my approach.

I'm trying to implement a blacklist lookup.
The blacklist table contains strings (or beginning-parts of strings)
which are considered "black". Lenght of this test patterns may be up to
25 char. And there may be *a lot* of them.
A pattern 'ABC' in the table allways means 'ABC.*' as a regexp.

Now i need a fast way to test a single string varchar(25) if it matches
against one (ore more) candidates in the table.

---------
A naive way may be having a table
blacklist
(
syskey integer NOT NULL
, haystack varchar(25) NOT NULL
);

containing
1, AB
2, EFG

Testing the string needle = 'ABC' would then lead to the dynamic
creation (because of the variable number of characters in 'needle') of a
select like

select count(*)
from blacklist
where
(substr(haystack,1,1) = '0' or substr(haystack,1,1) ='')
and (substr(haystack,2,1) = '9' or substr(haystack,2,1) ='')
and (substr(haystack,3,1) = '6' or substr(haystack,3,1) ='')

This is very expensive.

---------
Maybe with a table layout like
blacklist
(
syskey integer NOT NULL
, haystack_1 char NOT NULL
...
, haystack_25 char NOT NULL
);

or the postgres-specific array type i could save the substr and use
preapred statements.

The table-access is 99% read-only, so creating extensive indexes would
be ok.
---------

I also could rearrange the table to have haystack-entries
1, AB%
2, EFG%
and use the LIKE operator like

select count(*)
from blacklist
where 'ABC' like haystack

EXPLAIN think's this is quicker.


I think we can do better.
Maybe I get something wrong here, but

SELECT count(*) FROM blacklist WHERE 'ABC' || '%' LIKE haystack;

should work nicely with your original query and use an index on the
column "haystack".

If it does not use the index, try again with more rows in the table
or issue "SET enable_seqscan=off;" for test purposes.

Yours,
Laurenz Albe


Reply With Quote
  #3  
Old   
Laurenz Albe
 
Posts: n/a

Default Re: optimzing query with substr - 08-01-2008 , 11:29 AM



reppisch <spam (AT) reppisch (DOT) de> wrote:
Quote:
maybe this question is not closely relatd to postgres but maybe someone
can proide a better way than my approach.

I'm trying to implement a blacklist lookup.
The blacklist table contains strings (or beginning-parts of strings)
which are considered "black". Lenght of this test patterns may be up to
25 char. And there may be *a lot* of them.
A pattern 'ABC' in the table allways means 'ABC.*' as a regexp.

Now i need a fast way to test a single string varchar(25) if it matches
against one (ore more) candidates in the table.

---------
A naive way may be having a table
blacklist
(
syskey integer NOT NULL
, haystack varchar(25) NOT NULL
);

containing
1, AB
2, EFG

Testing the string needle = 'ABC' would then lead to the dynamic
creation (because of the variable number of characters in 'needle') of a
select like

select count(*)
from blacklist
where
(substr(haystack,1,1) = '0' or substr(haystack,1,1) ='')
and (substr(haystack,2,1) = '9' or substr(haystack,2,1) ='')
and (substr(haystack,3,1) = '6' or substr(haystack,3,1) ='')

This is very expensive.

---------
Maybe with a table layout like
blacklist
(
syskey integer NOT NULL
, haystack_1 char NOT NULL
...
, haystack_25 char NOT NULL
);

or the postgres-specific array type i could save the substr and use
preapred statements.

The table-access is 99% read-only, so creating extensive indexes would
be ok.
---------

I also could rearrange the table to have haystack-entries
1, AB%
2, EFG%
and use the LIKE operator like

select count(*)
from blacklist
where 'ABC' like haystack

EXPLAIN think's this is quicker.


I think we can do better.
Maybe I get something wrong here, but

SELECT count(*) FROM blacklist WHERE 'ABC' || '%' LIKE haystack;

should work nicely with your original query and use an index on the
column "haystack".

If it does not use the index, try again with more rows in the table
or issue "SET enable_seqscan=off;" for test purposes.

Yours,
Laurenz Albe


Reply With Quote
  #4  
Old   
Laurenz Albe
 
Posts: n/a

Default Re: optimzing query with substr - 08-01-2008 , 11:29 AM



reppisch <spam (AT) reppisch (DOT) de> wrote:
Quote:
maybe this question is not closely relatd to postgres but maybe someone
can proide a better way than my approach.

I'm trying to implement a blacklist lookup.
The blacklist table contains strings (or beginning-parts of strings)
which are considered "black". Lenght of this test patterns may be up to
25 char. And there may be *a lot* of them.
A pattern 'ABC' in the table allways means 'ABC.*' as a regexp.

Now i need a fast way to test a single string varchar(25) if it matches
against one (ore more) candidates in the table.

---------
A naive way may be having a table
blacklist
(
syskey integer NOT NULL
, haystack varchar(25) NOT NULL
);

containing
1, AB
2, EFG

Testing the string needle = 'ABC' would then lead to the dynamic
creation (because of the variable number of characters in 'needle') of a
select like

select count(*)
from blacklist
where
(substr(haystack,1,1) = '0' or substr(haystack,1,1) ='')
and (substr(haystack,2,1) = '9' or substr(haystack,2,1) ='')
and (substr(haystack,3,1) = '6' or substr(haystack,3,1) ='')

This is very expensive.

---------
Maybe with a table layout like
blacklist
(
syskey integer NOT NULL
, haystack_1 char NOT NULL
...
, haystack_25 char NOT NULL
);

or the postgres-specific array type i could save the substr and use
preapred statements.

The table-access is 99% read-only, so creating extensive indexes would
be ok.
---------

I also could rearrange the table to have haystack-entries
1, AB%
2, EFG%
and use the LIKE operator like

select count(*)
from blacklist
where 'ABC' like haystack

EXPLAIN think's this is quicker.


I think we can do better.
Maybe I get something wrong here, but

SELECT count(*) FROM blacklist WHERE 'ABC' || '%' LIKE haystack;

should work nicely with your original query and use an index on the
column "haystack".

If it does not use the index, try again with more rows in the table
or issue "SET enable_seqscan=off;" for test purposes.

Yours,
Laurenz Albe


Reply With Quote
  #5  
Old   
Laurenz Albe
 
Posts: n/a

Default Re: optimzing query with substr - 08-01-2008 , 11:29 AM



reppisch <spam (AT) reppisch (DOT) de> wrote:
Quote:
maybe this question is not closely relatd to postgres but maybe someone
can proide a better way than my approach.

I'm trying to implement a blacklist lookup.
The blacklist table contains strings (or beginning-parts of strings)
which are considered "black". Lenght of this test patterns may be up to
25 char. And there may be *a lot* of them.
A pattern 'ABC' in the table allways means 'ABC.*' as a regexp.

Now i need a fast way to test a single string varchar(25) if it matches
against one (ore more) candidates in the table.

---------
A naive way may be having a table
blacklist
(
syskey integer NOT NULL
, haystack varchar(25) NOT NULL
);

containing
1, AB
2, EFG

Testing the string needle = 'ABC' would then lead to the dynamic
creation (because of the variable number of characters in 'needle') of a
select like

select count(*)
from blacklist
where
(substr(haystack,1,1) = '0' or substr(haystack,1,1) ='')
and (substr(haystack,2,1) = '9' or substr(haystack,2,1) ='')
and (substr(haystack,3,1) = '6' or substr(haystack,3,1) ='')

This is very expensive.

---------
Maybe with a table layout like
blacklist
(
syskey integer NOT NULL
, haystack_1 char NOT NULL
...
, haystack_25 char NOT NULL
);

or the postgres-specific array type i could save the substr and use
preapred statements.

The table-access is 99% read-only, so creating extensive indexes would
be ok.
---------

I also could rearrange the table to have haystack-entries
1, AB%
2, EFG%
and use the LIKE operator like

select count(*)
from blacklist
where 'ABC' like haystack

EXPLAIN think's this is quicker.


I think we can do better.
Maybe I get something wrong here, but

SELECT count(*) FROM blacklist WHERE 'ABC' || '%' LIKE haystack;

should work nicely with your original query and use an index on the
column "haystack".

If it does not use the index, try again with more rows in the table
or issue "SET enable_seqscan=off;" for test purposes.

Yours,
Laurenz Albe


Reply With Quote
  #6  
Old   
Laurenz Albe
 
Posts: n/a

Default Re: optimzing query with substr - 08-01-2008 , 11:29 AM



reppisch <spam (AT) reppisch (DOT) de> wrote:
Quote:
maybe this question is not closely relatd to postgres but maybe someone
can proide a better way than my approach.

I'm trying to implement a blacklist lookup.
The blacklist table contains strings (or beginning-parts of strings)
which are considered "black". Lenght of this test patterns may be up to
25 char. And there may be *a lot* of them.
A pattern 'ABC' in the table allways means 'ABC.*' as a regexp.

Now i need a fast way to test a single string varchar(25) if it matches
against one (ore more) candidates in the table.

---------
A naive way may be having a table
blacklist
(
syskey integer NOT NULL
, haystack varchar(25) NOT NULL
);

containing
1, AB
2, EFG

Testing the string needle = 'ABC' would then lead to the dynamic
creation (because of the variable number of characters in 'needle') of a
select like

select count(*)
from blacklist
where
(substr(haystack,1,1) = '0' or substr(haystack,1,1) ='')
and (substr(haystack,2,1) = '9' or substr(haystack,2,1) ='')
and (substr(haystack,3,1) = '6' or substr(haystack,3,1) ='')

This is very expensive.

---------
Maybe with a table layout like
blacklist
(
syskey integer NOT NULL
, haystack_1 char NOT NULL
...
, haystack_25 char NOT NULL
);

or the postgres-specific array type i could save the substr and use
preapred statements.

The table-access is 99% read-only, so creating extensive indexes would
be ok.
---------

I also could rearrange the table to have haystack-entries
1, AB%
2, EFG%
and use the LIKE operator like

select count(*)
from blacklist
where 'ABC' like haystack

EXPLAIN think's this is quicker.


I think we can do better.
Maybe I get something wrong here, but

SELECT count(*) FROM blacklist WHERE 'ABC' || '%' LIKE haystack;

should work nicely with your original query and use an index on the
column "haystack".

If it does not use the index, try again with more rows in the table
or issue "SET enable_seqscan=off;" for test purposes.

Yours,
Laurenz Albe


Reply With Quote
  #7  
Old   
Laurenz Albe
 
Posts: n/a

Default Re: optimzing query with substr - 08-01-2008 , 11:29 AM



reppisch <spam (AT) reppisch (DOT) de> wrote:
Quote:
maybe this question is not closely relatd to postgres but maybe someone
can proide a better way than my approach.

I'm trying to implement a blacklist lookup.
The blacklist table contains strings (or beginning-parts of strings)
which are considered "black". Lenght of this test patterns may be up to
25 char. And there may be *a lot* of them.
A pattern 'ABC' in the table allways means 'ABC.*' as a regexp.

Now i need a fast way to test a single string varchar(25) if it matches
against one (ore more) candidates in the table.

---------
A naive way may be having a table
blacklist
(
syskey integer NOT NULL
, haystack varchar(25) NOT NULL
);

containing
1, AB
2, EFG

Testing the string needle = 'ABC' would then lead to the dynamic
creation (because of the variable number of characters in 'needle') of a
select like

select count(*)
from blacklist
where
(substr(haystack,1,1) = '0' or substr(haystack,1,1) ='')
and (substr(haystack,2,1) = '9' or substr(haystack,2,1) ='')
and (substr(haystack,3,1) = '6' or substr(haystack,3,1) ='')

This is very expensive.

---------
Maybe with a table layout like
blacklist
(
syskey integer NOT NULL
, haystack_1 char NOT NULL
...
, haystack_25 char NOT NULL
);

or the postgres-specific array type i could save the substr and use
preapred statements.

The table-access is 99% read-only, so creating extensive indexes would
be ok.
---------

I also could rearrange the table to have haystack-entries
1, AB%
2, EFG%
and use the LIKE operator like

select count(*)
from blacklist
where 'ABC' like haystack

EXPLAIN think's this is quicker.


I think we can do better.
Maybe I get something wrong here, but

SELECT count(*) FROM blacklist WHERE 'ABC' || '%' LIKE haystack;

should work nicely with your original query and use an index on the
column "haystack".

If it does not use the index, try again with more rows in the table
or issue "SET enable_seqscan=off;" for test purposes.

Yours,
Laurenz Albe


Reply With Quote
  #8  
Old   
Laurenz Albe
 
Posts: n/a

Default Re: optimzing query with substr - 08-01-2008 , 11:29 AM



reppisch <spam (AT) reppisch (DOT) de> wrote:
Quote:
maybe this question is not closely relatd to postgres but maybe someone
can proide a better way than my approach.

I'm trying to implement a blacklist lookup.
The blacklist table contains strings (or beginning-parts of strings)
which are considered "black". Lenght of this test patterns may be up to
25 char. And there may be *a lot* of them.
A pattern 'ABC' in the table allways means 'ABC.*' as a regexp.

Now i need a fast way to test a single string varchar(25) if it matches
against one (ore more) candidates in the table.

---------
A naive way may be having a table
blacklist
(
syskey integer NOT NULL
, haystack varchar(25) NOT NULL
);

containing
1, AB
2, EFG

Testing the string needle = 'ABC' would then lead to the dynamic
creation (because of the variable number of characters in 'needle') of a
select like

select count(*)
from blacklist
where
(substr(haystack,1,1) = '0' or substr(haystack,1,1) ='')
and (substr(haystack,2,1) = '9' or substr(haystack,2,1) ='')
and (substr(haystack,3,1) = '6' or substr(haystack,3,1) ='')

This is very expensive.

---------
Maybe with a table layout like
blacklist
(
syskey integer NOT NULL
, haystack_1 char NOT NULL
...
, haystack_25 char NOT NULL
);

or the postgres-specific array type i could save the substr and use
preapred statements.

The table-access is 99% read-only, so creating extensive indexes would
be ok.
---------

I also could rearrange the table to have haystack-entries
1, AB%
2, EFG%
and use the LIKE operator like

select count(*)
from blacklist
where 'ABC' like haystack

EXPLAIN think's this is quicker.


I think we can do better.
Maybe I get something wrong here, but

SELECT count(*) FROM blacklist WHERE 'ABC' || '%' LIKE haystack;

should work nicely with your original query and use an index on the
column "haystack".

If it does not use the index, try again with more rows in the table
or issue "SET enable_seqscan=off;" for test purposes.

Yours,
Laurenz Albe


Reply With Quote
  #9  
Old   
Laurenz Albe
 
Posts: n/a

Default Re: optimzing query with substr - 08-01-2008 , 11:29 AM



reppisch <spam (AT) reppisch (DOT) de> wrote:
Quote:
maybe this question is not closely relatd to postgres but maybe someone
can proide a better way than my approach.

I'm trying to implement a blacklist lookup.
The blacklist table contains strings (or beginning-parts of strings)
which are considered "black". Lenght of this test patterns may be up to
25 char. And there may be *a lot* of them.
A pattern 'ABC' in the table allways means 'ABC.*' as a regexp.

Now i need a fast way to test a single string varchar(25) if it matches
against one (ore more) candidates in the table.

---------
A naive way may be having a table
blacklist
(
syskey integer NOT NULL
, haystack varchar(25) NOT NULL
);

containing
1, AB
2, EFG

Testing the string needle = 'ABC' would then lead to the dynamic
creation (because of the variable number of characters in 'needle') of a
select like

select count(*)
from blacklist
where
(substr(haystack,1,1) = '0' or substr(haystack,1,1) ='')
and (substr(haystack,2,1) = '9' or substr(haystack,2,1) ='')
and (substr(haystack,3,1) = '6' or substr(haystack,3,1) ='')

This is very expensive.

---------
Maybe with a table layout like
blacklist
(
syskey integer NOT NULL
, haystack_1 char NOT NULL
...
, haystack_25 char NOT NULL
);

or the postgres-specific array type i could save the substr and use
preapred statements.

The table-access is 99% read-only, so creating extensive indexes would
be ok.
---------

I also could rearrange the table to have haystack-entries
1, AB%
2, EFG%
and use the LIKE operator like

select count(*)
from blacklist
where 'ABC' like haystack

EXPLAIN think's this is quicker.


I think we can do better.
Maybe I get something wrong here, but

SELECT count(*) FROM blacklist WHERE 'ABC' || '%' LIKE haystack;

should work nicely with your original query and use an index on the
column "haystack".

If it does not use the index, try again with more rows in the table
or issue "SET enable_seqscan=off;" for test purposes.

Yours,
Laurenz Albe


Reply With Quote
  #10  
Old   
reppisch
 
Posts: n/a

Default Re: optimzing query with partial string matching - 08-04-2008 , 02:17 AM



Hi,

thanks for responding
Quote:

Maybe I get something wrong here, but

SELECT count(*) FROM blacklist WHERE 'ABC' || '%' LIKE haystack;

The concept is:
the blacklist table contains a set of 'rules'
like
'AB%'
'EFG%'
....

Now i like to test some strings
a) 'ABC' or
b) 'QRZY' against all entries of the blacklist 'rules' and check if any
'rule' matches the sample.
So ABC is a varchar(25) typed argument to the query.

In sample a) 'ABC' matches rule 1 ('AB%') causing count(*) to be >0
In sample b) 'QRZY' is matched by no rule matches and count(*) will be 0

Nomally we do the inverse thing with 'LIKE' Eg. looking for names in a
table as in
'John'
'Jonny'
'Joe'
'Will'
and look for all Datasets matching 'Jo%'

The question was, if we know something faster then a inverse LIKE search.


Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.