dbTalk Databases Forums  

indexing text

comp.databases.postgresql comp.databases.postgresql


Discuss indexing text in the comp.databases.postgresql forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
M. Strobel
 
Posts: n/a

Default indexing text - 03-03-2011 , 02:29 AM






Hi,

suppose I want to store log messages in a table, or I want to
have translations of messages in a table. To avoid duplicates,
many programmers would compute a hash on the text and use it for
indexing.

I wonder if the postgres indexing facilities would handle longer
text columns well by itself, so there is not much to gain by hashing.

I am talking about modest size tables, in web applications.

/Str.

Reply With Quote
  #2  
Old   
Mladen Gogala
 
Posts: n/a

Default Re: indexing text - 03-03-2011 , 07:54 AM






On Thu, 03 Mar 2011 10:29:53 +0200, M. Strobel wrote:

Quote:
Hi,

suppose I want to store log messages in a table, or I want to have
translations of messages in a table. To avoid duplicates, many
programmers would compute a hash on the text and use it for indexing.

I wonder if the postgres indexing facilities would handle longer text
columns well by itself, so there is not much to gain by hashing.

I am talking about modest size tables, in web applications.

/Str.
I don't understand the question. Do you mean normal B-Tree index or text
index? Postgres does have text indexing engine, it's called Tsearch2, and
if you only need to search for words, it's an excellent choice. If you
need more capabilities than Tsearch2 provides (phrases, near operator),
you can try with Sphinx, Lucene or Xapian.
All of those engines are geared toward large text documents and can
handle multi-MB documents with ease.



--
http://mgogala.byethost5.com

Reply With Quote
  #3  
Old   
Anselmo Canfora
 
Posts: n/a

Default Re: indexing text - 03-03-2011 , 08:41 AM



From:
http://www.postgresql.org/docs/8.3/s...xes-types.html

"Hash indexes can only handle simple equality comparisons. The query
planner will consider using a hash index whenever an indexed column is
involved in a comparison using the = operator. (But hash indexes do not
support IS NULL searches.) The following command is used to create a
hash index:

CREATE INDEX name ON table USING hash (column);

Note: Testing has shown PostgreSQL's hash indexes to perform no
better than B-tree indexes, and the index size and build time for hash
indexes is much worse. Furthermore, hash index operations are not
presently WAL-logged, so hash indexes might need to be rebuilt with
REINDEX after a database crash. For these reasons, hash index use is
presently discouraged."



Il 03/03/2011 9.29, M. Strobel ha scritto:
Quote:
Hi,

suppose I want to store log messages in a table, or I want to
have translations of messages in a table. To avoid duplicates,
many programmers would compute a hash on the text and use it for
indexing.

I wonder if the postgres indexing facilities would handle longer
text columns well by itself, so there is not much to gain by hashing.

I am talking about modest size tables, in web applications.

/Str.

Reply With Quote
  #4  
Old   
M. Strobel
 
Posts: n/a

Default Re: indexing text - 03-03-2011 , 10:00 AM



Am 03.03.2011 16:41, schrieb Anselmo Canfora:
Quote:
From:
http://www.postgresql.org/docs/8.3/s...xes-types.html

"Hash indexes can only handle simple equality comparisons. The
query planner will consider using a hash index whenever an
indexed column is involved in a comparison using the = operator.
(But hash indexes do not support IS NULL searches.) The following
command is used to create a hash index:

CREATE INDEX name ON table USING hash (column);

Note: Testing has shown PostgreSQL's hash indexes to perform
no better than B-tree indexes, and the index size and build time
for hash indexes is much worse. Furthermore, hash index
operations are not presently WAL-logged, so hash indexes might
need to be rebuilt with REINDEX after a database crash. For these
reasons, hash index use is presently discouraged."



Il 03/03/2011 9.29, M. Strobel ha scritto:
Hi,

suppose I want to store log messages in a table, or I want to
have translations of messages in a table. To avoid duplicates,
many programmers would compute a hash on the text and use it for
indexing.

I wonder if the postgres indexing facilities would handle longer
text columns well by itself, so there is not much to gain by
hashing.

I am talking about modest size tables, in web applications.

/Str.
Thanks, I did not know about hash indexes, I was only aware of
the B-tree. I want to look up the whole text. I just noticed that
much more data would be sent to the database when looking up the
text, instead of a self made hash.

It still would be interesting to know how well b-tree indexes
perform on longer text columns.

/Str.

Reply With Quote
  #5  
Old   
M. Strobel
 
Posts: n/a

Default Re: indexing text - 03-03-2011 , 10:08 AM



Am 03.03.2011 15:54, schrieb Mladen Gogala:
Quote:
On Thu, 03 Mar 2011 10:29:53 +0200, M. Strobel wrote:

Hi,

suppose I want to store log messages in a table, or I want to have
translations of messages in a table. To avoid duplicates, many
programmers would compute a hash on the text and use it for indexing.

I wonder if the postgres indexing facilities would handle longer text
columns well by itself, so there is not much to gain by hashing.

I am talking about modest size tables, in web applications.

/Str.

I don't understand the question. Do you mean normal B-Tree index or text
index? Postgres does have text indexing engine, it's called Tsearch2, and
if you only need to search for words, it's an excellent choice. If you
need more capabilities than Tsearch2 provides (phrases, near operator),
you can try with Sphinx, Lucene or Xapian.
All of those engines are geared toward large text documents and can
handle multi-MB documents with ease.



I meant b-tree, and equal-operation, no search on words so far
(or not more than 'like' or 'regex' would do).

/Str.

Reply With Quote
  #6  
Old   
Anselmo Canfora
 
Posts: n/a

Default Re: indexing text - 03-03-2011 , 12:27 PM



[CUT]
Quote:
Thanks, I did not know about hash indexes, I was only aware of
the B-tree. I want to look up the whole text. I just noticed that
much more data would be sent to the database when looking up the
text, instead of a self made hash.

It still would be interesting to know how well b-tree indexes
perform on longer text columns.

/Str.
It seems that starting from 8.4 hash indexes issue has been resolved:
http://developer.postgresql.org/pgdo...lease-8-4.html

"E.13.3.5.1. Indexes

* Dramatically improve the speed of building and accessing hash
indexes (Tom Raney, Shreya Bhargava)
This allows hash indexes to be sometimes faster than btree indexes.
However, hash indexes are still not crash-safe.

* Make hash indexes store only the hash code, not the full value of
the indexed column (Xiao Meng)
This greatly reduces the size of hash indexes for long indexed values,
improving performance."

If you use PG >=8.4 you need to do some testing in order to choose the
right index, for PG<8.4 btree is the only choice. Pay attention, if you
know you'll need to perform queries with 'like' operator, PG won't use
the index: you will need a trick in that case, otherwise PG will do a
full table scan for each 'like' query.
If uniqueness of text is the only requirement you need, I think you can
do something like this:

create table my_log_table (id char(32) primary key, logline text,
whateveryouwant text);

then create a 'before insert' trigger to set id=md5(logline).

Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.