dbTalk Databases Forums  

Search efficiency - RDMS search versus free text search

comp.databases.theory comp.databases.theory


Discuss Search efficiency - RDMS search versus free text search in the comp.databases.theory forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
Taras_96
 
Posts: n/a

Default Search efficiency - RDMS search versus free text search - 01-13-2009 , 05:10 AM






Hi everyone,

I just had an interesting discussion with a colleague regarding how we
are implementing search. We allow users to browse resources under a
list of topics, so for example a list of all resources relating to
'cars'. One way of doing this is with a SQL query against the database
backend:

SELECT * FROM resources WHERE topic_area = 'cars'

The actual query is quite a bit more complicated than this (involves
table joins and the like).

However, he said that due to the table joins (coming from the fact
that the database is normalised), that this technique is quite
resource expensive. Instead, we are using the Google Search Appliance
to index our resources (delivered via the web), and creating the topic
pages by using Google to search on meta-data, which apparently results
in better performance.

This came to me somewhat as a suprise, as I was under the impression
that returning such result sets were one of the key uses of a RDMS -
the efficient storage and retrieval of records. I can understand the
performance issues if the search was a free text search, but the above
search is really searching against an indexed column value.

Is this often the case with enterprise level database searches - that
the database itself isn't used for searching because of performance
issues?

Cheers

Taras

Reply With Quote
  #2  
Old   
Cimode
 
Posts: n/a

Default Re: Search efficiency - RDMS search versus free text search - 01-13-2009 , 05:34 AM






On 13 jan, 12:10, Taras_96 <taras... (AT) gmail (DOT) com> wrote:
Quote:
Hi everyone,

I just had an interesting discussion with a colleague regarding how we
are implementing search. We allow users to browse resources under a
list of topics, so for example a list of all resources relating to
'cars'. One way of doing this is with a SQL query against the database
backend:

SELECT * FROM resources WHERE topic_area = 'cars'

The actual query is quite a bit more complicated than this (involves
table joins and the like).

However, he said that due to the table joins (coming from the fact
that the database is normalised), that this technique is quite
resource expensive. Instead, we are using the Google Search Appliance
to index our resources (delivered via the web), and creating the topic
pages by using Google to search on meta-data, which apparently results
in better performance.

This came to me somewhat as a suprise, as I was under the impression
that returning such result sets were one of the key uses of a RDMS -
the efficient storage and retrieval of records. I can understand the
performance issues if the search was a free text search, but the above
search is really searching against an indexed column value.

Is this often the case with enterprise level database searches - that
the database itself isn't used for searching because of performance
issues?

Cheers

Taras
Performance depends greatly on the storage and retrieval mechanism
used by the db engine to implement search. Most directimage systems
have poor performance due to fundamental limitations onto how they
represent data on disk and how the retrieve that information.


Reply With Quote
  #3  
Old   
vadimtro@gmail.com
 
Posts: n/a

Default Re: Search efficiency - RDMS search versus free text search - 01-13-2009 , 12:49 PM



On Jan 13, 3:10*am, Taras_96 <taras... (AT) gmail (DOT) com> wrote:
Quote:
Hi everyone,

I just had an interesting discussion with a colleague regarding how we
are implementing search. We allow users to browse resources under a
list of topics, so for example a list of all resources relating to
'cars'. One way of doing this is with a SQL query against the database
backend:

SELECT * FROM resources WHERE topic_area = 'cars'

The actual query is quite a bit more complicated than this (involves
table joins and the like).
There are two options, both being not particularly great solutions:
join index (which is essentially bitmap), and materialized view with
index on top.

Performance issue is not a great surprise: databases are solutions for
wide range of problems not being particularly great at any of them.
They however are infitely more flexible than alternatives.


Reply With Quote
  #4  
Old   
Taras_96
 
Posts: n/a

Default Re: Search efficiency - RDMS search versus free text search - 01-14-2009 , 05:46 AM



My colleague even postulated that big sites such as e-bay would be
using dedicated search technologies to accomplish categorical listings
instead of querying the back end database. For example, if you
searched for all products in the 'car' category, that instead of
querying the database for all products in the car category (presumably
a field value), that it would query the search technology for the
products. Does this seem probably to anyone?

Taras

Reply With Quote
  #5  
Old   
paul c
 
Posts: n/a

Default Re: Search efficiency - RDMS search versus free text search - 01-14-2009 , 08:56 AM



Taras_96 wrote:
Quote:
Hi everyone,

I just had an interesting discussion with a colleague regarding how we
are implementing search. We allow users to browse resources under a
list of topics, so for example a list of all resources relating to
'cars'. One way of doing this is with a SQL query against the database
backend:

SELECT * FROM resources WHERE topic_area = 'cars'

The actual query is quite a bit more complicated than this (involves
table joins and the like).

However, he said that due to the table joins (coming from the fact
that the database is normalised), that this technique is quite
resource expensive. Instead, we are using the Google Search Appliance
to index our resources (delivered via the web), and creating the topic
pages by using Google to search on meta-data, which apparently results
in better performance.

This came to me somewhat as a suprise, as I was under the impression
that returning such result sets were one of the key uses of a RDMS -
the efficient storage and retrieval of records. I can understand the
performance issues if the search was a free text search, but the above
search is really searching against an indexed column value.

Is this often the case with enterprise level database searches - that
the database itself isn't used for searching because of performance
issues?

Cheers

Taras
I think this is a significant question even if it is perhaps naive in the c.d.t. context. The question I would ask is: does the 'appliance' give the same answers as the dbms? Possibly the answer to that question would indicate that the requirements for this particular database are just as fuzzy as the answers.


Presumably, this is a read-only db at least as far as some users are concerned. If there is a requirement for accurate updating by other users, let's hope they use some version or special copy of the db itself. On the other hand, I remember read-only advertising research apps from the early 1980's that did fuzzy searching. Their main requirement was to give possible answers, not all answers.


(Some time ago I was interested in finding the children of a woman I thought to be dead for twenty or more years. No criticism of Google but its algorithms got me nowhere whereas Yahoo gave me some clues as to countries she might have lived in. Knowing her husband's surname and the given name of her male child, I looked him up in a foreign telephone directory and by accident reached another son I didn't know of, and then to my surprise, the lady herself, still alive. At one time, Yahoo's machine-generated indexes were edited to some extent manually. No idea if this is still the case, but this example shows that the ranking involved in these 'search engines' may omit useful answers and has a big effect on subsequent searches.)


Reply With Quote
  #6  
Old   
Walter Mitty
 
Posts: n/a

Default Re: Search efficiency - RDMS search versus free text search - 01-14-2009 , 09:34 AM




"paul c" <toledobythesea (AT) oohay (DOT) ac> wrote

Quote:
Taras_96 wrote:
Hi everyone,

I just had an interesting discussion with a colleague regarding how we
are implementing search. We allow users to browse resources under a
list of topics, so for example a list of all resources relating to
'cars'. One way of doing this is with a SQL query against the database
backend:

SELECT * FROM resources WHERE topic_area = 'cars'

The actual query is quite a bit more complicated than this (involves
table joins and the like).

However, he said that due to the table joins (coming from the fact
that the database is normalised), that this technique is quite
resource expensive. Instead, we are using the Google Search Appliance
to index our resources (delivered via the web), and creating the topic
pages by using Google to search on meta-data, which apparently results
in better performance.

This came to me somewhat as a suprise, as I was under the impression
that returning such result sets were one of the key uses of a RDMS -
the efficient storage and retrieval of records. I can understand the
performance issues if the search was a free text search, but the above
search is really searching against an indexed column value.

Is this often the case with enterprise level database searches - that
the database itself isn't used for searching because of performance
issues?

Cheers

Taras

I think this is a significant question even if it is perhaps naive in the
c.d.t. context. The question I would ask is: does the 'appliance' give
the same answers as the dbms? Possibly the answer to that question would
indicate that the requirements for this particular database are just as
fuzzy as the answers.


Presumably, this is a read-only db at least as far as some users are
concerned. If there is a requirement for accurate updating by other
users, let's hope they use some version or special copy of the db itself.
On the other hand, I remember read-only advertising research apps from the
early 1980's that did fuzzy searching. Their main requirement was to give
possible answers, not all answers.


(Some time ago I was interested in finding the children of a woman I
thought to be dead for twenty or more years. No criticism of Google but
its algorithms got me nowhere whereas Yahoo gave me some clues as to
countries she might have lived in. Knowing her husband's surname and the
given name of her male child, I looked him up in a foreign telephone
directory and by accident reached another son I didn't know of, and then
to my surprise, the lady herself, still alive. At one time, Yahoo's
machine-generated indexes were edited to some extent manually. No idea if
this is still the case, but this example shows that the ranking involved
in these 'search engines' may omit useful answers and has a big effect on
subsequent searches.)
From reading your response, I have some real doubt as to whether it was the
Google "algorithms" or the data those algorithms had available to them that
caused the results you saw.

Speculating about the internals of an engine like Google provides lots of
opportunities for mistakes. Google gets inputs from all over the place.
Google applies algorithms to those inputs prior to storing. Google uses
queries to get a presumably relevant subset back when responding to a user
request. For this last purpose, the database can indeed be treated as read
only, although there is probably a second schema that collects data about
user search activities, and that schema is probably written to.

As far as fuzzy logic goes, the data can be fuzzed prior to storage, or
fuzzed in the course of a query, or fuzzed by an algorithm applied to query
results. Without know how the data is stored, we can't know whether it's
already fuzzy in storage.

As far as the OP is concerned, I'd like some responder that's more well
versed in theory than I am to refine the OP's understanding of what
relational databases are for. "The efficient storage and retrieval or
records" seems like the very tip of the iceberg to me.





Reply With Quote
  #7  
Old   
patrick61z@yahoo.com
 
Posts: n/a

Default Re: Search efficiency - RDMS search versus free text search - 01-23-2009 , 03:06 PM



On Jan 13, 6:10 am, Taras_96 <taras... (AT) gmail (DOT) com> wrote:
Quote:
Hi everyone,

I just had an interesting discussion with a colleague regarding how we
are implementing search. We allow users to browse resources under a
list of topics, so for example a list of all resources relating to
'cars'. One way of doing this is with a SQL query against the database
backend:

SELECT * FROM resources WHERE topic_area = 'cars'

The actual query is quite a bit more complicated than this (involves
table joins and the like).

However, he said that due to the table joins (coming from the fact
that the database is normalised), that this technique is quite
resource expensive. Instead, we are using the Google Search Appliance
to index our resources (delivered via the web), and creating the topic
pages by using Google to search on meta-data, which apparently results
in better performance.

This came to me somewhat as a suprise, as I was under the impression
that returning such result sets were one of the key uses of a RDMS -
the efficient storage and retrieval of records. I can understand the
performance issues if the search was a free text search, but the above
search is really searching against an indexed column value.

Is this often the case with enterprise level database searches - that
the database itself isn't used for searching because of performance
issues?

Cheers

Taras
Search like this is probably one of the areas that RDBMS's struggle
with, simply because the problem of keyword or text search is not what
relational database tech is all about.

Yes it can do it, but full text search is best designed outside of
relational algebra and like repeating groups in some of todays
database engines, the engine itself will do full text indexing for
you. And you can bet that underneath the hood, theres real programmers
implementing indexes that maximize performance.

An example I found is http://www.developer.com/db/article.php/3446891
that discusses how SQL Server 2008 bails on this task to an instance
of "Microsoft Search Service" and to me this makes perfect sense.

It wouldn't hurt to forget relational algebra long enough to read a
few articles about document indexing and the design decisions that
come into play. Along with that, be familiar with the types of indexes
in use on todays databases (that while are parts of relational
databases themselves, are definitely built using non relational
algorithms). I've found a basic understanding of indexing techniques
to be pretty helpful.



Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.