dbTalk Databases Forums  

SQL server FTS vs Lucene.NET

microsoft.public.sqlserver.fulltext microsoft.public.sqlserver.fulltext


Discuss SQL server FTS vs Lucene.NET in the microsoft.public.sqlserver.fulltext forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
Remi Blanchette
 
Posts: n/a

Default SQL server FTS vs Lucene.NET - 01-26-2010 , 01:43 PM






People,
we are currently investigating, prototyping and comparing the architecture
of a full text search capability in a project. Our project has many data
repository, one of which is SQL server 2008. My personal vision for FTS
support was to use whatever was available at the repository level. As such,
for our SQL server database, we would have used SQL server FTS without even
looking into it.

It is argued internally however that this selection is a poor candidate even
for the SQL server repository. Which I definitely sound weird...

Here is my developer arguments:
"This product (sql server FTS) is lacking in term of the features we need to
properly rank search results. Even if we only had only one repository, the
limitations of Microsoft SQL Server Full-Text Search would make it a poor
candidate to have proper ranking.
....
It cannot be used to rank results for different kind of repository
(databases), it cannot even be used across tables in the same database,
since the number is normalized according to the number of occurrences of the
search terms in the table.
....
In addition to the problems with the aggregation of ranks in SQL Server
search queries, the other important limitation is the lack of flexibility to
influence the relevance. It is possible to use a weight operator to give
more or less importance to keywords, but it is not possible to do so for
fields. It means it is not possible natively to give more weight to a match
in the title of a document than a match in its content. It is the same
problem for the weighting from geographical position. In both case, it would
be possible to try to tweak the results with some kind of multiplication
factor to the rank results, but it would be rather fragile, skewing the
results in non-coherent ways and probably not offer good performance.
....
Many products, both open source and paid products are available the need of
a centralized, external index. The Apache Software Foundation offers one
that seems to stand out, Solr, based on the Lucene search engine. It is a
standalone server, that can be interacted with through HTTP queries. There
is also a C# client, with NHibernate integration through NHibernate Search
available."

I offer the general public, and especially those at microsoft, the chance to
mitigate that argumentation or to add to it. Either way I don't care
personally, but I don't want us choosing the solution for the wrong reasons.
We are both a little bit biased...

Reply With Quote
  #2  
Old   
Hilary Cotter
 
Posts: n/a

Default Re: SQL server FTS vs Lucene.NET - 01-26-2010 , 08:50 PM






answers inline.

"Remi Blanchette" <remi_blanchette (AT) hotmail (DOT) com> wrote

Quote:
People,
we are currently investigating, prototyping and comparing the architecture
of a full text search capability in a project. Our project has many data
repository, one of which is SQL server 2008. My personal vision for FTS
support was to use whatever was available at the repository level. As
such, for our SQL server database, we would have used SQL server FTS
without even looking into it.

It is argued internally however that this selection is a poor candidate
even for the SQL server repository. Which I definitely sound weird...

Here is my developer arguments:
"This product (sql server FTS) is lacking in term of the features we need
to properly rank search results. Even if we only had only one repository,
the limitations of Microsoft SQL Server Full-Text Search would make it a
poor candidate to have proper ranking.
I have gone back and forth between recommending lucene over SQL FTS. You
really need to make your decision based on your goals and test to see what
works best for you.

Basically for tight integration with SQL Server where you need close to real
time indexing and to join relational queries against full-text queries you
can't beat SQL FTS.

Now if you use lucene you have more features and a richer querying language.
You have hit high-lighting, but you have to roll your own indexing mechanism
and build some sort of transactional features to the index. Consider your
index getting corrupt while indexing 100 docs per second. How to do roll
back to a previous version of the index. How to you index 1 billion rows
quickly should you need to reinidex. These are difficult problems to get
around in lucene, but part of the engine in SQL FTS.

Quote:
...
It cannot be used to rank results for different kind of repository
(databases), it cannot even be used across tables in the same database,
since the number is normalized according to the number of occurrences of
the search terms in the table.
ok, I don't really understand the point here. If this is important to you
you can consolidate all the data in a single table or index an index view.
With a distributed repostory you have to worry about a synchronization
problem. For example if you index 20 document stores you have to worry that
all 20 document stores are complete otherwise you will have dead links.

Quote:
...
In addition to the problems with the aggregation of ranks in SQL Server
search queries, the other important limitation is the lack of flexibility
to influence the relevance. It is possible to use a weight operator to
give more or less importance to keywords, but it is not possible to do so
for fields. It means it is not possible natively to give more weight to a
match in the title of a document than a match in its content. It is the
same problem for the weighting from geographical position. In both case,
it would be possible to try to tweak the results with some kind of
multiplication factor to the rank results, but it would be rather fragile,
skewing the results in non-coherent ways and probably not offer good
performance.
...
As I said previously the querying language of lucene is more feature rich
than SQL FTS. If this is important to you this makes the decision an easy
one.

However there are problems doing date sorting and searching in lucene and
you need to code in Java - I am not sure if this is important to you. The
date issues may have been solved by now - I haven't worked in a large
lucene search deployment in over a year now.

Quote:
Many products, both open source and paid products are available the need
of a centralized, external index.
This adds another point of failure you have to work around. This may or may
not be a problem for you.

Quote:
The Apache Software Foundation offers one that seems to stand out, Solr,
based on the Lucene search engine. It is a standalone server, that can be
interacted with through HTTP queries. There is also a C# client, with
NHibernate integration through NHibernate Search available."
Solr is a wrapper around lucene. Basically you have to look at what SQL FTS
delivers and then what Lucene delivers and then build your own search engine
around this, or use what ships in Solr.
Quote:
I offer the general public, and especially those at microsoft, the chance
to mitigate that argumentation or to add to it. Either way I don't care
personally, but I don't want us choosing the solution for the wrong
reasons. We are both a little bit biased...
You have to define your goals carefully and test to see which solution
offers the best solution. Both solutions will involve considerable coding
effort.

I should mention that the search algorithim used by SQL FTS cleans up at
trec. Search is commodotized today, most search vendors are trying to gain
market share by adding features. When you make your selection it should not
be driven by search itself, but rather the code effort to support your
solution and the features provided.

Quote:



Reply With Quote
  #3  
Old   
pauliom
 
Posts: n/a

Default Re: answers inline. - 08-23-2011 , 07:26 AM



This is something I'm looking at too. I thought I'd add my thoughts into the pot.

First off, there is lucene.net. So it is possible to avoid Java, but is it a mature enough port? It's something that worries me.

As for FTS (Full Text Search) not running across databases then I'm looking at accepting that searching uses stale data. Once I accept that I can look at constructing more of a publish model. If I have x databases then I can create a (conceptually) single database that consolodates data into one or more views that I wish to search on. Although this creates comms problems it does provide adv. when it comes to CDN (Content Delivery Network).

I'm no nearer selecting one, as both have pro's/con's, but hopefully this can further the discussion.

Quote:
On Tuesday, January 26, 2010 2:43 PM Remi Blanchette wrote:

People,
we are currently investigating, prototyping and comparing the architecture
of a full text search capability in a project. Our project has many data
repository, one of which is SQL server 2008. My personal vision for FTS
support was to use whatever was available at the repository level. As such,
for our SQL server database, we would have used SQL server FTS without even
looking into it.

It is argued internally however that this selection is a poor candidate even
for the SQL server repository. Which I definitely sound weird...

Here is my developer arguments:
"This product (sql server FTS) is lacking in term of the features we need to
properly rank search results. Even if we only had only one repository, the
limitations of Microsoft SQL Server Full-Text Search would make it a poor
candidate to have proper ranking.
...
It cannot be used to rank results for different kind of repository
(databases), it cannot even be used across tables in the same database,
since the number is normalized according to the number of occurrences of the
search terms in the table.
...
In addition to the problems with the aggregation of ranks in SQL Server
search queries, the other important limitation is the lack of flexibility to
influence the relevance. It is possible to use a weight operator to give
more or less importance to keywords, but it is not possible to do so for
fields. It means it is not possible natively to give more weight to a match
in the title of a document than a match in its content. It is the same
problem for the weighting from geographical position. In both case, it would
be possible to try to tweak the results with some kind of multiplication
factor to the rank results, but it would be rather fragile, skewing the
results in non-coherent ways and probably not offer good performance.
...
Many products, both open source and paid products are available the need of
a centralized, external index. The Apache Software Foundation offers one
that seems to stand out, Solr, based on the Lucene search engine. It is a
standalone server, that can be interacted with through HTTP queries. There
is also a C# client, with NHibernate integration through NHibernate Search
available."

I offer the general public, and especially those at microsoft, the chance to
mitigate that argumentation or to add to it. Either way I do not care
personally, but I do not want us choosing the solution for the wrong reasons.
We are both a little bit biased...

Quote:
On Tuesday, January 26, 2010 9:50 PM Hilary Cotter wrote:

answers inline.


I have gone back and forth between recommending lucene over SQL FTS. You
really need to make your decision based on your goals and test to see what
works best for you.

Basically for tight integration with SQL Server where you need close to real
time indexing and to join relational queries against full-text queries you
cannot beat SQL FTS.

Now if you use lucene you have more features and a richer querying language.
You have hit high-lighting, but you have to roll your own indexing mechanism
and build some sort of transactional features to the index. Consider your
index getting corrupt while indexing 100 docs per second. How to do roll
back to a previous version of the index. How to you index 1 billion rows
quickly should you need to reinidex. These are difficult problems to get
around in lucene, but part of the engine in SQL FTS.


ok, I do not really understand the point here. If this is important to you
you can consolidate all the data in a single table or index an index view.
With a distributed repostory you have to worry about a synchronization
problem. For example if you index 20 document stores you have to worry that
all 20 document stores are complete otherwise you will have dead links.


As I said previously the querying language of lucene is more feature rich
than SQL FTS. If this is important to you this makes the decision an easy
one.

However there are problems doing date sorting and searching in lucene and
you need to code in Java - I am not sure if this is important to you. The
date issues may have been solved by now - I have not worked in a large
lucene search deployment in over a year now.


This adds another point of failure you have to work around. This may or may
not be a problem for you.


Solr is a wrapper around lucene. Basically you have to look at what SQL FTS
delivers and then what Lucene delivers and then build your own search engine
around this, or use what ships in Solr.

You have to define your goals carefully and test to see which solution
offers the best solution. Both solutions will involve considerable coding
effort.

I should mention that the search algorithim used by SQL FTS cleans up at
trec. Search is commodotized today, most search vendors are trying to gain
market share by adding features. When you make your selection it should not
be driven by search itself, but rather the code effort to support your

Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.