SQL server FTS vs Lucene.NET - 01-26-2010 , 01:43 PM
we are currently investigating, prototyping and comparing the architecture
of a full text search capability in a project. Our project has many data
repository, one of which is SQL server 2008. My personal vision for FTS
support was to use whatever was available at the repository level. As such,
for our SQL server database, we would have used SQL server FTS without even
looking into it.
It is argued internally however that this selection is a poor candidate even
for the SQL server repository. Which I definitely sound weird...
Here is my developer arguments:
"This product (sql server FTS) is lacking in term of the features we need to
properly rank search results. Even if we only had only one repository, the
limitations of Microsoft SQL Server Full-Text Search would make it a poor
candidate to have proper ranking.
It cannot be used to rank results for different kind of repository
(databases), it cannot even be used across tables in the same database,
since the number is normalized according to the number of occurrences of the
search terms in the table.
In addition to the problems with the aggregation of ranks in SQL Server
search queries, the other important limitation is the lack of flexibility to
influence the relevance. It is possible to use a weight operator to give
more or less importance to keywords, but it is not possible to do so for
fields. It means it is not possible natively to give more weight to a match
in the title of a document than a match in its content. It is the same
problem for the weighting from geographical position. In both case, it would
be possible to try to tweak the results with some kind of multiplication
factor to the rank results, but it would be rather fragile, skewing the
results in non-coherent ways and probably not offer good performance.
Many products, both open source and paid products are available the need of
a centralized, external index. The Apache Software Foundation offers one
that seems to stand out, Solr, based on the Lucene search engine. It is a
standalone server, that can be interacted with through HTTP queries. There
is also a C# client, with NHibernate integration through NHibernate Search
I offer the general public, and especially those at microsoft, the chance to
mitigate that argumentation or to add to it. Either way I don't care
personally, but I don't want us choosing the solution for the wrong reasons.
We are both a little bit biased...
Re: SQL server FTS vs Lucene.NET - 01-26-2010 , 08:50 PM
"Remi Blanchette" <remi_blanchette (AT) hotmail (DOT) com> wrote
really need to make your decision based on your goals and test to see what
works best for you.
Basically for tight integration with SQL Server where you need close to real
time indexing and to join relational queries against full-text queries you
can't beat SQL FTS.
Now if you use lucene you have more features and a richer querying language.
You have hit high-lighting, but you have to roll your own indexing mechanism
and build some sort of transactional features to the index. Consider your
index getting corrupt while indexing 100 docs per second. How to do roll
back to a previous version of the index. How to you index 1 billion rows
quickly should you need to reinidex. These are difficult problems to get
around in lucene, but part of the engine in SQL FTS.
you can consolidate all the data in a single table or index an index view.
With a distributed repostory you have to worry about a synchronization
problem. For example if you index 20 document stores you have to worry that
all 20 document stores are complete otherwise you will have dead links.
than SQL FTS. If this is important to you this makes the decision an easy
However there are problems doing date sorting and searching in lucene and
you need to code in Java - I am not sure if this is important to you. The
date issues may have been solved by now - I haven't worked in a large
lucene search deployment in over a year now.
not be a problem for you.
delivers and then what Lucene delivers and then build your own search engine
around this, or use what ships in Solr.
offers the best solution. Both solutions will involve considerable coding
I should mention that the search algorithim used by SQL FTS cleans up at
trec. Search is commodotized today, most search vendors are trying to gain
market share by adding features. When you make your selection it should not
be driven by search itself, but rather the code effort to support your
solution and the features provided.
Re: answers inline. - 08-23-2011 , 07:26 AM
This is something I'm looking at too. I thought I'd add my thoughts into the pot.
First off, there is lucene.net. So it is possible to avoid Java, but is it a mature enough port? It's something that worries me.
As for FTS (Full Text Search) not running across databases then I'm looking at accepting that searching uses stale data. Once I accept that I can look at constructing more of a publish model. If I have x databases then I can create a (conceptually) single database that consolodates data into one or more views that I wish to search on. Although this creates comms problems it does provide adv. when it comes to CDN (Content Delivery Network).
I'm no nearer selecting one, as both have pro's/con's, but hopefully this can further the discussion.