![]() | |
![]() |
| | Thread Tools | Display Modes |
#1
| |||
| |||
|
#2
| |||||||
| |||||||
|
|
People, we are currently investigating, prototyping and comparing the architecture of a full text search capability in a project. Our project has many data repository, one of which is SQL server 2008. My personal vision for FTS support was to use whatever was available at the repository level. As such, for our SQL server database, we would have used SQL server FTS without even looking into it. It is argued internally however that this selection is a poor candidate even for the SQL server repository. Which I definitely sound weird... Here is my developer arguments: "This product (sql server FTS) is lacking in term of the features we need to properly rank search results. Even if we only had only one repository, the limitations of Microsoft SQL Server Full-Text Search would make it a poor candidate to have proper ranking. |
|
... It cannot be used to rank results for different kind of repository (databases), it cannot even be used across tables in the same database, since the number is normalized according to the number of occurrences of the search terms in the table. |
|
... In addition to the problems with the aggregation of ranks in SQL Server search queries, the other important limitation is the lack of flexibility to influence the relevance. It is possible to use a weight operator to give more or less importance to keywords, but it is not possible to do so for fields. It means it is not possible natively to give more weight to a match in the title of a document than a match in its content. It is the same problem for the weighting from geographical position. In both case, it would be possible to try to tweak the results with some kind of multiplication factor to the rank results, but it would be rather fragile, skewing the results in non-coherent ways and probably not offer good performance. ... |
|
Many products, both open source and paid products are available the need of a centralized, external index. |
|
The Apache Software Foundation offers one that seems to stand out, Solr, based on the Lucene search engine. It is a standalone server, that can be interacted with through HTTP queries. There is also a C# client, with NHibernate integration through NHibernate Search available." |
|
I offer the general public, and especially those at microsoft, the chance to mitigate that argumentation or to add to it. Either way I don't care personally, but I don't want us choosing the solution for the wrong reasons. We are both a little bit biased... |
| |
#3
| |||
| |||
|
|
On Tuesday, January 26, 2010 2:43 PM Remi Blanchette wrote: People, we are currently investigating, prototyping and comparing the architecture of a full text search capability in a project. Our project has many data repository, one of which is SQL server 2008. My personal vision for FTS support was to use whatever was available at the repository level. As such, for our SQL server database, we would have used SQL server FTS without even looking into it. It is argued internally however that this selection is a poor candidate even for the SQL server repository. Which I definitely sound weird... Here is my developer arguments: "This product (sql server FTS) is lacking in term of the features we need to properly rank search results. Even if we only had only one repository, the limitations of Microsoft SQL Server Full-Text Search would make it a poor candidate to have proper ranking. ... It cannot be used to rank results for different kind of repository (databases), it cannot even be used across tables in the same database, since the number is normalized according to the number of occurrences of the search terms in the table. ... In addition to the problems with the aggregation of ranks in SQL Server search queries, the other important limitation is the lack of flexibility to influence the relevance. It is possible to use a weight operator to give more or less importance to keywords, but it is not possible to do so for fields. It means it is not possible natively to give more weight to a match in the title of a document than a match in its content. It is the same problem for the weighting from geographical position. In both case, it would be possible to try to tweak the results with some kind of multiplication factor to the rank results, but it would be rather fragile, skewing the results in non-coherent ways and probably not offer good performance. ... Many products, both open source and paid products are available the need of a centralized, external index. The Apache Software Foundation offers one that seems to stand out, Solr, based on the Lucene search engine. It is a standalone server, that can be interacted with through HTTP queries. There is also a C# client, with NHibernate integration through NHibernate Search available." I offer the general public, and especially those at microsoft, the chance to mitigate that argumentation or to add to it. Either way I do not care personally, but I do not want us choosing the solution for the wrong reasons. We are both a little bit biased... |
|
On Tuesday, January 26, 2010 9:50 PM Hilary Cotter wrote: answers inline. I have gone back and forth between recommending lucene over SQL FTS. You really need to make your decision based on your goals and test to see what works best for you. Basically for tight integration with SQL Server where you need close to real time indexing and to join relational queries against full-text queries you cannot beat SQL FTS. Now if you use lucene you have more features and a richer querying language. You have hit high-lighting, but you have to roll your own indexing mechanism and build some sort of transactional features to the index. Consider your index getting corrupt while indexing 100 docs per second. How to do roll back to a previous version of the index. How to you index 1 billion rows quickly should you need to reinidex. These are difficult problems to get around in lucene, but part of the engine in SQL FTS. ok, I do not really understand the point here. If this is important to you you can consolidate all the data in a single table or index an index view. With a distributed repostory you have to worry about a synchronization problem. For example if you index 20 document stores you have to worry that all 20 document stores are complete otherwise you will have dead links. As I said previously the querying language of lucene is more feature rich than SQL FTS. If this is important to you this makes the decision an easy one. However there are problems doing date sorting and searching in lucene and you need to code in Java - I am not sure if this is important to you. The date issues may have been solved by now - I have not worked in a large lucene search deployment in over a year now. This adds another point of failure you have to work around. This may or may not be a problem for you. Solr is a wrapper around lucene. Basically you have to look at what SQL FTS delivers and then what Lucene delivers and then build your own search engine around this, or use what ships in Solr. You have to define your goals carefully and test to see which solution offers the best solution. Both solutions will involve considerable coding effort. I should mention that the search algorithim used by SQL FTS cleans up at trec. Search is commodotized today, most search vendors are trying to gain market share by adding features. When you make your selection it should not be driven by search itself, but rather the code effort to support your |
![]() |
| Thread Tools | |
| Display Modes | |
| |