![]() | |
#1
| |||
| |||
|
#2
| |||
| |||
|
|
Hi, We are currently thinking of replacing our existing database system which is no longer supported. The data corpus encompasses 4 million records each of which having about 30 fields. Half a million records would have a full text in PDF format which we also put as a text field (by a self-made PDF extraction script). Our DB usage is moderate, about 1000 searches per day. We load balance by means of pound dealing queries to 6 different virtual machines (each holding a copy of the database). The web interface is decoupled from the DB; we use a MVC framework that talks to the DB via an API and retrieves data only. Currently we use a Windows system which can easily be replaced with Linux if need be. Our existing solution has a built-in thesaurus (controlled vocabulary is static) in addition every term of which holds the number of records currently tagged with it. The new solution should of course be well performing, a thesaurus functionality would be nice as would be a relevance ranking and proximity searching – yet not a must. A cost free solution would be desirable since we want to open our database to the internet thus we might encounter the need to add new instances of the DB (virtual machine). Having to pay additional licences would be too expensive for us. What would you recommend? In addition: 1. Our data repository is a large XML file from which we update our database on a weekly basis by means of a self-made update script. Would an XML database be an alternative, esp. viewing at the performance? 2. I was also asked to investigate a poosibiltiy to implement a federal search on 2 to 3 other sources (different data structure). I assume this then would be a different beast and not a feature for the above mentioned new DB I am looking for. Indeed this is not a requirement yet what options would I have in that concern? Many thanks for you input, JR PS: If this group is not the right place please point me to a proper one |
#3
| |||
| |||
|
|
I can't speak for all databases, but I'm quite familiar with MySQL and it meets all of your needs. It's quite fast and can easily handle datasets of that size (given proper indexing). It also has full-text search which you can leverage nicely. http://dev.mysql.com/doc/refman/5.0/...xt-search.html |
#4
| |||
| |||
|
|
I can't speak for all databases, but I'm quite familiar with MySQL and it meets all of your needs. It's quite fast and can easily handle datasets of that size (given proper indexing). It also has full-text search which you can leverage nicely. http://dev.mysql.com/doc/refman/5.0/...xt-search.html thanks for the hint! A question: is there any support for a "real" thesaurus (broader terms, narrower terms) - perhaps also to display or use the thesaurus to browse the data corpus? - in either mySQL or postGre? |
#5
| |||
| |||
|
|
I assume there is no thesaurus functionality in MySQL. *But maybe this page helps as a starting point, there are some thesaurus libs referenced: http://search.cpan.org/~joseibert/Th...b/Thesaurus/DB... And this looks promising as well: http://www.sequencepublishing.com/thesage.html Kind regards * * * * robert |
#6
| |||
| |||
|
|
Hi, We are currently thinking of replacing our existing database system which is no longer supported. The data corpus encompasses 4 million records each of which having about 30 fields. Half a million records would have a full text in PDF format which we also put as a text field (by a self-made PDF extraction script). Our DB usage is moderate, about 1000 searches per day. |
|
We load balance by means of pound dealing queries to 6 different virtual machines each holding a copy of the database). |
|
2. I was also asked to investigate a poosibiltiy to implement a federal search on 2 to 3 other sources (different data structure). I assume this then would be a different beast and not a feature for the above mentioned new DB I am looking for. Indeed this is not a requirement yet what options would I have in that concern? |
#7
| |||
| |||
|
|
Regards Thomas |
#8
| |||
| |||
|
|
[..] Regards Thomas Thanks Thomas for your great comment! and still looking for a working thesaurus implementation in Postgre |
#9
| |||
| |||
|
|
You will have more luck posting questions related to Postgres on the PG mailing [..] http://www.postgresql.org/community/lists/ Regards Thomas |

#10
| |||
| |||
|
|
You will have more luck posting questions related to Postgres on the PG mailing Thanks Thomas! I just sent an email ![]() |

![]() |
| Thread Tools | |
| Display Modes | |
| |