![]() | |
![]() |
| | Thread Tools | Display Modes |
#1
| |||
| |||
|
#2
| |||
| |||
|
|
(also posted in sql group but got no replies, apolgies if that's bad etiquette) Hi, Google released a corpus of n-grams collected from the Web. http://googleresearch.blogspot.com/2...am-are-belong-... It contains all 1..5grams that occur more than 40 times in their web crawl. It comes as 5 folders, each folder containing around 120 files. Each file contains 10,000,000 (10^7) lines. A line looks like: "this is a four gram 65" where the last number is the frequency of that exact phrase. The total unzipped size of the 3 grams alone is 19GB, each individual file around 200MB. All the unzipped data is around 100GB. I would like to be able to search through all this and return all lines that contain a particular word or phrase. I have no idea where to start with this, but I was wondering would an SQL database be feasible. For the 5-grams i would need a billion rows and of 6 columns. What sort of hard disk space would I need, and what kind of time would i be looking at per search on on ordinary mahcine?, I would like to be able to find every line where a particular word occurs, no matter which position it occurs in, and ideally I would like to be able to find particular bigrams as well. thanks. I believe that your approach is probably inappropriate for this data. If |
#3
| |||
| |||
|
|
In actuality, this much better handled by a custom search engine designed along the same lines but with a lot of compression. If you are interested in the latter, I will be willing to explain further. |
#4
| |||
| |||
|
|
In actuality, this much better handled by a custom search engine designed along the same lines but with a lot of compression. If you are interested in the latter, I will be willing to explain further. thanks, that makes sense.. what kind of compression do you mean? Well there are at least two different directions of compression: words |
#5
| |||
| |||
|
#6
| |||
| |||
|
|
compressing the data would take allot of time. time taken away from the actual experiment. what would be the fastest way to using the dataset, using the same conditions of searching for occurances? The payoff on the compression & reindexing is less than 100 straight |
![]() |
| Thread Tools | |
| Display Modes | |
| |