dbTalk Databases Forums  

Searching Google n-gram corpus

comp.databases.theory comp.databases.theory


Discuss Searching Google n-gram corpus in the comp.databases.theory forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
bobterwillinger@gmail.com
 
Posts: n/a

Default Searching Google n-gram corpus - 09-08-2007 , 09:36 AM






(also posted in sql group but got no replies, apolgies if that's bad
etiquette)

Hi,

Google released a corpus of n-grams collected from the Web.

http://googleresearch.blogspot.com/2...am-are-belong-...

It contains all 1..5grams that occur more than 40 times in their web
crawl. It comes as 5 folders, each folder containing around 120 files.
Each file contains 10,000,000 (10^7) lines. A line looks like:

"this is a four gram 65"

where the last number is the frequency of that exact phrase.
The total unzipped size of the 3 grams alone is 19GB, each individual
file around 200MB.
All the unzipped data is around 100GB.

I would like to be able to search through all this and return all
lines that contain a particular word or phrase.
I have no idea where to start with this, but I was wondering would an
SQL database be feasible. For the 5-grams i would need a billion rows
and of 6 columns. What sort of hard disk space would I need, and what
kind of time would i be looking at per search on on ordinary mahcine?,

I would like to be able to find every line where a particular word
occurs, no matter which position it occurs in, and ideally I would
like to be able to find particular bigrams as well.

thanks.


Reply With Quote
  #2  
Old   
Bob Stearns
 
Posts: n/a

Default Re: Searching Google n-gram corpus - 09-08-2007 , 01:59 PM






bobterwillinger (AT) gmail (DOT) com wrote:
Quote:
(also posted in sql group but got no replies, apolgies if that's bad
etiquette)

Hi,

Google released a corpus of n-grams collected from the Web.

http://googleresearch.blogspot.com/2...am-are-belong-...

It contains all 1..5grams that occur more than 40 times in their web
crawl. It comes as 5 folders, each folder containing around 120 files.
Each file contains 10,000,000 (10^7) lines. A line looks like:

"this is a four gram 65"

where the last number is the frequency of that exact phrase.
The total unzipped size of the 3 grams alone is 19GB, each individual
file around 200MB.
All the unzipped data is around 100GB.

I would like to be able to search through all this and return all
lines that contain a particular word or phrase.
I have no idea where to start with this, but I was wondering would an
SQL database be feasible. For the 5-grams i would need a billion rows
and of 6 columns. What sort of hard disk space would I need, and what
kind of time would i be looking at per search on on ordinary mahcine?,

I would like to be able to find every line where a particular word
occurs, no matter which position it occurs in, and ideally I would
like to be able to find particular bigrams as well.

thanks.

I believe that your approach is probably inappropriate for this data. If
you want to remain with SQL consider a design like:

create table ngrams( ngramid longint generated always, ngram
varchar(300), primary key ngramid)
create table words ( word varchar(25), pos shortint, ngramid longint,
primary key word,pos,ngramid)

insert all the ngrams in the first table, then for each word in each
ngram insert a record into words.

Search words to find the relevant ngrams containing any word and a self
join of words for the bigrams on t1.word = 'firstword' and t2.word =
'secondword' and t1.pos = t2.pos-1 and t1.ngramid=t2.ngramid.

In actuality, this much better handled by a custom search engine
designed along the same lines but with a lot of compression. If you are
interested in the latter, I will be willing to explain further.


Reply With Quote
  #3  
Old   
bobterwillinger@gmail.com
 
Posts: n/a

Default Re: Searching Google n-gram corpus - 09-11-2007 , 07:48 AM




Quote:
In actuality, this much better handled by a custom search engine
designed along the same lines but with a lot of compression. If you are
interested in the latter, I will be willing to explain further.
thanks, that makes sense.. what kind of compression do you mean?




Reply With Quote
  #4  
Old   
Bob Stearns
 
Posts: n/a

Default Re: Searching Google n-gram corpus - 09-12-2007 , 01:37 AM



bobterwillinger (AT) gmail (DOT) com wrote:
Quote:
In actuality, this much better handled by a custom search engine
designed along the same lines but with a lot of compression. If you are
interested in the latter, I will be willing to explain further.


thanks, that makes sense.. what kind of compression do you mean?


Well there are at least two different directions of compression: words
in the n-grams and the word lists. since all of the words in the 2-grams
through 5-grams are guaranteed to be in the 1-grams, consider sorting
the 1-grams in frequency order and assigning each word a compression id;
the first 127 get 1 byte ids x'01'-x'7f', the next 16328 get 2 byte ids
x'8000'-x'bfff', and the remainder 3 byte ids from x'c00000'-x'dfffff'
(that is 2 million words, much more than in english -- if foreign
languages are included the 4 byte ids go from x'e0000000'-x'efffffff',
an additional 256 million words, which should be enough). Now record the
2-5 grams with the corresponding 1-gram ids which should shorten them to
an average of around two bytes per word. If that is not enough we could
look at prefix trees to reduce the storage even further: every n-gram's
leading (n-1)-gram is guaranteed to exist in the corpus, so the
representation of an n-gram is (n-1)-gram id and the word id of the last
word.

The word lists can be stored one record per word:

wordid as described above, position in ngram, number of ngrams, list of
ngrams

The list of ngrams starts with a marker indicating whether the list of
ngramids or a bit list with each bit position representing an ngram. If
it a bit list, several types of RLE can be used to shorten it.

I have used these techniques on everything from a TRS-80 to a 16-way
3090, currently called (I think) X machines in IBM parlance, to great
effect to speed up (and just make possible) searches over corpuses that
would otherwise not even be possible (like a 4 million volume library,
searchable on every title, author, publisher, etc.).


Reply With Quote
  #5  
Old   
Shield
 
Posts: n/a

Default Re: Searching Google n-gram corpus - 09-15-2007 , 08:53 PM




compressing the data would take allot of time. time taken away from
the actual experiment.

what would be the fastest way to using the dataset, using the same
conditions of searching for occurances?




Reply With Quote
  #6  
Old   
Bob Stearns
 
Posts: n/a

Default Re: Searching Google n-gram corpus - 09-16-2007 , 12:45 AM



Shield wrote:

Quote:
compressing the data would take allot of time. time taken away from
the actual experiment.

what would be the fastest way to using the dataset, using the same
conditions of searching for occurances?



The payoff on the compression & reindexing is less than 100 straight
searches on the original corpus. If your experiment is that small (or
even 1000 searches) just do the in the brute force way.


Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.