Google Ngram Viewer

First look at Google Books Ngram viewer, really interesting feature from Google.

Launched in December 2010, Google N-gram viewer was created by Jon Orwant, (Engineering Manager) and co-creator Will Brockman, for the purpose of tracking the usage of phrases across time and would therefore be of interest to professional linguists and historians.  However, it also became very popular with casual users and since then has been used about 50 times every minute to explore how phrases have been used in books spanning the centuries.

–       Google N-gram viewer is a graphical tool, which charts the usage of words and phrases, or ‘N-Grams’ (word-phrases), based on their yearly count, within 5.2 million books, spanning from 1500 to 2008, containing approximately 500 billion words.  It also recognises a variety of languages, such as American English, British English, French, German, Spanish, Russian, or Chinese.  As a result, over 45 million graphs have been created; I would say the best way to describe Google N-gram would be a history of the written word.

–       As of 2012, Ngram viewer has been updated to version 2.0, which extracts data from more than eight million out of the 20 million books in the Google books archive.  Approximately 6% of all books ever published. The recently upgraded Ngram viewer 2.0, includes improvements made by the engineers at Google in terms of OCR deficiencies and in hammering out inconsistencies between library and publisher metadata.

Advanced usage of Google Ngram

 

–       (Part of speech tagging).  Part of speech tags, e.g. words in context can be searched for, e.g. same words which could have different meanings, e.g. certain phrases etc.  Also how words have developed or changed examples verbs such as telephone_VERB, to phone_VERB.

–       Set of mathematical operators allowing you to add, subtract, multiply, and divide the counts of Ngrams.

Limitations

–       Due to limitations on the size of the Ngram database, only matches found in over 40 books are indexed in the database; otherwise the database could not have stored all possible combinations.

–       Typically, search-terms cannot end with punctuation, although a separate full stop, or period, can be searched. Also, an ending question mark (as in “Why?”) will cause a 2nd search for the question mark separately.

–       Once relevant books are found, often the whole book is not available for reading, either due to copyright laws or otherwise, however, this is more a problem with copyright laws in relation to Google books.

– Only 6% of books ever published?  Not sure if this can be seen as a limitation, as in reality this is an immense collection.

Advertisements