Google N-Gram Viewer Critique

Google books N-Gram Viewer Critique

Jon Orwant and co-creator Will Brockman along with a team of engineers launched Google N-Gram viewer an extension of Google books in December 2010.  It is essentially a graphing tool, which displays the  yearly count of selected n-grams (words or phrases)[1] and extracts data from more than eight million out of the 20 million books in the Google books archive, which is an estimated 6% of all books ever published spanning from 1500-2008. Containing approximately 500 billion words [2], in British English, American English, French, German, Russian, Spanish, Hebrew and Chinese the database is extensive to say the least.

The main methodologies involved in gathering the various n-grams are through Google books and therefore uses OCR (optical character recognition) as the main technology in which Google has gathered the n-grams (data).  As a result this could represent the first of possible ‘limitations’ to n-gram viewer, due to possible OCR errors that can and do occur, such as that of the word ‘internet’ appearing pre-1950s, which Google addresses in it’s frequently asked questions section by saying that they do a good job at filtering out books with low OCR quality scores, but some errors do slip through [3].   This was also highlighted in letters being misinterpreted by the OCR technology, especially interesting are examples’ of the ‘long medial s’[4] which in fairness looks very much identical to the ‘common day f’, a common example of this is shown on the heading of the American Bill of Rights as shown below.

Screen Shot 2013-03-10 at 11.36.07

This consequently raises concerns over the accuracy over some of the graphs produced by n-gram viewer, even though Google are combating the issue.However an example given by Google showing the evident improvements in their OCR technology is shown in the graph below through the comparisons of the word ‘beft’ (misinterpreted by the OCR technology originally ‘best’), showing a significant improvement in 2012.

2009-2012

N-gram viewer is essentially a ‘text mining tool’, which uses the data supplied by Google Books, allowing users’ to identify trends over time.  Through using n-gram viewer, you begin to realise the immense scale of the Google books archive and that n-gram viewer provides the user with such an extensive amount of both primary and secondary source material represented clearly and instantly.  It is therefore an extremely useful and addictive visual tool, which in its most basic form allows users to track the use of words throughout time.  The metadata supplied through Google books, now allows users to search all the instances that the searched n-gram was used by time periods, allowing for access to the original source material, through a hyperlink linking to books within the Google books archive, as shown below.  Google also clearly explains the various speech tags that are available to search, and therefore greatly eases the usability of the site (also shown below).

hyperlinktobooks

Screen Shot 2013-03-13 at 22.26.27

As a result of the data being extracted straight from Google books, it provides for reliable, useful and relevant sources, however due to copyright laws, many of the books are at the present time only previews instead of the whole book, however this is a legislative problem, rather than a direct fault with the n-gram viewer.  Also it seems that in the ‘best’ sets of data, in terms of correlating n-grams are after the 1800s, however that is still not to say that searches from 1500 don’t produce interesting data, that may or may not be both useful and interesting to potential users.

N-gram viewer is extremely user friendly and accessible, along with the sheer size of the database that Google has at it’s disposable allows for endless possibilities in terms of how people can and actually do use n-gram viewer, which is reflected in the sheer volume of searches, at 50 times a minute.  As a result, the popularity of n-gram viewer speaks volumes, along with the individuality and innovation of a project such as this from Google represents the future of digital history online and most importantly keeping people engaged in history in a new and creative way, from a resource that would otherwise not have been available.  Therefore I highly recommend Google n-gram viewer, not only as a useful tool in providing historical perspective throughout time, but also as an addictive invention that all can use to explore its endless potential.

References

[1] – http://en.wikipedia.org/wiki/Google_Ngram_Viewer

[2] – http://libweb.lib.buffalo.edu/pdp/index.asp?ID=497

[3] – http://searchengineland.com/when-ocr-goes-bad-googles-ngram-viewer-the-f-word-59181

[4] – http://en.wikipedia.org/wiki/File:Bill_of_Rights_Pg1of1_AC.jpg

[5] -http://www.culturomics.org/Resources/A-users-guide-to-culturomics

[6] -http://googleresearch.blogspot.co.uk/2012/10/ngram-viewer-20.html