1/2/11

Google's Ngram Viewer: Misanalyses & Methodology

Since I noticed that you can tweet your Google Ngram Viewer results, I have been searching Twitter for people's n-grams.  And I've been amazed/amused at the sweeping conclusions that are implied by their comparison of usage over time of a few words or phrases.

Google's Ngram Viewer is an amazing new tool that enables you to search several different corpora for the frequency with which words/phrases appear in them.  All of those together include around 5 billion words, which is the number in about 4% of all books ever published. 

As some have pointed out, there are technical issues inherent in scanning millions of books by computers, including OCR issues (computer can't read the texts accurately) and metadata (information about the book, most importantly the date of publication).  The English One Million is a corpus that they cleaned up for OCR issues and adjusted for the number of books published per year (6,000) but it's not separated by British and American English. 

A recently published paper on the construction of this corpus and its possible applications has defined a new academic discipline, culturomics, that uses quantitative analysis of language use to draw inferences about culture and, although they're enthusiastic about the potential uses of the N-gram Viewer, they emphasize how cautiously the results should be interpreted.  But I'm guessing that not many people doing ngrams have read that.

The technical issues undermine the validity of Ngram Viewer analyses but there are also problems with how people are interpreting the results.  The Binder Blog has a good post that questions the relationship between word frequency and culture, which is assumed by culturomics, and the evolution of word meanings.
  
Dr. Mark Davies at BYU has constructed some corpora that you can use for the same purpose.  In his comparison of his Corpus of Historical American English (COHA) with the Google Books corpus, Davies contrasts with limitations of the latter with the higher utility of COHA for doing so-called culturomic research.  But he does praise the effort in constructing the corpora and the Ngram Viewer's ease of use.

Back to the Ngram Viewer.  When you type in a 1-gram (single word), it shows the frequency for that word without distinguishing the part of speech.  Also, it is case-sensitive and it's a good idea to enter n-grams that are nouns both capitalized and uncapitalized.  Pluralize them.  Compare these all in the same graph to get a sense of whether it's generally been used as a proper noun or maybe even more commonly as a verb. 

The frequency of a word's use in books at some point in history does not tell you anything about what is actually happening then but it does offer insight into the preoccupations of those whose thoughts, expressed in words, were preserved by one artifact of culture, the book.  And it is interesting.  The vast number of words in the Google Books corpora is what gives the Ngram Viewer its authoritative air.

But what would it make of 'air'?  So many meanings for such a simple word.

The best way I have found to use the Ngram Viewer is the comparison of singular concepts (as in unique) or phrases.  Pay attention to the construction of the phrases/terms and play around with the capitalization of words in them.  Here's a good example from Twitter of the analysis of a unique term and a poorly-constructed one in the same n-gram.
Creationism vs. evolution theory in Google Books Ngram Viewer. WTF happened in 1980?http://t.co/LELxh3l  (Change the %2C to commas).
If you change "evolution theory" to "theory of evolution" it dwarfs "creationism" in the Ngram and capitalizing "Darwinism" changes the graph a lot as well, which demonstrates the usefulness of my rules of thumb.  Feel free to use them. 

It's a great resource that's open for everyone to use.  There are limitations but this is just version 1.0 of the Ngram Viewer.  For now, keep them in mind but try it out.

2 comments:

  1. The Binder Blog entry you're looking for is "The problem with Google's thin description" http://thebinderblog.com/2010/12/18/google-ngrams-thin-description/

    ReplyDelete
  2. Natalie, I did link to your post on 'thin description' (and your post about metadata, as well). I was thinking of these issues before I found your posts on them so I tried to cover different ground. Or maybe the same ground, somewhat differently.

    ReplyDelete