Saturday, December 18, 2010

N-gram games

Google have announced a new toy to play with - actually, it's not really a toy, but the biggest searchable English language corpus in the world. 360 billion words from around 5 million books are at your disposal. That's over a thousand times larger than any existing corpus.

It's not as well structured as some university-based corpora - for example, they've side-stepped the issue of copyright with newer books by not showing you the context of the word you're searching for. There's also no ability to fine tune for genre, or anything other than "British" or "American" English. But it's still a lot of fun. There's a nice discussion about it here on good ol' Language Log, but if you're more interested in just playing you can head here.

I thought I'd use it on some old lexical friends of ours to see how it goes. First of all, after our recent discussion about "luck out" I plugged in that phrase. These are the frequencies I got:

Given that there wasn't much in the newspaper corpora from earlier than the mid 20th century this chart adds to the theory that "luck out" is a relatively recent phrase. Especially when we look at some of the examples from the 19th century and find things like:

we must not leave " good luck" out of the statement, as we feel assured that ' good luck' is a great point towards a fortune (Pierce Egan's book of sports, and mirror of life, 1832, p. 71)


Have yure eyes about you, and luck out for sparks
(Thomas Hood, Hood's own, or, laughter from year to year, 1939, p. 352)

The first being a different structure, but still picked up thanks to the OCR not picking up punctuation while the second appears to be an outdated way to spell 'look.'

The same types of mistaken reading occur up until the first reference I've found for the phrase as we know it. In George A. Meyer's 1975 book "The two-word verb: a dictionary of the verb-preposition phrases in American" where we find the entry for "Luck":

Used as a verb only in the expression "luck out", luck out I (9) Slang. John lucked out when his motorcycle crashed into the big truck. (He was not seriously injured.)

This adds weight to the earlier analysis that the 'positive' definition of "luck out" is more of a USA usage than a UK one - and with the earliest usage I could find was a quote from a baseball player in 1971 the time frame for this 1975 book entry is also about right. After that 1975 entry there are more that pop up, and it appears that the sharp upward curve in the frequency count is somewhat attributable to increase in this phrase

Another old friend of ours here is the pejorative term "douche bag." Here we see a spike in usage in the 1920s:

The 1920s peak in usage is from the heyday of the douche bag as a piece of medical equipment, and the search function provides you with a baffling and occasionally scary array of books and journals on the topic. These kind of references occur right into the 2000s, and are still more common than the derogatory usage, but this begins to creep in during the 1970s and 1980s. Still, these references are not nearly as common as to single-handedly explain the rise in usage of "douche bag" since the 1970s.

Strangely enough, our old buddy "awkward turtle" has not made it into a publication in the Google corpus. This is possibly because the corpus stops in 2000 and "awkward turtle" is newer, but also likely to do with the fact that only using published book corpora gives a limited type of language use. A reminder that while book and published material corpora are interesting and useful they're not always the final word on language use!

You too can play at home! The Lousy Linguist has a great little post today about how not to interpret n-grams, but have a play for yourself and if you find anything amusing let us know below!


  1. I love N-gram! I'm just glad my internet access might be a little restricted this coming week so I'll get some work done.

  2. A couple comments on "douche-x":

    1. Urban Dictionary has an entry for douche canoe here with several hundred voters responding.

    2. We should expect to see a wild rise in this post-2000 thanks largely to John Stewart. But, as an online poker player, I can assure you that, if we ever get access to the Poker Stars chat logs, the rate of douche-x/doosh-x constructions will increase exponentially.