I downloaded a new keyboard onto my phone yesterday and was pleasantly surprised to see it suggesting the word “to” after I had typed “going”. It “knew” that the phrase “going to” was a very common pair of words and managed to use this in order to speed up the message I was sending. The next word it suggested was “be”, and then “late”. While I was less pleased that to see my phone pre-diagnosing my punctuality problem, it had indeed correctly predicted the message I planned to send.
This is a practical application of something called “corpus linguistics”. And now that the good people of Google have developed a tool called the n-gram viewer, some of its insights are open to all. Anyone can now become an armchair linguist.
Corpus linguistics is essentially a big data approach to studying language that predates the big data age. It can be traced back to the earliest days of linguistic study, when people counted the frequency of words and other patterns in the Bible. But it has really taken off in the past 20 years thanks to a massively increase in the availability of sources. Sometimes these collections, known as corpora, are massive online libraries: sometimes they are smaller and more specialised. But none is larger than Google Books, which provides us with an unparallelled treasure trove of sources to scour in our quest to understand how language has changed.
Studying these sources has greatly enhanced our understanding of how words work together, and also how they may carry certain other qualities. We know, for example, that the verb “cause” is often used in a negative sense, as in “cause a problem” or “cause delays”, whereas the term “lead to” is generally more neutral.
Many big sources of language contain millions of words but Google’s database, sometimes called the World Brain, contains millions of books. It’s a collection that has emerged out of Google’s attempt to digitise every book ever written. A huge team of people and robots are working to scan hard copies of books to add to the database and while it is nowhere near complete, its size is already extraordinary.
The n-gram viewer lets users find out how often a word has appeared across the whole library of books and the results are illuminating.
If you type in the words “email”, “telegram” and “telephone” into Google Books, for example, you can see how these various forms of communication technology have come in and out of fashion over the years.
You can see that use of the word “telegram” peaked in the 1920s and 1930s before beginning a steady decline. You can chart the rapid rise of email in our lexicon since its introduction into the mainstream in the 1990s and see how we appear to fall in and out of love with the telephone.
And if you put in the word “wireless”, n-gram shows how the word was popular, then got lost in the wilderness for some time before coming back with a different meaning.
There are other shifts that can be seen beyond the use of technology. A comparison of the phrases “air stewardess” and “cabin crew” is one example of the rise of gender neutral language since the 1970s. It also shows the persisitance of the gender specific variant.
Sometimes the data can be deceptive though. The graphs show that “male nurse” is a much more common phrase than “female nurse” but that its use has been in decline since the 1970s. But that doesn’t necessarily mean there are more male nurses or that fewer men have become nurses in the past 40 years; it may simply have seemed necessary in the past to explicitly state that a nurse was male when the profession was more readily associated with women.
Peace and love
Some of the charts can be fairly depressing, while others can be uplifting. Anyone will tell you that the path of true love never did run smooth, and nor has our use of the word over the years.
Google tells us that peace seems to be making a comeback after a tough few decades, as do faith and hope. But charity is far less talked about.
Politics and Language
This is a blunt tool but it can provide a jumping off point for investigating some complex issues. Use of the term “neo-liberal agenda” has overtaken the term “socialist agenda” in recent years. It’s not immediately clear why but trying to find that out could take you in all kinds of directions.
And some graphs can make us instantly question a popular belief. The term “broken Britain” has become something of a catchphrase for the current UK government, but it seems it actually has a surprisingly long history. The fact that its use spiked around 1950 and again in the mid-sixties might put to bed the idea that the youth of today are worse than ever.
There are of course many things that don’t turn up in n-grams and many pose more questions than they answer but they give a fascinating insight into the way language has changed to reflect our culture. Or perhaps, as some graphs might indicate, the culture is changing to reflect the language.