What is Zipf’s Law?

By Reece Goodall

Jul. 25, 2020
Posted in Books

Whether you know it or not, half of all the conversations you’ve had today will have been made up of a very small selection of common words. This is the result of something called Zipf’s Law, a probability rule that reveals a lot of interesting things about our language.

About 6% of everything we read and say is made up of the word ‘the’, and it’s the most common word in the English language. This conclusion is the result of an analysis of all public domain English texts, where words have been ranked in order of frequency. Through this, we know that the 20 most frequent words in the English language are as follows: the, of, and, to, a, in, is, I, that, it, for, you, was, with, on, as, have, but, be, they. The analysis also conveyed something really interesting about the relation between words. We use ‘of’, word number two, half as often as we do ‘the’. We use ‘and’, word number three, a third as often as we do ‘the’.

Whichever word you choose, if it has a rank of n, we use it 1/n times as often as we do the most used word. This is what we call Zipf’s Law and interestingly, it doesn’t only apply to English – it’s true of every single language, even ancient ones that haven’t been fully translated. This phenomenon was popularised by George Zipf, a linguist at Harvard University, and was an adaptation of the Pareto Principle. This states that the first 20% of causes are responsible for 80% of outcomes, and this is true of language – the 18% of most frequently used words make up 80% of communication.

Zipf himself believed that speakers would naturally gravitate to as few words as possible to get their point across

Why is this the case? The frustrating answer is that we don’t actually know, despite the best efforts of linguists and scientists to figure it out, there’s no clear answer. Zipf himself believed that speakers would naturally gravitate to as few words as possible to get their point across, but listeners wanted as large a vocabulary as possible in order to understand what was being said. This balance of efficient communication, he argued, led to the modern state of language. Some researchers argue that the common words help space out the less common ones, helping the information rate for listeners so they can better follow a sentence.

One person who has attempted to solve the mystery is Sander Lestrade, a linguist at Radboud University in the Netherlands. He charted a relationship between the structure of sentences (or syntax) and the meaning of words (semantics), using computer simulations to find that the two need each other for Zipf’s Law to work.

He said: “In the English language, but also in Dutch, there are only three articles, and tens of thousands of nouns. Since you use an article before almost every noun, articles occur way more often than nouns. Within the nouns, you also find big differences. The word ‘thing’, for example, is much more common than ‘submarine’, and thus can be used more frequently. But in order to actually occur frequently, a word should not be too general either. If you multiply the differences in meaning within word classes, with the need for every word class, you find a magnificent Zipfian distribution. And the distribution only differs a little from the Zipfial ideal, just like natural language does.”

On average, nearly half of any book, conversation or article will be nothing but the same 50-100 words

There are a lot of factors that may explain Zipf’s Law. There’s such a thing called the ‘preferential attachment process’, where a word is more likely to be used once it has already been used. It could also be linked to something known as critical points. Conversations and writing sticks to a certain topic until you reach a critical point, and then the subject will change – with this, the vocabulary also shifts. The general nature of words like ‘the’ and ‘and’ mean they can transfer with far more ease to a new subject, whereas more specific words will not. Add this on to Zipf’s theory of language, and you’ve a boiling pot of explanations.

How Zipf’s Law manifests is hugely interesting. On average, nearly half of any book, conversation or article will be nothing but the same 50-100 words, and the other half that will be used appear only once in that particular selection. That is perhaps unsurprising when, if you do the sums, the top 25 most used words make up about a third of everything we say. This is supported by examples – in Alice’s Adventures in Wonderland, 44% of the unique words used appear only once in the book. For The Adventures of Tom Sawyer, that figure is 49.8%.

Zipf’s Law is interesting, because it implies a hidden and predictable structure to something as creative as language. According to language rankings, we like ‘knowing’ more than we do a ‘mystery’, so perhaps one day we’ll learn a lot more about Zipf’s Law.

Comments

Leave a Reply Cancel reply
Your email address will not be published. Required fields are marked *
Comment *
Name *

Email *

Website

Δ

This site uses Akismet to reduce spam. Learn how your comment data is processed.