For some reason, the amount of times a word is used is just proportional to one over its rank. Word frequency and ranking on a log log graph follow a nice straight line. A power-law. This phenomenon is called Zipf's Law and it doesn't only apply to English. It also applies to other languages, like, well, all of them. Even ancient languages we haven't been able to translate yet. And here's the thing. We have no idea why. It's surprising that something as complex as reality should be conveyed by something as creative as language in such a predictable way. How predictable?
Well, watch this. According to WordCount.org, which ranks words as found in the British National Corpus, "sauce" is the 5,555th most common English word. Now, here is a list of how many times every word on Wikipedia and in the entire Gutenberg Corpus of tens of thousands of public domain books shows up. The most used word, 'the,' shows up about 181 million times. Knowing these two things, we can estimate that the word "sauce" should appear about thirty thousand times on Wikipedia and Gutenberg combined. And it pretty much does. What gives? The world is chaotic. Things are distributed in myriad of ways, not just power laws. And language is personal, intentional, idiosyncratic. What about the world and ourselves could cause such complex activities and behaviors to follow such a basic rule? We literally don't know.
More than a century of research has yet to close the case. Moreover, Zipf's law doesn't just mysteriously describe word use. It's also found in city populations, solar flare intensities, protein sequences and immune receptors, the amount of traffic websites get, earthquake magnitudes, the number of times academic papers are cited, last names, the firing patterns of neural networks, ingredients used in cookbooks, the number of phone calls people received, the diameter of Moon craters, the number of people that die in wars, the popularity of opening chess moves, even the rate at which we forget. There are plenty of theories about why language is 'zipf-y,' but no firm conclusions and this video doesn't contain a definite explanation either. Sorry, I know that's a bummer, since we appear to like knowing more than mystery. But that said, we also ask more than we answer. So let's dive into Zipf's ramifications, some related patterns, some possible explanations and the depth of the mystery itself.
Zipf's law was popularized by George Zipf, a linguist at Harvard University. It is a discrete form of the continuous Pareto distribution from which we get the Pareto Principle. Because so many real-world processes behave this way, the Pareto Principle tells us that, as a rule of thumb, it's worth assuming that 20% of the causes are responsible for 80% of the outcome, like in language, where the most frequently used 18 percent of words account for over 80% of word occurrences. In 1896, Vilfredo Pareto showed that approximately 80% of the land in Italy was owned by just twenty percent of the population. It is said that he later noticed in his garden 20 percent of his pea pods contained eighty percent of the peas. He and other researchers looked at other datasets and found that this 80-20 imbalance comes up a lot in the world. The richest 20% of humans have 82.7% of the world's income. In the US, 20% of patients use eighty percent of health care resources. In 2002, Microsoft reported that 80% of the errors and crashes in Windows and Office are caused by 20% of the bugs detected. A common rule of thumb in the business world states that 20% of your customers are responsible for 80% of your profits and eighty percent of the complaints you receive will come from 20% of your customers. A book titled "The 80/20 Principle" even says that in a home or office, 20% of the carpet receives 80 percent of the wear. Oh, and as Woody Allen famously said, "eighty percent of success is just showing up." The Pareto Principle is everywhere, which is good. By focusing on just 20 percent of what's wrong, you can often expect to solve eighty percent of the problems.
A variety of different unrelated factors cause this to be true from case to case, but if we can get to the bottom of what causes some of them, maybe we'll find that one or more of those mechanisms is responsible for Zipf's law in language. George Zipf himself thought languages' interesting rank frequency distribution was a consequence of the Principle of Least Effort. The tendency for life and things to follow the path of least resistance. Zipf believed it drove much of human behavior and hypothesized that as language developed in our species, speakers naturally preferred drawing from as few words as possible to get their thoughts out there. It was easier. But in order to understand what was being said, listeners preferred larger vocabularies that gave more specificity, so that they had to do less work. The compromise between listening and speaking, Zipf felt, led to the current state of language. A few words are used often and many many many words are used rarely.