Word rank

by | Jul 23, 2019 | Latest News

 

I recently came across a large database of the use of English-American words. It aims to provide a representative sample of the usage English-American by including the words extracted from a large number of English texts of different types – books, newspaper articles, magazines etc. In total it includes around 560 million words collected over the years 1990-2017.

The word ‘football’ occurs in the database 25,271 times and has rank 1543. In principle, this means that ‘football’ was the 1543rd most frequent word in the database, though the method used for ranking the database elements is a little more complicated than that, since it attempts to combine a measure of both the number of times the word appears and the number of texts it appears in. Let’s leave that subtlety aside though and assume that ‘football’, with a frequency of 25,271, is the 1543rd most common word in the database.

The word ‘baseball’ occurs in the same database 28,851 times. With just this information, what would you predict the rank of the word ‘baseball’ to be? For example, if you think ‘baseball’ is the most common word, it would have rank 1. (It isn’t: ‘the’ is the most common word). If you think ‘baseball’ would be the 1000th most common word, your answer would be 1000.

Give it a little thought, but don’t waste time on it. I really just want to use the problem as an introduction to an issue that I’ll discuss in a future post. I’d be happy to receive your answer though, together with an explanation if you like, by mail. Or if you’d just like to fire an answer anonymously at me, without explanation, you can do so using this survey form.

 

Stuart Coles

Stuart Coles

Author

I joined Smartodds in 2004, having previously been a lecturer of Statistics in universities in the UK and Italy. A famous quote about statistics is that “Statistics is the art of lying by means of figures”. In writing this blog I’m hoping to provide evidence that this is wrong.