You are here

Exercises

8 September, 2015 - 10:43

Exercise 13.9.The “rank” of a word is its position in a list of words sorted by frequency: the mostcommon word has rank 1, the second most common has rank 2, etc.

Zipf’s law describes a relationship between the ranks and frequencies of words in natural languages(http: // en. wikipedia. org/ wiki/ Zipf's_ law ). Specifically, it predicts that the frequency,f , of the word with rank r is:

f=cr^{-s}

where s and c are parameters that depend on the language and the text. If you take the logarithm ofboth sides of this equation, you get:

logf=logc-slogr

So if you plot log f versus log r, you should get a straight line with slope -s and intercept log c.

Write a program that reads a text from a file, counts word frequencies, and prints one line for each word, in descending order of frequency, with log f and log r. Use the graphing program of your choice to plot the results and check whether they form a straight line. Can you estimate the value of s?

Solution: http: // thinkpython. com/ code/ zipf. py. To make the plots, you might have toinstall matplotlib (see http: // matplotlib. sourceforge. net/ ).