Exercise 13.9.The “rank” of a word is its position in a list of words sorted by frequency: the mostcommon word has rank 1, the second most common has rank 2, etc.
Zipf’s law describes a relationship between the ranks and frequencies of words in natural languages(http: // en. wikipedia. org/ wiki/ Zipf's_ law ). Specifically, it predicts that the frequency,f , of the word with rank r is:
where s and c are parameters that depend on the language and the text. If you take the logarithm ofboth sides of this equation, you get:
So if you plot log f versus log r, you should get a straight line with slope -s and intercept log c.
Write a program that reads a text from a file, counts word frequencies, and prints one line for each word, in descending order of frequency, with log f and log r. Use the graphing program of your choice to plot the results and check whether they form a straight line. Can you estimate the value of s?
Solution: http: // thinkpython. com/ code/ zipf. py. To make the plots, you might have toinstall matplotlib (see http: // matplotlib. sourceforge. net/ ).
- 瀏覽次數:1898