You are here

Word frequency analysis

8 September, 2015 - 10:43

As usual, you should at least attempt the following exercises before you read my solutions.

Exercise 13.1. Write a program that reads a file, breaks each line into words, strips whitespace andpunctuation from the words, and converts them to lowercase.

Hint: Thestringmodule provides strings namedwhitespace, which contains space, tab, newline,etc., and punctuationwhich contains the punctuation characters. Let’s see if we can make Python swear:

>>> import string>>> print string.punctuation!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

Also, you might consider using the string methods strip, replaceand translate.

Exercise 13.2.Go to Project Gutenberg (http: // gutenberg. org ) and download your favoriteout-of-copyright book in plain text format.

Modify your program from the previous exercise to read the book you downloaded, skip over theheader information at the beginning of the file, and process the rest of the words as before.

Then modify the program to count the total number of words in the book, and the number of timeseach word is used.

Print the number of different words used in the book. Compare different books by different authors,written in different eras. Which author uses the most extensive vocabulary?

Exercise 13.3.Modify the program from the previous exercise to print the 20 most frequently-usedwords in the book.

Exercise 13.4.Modify the previous program to read a word list (see Reading word lists) and then print allthe words in the book that are not in the word list. How many of them are typos? How many ofthem are common words that shouldbe in the word list, and how many of them are really obscure?