您在這裡

Advanced text parsing

24 二月, 2015 - 09:25

In the above example using the file romeo.txt, we made the file as simple as possible by removing any and all punctuation by hand. The real text has lots of punctuation as shown below:

But, soft! what light through yonder window breaks?It is the east, and Juliet is the sun.Arise, fair sun, and kill the envious moon,Who is already sick and pale with grief,

Since the Python split function looks for spaces and treats words as tokens separated by spaces, we would treat the words “soft!” and “soft” as different words and create a separate dictionary entry for each word.

Also since the file has capitalization, we would treat “who” and “Who” as different words with different counts.

We can solve both these problems by using the string methods lower, punctuation, and translate. The translate is the most subtle of the methods. Here is the documentation for translate:

string.translate(s, table[, deletechars])

Delete all characters from s that are in deletechars (if present), and then translate the characters using table, which must be a 256-character string giving the translation for each character value, indexed by its ordinal. If table is None, then only the character deletion step is performed.

We will not specify the table but we will u se the deletechars parameter to delete all of the punctuation. We will even let Python tell us the list of characters that it considers “punctuation”:

>>> import string>>> string.punctuation'!"#$%&\'()*+,-./:;<=>?@[\\]ˆ_`{|}˜'

We make the following modifications to our program:

import string                                       # New Code

fname = raw_input('Enter the file name: ')try:    fhand = open(fname)except:    print 'File cannot be opened:', fname    exit()

counts = dict()for line in fhand:    line = line.translate(None, string.punctuation)     # New Code    line = line.lower() # New Code    words = line.split()    for word in words:        if word not in counts:            counts[word] = 1        else:            counts[word] += 1

print counts

We use translate to remove all punctuation and lower to force the line to lowercase. Otherwise the program is unchanged. Note for Python 2.5 and earlier, translate does not accept None as the first parameter so use this code for the translate call:

print a.translate(string.maketrans(' ',' '), string.punctuation

Part of learning the “Art of Python” or “Thinking Pythonically” is realizing that Python often has built-in capabilities for many common data-analysis problems. Over time, you will see enough example code and read enough of the documentation to know where to look to see if someone has already written something that makes your job much easier.

The following is an abbreviated version of the output:

Enter the file name: romeo-full.txt{'swearst': 1, 'all': 6, 'afeard': 1, 'leave': 2, 'these': 2,'kinsmen': 2, 'what': 11, 'thinkst': 1, 'love': 24, 'cloak': 1,a': 24, 'orchard': 2, 'light': 5, 'lovers': 2, 'romeo': 40,'maiden': 1, 'whiteupturned': 1, 'juliet': 32, 'gentleman': 1,'it': 22, 'leans': 1, 'canst': 1, 'having': 1, ...}

Looking through this output is still unwieldy and we can use Python to gives us exactly what we are looking for, but to do so, we need to learn about Python tuples. We will pick up this example once we learn about tuples.