If we want to extract data from a string in Python we can use the findall() method to extract all of the substrings which match a regular expression. Let’s use the example of wanting to extract anything that looks like an e-mail address from any line regardless of format. For example, we want to pull the e-mail addresses from each of the following lines:
From email@example.com Sat Jan 5 09:14:16 2008Return-Path: <firstname.lastname@example.org> for <email@example.com>;Received: (from apache@localhost)Author: firstname.lastname@example.org
We don’t want to write code for each of the types of lines, splitting and slicing differently for each line. This following program uses findall() to find the lines with e-mail addresses in them and extract one or more addresses from each of those lines.
import res = 'Hello from email@example.com to firstname.lastname@example.org about the meeting @2PM'lst = re.findall('\S+@\S+', s)print lst
The findall() method searches the string in the second argument and returns a list of all of the strings that look like e-mail addresses. We are using a twocharacter sequence that matches a non-whitespace character (\S).
The output of the program would be:
Translating the regular expression, we are looking for substrings that have at least one non-whitespace character, followed by an at-sign, followed by at least one more non-white space characters. Also, the “\S+” matches as many nonwhitespace characters as possible (this is called “greedy” matching in regular expressions).
The regular expression would match twice (email@example.com and firstname.lastname@example.org) but it would not match the string “@2PM” because there are no non-blank characters before the at-sign. We can use this regular expression in a program to read all the lines in a file and print out anything that looks like an e-mail address as follows:
import rehand = open('mbox-short.txt')for line in hand: line = line.rstrip() x = re.findall('\S+@\S+', line) if len(x) > 0 : print x
We read each line and then extract all the substrings that match our regular expression. Since findall() returns a list, we simple check if the number of elements in our returned list is more than zero to print only lines where we found at least one substring that looks like an e-mail address.
If we run the program on mbox.txt we get the following output:
Some of our E-mail addresses have incorrect characters like “<” or “;” at the beginning or end. Let’s declare that we are only interested in the portion of the string that starts and ends with a letter or a number.
To do this, we use another feature of regular expressions. Square brackets are used to indicate a set of multiple acceptable characters we are willing to consider matching. In a sense, the “\S” is asking to match the set of “non-whitespace characters”. Now we will be a little more explicit in terms of the characters we will match.
Here is our new regular expression:
This is getting a little complicated and you can begin to see why regular expressions are their own little language unto themselves. Translating this regular expression, we are looking for substrings that start with a single lowercase letter, upper case letter, or number “[a-zA-Z0-9]” followed by zero or more non blank characters “\S*”, followed by an at-sign, followed by zero or more non-blank characters “\S*” followed by an upper or lower case letter. Note that we switched from “+” to “*” to indicate zero-or-more non-blank characters since “[a-zA-Z0-9]” is already one non-blank character. Remember that the “*” or “+” applies to the single character immediately to the left of the plus or asterisk.
If we use this expression in our program, our data is much cleaner:
import re hand = open('mbox-short.txt') for line in hand: line = line.rstrip() x = re.findall('[a-zA-Z0-9]\S*@\S*[a-zA-Z]', line) if len(x) > 0 : print x ['email@example.com']['firstname.lastname@example.org']['email@example.com']['200801032122.m03LMFo4005148@nakamura.uits.iupui.edu']['firstname.lastname@example.org']['email@example.com']['firstname.lastname@example.org']['apache@localhost']
Notice that on the “email@example.com” lines, our regular expression eliminated two letters at the end of the string (“>;”). This is because when we append “[a-zA-Z]” to the end of our regular expression, we are demanding that whatever string the regular expression parser finds, it must end with a letter. So when it sees the “>” after “sakaiproject.org>;” it simply stops at the last “matching” letter it found (i.e. the “g” was the last good match).
Also note that the output of the program is a Python list that has a string as the single element in the list.