您在這裡

Character matching in regular expressions

24 二月, 2015 - 10:13

There are a number of other special characters that let us build even more powerful regular expressions. The most commonly used special character is the period character which matches any character.

In the following example, the regular expression “F..m:” would match any of the strings “From:”, “Fxxm:”, “F12m:”, or “F!@m:” since the period characters in the regular expression match any character.

import rehand = open('mbox-short.txt')for line in hand:    line = line.rstrip()    if re.search('ˆF..m:', line) :        print line

This is particularly powerful when combined with the ability to indicate that a character can be repeated any number of times using the “*” or “+” characters in your regular expression. These special characters mean that instead of matching a single character in the search string they match zero-or-more in the case of the asterisk or one-or-more of the characters in the case of the plus sign.

We can further narrow down the lines that we match using a repeated wild card character in the following example:

import rehand = open('mbox-short.txt')for line in hand:    line = line.rstrip()    if re.search('ˆFrom:.+@', line) :        print line

The search string “ˆFrom:.+@” will successfully match lines that start with “From:” followed by one or more characters “.+” followed by an at-sign. So this will match the following line:

From: stephen.marquard @uct.ac.za

You can think of the “.+” wildcard as expanding to match all the characters between the colon character and the at-sign.

From:.+ @

It is good to think of the plus and asterisk characters as “pushy”. For example the following string would match the last at-sign in the string as the “.+” pushes outwards as shown below:

From: stephen.marquard@uct.ac.za, csev@umich.edu, and cwen @iupui.edu

It is possible to tell an asterisk or plus-sign not to be so “greedy” by adding another character. See the detailed documentation for information on turning off the greedy behavior.