您在這裡

Parsing HTML using BeautifulSoup

24 二月, 2015 - 11:00

There are a number of Python libraries which can help you parse HTML and extract data from the pages. Each of the libraries has its strengths and weaknesses and you can pick one based on your needs.

As an example, we will simply parse some HTML input and extract links using the BeautifulSoup library. You can download and install the BeautifulSoup code from:

www.crummy.com

You can download and “install” BeautifulSoup or you can simply place the BeautifulSoup.py file in the same folder as your application.

Even though HTML looks like XML and some pages are carefully constructed to be XML, most HTML is generally broken in ways that cause an XML parser to reject the entire page of HTML as improperly formed. BeautifulSoup tolerates highly flawed HTML and still lets you easily extract the data you need.

We will use urllib to read the page and then u se BeautifulSoup to extract the href attributes from the anchor (a) tags.

import urllibfrom BeautifulSoup import *

url = raw_input('Enter - ')html = urllib.urlopen(url).read()soup = BeautifulSoup(html)# Retrieve all of the anchor tagstags = soup('a')for tag in tags:    print tag.get('href', None)

The program prompts for a web address, then opens the web page, reads the data and passes the data to the BeautifulSoup parser, and then retrieves all of the anchor tags and prints out the href attribute for each tag.

When the program runs it looks as follows:

python urllinks.pyEnter - http://www.dr-chuck.com/page1.htmhttp://www.dr-chuck.com/page2.htm 

python urllinks.pyEnter - http://www.py4inf.com/book.htmhttp://www.greenteapress.com/thinkpython/thinkpython.htmlhttp://allendowney.com/http://www.si502.com/http://www.lib.umich.edu/espresso-book-machinehttp://www.py4inf.com/codehttp://www.pythonlearn.com/

You can use BeautifulSoup to pull out various parts of each tag as follows:

import urllibfrom BeautifulSoup import *

url = raw_input('Enter - ')html = urllib.urlopen(url).read()soup = BeautifulSoup(html)

# Retrieve all of the anchor tagstags = soup('a')for tag in tags:    # Look at the parts of a tag    print 'TAG:',tag    print 'URL:',tag.get('href', None)    print 'Content:',tag.contents[0]    print 'Attrs:',tag.attrs

This produces the following output:

python urllink2.pyEnter - http://www.dr-chuck.com/page1.htmTAG: <a href="http://www.dr-chuck.com/page2.htm">Second Page</a>URL: http://www.dr-chuck.com/page2.htmContent: [u'\nSecond Page']Attrs: [(u'href', u'http://www.dr-chuck.com/page2.htm')]

These examples only begin to show the power of BeautifulSoup when it comes to parsing HTML. See the documentation and samples at www.crummy.com for more detail.