您在這裡

Reading binary files using urllib

24 二月, 2015 - 11:02

Sometimes you want to retrieve a non-text (or binary) file such as an image or video file. The data in these files is generally not useful to print out but you can easily make a copy of a URL to a local file on your hard disk using urllib.

The pattern is to open the URL and use read to download the entire contents of the document into a string variable (img) and then write that information to a local file as follows:

img = urllib.urlopen('http://www.py4inf.com/cover.jpg').read()fhand = open('cover.jpg', 'w')fhand.write(img)fhand.close()

This program reads all of the data in at once across the network and stores it in the variable img in the main memory of your computer and then opens the file cover.jpg and writes the data out to your disk. This will work if the size of the file is less than the size of the memory of your computer.

However if this is a large audio or video file, this program may crash or at least run extremely slowly when your computer runs out of memory. In order to avoid running out of memory, we retrieve the data in blocks (or buffers) and then write each block to your disk before retrieving the next block. This way the program can read any sized file without using up all of the memory you have in your computer.

import urllib

img = urllib.urlopen('http://www.py4inf.com/cover.jpg')fhand = open('cover.jpg', 'w')size = 0while True:    info = img.read(100000)    if len(info) < 1 : break    size = size + len(info)    fhand.write(info)

print size,'characters copied.'fhand.close()

In this example, we read only 100,000 characters at a time and then write those characters to the cover.jpg file before retrieving the next 100,000 characters of data from the web.

This program runs as follows:

python curl2.py568248 characters copied.

If you have a Unix or Macintosh computer, you probably have a command built into your operating system that performs this operation as follows:

curl -O http://www.py4inf.com/cover.jpg

The command curl is short for “copy URL” and so these two examples are cleverly named curl1.py and curl2.py on www.py4inf.com/code as they implement similar functionality to the curl command. There is also a curl3.py sample program that does this task a little more effectively in case you actually want to use this pattern in a program you are writing.