您在這裡

Exercises

24 二月, 2015 - 15:20

Exercise 16.1 In a large collection of MP3 files there may be more than one copy of the same song, stored in different directories or with different file names. The goal of this exercise is to search for these duplicates.

  1. Write a program that walks a directory and all of its sub-directories for all files with a given suffix (like .mp3) and lists pairs of files with that are the same size. Hint: Use a dictionary where the key of the dictionary is the size of the file from os.path.getsize and the value in the dictionary is the path name concatenated with the file name. As you encounter each file check to see if you already have a file that has the same size as the current file. If so, you have a duplicate size file and print out the file size and the two files names (one from the hash and the other file you are looking at).
  2. Adapt the previous program to look for files that have duplicate content using a hashing or checksum algorithm. For example, MD5 (Message- Digest algorithm 5) takes an arbitrarily-long “message” and returns a 128- bit “checksum.” The probability is very small that two files with different contents will return the same checksum.
You can read about MD5 at wikipedia.org/wiki/Md5. The following code snippet opens a file, reads it and computes its checksum.
import hashlib...        fhand = open(thefile,'r')        data = fhand.read()        fhand.close()        checksum = hashlib.md5(data).hexdigest()

You should create a dictionary where the checksum is the key and the file name is the value. When you compute a checksum and it is already in the dictionary as a key, you have two files with duplicate content so print out the file in the dictionary and the file you just read. Here is some sample output from a run in a folder of image files:

./2004/11/15-11-04_0923001.jpg ./2004/11/15-11-04_1016001.jpg./2005/06/28-06-05_1500001.jpg ./2005/06/28-06-05_1502001.jpg./2006/08/11-08-06_205948_01.jpg ./2006/08/12-08-06_155318_02.jpg

Apparently I sometimes sent the same photo more than once or made a copy of a photo from time to time without deleting the original.