PA4

Objective: learn how to implement the Huffman encoding algorithm.

Instructions

You will build on top of your heap implementation to create a Huffman code for text character encoding.

Make a new file named huffman.py. Copy in your heap implementation from the previous assignment. Then copy in the starter code provided here.

We will test your code by calling the function named huffman_letter_codes_from_file_contents(file_name). It takes the name of a file as input and it returns a Huffman code for the letters in the file, in the form a dict mapping from letters to a binary code as a string. When you are done, you should be able to use your program like this:

Note that the return from huffman_letter_codes_from_file_contents is a dict from string to string. Since Python does not have a type for single characters, we represent them as strings of length one. See how the Huffman code gave the letter "g" a long code (5 bits), and the letter "l" a short code (3 bits)?

Testing

Testing your Huffman codes is not trivial. Python dicts don't have a guaranteed order of iteration, but you can compare two dicts for equality. However, a Huffman code for a given letter frequency is not unique. There are many Huffman codes for a given letter frequency that are all equally optimal, in terms of the compression they give.

Prefix property

So, to determine if a Huffman code is correct, you should at least verify that it is a prefix code. A prefix code is a variable length encoding where no letter's code is a prefix of any other code. See the example above, the letter "l" was given code "101". No other code begins with "101", so it is not a prefix of any other code. This property is true for all the other codes too.

While you could verify this property by hand, it's more fun to see if files "round trip" through the encoding and decoding processes. We have given you code to do this, and you can see above how to call it. After encoding and decoding an example file, you should be able to diff the original and the "_encoded_decoded", and find no difference.

Optimality of Compression

Huffman codes are optimal prefix codes for per-symbol encoding. Here our symbols are just letters. While there are many possible Huffman codes for given letter frequencies, they will all result in a minimal length of the encoded file.

To help you verify that your Python program is correct, we will provide several example lengths of encoded files (that we got from our reference implementation). If your encodings are that same minimal length, then you can feel more confident that your algorithm for producing a Huffman code is correct.


Sample lengths (right click on the links and choose "save as"):
Submit your huffman.py code through Checkin. The Checkin tab on the course website performs preliminary testing of your code. These tests do not indicate your final grade. They can, however, catch mistakes in your submission. You can re-submit your file as often as you want.