You will build on top of your heap implementation to create a Huffman code for text character encoding.
Make a new file named
huffman.py. Copy in your heap implementation
from the previous assignment. Then copy in the starter code provided
We will test your code by calling the function named
It takes the name of a file as input and it returns a Huffman code for the letters in the file,
in the form a dict mapping from letters to a binary code as a string. When you are
done, you should be able to use your program like this:
Note that the return from
is a dict from string to string. Since Python does not have a type for single characters,
we represent them as strings of length one. See how the Huffman code gave the letter
"g" a long code (5 bits), and the letter "l" a short code (3 bits)?
Testing your Huffman codes is not trivial. Python dicts don't have a guaranteed order of iteration, but you can compare two dicts for equality. However, a Huffman code for a given letter frequency is not unique. There are many Huffman codes for a given letter frequency that are all equally optimal, in terms of the compression they give.
So, to determine if a Huffman code is correct, you should at least verify that it is a prefix code. A prefix code is a variable length encoding where no letter's code is a prefix of any other code. See the example above, the letter "l" was given code "101". No other code begins with "101", so it is not a prefix of any other code. This property is true for all the other codes too.
While you could verify this property by hand, it's more fun to see if files "round trip" through the encoding and decoding processes. We have given you code to do this, and you can see above how to call it. After encoding and decoding an example file, you should be able to diff the original and the "_encoded_decoded", and find no difference.
Huffman codes are optimal prefix codes for per-symbol encoding. Here our symbols are just letters. While there are many possible Huffman codes for given letter frequencies, they will all result in a minimal length of the encoded file.
To help you verify that your Python program is correct, we will provide several example lengths of encoded files (that we got from our reference implementation). If your encodings are that same minimal length, then you can feel more confident that your algorithm for producing a Huffman code is correct.