This document will guide you through the assignment from highlevel concept down instead of
from each class function up, as was done in PA 1. You will implement file compression and
decompression using Huffman’s Algorithm, efficient data structures, and bitwise input/output
operations. Your program must accomplish effective space management by having comparable
results to our reference solution. Included in this writeup are some design notes in part 6 which
you need to follow. As always, start early.
Provided Starter Files: HCNode.hpp, HCTree.hpp, Makefile, refcompress, refuncompress,
Checkpoint Submission [NO SLIP DAY]
Due: Tuesday, August 9th, 2016, 11:59PM
Required Submission Files: (at least) compress.cpp, uncompress.cpp, HCNode.hpp,
HCNode.cpp, HCTree.hpp, HCTree.cpp, Checkpoint.pdf
Final Submission [1 Slip Day Allowed]
Due: Saturday, August 13th, 2016, 11:59PM
Required Submission Files: BitInputStream.hpp, BitInputStream.cpp, BitOutputStream.hpp,
BitOutputStream.cpp, compress.cpp, uncompress.cpp, HCNode.hpp, HCNode.cpp, HCTree.hpp,
Table of Contents:
Part 1: Getting Started
Part 2: File Compression and Decompression
Part 3: Checkpoint Submission
Part 4: Final Submission
Part 5: Control Flow
Part 6: Design
Part 7: Testing
Part 8: Submitting on Vocareum for Checkpoint and Final
Part 9: Grading Guidelines
Part 1: Getting Started
The starter files are located in /home/linux/ieng6/cs100v/public/pa2
Copy these files onto your own folder in your account. The test input files are in a folder in the
pa2 folder. Make sure those get copied over as well.
If you are unable to use C++ 4.8.3 a.k.a. std=c++11, you will need to use the command prep
Part 2: File Compression and Decompression
The assignment is basically writing two C++ programs: compress and uncompress. When
compress is executed by the following:
> ./compress infile outfile
This program will read the contents of the file as named by the first command line argument
(infile), construct a Huffman code for the contents of that file, and use that code to construct a
compressed version of the input file to be saved as named by the second command line
The second program, uncompress, is executed by the following:
> ./uncompress infile outfile
This program will read the contents of the file as named by the first command line argument
(infile). This file should have been constructed by the compress program as detailed previously.
The contents of this file will be used to reconstruct the uncompressed, unchanged original file,
which will be saved to a file as named by the second command line argument (outfile).
To be clear, for any files
absolutely no differences. A quick way to check if two files are identical are to use the diff
command. Doing diff
identical, but will print out where they differ if there are any differences between the two files.
This can be used to ensure that your ./uncompress is working properly.
Part 3: Checkpoint Submission
The purpose of the checkpoint is to focus on implementing the Huffman algorithm correctly.
The “compressed” file will actually not be smaller than the original file because this portion of
the assignment will have you compress using ASCII I/O rather than bit I/O.
Your checkpoint submission’s compress and uncompress should work as follows:
■ “./compress infile outfile” should “compress” the infile by building the
appropriate Huffman code tree based on the frequencies in infile (which is guaranteed
to have only ASCII text and to be very small (<1KB)). Then, write out the tree and the coded infile as plain (ASCII) text to outfile. The outfile should be a readable text file that contains your representation of the tree, as well as the code that represents the original file. So, the 1's and 0's in your code will be written to the "compressed" file using the ASCII characters for 1 and 0. ■ “./uncompress infile outfile” should "uncompress" the infile by reading in the file header, reconstructing the tree, and then "uncompressing" (i.e. decoding) the rest of it and writing it back to the outfile, as plain (ASCII) text. Checkpoint.pdf You need to verify your compression. Use checkpoint1.txt and checkpoint2.txt to run through the compress and uncompress programs. Save the output in .txt. You need to document your verification process by doing each step in the writeup as follows. For each input file (checkpoint1.txt and checkpoint2.txt): 1. Describe how you built the tree. 2. Describe how you find the codeword for each byte. 3. Draw the Huffman Tree by hand and include the result in the writeup. 4. Explain why your handcoded text is different from your compressor output. 5. Explain how you fixed it. 6. If you had no difference, then ignore Steps 5 and 6, and state you had no differences. 7. Save your writeup in .pdf format. No credit will be given for any other format. There is a maximum of 5 pages (this can be done using various tools like LaTex, word conversion to pdf, or others). You might find the suggestions in HCTree.hpp helpful. Be aware that those suggestions are only good for the checkpoint submission and will not work well for the final submission. Since you are not required to submit BitInputStream and BitOutputStream for the checkpoint, you will need to solve compilation issues that result from this. You can (1) comment out all references to them from the provided files or (2) create “dummy” versions of these .hpp and .cpp files to get the code to compile. You will also need to edit the Makefile (commenting would be the safest). When you work on the final submission, restore the references to the Bit Stream classes. Part 4: Final Submission This is the part where your program will use bitwise input/output to see results in compressed encoded files. The functionality is the same as the checkpoint functionality with the following exceptions. Your program: 1. Must be able to handle files much larger than 1KB, up to 10MB. This means a particular byte value may occur up to 10 million times in one file. 2. Must be able to handle file formats other than text files, such as binary files, images, videos, and more. 3. Must create compressed files that are at least 1 byte smaller than the input file AND the file produced by our reference solution. 4. Must use bitwise input/output to read and write the Huffman codes. 5. Should finish in about 15 seconds, but the hard requirement is that it must finish in under 180 seconds. Part 5: Control Flow Here are the highlevel steps your programs need to go through. Compress: 1. Open the input file for reading. 2. Read the file bytebybyte. Count the number of occurrences of each byte value. Close the file. 3. Use the byte counts to construct a Huffman coding tree. Each unique byte with a nonzero count will be a leafnode in the Huffman tree. 4. Open the output file for writing. 5. Write enough information (a “file header”) to the output file to enable the coding tree to be reconstructed when the file is read by your uncompress program. 6. Open the input file for reading again. 7. Using the huffman coding tree, translate each byte from the input file into its respective Huffman code. Append these codes as a sequence of bits to the output file after the header. This is done in ASCII for the checkpoint and in bitwise I/O for the final submission. 8. Close the input and output files. Uncompress: 1. Open the input file for reading. 2. Read the file header at the beginning of the input file. Reconstruct the same Huffman coding tree that compress created. 3. Open the output file for writing. 4. Using the reconstructed Huffman tree, decode the bits from the input file into the appropriate sequence of bytes, saving them into the output file. 5. Close the input and output files. Reference solution: refcompress/refuncompress The programs refcompress and refuncompress are “reference” implementations of this assignment you will use to verify your results. The reference programs are likely to work on ieng6 only. Your compression must beat the reference compressions by at least one byte for the final submission. Your compress program is not expected to work with the reference uncompress, and your uncompress is not expected to work with the reference compress. Part 6: Design (1) Implementing the Tree and its Nodes. One crucial data structure you will need is a binary trie (i.e. code tree or encoding tree) that represents a Huffman code. The HCTree.hpp header file provides a possible interface for this structure (included in your repo as skeleton code). You can modify this in any way you want. In addition, you will write a companion HCTree.cpp implementation file that implements the interface specified in HCTree.hpp, and then use it in your compress and uncompress programs. Note that you will also need the files HCNode.hpp (provided, you may modify) and HCNode.cpp in your submission (both at both the checkpoint and the final). (2) ObjectOriented Design. In implementing Huffman's algorithm, you will find it convenient to use multiple data structures. For example, a priority queue will assist in building the Huffman tree. Feel free to utilize other beneficial data structures. However, you should use good objectoriented design in your solution. For example, since a Huffman code tree will be used by both your compress and uncompress programs, it makes sense to encapsulate its functionality inside a single class accessible by both programs. With a good design, the main methods in the compress and uncompress programs will be quite simple; they will create objects of other classes and call their methods to do the necessary work. Thinking about objectoriented design will help you with the following two PAs. (3) The header and its use. Under the section on "Control Flow" you saw references to a header that should be stored by the compress program and later retrieved by the uncompress program. Both the compress and uncompress programs need to construct the Huffman Tree before they can successfully encode and decode information, respectively (and in the same way). The compress program has access to the original file, so it can build the tree by first deciphering the symbol counts. However, the uncompress program only has access to the compressed input file and not the original file, so it has to use some other information to build the tree. The information needed for the uncompress program to build the Huffman tree is stored in the header of the compressed file. So the header information should be sufficient in reconstructing the tree. (4) Designing the header: A straightforward, nonoptimized method that you MUST USE for the CHECKPOINT Probably the easiest way to do it is to save the frequency counts of the bytes in the original uncompressed file as a sequence of 256 integers at the beginning of the compressed file. For your checkpoint you MUST write this frequencybased header as 256 lines where each line contains a single int written as plain text. E.g.: 0 0 0 23 145 0 0 ... Where 0's represent characters with no occurrences in the file (e.g. above the ASCII values 0, 1 and 2 do not occur in the file), and any nonzero number represents the number of times the ASCII value occurs in the file. (5) Efficient header design(for the FINAL SUBMISSION) The method described above is extremely inefficient. Even if we improve this approach to store frequency counts as 4byte ints, storing the frequency counts in your header is not very efficient in terms of space: it uses 1024 bytes of disk for the header if you write your header as ints (not ASCII text), no matter what the statistics of the input file are. This is the approach we use in our reference solution. For your final submission, you must BEAT our reference solution by coming up with a more efficient way to represent this header. One way to do this is to take a frequency approach, but to think about how many bytes you really need to store each integer. Answering this question relates to the maximum size of the files that you are required to encode. To help you with your header design, please note the following: Your compress program must work for input files up to 10 megabytes in size, so a particular byte value may occur up to 10 million times in the file. This fact should help you determine the minimum number of bytes required to represent each frequency count in you header. Alternative approaches may use arrays to represent the structure of the tree itself in the header. Other schemes also work. With some cleverness, it is possible to optimize the header size to about 10*M bits, where M is the number of distinct byte values that actually appear in the input file. However, we strongly encourage you to implement the naive (256byte) approach first, and do not attempt to reduce the size of the header until you’ve gotten your compress and uncompress to work correctly for the provided inputs. (6) Note that there are some differences between the requirements of this assignment and the description of Huffman encoding in the textbook; for example, your program is required to work on a file containing any binary data, not just ASCII characters. It is important to keep in mind that the number of bits in the Huffmancompressed version of a file may not be a multiple of 8; but the compressed file will always contain an integral number of bytes. You need a way to make sure that any "stray" bits at the end of the last byte in the compressed file are not interpreted as real code bits. One way to do this is to store, in the compressed file, the length in bytes of the uncompressed file, and use that to terminate decoding. (Note that the array of byte counts implicitly contains that information already, since the sum of all of them is equal to the length in bytes of the input file.) (7) Please don't try out your compressor on large files (say 10 MB or larger) until you have it working on the smaller test files (< 1 MB). Even with compression, larger files (10MB or more) can take a long time to write to disk, unless your I/O is implemented efficiently. The rule of thumb here is that most of your testing should be done on files that take 15 seconds or less to compress, but never more than about 1 minute. If all of you are writing large files to disk at the same time, you’ll experience even larger writing times. Try this only when the system is quiet and you've worked your way through a series of increasing large files so you are confident that the write time will complete in about a minute. (8) Bitwise I/O: If you encode your files using ASCII representations of 0 and 1, you don't get any compression at all because you are using 1 byte to store the '0' or '1'. Once you've got your Huffman tree working, you'll modify your code so that you can write data to a file one bit "at a time". All disk I/O operations (and all memory read and write operations, for that matter) deal with a byte as the smallest unit of storage. But in this assignment (as you saw in the checkpoint), it would be convenient to have an API to the filesystem that permits writing and reading one bit at a time. Define classes BitInputStream and BitOutputStream (with separate interface header and implementation files) to provide that interface. To implement bitwise file I/O, you'll want to make use of the existing C++ IOstream library classes ifstream and ofstream that 'know how to' read and write files. However, these classes do not support bitlevel reading or writing. So, use inheritance or composition to add the desired bitwise functionality. Refer to the lecture notes for more information. (9) Separating Interface from Implementation. One difference from the assignment (BST) is the introduction of an .hpp file. Unlike other languages, such as Java, that allow for interfaces to be created, C++ does not. Therefore, the closest option a programmer has for decoupling the interface from the actual implementation is the use of header files. The header files contain just the prototypes for the functions and the class. Usually no actual definitions for functions exist in the header files (although there is some in files given to you). The actual implementation is placed inside the cpp files. In general, this is the convention used by many C++ programmers. C++ templates are sometimes an exception and this has to do with compiler challenges. Part 7: Testing Test components of your solution as you develop them, and test your overall running programs as thoroughly as you can to make sure they meet the specifications (we will certainly test them as thoroughly as we can when grading it). Getting the checkpoint submission working is a great middlestep along the way to a full working program. In addition, because the checkpoint writes in plain text, you can actually check the codes produced for small files by hand! Remember large programs are hard to debug. So test each function that you write before writing more code. We will only be doing "black box" testing of your program, so you will not receive partial credit for each function that you write. However, to get correct endtoend behavior you must unit test your program extensively. For all deadlines (checkpoint and final), be sure to test on "corner cases": an empty file, files that contain only one character repeated many times, etc. "Working" means, at minimum, that running your compress on a file, and then running your uncompress on the result, must reproduce the original file. There are some files provided in the input_files/ directory in your repo that may be useful as test cases, but you will want to test on more than these. For some useful testing tools, refer to this doc. Part 8: Submitting on Vocareum for Checkpoint and Final Checkpoint Submission [NO SLIP DAY] Due: Tuesday, August 9th, 2016, 11:59PM Required Submission Files: (at least) compress.cpp, uncompress.cpp, HCNode.hpp, HCNode.cpp, HCTree.hpp, HCTree.cpp, Checkpoint.pdf Final Submission [1 Slip Day Allowed] Due: Saturday, August 13th, 2016, 11:59PM Required Submission Files: BitInputStream.hpp, BitInputStream.cpp, BitOutputStream.hpp, BitOutputStream.cpp, compress.cpp, uncompress.cpp, HCNode.hpp, HCNode.cpp, HCTree.hpp, HCTree.cpp Choose the files you want to submit. A script will run. Make sure that your submission at least: 1. Has all the correct names 2. Compiles 3. Runs without issues with the tester Fix any errors that arise from Steps 13. Only one of the programmers in a partnership need to submit, but both should verify that the submission works for both programmers. You can submit once, ten times, any number of times before the deadline and only your last submission will be graded. If you submit within 24 hours after the deadline, you will be charged a slip day. Slip days cannot stack, so there will be no submissions allowed after 24 hours. Try to save your slip days for the remaining two PAs… Part 9: Grading Guidelines This PA is worth 30 points. The checkpoint is worth 5 points: ● 3 points for working code ● 2 points for writeup The final submission is worth 25 points: 1. 20 points for compress and uncompress, using Huffman coding to correctly compress and uncompress files of all sizes. Your program will be tested on 10 files. For each file, one point is given if your compressed file is within 25% of the compression obtained by the reference implementation, and one point for full reconstruction with uncompress. 2. 3 points for correct implementation of compress that creates a smaller compressed file than the reference implementation when run on three files. One of the files is warandpeace.txt. We won’t tell you what the other two files are. 🙂 3. 2 points for good objectoriented design, style, and informative, wellwritten comments. Memory leaks will result in deductions. Each test case that results in a segfault will receive a 0. Compile and link errors will cause the entire assignment to receive a 0. Valgrind and GDB are highly recommended debugging tools to use.