Web search engines use a variety of information in determining the most relevant documents to a query. One
important factor (especially in early search engines) is the frequency of occurrences of the query words in a
document. In general, one can try to answer a question how similar or dissimilar two documents are based on the
similarity of their word frequency counts (relative to the document size). A necessary step in answering these types
of questions is to compute the word frequency for all words in a document.
In this assignment, you will write a method wordCount(String input_file, String output_file) that reads a file
(document) and prints out (into another file) all the words encountered in the document along with their number of
occurrences in the document. Please use output format such as “(father 30) (fishing 12) (aspirin 45) …”. For
simplicity, assume any derivative words to be distinct, e.g., “book” and “books”, “eat” and “eating” are all
considered distinct. Assume that words are defined to be simply strings of characters between two delimiting
characters, which include a space and punctuation characters. Assuming that something like “Father’s” is two
words (“Father” and “s”, because they are separated by delimiters) is OK for our purposes. You can use Java class
StringTokenizer (which is sometimes viewed as deprecated but it’s not, it’s considered “legacy” class) or
String.split() to extract words from an input string to save yourself some programming. Do not distinguish words
that only differ in upper or lower case of their characters, e.g., “Father” and “father” is one word. You can use
appropriate methods of the String class handle this easily (e.g., using toLowerCase method).
In implementing wordCount, please implement (yourself, don’t use java’s hash-related classes) a hash table with
separate chaining to keep the current counts for words you have already encountered while you are scanning the
input file. Your general procedure would include the following steps:
1. Scan in the next word
2. Search for this word in the hash table
3. If not found, insert the new entry with this word and the initial count of 1. Otherwise increment the count.
4. If you inserted a new word, check if the hash table needs to be expanded.
After you process the entire file, loop through the entire hash table and print out, sequentially in any order you like,
the list of words and their counts. Also, report the average length of the collision lists in the final state of your hash
table (across all hash slots, so empty slots also contribute).
You also need the main method that accepts the names of the two files above and passes them to the wordCount
method. Please run your program on the same input file you used for Programming Assignment 2 (still truncated to
50Kbyte size). If you skipped that assignment, please refer to it for instructions on how to obtain a realistic input
1. In implementing your hash table, you can use Java’s hashCode function on strings, so that your hash
function will be h = Math.abs(word.hashCode()) % tableSize. But you obviously cannot use built-in hash
tables like HashMap in Java. Note that we take the absolute value of the hash because hashCode returns an
int, which can be negative.
2. Please use separate chaining to resolve collisions in your hash table. Using separate chaining, you do not
need to have tableSize to be prime number. Any number will work as long as it is not a multiple of 31 (see
lecture for the reason why). For example, starting with tableSize as a power of 2 and then doubling if you
need to expand will ensure you do not have a multiple of 31.
1. Source code including comments necessary to understand it;
2. Input file;
3. Output result: word counts and average length of the collision lists.
4. A “toy” test file and output produced on the toy file (see below).
• Implementation of the hash table class: 50 pts, including:
o Correct hashing (with proper comments): 30
o Resizing/rehashing as needed (with proper explanation in comments): 20
• Application program utilizing the hash table, along with a “toy” test file and output produced on a toy file
(Important: DO THIS FIRST, before working with a real file!): 25 pts
• Programming style: 10 pts
• Producing output on a real file: 15 pts.