Description
Goal: This lab will teach you about hash function.
Part I
——————
Write a lab9.cpp file. Inside this file, implement a hash function called hashcode(). The
head of this function should be:
long hashcode(char* s)
This function takes a c-type string as parameter, and calculates the hash code of this
string. The mathematic definition of this hash function follows the hashcode() definition
in JAVA. That is:
where s[i] denotes the ith character of the string, and n is the length of s.
Compression functions
———————
A simple compression function can look like this
h(i) = |i| mod N.
In fact, it just shuffles the buckets to different indices. A better compression function is
h(i) = ((ai + b) mod p) mod N,
where p is a large prime that’s substantially bigger than N. (You can replace the
parentheses with absolute values if you like; it doesn’t matter much.)
For this lab, we use h(s) mod 10007. The bottom line is whether you have too many
collisions or not in Part II. If so, you’ll need to improve your hash code or compression
function or both.
Part II
——————-
Download the provided files on Canvas to your working directory. When you run the
code, provide one file name as the parameter at the command line. The program should
read the contents of the file, calculate the h(s) mod 10007 value, checks whether some
other string generates the same value. If there, increase the collision counter by one.
After processing all the inputs, the program prints out the total number of strings and
the number of collisions. For example, we list the following command line and the
results of this command.
jli$ ./a.out f1
Total Input is 10000
Collision # is 4744
A tutorial on collision probability
———————————–
People are always surprised when they find out how many collisions occur in a working
hash table. You might have the misimpression that there won’t be many collisions at all
until the table is nearly full. Let’s analyze how many collisions you should expect to see
if your hash code and compression function are good. Here, we define a “collision” to be
the event where a newly inserted key has to share its bucket with one or more
previously inserted keys. (We count that as only one collision, regardless of how many
keys are already in the bucket.)
If you have N buckets and a good (pseudorandom) hash function, the probability of any
two keys colliding is 1/N. So when you have i keys in the table and insert key i + 1, the
probability that the new key does NOT collide with any old key is (1 – 1/N)i
. If you insert
n distinct items, the expected number that WON’T collide with any previous item is
(1 − 1
�)&
‘()
&*+
= � − �(1 − 1
�)’
so the expected number of collisions is
� − � + �(1 − 1
�)’
Now, for any n and N you test, you can just plug them into this formula and see if the
number of collisions you’re getting is in the ballpark of what you should expect to get.
For example, if you have N = 100 buckets and n = 100 keys, expect about 36.6 collisions.
Wrap up
————————
You should write the code inside lab9.cpp file. Jar you C++ files and the downloaded data
files into lab9.jar. Submit the completed file to Canvas.