Sale!

CS 490-B8/680-A8 HOMEWORK #3 Spell checking using Kernighan’s algorithm

$30.00 $18.00

Category: You will Instantly receive a download link for .zip solution file upon Payment || To Order Original Work Click Custom Order?

Description

5/5 - (3 votes)

Purpose
Implement Kernighan’s spelling algorithm using a real-world dataset.
Use dictionaries, list comprehensions, sorting and output formatting in a real-world size project.
Learn useful pieces of Python infrastructure including the use of a singleton class, the pickle function and the time class.
Part I: Preparing the data
1. Clean up the data by replacing every character that is not an Ascii letter by a space. Then convert to lowercase and split words by spaces.
That will split abbreviations into two words, e.g., “doesn” + “t”. This is a common approach in NLP software, e.g., the Stanford parser.
2. Create dictionaries of character unigrams (single characters) and character bigrams (pairs of characters) with their frequencies, including beginning-of-word and end-of-word as shown in the sample output.
The easiest way to do this is to use a regular expression (check out the ‘w’ format) to surround the tokens by angle brackets, e.g., ‘‘. That produces bigrams like this: [‘‘], where ‘‘ means d endword.
Print the dicts as shown in the sample output.
3. Build a dictionary of words in the database with their frequencies.
As a check on the file processing, print the following items: the number of items in the dictionary, the frequency of the words ‘a’ and ‘the’, and the most frequently used word and its frequency.
4. Since these are time-consuming operation, measure the elapsed time and CPU time to process the file. Use time.perf_counter() to measure elapsed time and time.process_time() to measure CPU time. Call each function before and after reading the file and take the different.
5. Save all the data structures in an object of a singleton class. Pickle the object so that the actual spelling correction program doesn’t need repeat this preparatory work.
my_file = open(“pickled_data2.dat”, ‘wb’)
pickle.dump(my_data, my_file)

Part II: Using the data for spelling correction
1. Assume the misspelled words are provided as command line parameters. Quit with a message if there are none provided.
2. For each word, create a list of possible corrections as suggested by Kernighan et al. The possible corrections include:
a) Add any letter at any position (including at the beginning or end of a word).
b) Delete the letter at any position.
c) Substitute the letter at any position with any other letter.
d) Transpose any two consecutive letters.
Consider these corrections independently, i.e., we will only evaluate the effect of one error on a word at a time.
Prune the list by looking up each possible correction in the dictionary. Drop any possible correction that is not in the dictionary, i.e., we will ignore any string which is not a word or is so rare that it does not appear in the corpus at all.
For each of the four categories, print the number of possible corrections and the number that are left after pruning.
A link to the original paper can be found on the course home page:
Kernighan, M., Church, K., and Gale, W. (2000). A spelling correction program based on a noisy channel model. In COLING ’90 (Helsinki), v. 2, pp. 205–211.
3. For each of the misspelled words, use Kernighan’s approach to find the probability of each possible correction.
Define a correction class containing the fields you will need to print using the following layout:
Candidate Correction Correct Letter Error Letter x|w P(x|word) P(word) 10^9 *P(x|w)P(w)
across o e e|o .000009300 .000299000 2.80000
actress t – c|ct .000117000 .000023100 2.70000
acres – s es|e .000032100 .000031800 1.00000
acres – s ss|s .000034200 .000031800 1.00000
access c r r|c .000000209 .000091600 0.01900
caress ca ac ac|ca .000001640 .000001700 0.00280
cress – a a|# .000001440 .000000544 0.00078

Create an object for each possible correction containing the data in the table. Save the items in a list, then sort them so that the chart prints in descending order of probability.
To do this you will need the confusion matrices from Kernighan’s paper. To save you some trouble, I have scanned these and converted them to Python. They are available on turing in the course directory at /dhw/hw3-spell/tables.py for you to copy into your program. Write a function to convert the format given to a dict of dict, then call it for each table. You can do this in a few lines of code using a two-level dict comprehension.
Test data:
The course notes contain some possible corrections for one misspelled word. Kernighan’s paper contains some possible corrections for a number of words. (These lists may not be complete.)
You can obtain others from the spell checker of your choice, although your program may produce different possible corrections.
I have posted my unigram and bigram counts. I will post additional output after two or three people have completed the assignment and obtained the same results.

Submission:
Same submission rules as before. Don’t forget to follow the coding and naming conventions. Remember that your program file(s) must be zipped.
All tables must have aligned columns. Integers must be right-aligned. Real numbers must be decimal-aligned.
Every number you print must have a rubric, e.g., as a row or column heading.
Use list comprehensions wherever possible.
Grading:
10 points – unigram table
10 points – bigram table
10 points – frequency dictionary
10 points – timing information
10 points – creation of pickle file
20 points – creation and pruning of possible corrections
20 points – creation of data for each possible correction
20 points – sorting and printing possible corrections
There will be a penalty for violating the submission rules.