## Description

Problem 1 Defining Metrics (3 points):

A. (1/2 point) Define a feature set for a Facebook account so that you can represent any account

as a vector of a fixed number of features that can be encoded as real numbers. What are the

features? What values can they take? What is this feature set useful for capturing about facebook

account holders?

B. (1 point) Define a metric on the vector space from part A. Prove it is a metric.

C. (1/2 point) Assume each point in your space represents a person encoded as 3-element vector:

Give an expected range of values for each of these elements. The goal is to cluster people into 4

groups “child, teen, middle-aged, elderly”. Does it make sense to put people in a Euclidean 3-

space where we treat values for all three elements equally? Why or why not? How would you

design a metric to cluster people appropriately? Explain your reasoning.

D) (1 point) A strand of DNA can be encoded as a sequence of four bases—adenine, guanine,

cytosine, and thymine. Consider the metric for strings defined in “the String to String Correction

Problem” (a paper available on the course calendar). Explain how this could be used to determine

the distance between two strands of DNA. Does the metric need to be adapted in any way, or

could it be used, essentially as-is? Back up your statement.

Problem 2 Making a Nearest-Neighbor Spell Checker (3 points):

The combination of a distance measure for words and the word list from the dictionary are all you

need to make a simple nearest-neighbor spell-corrector. Define a word as a text string. Let

L={l1,l2,…ln } be the list of words and let w be the word you’re spell-checking. Define d(a,b) as

the Levenshtein distance (Figure 1) between strings a and b. Then, the closest word in the

dictionary to our word is the one with the lowest distance to that word.

If we assume that the closest word in the dictionary is the one to use for correcting, then this is all

we need for a spell checker. Given a sentence like “My doeg haz gleas”, simply find the closest

dictionary word to each of the words in the sentence. If the distance to the closest word is 0, the

word must be spelled right. If the closest dictionary word is not 0 steps away, then substitute the

closest dictionary word in and you’ll get “My dog has fleas”. Or so we hope.

Figure 1 shows pseudocode of the Levenshtein distance for strings. This variant is the one by

Wagner and Fischer. It gives a measure of distance between any two strings. Refer to the paper

‘The String-to-string Correction Problem’ from the assigned reading in the course calendar for

more details on this.

l

* = argmin

l∈L

(d(l,w))

EECS 349 (Machine Learning) Homework 2

int LevenshteinDistance(char s[1..m],

char t[1..n],

deletionCost,

insertionCost,

substitutionCost)

{

// for all i and j, d[i,j] will hold the Levenshtein distance between

// the first i characters of s and the first j characters of t

// NOTE: the standard approach is to set

// deletionCost = insertionCost = substitutionCost = 1

declare int d[0..m, 0..n] // note that d has (m+1)x(n+1) values

for i from 0 to m

d[i, 0] := i*deletionCost // dist of any 1st string to an empty 2nd string

for j from 0 to n

d[0, j] := j*insertionCost // dist of any 2nd string to an empty 1st string

for j from 1 to n

{

for i from 1 to m

{

if s[i] = t[j] then

d[i, j] := d[i-1, j-1] // no operation cost, because they match

else

d[i, j] := minimum(d[i-1, j] + deletionCost,

d[i, j-1] + insertionCost,

d[i-1, j-1] + substitutionCost)

}

}

return d[m,n]

}

Figure 1. Pseudo code for Levenshtein distance: Wagner and Fischer algorithm

Getting a Dictionary

The 12dicts project ( http://wordlist.aspell.net/12dicts/ ) has several variant dictionaries of

English. You should download Version 6 of the data set (also known as 12dicts-6.0.2.zip ).

The Data

The file wikipediatypo.txt is included with this homework. This is a list of common spelling

mistakes in the Wikipedia and is used to correct typographical errors throughout Wikipedia. See

(http://en.wikipedia.org/wiki/Wikipedia:Typo ) for information on spellchecking the Wikipedia.

Each line in wikipediatypo.txt contains a common misspelled word followed by its associated

correction. The two words (or phrases) on each line (error, correction) are separated by a tab.

Note, some corrections may contain blanks (e.g. a line containing “aboutto” and “about to”).

The file wikipediatypoclean.txt is a subset of wikipediatypo.txt that contains only words whose

correct spelling is an entry in the dictionary file /American/3esl.txt from the 12dicts data set. It is

in the same format as wikipediatypo.txt. Note this uses only words starting with a portion of the

alphabet.

The file syntheticdata.txt has the same format as the other two files, but the spelling errors were

created using a synthetic distribution. Note this uses only words starting with a portion of the

alphabet.

EECS 349 (Machine Learning) Homework 2

Programming Hints

Some helpful packages to import to do this assignment:

import csv #this would be for reading our text files

import numpy as np #this would be for computing levenshtein_distance

(multi-dimensional arrays are easier with these)

import matplotlib.pyplot as plt #this would be for making many of the

plots required in this assignment

A. (1 point) Write a function that finds the closest word (string) in a dictionary (a list of strings)

to a given input string. Call it find_closest_word. This function should call the

levenshtein_distance function that you write. Both of these functions should be included in the

same file “spellcheck.py”.

def find_closest_word (string1, dictionary):

#write some code to do this, calling levenshtein_distance, and

return a string (the closest word)

return closest_word

def levenshtein_distance(string1, string2, deletion_cost,

insertion_cost, substitution_cost):

#write some code to compute the levenshtein distance between two

strings, given some costs, return the distance as an integer

return distance

B. (1 point) Write a Python command-line program (“spellcheck.py”) that takes two ASCII text

files as input: the first is the file to be spell-checked. The second is a dictionary word list with one

word (or phrase) per line (we’re thinking of 3esl.txt). It should output a file in the current

directory called corrected.txt. Treat every contiguous sequence of alphanumeric characters in the

input file as a word. Treat all other characters (e.g. blanks, commas, ‘#’, tabs, etc.) as word

delimiters. The file corrected.txt should be the spell-corrected input file, where each word in the

input file has been replaced by the nearest dictionary word. Call this file spellcheck.py. Make sure

it can be called as:

python spellcheck.py

C. (1 point) Write a function, called measure_error that takes 3 lists of strings, called typos,

truewords, and dictionarywords. respectively. The ith typo corresponds to the ith trueword. For

each typo i, the function should find the closest dictionary word with the function

find_closest_word and then compare the closest dictionary word to the ith trueword . If the

closest dictionary word and the trueword are the same, then the error for that example is 0. If they

are different, then this was a misclassification and the error is 1. The output of this function

should be the error rate (number of errors divided by the number of typos).

def measure_error(typos, truewords, dictionarywords):

#find whether the corrected typo using the dictionary matches

the true word. 0 if it doesn’t, 1 if it does. Count them all up and

return a real value between 0 and 1 representing the error_rate

return error_rate

EECS 349 (Machine Learning) Homework 2

Problem 3 Parameter Picking (2 points):

A. (1/2 point) How long does it take to run measure_error on the data in wikipediatypo.txt using

the 3esl.txt dictionary? Hint: use the time function for this. For example:

import time

start = time.time()

# your code here

print (time.time() – start)

Different metrics can change the performance of a nearest-neighbor classifier. One obvious way

to change your metric is to vary the insertion, deletion and substitution costs of your

levenshtein_distance function. One obvious thing to try would be a grid search of the space and

see which settings give you the best results. If you were to vary insertion, deletion and

substitution costs among the values in the set {0, 1, 2, 4}, this would give 64 parameter

combinations. How long would that take, given you use the files specified in A? How about if

you used 10-fold cross validation on your data?

B. (1/2 point) Design an experiment to determine the best values for insertion, deletion and

substitution costs that can be run by your system in under 1 hour. Say what values of each

parameter you’ll try. Say whether you’re doing cross validation. If so, how many folds? Pick a

dictionary (or part of a dictionary) and data that will let you complete the grid search described in

part A of this problem. Explain the reasoning for your choices.

C. (1 point) Perform the experiment you designed in part B. Show a graph (or graphs) that give

the results. What is the best combination of insertion, deletion and substitution costs?

Problem 4 A Better Distance Measure? (2 points):

A. (1 point) We discussed an edit distance measure in class that uses the distance on the

QWERTY keyboard as the substitution cost function. Ignore capitalization (things on the same

key have distance 0). Distance between keys is Manhattan distance, measured by difference in

rows + difference in columns. Let d(x,y) be the distance function. Example values are d(‘Q’,’q’) =

0, d(‘G’,’B’) = 1, and d(‘Q’,’d’) = 3, since ‘d’ is 2 columns to the right and 1 row down from Q.

Consider only alphanumeric characters. Treat all other characters (e.g. blanks, commas, ‘#’, tabs,

etc.) as word delimiters that you don’t have to measure distances on. In other words, you don’t

have to calculate d(‘1’,’!’).

Write an edit distance that uses this substitution cost function in place of a fixed substitution cost.

def qwerty_levenshtein_distance(string1, string2, deletion_cost,

insertion_cost):

#write some code to compute the levenshtein distance between two

strings, given some costs, return the distance as an integer

return distance

B. (1 point) Rerun the experiment from problem 3, this time using qwerty_levenshtein_distance.

This means you’re only optimizing two parameters: deletion_cost and insertion_cost. You may

need to try different ranges of values than you did in question 3. What set of values did you try?

Show a graph (or graphs) with results. Don’t forget to label dimensions. What is the best

combination of insertion, deletion and substitution costs? Is qwerty_levenshtein_distance better

than levenshtein_distance? Why or why not?