Description
What to Submit
When you have completed the assignment, move or copy your python scripts and outputs in a directory
(e.g., assignment2), and use the following command to electronically submit your files:
% submit 4415 a2 umapper.py ureducer.py unigrams.txt bmapper.py breducer.py bigrams.txt
tmapper.py treducer.py trigrams.txt frequency-computation.txt skipgrammapper.py
skipgramreducer.py skipgrams.txt iimaper.py iireducer.py inverted-index.txt team.txt
The team.txt file includes information about the team members (first name, last name, student ID,
login, yorku email). You can also submit the files individually after you complete each part of the
assignment– simply execute the submit command and give the filename that you wish to submit. Make
sure you name your files exactly as stated (including lower/upper case letters). Failure to do so will
result in a mark of 0 being assigned. You may check the status of your submission using the command:
% submit -l 4415 a1
1 https://www.lyricsfreak.com/
A. Distributed Computation of n-grams (35%, 5% each)
In the fields of computational linguistics, an n-gram is a contiguous sequence of n items from a given
sample of text. For this part of the assignment you can assume that items are words collected from song
lyrics. An n-gram of size 1 is referred to as a “unigram”; of size 2 is a “bigram”; of size 3 is a “trigram”.
For example, given the text input “I love ice cream” the following unigrams, bigrams and
trigrams are computed:
unigrams (“I”, “love”, “ice”, “cream”)
bigrams (“I love”, “love ice”, “ice cream”)
trigrams (“I love ice”, “love ice cream”)
Your task is to design and implement MapReduce algorithms that given a collection of English songs:
compute the number of occurrences of each unigram in the song collection (umapper.py,
ureducer.py) and output the results in a file called unigrams.txt
compute the number of occurrences of each bigram in the song collection (bmapper.py,
breducer.py) and output the results in a file called bigrams.txt
compute the number of occurrences of each trigram in the song collection (tmapper.py,
treducer.py) and output the results in a file called trigrams.txt
how would you modify these scripts in order to compute the frequency of each of the quantities
(instead of the number of occurrences)? Provide a short answer in plain text (up to half a page)
with the name frequency-computation.txt
The collection of songs is provided in a file songdata.csv that follows the same format as the
original data set provided by Kaggle. The contents of the file might vary when testing your code.
Running the script:
The following webpage provides useful information on how to test your scripts first locally and then in
the Hadoop environment:
https://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/