Description
1. OverviewoftheAssignment
In this assignment, youwill implementthe SONalgorithmusing the Apache Spark Framework. Youwill developa
programto find frequentitemsetsin two datasets, one simulated dataset and one real-world dataset generated
from Yelp dataset. The goal of this assignment is to apply the algorithms you have learned in class on large
datasetsmore efficiently in adistributed environment.
2. Requirements
2.1 Programming Requirements
a. YoumustusePythontoimplementalltasks.You can only use standard python libraries (i.e.,
external libraries like numpy or pandas are not allowed). Therewillbe10%bonusforeachtaskifyou
alsosubmita Scala implementation and both your Pythonand Scala implementations are correct.
b. YouarerequiredtoonlyuseSparkRDDinordertounderstandSparkoperationsmoredeeply.Youwill not get
any pointif you use Spark DataFrame orDataSet.
2.2 Programming Environment
Python 3.6, Scala 2.11 and Spark 2.3.3
Wewillusetheselibraryversionstocompileandtestyourcode.Therewillbea20%penaltyifwecannot run your code
due to the library version inconsistency.
2.3 Write your own code
Do not share code with other students!!
Forthisassignmenttobeaneffectivelearningexperience,youmustwriteyourowncode!Weemphasize thispoint
because youwillbe able to findPython implementations ofsome oftherequiredfunctionson the web. Please do
notlook for or at any such code!
TAswillcombineallthecodewecanfindfromtheweb(e.g.,Github)aswellasotherstudents’codefrom thisandother
(previous)sectionsforplagiarismdetection.Wewillreportalldetectedplagiarism.
2.4 What you need to turn in
Yoursubmission must be a zip file with name: firstname_lastname_hw2.zip (all lowercase). You need to pack the
following filesin the zip file (see Figure 1):
a. two Python scripts, named: (all lowercase)
firstname_lastname_task1.py firstname_lastname_task2.py
b1.[OPTIONAL]two Scala scripts, named: (all lowercase)
firstname_lastname_task1.scala
firstname_lastname_task2.scala
b2. [OPTIONAL] one jar package, named: firstname_lastname_hw2.jar (all lowercase)
c. Youdon’tneedtoincludeyourresults.Wewillgradeonyour codewithourtestingdata(datawillbe in the
sameformat).
Figure 1: Submission Structure
3. Datasets
In this assignment, you will use one simulated dataset and one real-world. In task 1, you will build and test
your program with a small simulated CSV file that has been provided to you. In task 2, you will build and
test your program with a large real-world data which generated by a subset using business.json and
review.json from the Yelp dataset (https://www.yelp.com/dataset) with the same structure asthe simulated
data. For saving the time, the dataset has been provided to you.
Figure 2: Input Data Format
4. Tasks
In this assignment, you will implementthe SON algorithm to solve alltasks(Task 1 and 2) on top of Apache Spark
Framework. Youneedtofindallthepossible combinations ofthe frequentitemsetsinany given inputfile within
therequiredtime.YoucanrefertotheChapter 6fromtheMiningofMassiveDatasets
book and concentrate on section 6.4 – Limited-PassAlgorithms. You need to use A-Priori algorithmto process
each chunk ofthe data.
4.1 Task 1: Simulated data (6pts)
There is a CSV file(test_data.csv) posted on the Blackboard.You can use this test file to debug your code.
In this task, you need to build two kinds of market-basket model.
Case 1 (3 pts):
You will calculate the combinations of frequent businesses(assingletons, pairs,triples, etc.) that are qualifiedas
frequentgivenasupportthreshold.Youneedtocreateabasketforeachusercontainingthe businessidsreviewed
bythisuser.Ifabusinesswasreviewedmorethanoncebyareviewer,weconsider thisproductwasratedonlyonce.
Morespecifically,thebusinessidswithineachbasketareunique.The generated baskets are similarto:
user1: [business11, business12, business13, …]
user2: [business21, business22, business23, …]
user3: [business31, business32, business33, …]
Case 2 (3 pts):
You will calculate the combinations of frequent users (as singletons, pairs, triples, etc.) that are qualified asfrequent
given a supportthreshold. You need to create a basketfor each business containing the user idsthat commented
onthisbusiness. Similartocase 1,theuseridswithineachbasket areunique. The generated baskets are similarto:
business1:[user11,user12,user13,…]
business2:[user21,user22,user23,…]
business3:[user31, user32, user33, …]
Input format:
1. Case number: Integerthatspecifiesthe case.1 forCase1 and2 forCase 2.
2. Support: Integerthatdefinestheminimumcountto qualify as a frequentitemset.
3. Inputfilepath:Thisisthepathtotheinputfile includingpath,filename andextension.
4. Outputfilepath:Thisisthepathtotheoutputfileincludingpath,filenameandextension.
Output format:
1. Runtime:thetotalexecutiontimefromloadingthefiletillfinishingwritingtheoutputfile
You need to print the runtime in the console with the “Duration” tag, e.g., “Duration: 100”.
2. Outputfile:
(1) Intermediate result
You should use “Candidates:” as the tag. For each line you should output the candidates of frequent itemsets
youfoundafterthefirstpassofSONalgorithmfollowedbyanemptylineaftereachcombination. Theprinteditemsets
mustbesortedinlexicographicalorder(Bothuser_idandbusiness_idaretypeof string).
(2) Finalresult
You should use “FrequentItemsets:” as the tag. For each line you should outputthe final frequentitemsets you
found after finishing the SON algorithm. The format is the same with the intermediate results. The printed
itemsetsmust be sorted in lexicographical order.
Here is an example of the output file:
BoththeintermediateresultsandfinalresultsshouldbesavedinONEoutputresultfile
“firstname_lastname_task1.txt”.
Execution example:
Python:
spark-submit firstname_lastname_task1.py
Scala:
spark-submit–classfirstname_lastname_task1firstname_lastname_hw2.jar
4.2 Task 2: SON algorithm on Yelp data (6.5 pts)
In task2,you will explore the Yelpdataset to find the frequent business sets(onlycase1).
We generated a sample dataset from business.json and review.json in the state of Arizona in CSV format. You
can download this dataset(AZ_Yelp.csv) from blackboard.
Apply SONalgorithm
The requirements for task 2 are similar to task 1. However, you will test your implementation with the large
dataset(AZ_Yelp.csv). Forthis purpose, you need to reportthe total execution time. For this execution time, we
take into account also the time fromreading the file till writing the resultsto the outputfile.Youareaskedtofind
thefrequent business sets(only case1)from the file we provide. The following are the steps you need to do:
1. Reading theAZ_Yelp CSVfileintoRDDandthenbuildthecase 1market-basketmodel;
2. Filteroutqualifieduserswhoreviewedmore thankbusinesses.(k isthefilterthreshold);
3. Apply the SONalgorithmcode to the filteredmarket-basketmodel;
Input format:
1. Filterthreshold: Integerthatis used to filter out qualified users
2. Support: Integerthatdefinestheminimumcountto qualify as a frequentitemset.
3. Inputfilepath:Thisisthepathtotheinputfileincludingpath,filename andextension.
4. Outputfilepath:Thisisthepathtotheoutputfileincludingpath,filenameandextension.
Output format:
1. Runtime:thetotalexecutiontimefromloadingthefiletillfinishingwritingtheoutputfile
You need to print the runtime in the console with the “Duration” tag, e.g., “Duration: 100”.
2. Outputfile
Theoutputfileformatisthesamewithtask1.Boththeintermediateresultsandfinalresultsshouldbe saved inONE
outputresultfile “firstname_lastname_task2.txt”.
Execution example:
Python:
spark-submit firstname_lastname_task2.py
Scala:
spark-submit –class firstname_lastname_task2 firstname_lastname_hw2.jar
5. Evaluation Metric
Task 1:
Input File Case Support Runtime (sec)
test_data.csv 1 4 <=100
test_data.csv 2 7 <=200
Task 2:
6. Grading Criteria
(% penalty = % penalty of possible points you get)
1. You can use yourfree 5-day extension separately ortogether.
2. There will be 10% bonusif you use both Scala and Python.
3. If youdo not apply the SONalgorithm,therewill be no pointforthis assignment.
4. Ifwe cannotrunyourprogramswiththecommandwe specified,therewillbe80%penalty.
5. IfyourprogramcannotrunwiththerequiredScala/Python/Sparkversions,therewillbe20% penalty.
6. Ifour gradingprogramcannotfinda specifiedtag,therewillbenopointforthisquestion.
Input File Filter Threshold Support Runtime (sec)
AZ_Yelp.csv 70 50 <=1000
7. Iftheoutputsof yourprogramareunsortedorpartially sorted,therewillbe50%penalty.
8. Ifthe header ofthe outputfile ismissing,there will be 10% penalty.
9. Wecanregradeonyourassignmentswithinsevendaysoncethescoresarereleased.Noargueafter oneweek.
There will be 20% penalty if our grading is correct.
10. Therewill be 20%penalty forlate submissionwithin aweek and no point after aweek.
11. TherewillbenopointiftheexecutiontimeineachtaskexceedstheruntimeinSecion6Evaluation Metric.
12. OnlywhenyourresultsfromPythonarecorrect,thebonusofusingScalawillbecalculated.Thereis no partially
pointfor Scala. See the example below:
Example situations
Task Score for Python Score for Scala
(10% of previous column if correct) Total
Task1 Correct: 6 points Correct: 6 * 10% 6.6
Task1 Wrong: 0 point Correct: 0 * 10% 0.0
Task1 Partially correct: 3 points Correct: 3 * 10% 3.3
Task1 Partially correct: 3 points Wrong: 0 3.0