InAssignment3,youwillcompletetwotasks.ThegoalistoletyoubefamiliarwithMinHash, Locality Sensitive
Hashing (LSH), and different types of collaborative-filtering recommendation systems. The datasetyouaregoing
toplaywithisasubsetfromtheYelpdataset(https://www.yelp.com/dataset)used in the previous assignments.
2.1 Programming Language and Library Requirements
a. YoumustusePythontoimplementalltasks.YoucanonlyusestandardPythonlibraries(i.e.,external librarieslike
Numpy or Pandas are not allowed). There will be a 10% bonus for each task (or case) if you also submit a Scala
b. You are requiredto onlyuse the Spark RDDto understandSpark operations. Youwill notreceive any pointif
youuse SparkDataFrame orDataSet.
2.2 Programming Environment
your codeduetothe libraryversioninconsistency.
2.3 Write your own code
Do not share your code with other students!!
2.4 What you need to turn in
Your submission must be a zip file with the naming convention: firstname_lastname_hw3.zip (all lowercase,
e.g., yijun_lin_hw3.zip). You should pack the following required (and optional) files in a folder named
firstname_lastname_hw3 (all lowercase, e.g., yijun_lin_hw3)inthezipfile(Figure1,only thefiles in the red boxes are
a. [REQUIRED] two Python scripts containing the main function, named:
b1. [OPTIONAL] two Scala scripts containing the main function, named:
b2. [OPTIONAL] one Jar package, named:
c. [OPTIONAL]Youcanincludeotherscriptstosupportyourprograms(e.g.,callablefunctions),butyou needto
d. Youdon’tneedtoincludeanyresult.Wewillgradeyourcodeusingourtestingdata.Ourtestingdata willbe in
the same format asthevalidationdataset.
Figure 1: The folder structure after your submission file is unzipped.
We generated the following two datasets from the original Yelp review dataset with some filters such as the
condition: “state” == “CA”. We randomly took 60% of the data as the training dataset, 20% of the data as the
c. We do notshare the testingdataset.
In thistask, you will implement the Locality Sensitive Hashing algorithm with Jaccard similarity using
yelp_train.csv. You can refer to sections 3.3 – 3.5 of the Mining of Massive Datasets book.
Inthistask,wefocusonthe“0or1”ratingsratherthantheactualratings/starsfromtheusers.Specifically, if auserhas
rated a business, the user’s contribution in the characteristic matrix is 1. If the user hasn’t rated the business, the
contribution is 0.Youneedtoidentify similarbusinesseswhose similarity >= 0.5.
f(x)=(ax+b)%m or f(x) = ((ax +b)%p)%m
where p is any prime number andmisthe number of bins. You can use any combination forthe parameters
After youhavedefinedallthehashing functions, youwillbuildthe signaturematrix. Thenyouwilldivide thematrix
into b bandswithrrows each, where b x r =n(nisthenumber ofhash functions). You should carefully selectagood
combinationofbandrinyourimplementation.Rememberthattwoitemsarea candidate pairiftheirsignatures
Example of Jaccard Similarity:
user1 user2 user3 user4
business1 0 1 1 1
business2 0 1 0 0
Jaccard Similarity (business1, business2) = #intersection / #union = 1/3
Input format: (we will use the following command to execute your code)
Param: input_file_name:the name ofthe inputfile (e.g., yelp_train.csv), including the file path. Param:
IMPORTANT:Pleasestrictly followtheoutputformatsince your codewillbegradedautomatically.We will not
regrade on formatting issues.
a. The output file is a CSV file, containing all business pairs you have found. The header is “business_id_1,
business_id_2,similarity”. Each pairitself must be in the alphabetical order. The entire file also needsto be in the
alphabeticalorder.Thereisnorequirementforthenumberofdecimalsforthesimilarity value. Please referto the
formatin Figure 2.
Figure 2: a CSV output example for task1
The ground truth dataset only contains the business pairs (from the yelp_train.csv) whose Jaccard
similarity>=0.5.Thebusinesspairitselfissortedinthealphabeticalorder,soeachpaironlyappearsonce in the file
partial creditwith0.95 <= recall < 0.98: Your runtime should be less than 120 seconds with 4G driver memory and 4G executor memory. This is the environment of the grading machine. If yourruntime is more than or equalto 120 seconds, you will not receive any pointforthistask. 4.2 Task2:RecommendationSystem(8.5points) In task 2, you are going to build differenttypes ofrecommendation systems using the yelp_train.csv to predictfor the ratings/stars for given user ids and business ids. You can make any improvement to your recommendation systemin terms ofthe speed and accuracy. You can use the validation dataset (yelp_val.csv)toevaluatethe accuracyofyourrecommendationsystems. There are two options to evaluate your recommendation systems. You can compare your results to the correspond ground truth and compute the absolute differences. You can divide the absolute differences into 5 levels andcountthenumberforeachlevel asfollowing: >=0and<1:12345 >=1 and <2: 123 >=2and<3:1234 >=3and<4:1234 >=4: 12
Thismeansthatthere are12345predictionswith<1difference fromthe groundtruth.Thisway youwill be able to know the error distribution of your predictions and to improve the performance of your recommendation systems. Additionally, you can compute the RMSE (Root Mean Squared Error) by using following formula: Where Predi is the prediction for business i and Ratei isthe true rating for business i. n isthe total number of the business you are predicting. In this task, you are required to implement: Case1:Model-basedCFrecommendationsystemwithSparkMLlib(2.5point) Case2:User-basedCFrecommendationsystem(3points) Case 3: Item-based CF recommendation system (3 points) 4.2.1. Model-based CF recommendation system YouwilluseSparkMLlibtoimplementthistask.YoucanlearnmoreaboutSparkMLlibbythislink: http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html. 4.2.2. User-basedCFrecommendationsystem 4.2.3. Item-based CF recommendation system Input format: (we will use the following command to execute your code) Param:train_file_name:the name ofthe training file (e.g., yelp_train.csv), including the file path Param: test_file_name:thenameofthetestingfile(e.g.,yelp_val.csv),includingthefilepath Param: case_id:thecase number(i.e.,1,2,or 3) Param: output_file_name: the name of the prediction result file, including the file path Output format: a. The output file is a CSV file, containing all the prediction results for each user and business pair in the validation/testing data. The header is “user_id, business_id, prediction”. There is no requirement for the order in this task. There is no requirement for the number of decimals for the similarity values. Please refer to the format in Figure 3. Figure 2: Output example in CSV for task2 Grading: We will compare your prediction results againstthe ground truth. We will grade on allthe casesin Task2 based on your accuracy using RMSE. For yourreference,the table below showsthe RMSE baselines and suggested running timeforpredicting thevalidationdata. RMSE baseline and suggested running time for predicting the validation data Case 1 Case 2 Case 3 RMSE 1.31 1.12 1.10 Running Time 40s 150s 100s For grading, we will use the testing data to evaluate your recommendation systems. If you can pass the RMSE baselinesintheabovetable,youshouldbeabletopasstheRMSEbaselinesforthetestingdata. However,ifyour recommendationsystemonlypassestheRMSEbaselinesforthevalidationdata,youwill receive 50%ofthe points for each case. In case 2 and 3, if your RMSE can not pass the baseline, you will receive partial credit: case2: 100% RMSE <= 1.12, 80% 1.12 < RMSE <= 1.18 else 0% case3: 100% RMSE <= 1.10, 80% 1.10 < RMSE <= 1.17 else 0% 5. Grading Criteria (% penalty = % penalty of possible points you get) 1. Youcanuseyourfree5-dayextensionseparatelyortogether. 2. Therewillbe10%bonusif youuse bothScala andPython. 3. Ifwecannotrunyourprogramswiththecommandwespecified,therewillbean80%penalty. 4. IfyourprogramcannotrunwiththerequiredScala/Python/Sparkversions,therewillbea20% penalty. 5. Ifourgradingprogramcannotfindaspecifiedtag,therewillbenopointforthisquestion. 6. Iftheoutputsofyourprogramareunsortedorpartiallysorted,therewillbe50%penalty. 7. Iftheheaderoftheoutputfileismissing,therewillbe10%penalty. 8. Wecanregradeonyour assignmentswithinsevendaysoncethescores arereleased.No argueafter oneweek. Theregradingwillhavea20%penaltyifourgradingis correct. 9. Therewillbea20%penaltyforlatesubmissionswithinaweekandnopointafteraweek. 10. Therewillbenopointifthetotalexecutiontimeforalltasksexceeds20minutes. 11. OnlywhenyourresultsfromPythonarecorrect,thebonusofusingScalawillbecalculated.Thereis no partially pointfor Scala. Seetheexamplesbelow: Example situations Task Score for Python Score for Scala (10% of previous column if correct) Total Task1 Correct: 4points Correct: 4 * 10% 4.4 Task1 Wrong: 0 point Correct: 0 * 10% 0.0 Task1 Partially correct: 2 points Correct: 2 * 10% 2.2 Task1 Partially correct: 2 points Wrong: 0 2.0