Description
Set Similarity Join Using Spark on AWS
Problem Definition:
Given two collections of records R and S, a similarity function sim(., .), and
a threshold τ, the set similarity join between R and S, is to find all record
pairs r (from R) and s (from S), such that sim(r, s) >= τ.
In this project, you are required to use the Jaccard similarity function to
compute sim(r, s). Given the following example, and set τ=0.5,
the result pairs are (r1, s1) (similarity 0.75), (r2, s2) (similarity 0.5), (r3, s1)
(similarity 0.5), (r3, s2) (similarity 0.5).
Input files:
You are required to do the “self-join”, that is, a single input file is given, in
which each line is in format of:
“RecordId list”,
and this file serves as both R and S.
An example input file is as below (integers are separated by space):
0 1 4 5 6
1 2 3 6
2 4 5 6
3 1 4 6
4 2 5 6
5 3 5
This sample file “tiny-data.txt” can be downloaded at:
https://webcms3.cse.unsw.edu.au/COMP9313/18s1/resources/16524
Another sample input file “flickr_small.txt” can be downloaded at:
https://webcms3.cse.unsw.edu.au/COMP9313/18s1/resources/16525
Output:
The output file contains the similar pairs together with their similarities.
Each line is in format of “(RecordId1,RecordId2)\tSimilarity”
(RecordId1<RecordId2, and there are no duplicate pairs in the result). The
similarities are of double precision. The pairs are sorted in ascending order
(by the first record and then the second).
Given the example input data, the output file is like:
(0,2)\t0.75
(0,3)\t0.75
(1,4)\t0.5
(2,3)\t0.5
(2,4)\t0.5
Code format:
Name your java file as “SetSimJoin.scala” and put it in the package
“comp9313.ass4”. Your program should take three parameters: the input file,
the output folder, and the similarity threshold τ (double precision).
Cluster configuration:
Create an S3 bucket with name “comp9313.” in
AWS. Create a folder “project4” in this bucket for holding the input files.
This project aims to let you see the power of distributed computation. Your
code should scale well with the number of nodes used in a cluster. You are
required to create three clusters in AWS to run the same job:
Cluster1 – 2 nodes of instance type m3.xlarge;
Cluster2 – 3 nodes of instance type m3.xlarge;
Cluster3 – 4 nodes of instance type m3.xlarge.
Select release EMR-5.13.0 when creating each cluster. Unzip and upload the
following data set to your S3 bucket, and set τ to 0.85 to run your program:
https://webcms3.cse.unsw.edu.au/COMP9313/18s1/resources/16526
Record the runtime on each cluster, and draw a figure where the x-axis is the
number of nodes you used and the y-axis is the time of getting the result,
and store this figure in a file “Runtime.jpg”. Please also take a screenshot of
running your program on AWS in each cluster as a proof of the runtime.
Compress the three screenshots into a zip file “Screenshots.zip”. Briefly
describe your optimization techniques in a file “Optimization.pdf”.
Notes
Create a project locally in Eclipse, test everything in your local computer,
and finally do it in AWS EMR.
Documentation and code readability
Your source code will be inspected and marked based on readability and
ease of understanding. The efficiency and scalability of this project is very
important and will be evaluated as well. Below is an indicative marking
scheme:
Result correctness: 25
Efficiency and Scalability: 10
Code structure, Readability, and
Documentation: 5
Submission:
Deadline: Monday 28th May 09:59:59 PM
Log in any CSE server (williams or wagner), and use the give command
below to submit your solutions:
$ give cs9313 assignment4 SetSimJoin.scala Runtime.jpg Screenshots.zip
Optimization.pdf
Or you can submit through:
https://cgi.cse.unsw.edu.au/~give/Student/give.php
If you submit your assignment more than once, the last submission will
replace the previous one. To prove successful submission, please take a
screenshot as assignment submission instructions show and keep it by
yourself.
Late submission penalty
10% reduction of your marks for the 1st day, 30% reduction/day for the
following days.
Plagiarism:
The work you submit must be your own work. Submission of work partially
or completely derived from any other person or jointly written with any
other person is not permitted. The penalties for such an offence may include
negative marks, automatic failure of the course and possibly other academic
discipline. Assignment submissions will be examined manually.
Relevant scholarship authorities will be informed if students holding
scholarships are involved in an incident of plagiarism or other misconduct.
Do not provide or show your assignment work to any other person – apart
from the teaching staff of this subject. If you knowingly provide or show
your assignment work to another person for any reason, and work derived
from it is submitted you may be penalized, even if the work was submitted
without your knowledge or consent.