Sale!

CS6350 Homework #2

$30.00 $18.00

Category: You will Instantly receive a download link for .zip solution file upon Payment || To Order Original Work Click Custom Order?

Description

5/5 - (8 votes)

In this homework, you will use spark (pyspark or scala) to solve the following
problems.
Q1:
Write a spark script to find total number of common friends for any possible friend pairs.
The key idea is that if two people are friend then they have a lot of mutual/common friends.
For example,
Alice’s friends are Bob, Sam, Sara, Nancy Bob’s friends are Alice, Sam, Clara, Nancy Sara’s
friends are Alice, Sam, Clara, Nancy
As Alice and Bob are friend and so, their mutual friend list is [Sam, Nancy]
As Sara and Bob are not friend and so, their mutual friend list is empty. (In this case you
may exclude them from your output).
Input
files:
1. mutual.txt
The input contains the adjacency list and has multiple lines in the following
format:

Here, is a unique integer ID corresponding to a unique user and is a
comma-separated list of unique IDs ( ID) corresponding to the friends of the user. Note
that the friendships are mutual (i.e., edges are undirected): if A is friend with B then B is also
friend with A. The data provided is consistent with that rule as there is an explicit entry for
each side of each edge. So when you make the pair, always consider (A, B) or (B, A) for user
A and B but not both.
Output: The output should contain one line per user in the following format:
,
where & are unique IDs corresponding to a user A and B (A and B are
friend). < Mutual/Common Friend Number > is total number of common friends between user
A and user B.
Q2.
Please answer this question by using dataset from Q1.
Find friend pair(s) whose number of common friends is greater than the average number
among all the pairs.
Output Format:
,
Please use the following dataset.
1. mutual.txt
Please use Apache Spark to derive some statistics from Yelp Dataset.
Data set info:
The dataset files are as follows and columns are separate using ‘::’
business.csv. review.csv. user.csv.
Data set Description.
The data set comprises of three csv files, namely user.csv, business.csv and review.csv.
business.csv file contain basic information about local businesses.
business.csv file contains the following columns:
“business_id”::”full_address”::”categories”
‘business_id’: (a unique identifier for the business)
‘full_address’: (localized address),
‘categories’: [(localized category names)]
review.csv file contains the star rating given by a user to a business. Use user_id to
associate this review with others by the same user. Use business_id to associate this review
with others of the same business.
review.csv file contains the following columns:
“review_id”::”user_id”::”business_id”::”stars”
‘review_id’: (a unique identifier for the review)
‘user_id’: (the identifier of the reviewed business),
‘business_id’: (the identifier of the authoring user),
‘stars’: (star rating, integer 1-5), the rating given by the user to a business
user.csv file contains aggregate information about a single user across all of Yelp
user.csv file contains the following columns:
“user_id”::”name”::”url”
‘user_id’: (unique user identifier),
‘name’: (first name, last initial, like ‘Matt J.’), this column has been made anonymous to
preserve privacy
‘url’: url of the user on yelp
Note: :: is Column separator in the files.
Q3:
Please list the ‘name’ and ‘rating’ of users that reviewed businesses which are located
in “Stanford”
This will require you to use all three files.
Sample output:

Username Rating
John Snow 4.0
Q4:
List the business_ID, full address, and categories of the Top 10 businesses using the
average ratings
This will require you to use review.csv and business.csv files.
Sample output:
business id full address categories avg rating
xdf12344444444 CA 91711 List [‘Local Services’, ‘Carpet Cleaning’] 5.0
Q5:
For each state, find total number of business counts.
Q6:
Find names of top 10 users who have the most contribution in the reviews. The
contribution is measure by the percentage of the total reviews that a particular user
gave.
For example:
+—————–+—————-+
|name |contribution|
+—————–+—————-+
| John Snow |22% |
Q7.
Write a Spark program that will construct inverted index in the following ways.
The map and additional transformations parse each line in an input file, userdata.txt, and emit a
sequence of pairs. The reduce and additional transformations accept all
pairs for a given word, sort the corresponding line numbers, and emits a pair. The set of all the output pairs forms a simple inverted index. Your task is to emit
those pairs that have the highest frequency.
What to submit
(i) Submit the source code via the eLearning website. (ii) Submit the output file for each question.