## Description

Task 1 (12 points)

This task is about data scrapping, data wrangling and natural language processing.

(a) (2 points) Formulate one clearly-defined question to be answered by studying tweets.

(b) (1 point) Scrap as many data points and as many twitter users as twitter allows.

(c) Process the data and answer the question in (a). Your analysis must include

i. (2 points) Forming a rectangular tidy data

ii. (1 point) Analysing word frequency

iii. (2 points) Sentiment analysis

iv. (1 point) Visualisation

(d) (3 points) Present your findings in a report format. Submit your R-code on canvas.

Task 2 (10 points)

This task is about model formulation and prediction.

The goal is to use past movie ratings to predict how users will rate movies they haven’t

watched yet. This type of prediction algorithm forms the underpinning of recommendation

engines, such as the one used by many streaming providers. We have two sources of data,

Original Additional

1_movie_names.tsv 2_credits.csv 2_movies_metadata.csv

1_movie_ratings.csv 2_keywords.csv 2_ratings_small.csv

1_users.csv 2_links_small.csv 2_ratings.csv

1_predict.csv 2_links.csv

The dataset 1_predict.csv contains ratings for you to predict, with all ratings set initially

to 0, every rating needs to be between 1 and 5. You are restricted to use only the given

datasets. But you are free to choose any approach/model to make your rating predictions.

For example, the choice between using integer ratings and allowing any real number rating

is yours. Present your findings in a report format. Submit your R-code on canvas.

Task 3 (8 points)

This task is about working with a big dataset. You need to form a group of 3-4 for the task.

The Million Song Dataset (280GB)

provides a million contemporary popular music tracks. The dataset consists the feature

analysis and metadata for the songs. It does not include any audio, only the derived

features. We will use only a subset of it, and revisit the whole dataset using Hadoop later.

(a) (2 points) Load the subset of the Million Song Dataset into R.

(b) (5 points) Use three of your computers to form a cluster using the H2O package to

perform k-means for k = 1, 2, 3, 4, 5 to partition the 10,000 songs in the subset. Use as

much information in the subset as possible. You are recommended to use data.table

to wrangle the data so that it is appropriate for k-means.

(c) (1 point) Present your findings individually. Submit your R-code on canvas.