Description
Q1 [40 points] Collecting and visualizing The Movie DB (TMDb) data
Q1.1 [25 points] Collecting Movie Data
You will use “The Movie DB” API to: (1) download data about movies and (2) for each movie, download its 5
similar movies.
You will write some Python 3 code (not Python 2.x) in script.py in this question. You will need an API key to
use the TMDb data. Your API key will be an input to script.py so that we can run your code with our own API
key to check the results. Running the following command should generate the CSV files specified in part b and
part c:
python3 script.py
Please refer to this tutorial to learn how to parse command line arguments. Please DO NOT leave your API key
written in the code.
Note: The Python Standard Library and the requests library are allowed. Python wrappers (or modules) for the
TMDb API may NOT be used for this assignment. Pandas also may NOT be used we are aware that it is a
useful library to learn. However, to make grading more manageable and to enable our TAs to provide better,
more consistent support to our students, we have decided to restrict the libraries to the more “essential” ones
mentioned above.
a. How to use TheMovieDB API:
● Create a TMDb account and request for an API key https://www.themoviedb.org/account/signup.
Refer to this document for detailed instructions.
● Refer to the API documentation https://developers.themoviedb.org/3/gettingstarted/introduction , as
you work on this question.
Note:
The API allows you to make 40 requests every 10 seconds. Set appropriate timeout intervals in
your code while making requests. We recommend you think about how much time your script will run
for when solving this question, so you will complete it on time.
The API endpoint may return different results for the same request.
b. [10 points] Search for movies in the “Comedy” genre released in the year 2000 or later. Retrieve the 300 most
popular movies in this genre. The movies should be sorted from most popular to least popular. Hint: Sorting
based on popularity can be done in the API call.
● Documentation for retrieving movies: https://developers.themoviedb.org/3/discover/moviediscover https://developers.themoviedb.org/3/genres/getmovielist
● Save the results in movie_ID_name.csv.
Each line in the file should describe one movie, in the following format NO space after comma, and do
not include any column headers:
movieID,moviename
For example, a line in the file could look like:
353486,Jumanji: Welcome to the Jungle
Note:
You may need to make multiple API calls to retrieve all 300 movies. For example, the results may
be returned in “pages,” so you may need to retrieve them page by page.
Please use the “primary_release_date” parameter instead of the “release_date” parameter in the
API when retrieving movies released in the year 2000 or later. The “release_date” parameter will
incorrectly return a movie if any of its release dates fall within the years listed.
c. [15 points] For each of the 300 movies, use the API to find its 5 similar movies. If a movie has fewer than 5
similar movies, the API will return as many as it can find. Your code should be flexible to work with however
many movies the API returns.
● Documentation for obtaining similar movies: https://developers.themoviedb.org/3/movies/getsimilarmovies
● Save the results in movie_ID_sim_movie_ID.csv.
Each line in the file should describe one pair of similar movies NO space after comma, and do not
include any column headers:
movieID,similarmovieID
Note: You should remove all duplicate pairs after the similar movies have been found. That is, if both the
pairs A,B and B,A are present, only keep A,B where A < B. For example, if movie A has three similar
movies X, Y and Z; and movie X has two similar movies A and B, then there should only be four lines in
the file.
A,X
A,Y
A,Z
X,B
You do not need to fetch additional similar movies for a given movie, if one or more of its pairs were
removed due to duplication.
Deliverables: Place all the files listed below in the Q1 folder.
● movie_ID_name.csv: The text file that contains the output to part b.
● movie_ID_sim_movie_ID.csv: The text file that contains the output to part c.
● script.py: The Python 3 (not Python 2.x) script you write that generates both
movie_ID_name.csv and movie_ID_sim_movie_ID.csv.
Note : Q1.2 builds on the results of Q1.1. Specifically, Q1.2 asks that the “Source,Target” be added to the
resulting file from Q1.1. If you have completed both Q1.1 and Q1.2, your csv would have the header row —
please submit this file. If you have completed only Q1.1, but not Q1.2 (for any reasons), then please submit the
csv file without the header row.
Q1.2 [15 points] Visualizing Movie Similarity Graph
Using Gephi, visualize the network of similar movies obtained. You can download Gephi here. Ensure your
system fulfills all requirements for running Gephi.
a. Go through the Gephi quickstart guide.
b. [2 points] Insert Source,Target as the first line in movie_ID_sim_movie_ID.csv. Each line now
represents a directed edge with the format Source,Target. Import all the edges contained in the file
using Data Laboratory in Gephi.
Note: Remember to check the “create missing nodes” option while importing since we do not have an
explicit nodes file.
c. [8 points] Using the following guidelines, create a visually meaningful graph:
● Keep edge crossing to a minimum, and avoid as much node overlap as possible.
● Keep the graph compact and symmetric if possible.
● Whenever possible, show node labels. If showing all node labels create too much visual
complexity, try showing those for the “important” nodes.
● Using nodes’ spatial positions to convey information (e.g., “clusters” or groups).
Experiment with Gephi’s features, such as graph layouts, changing node size and color, edge thickness,
etc. The objective of this task is to familiarize yourself with Gephi and hence is a fairly open ended task.
d. [5 points] Using Gephi’s builtin functions, compute the following metrics for your graph:
● Average node degree (run the function called “Average Degree”)
● Diameter of the graph (run the function called “Network Diameter”)
● Average path length (run the function called “Avg. Path Length”)
Briefly explain the intuitive meaning of each metric in your own words.
You will learn about these metrics in the “graphs” lectures.
Deliverables: Place all the files listed below in the Q1 folder.
● For part b: movie_ID_sim_movie_ID.csv (with Source,Target as its first line).
● For part c: an image file named “graph.png” (or “graph.svg”) containing your visualization and a text
file named “graph_explanation.txt” describing your design choices, using no more than 50 words.
● For part d: a text file named “metrics.txt” containing the three metrics and your intuitive explanation
for each of them, using no more than 100 words.
Q2 [35 pt] SQLite
The following questions will help refresh your memory about SQL and get you started with SQLite a
lightweight, serverless embedded database that can easily handle up to multiple GBs of data. As mentioned in
class, SQLite is the world’s most popular embedded database. It is convenient to share data stored in an SQLite
database just one crossplatform file, and no need to parse (unlike CSV files).
You will modify the given Q2.SQL.txt file to add SQL statements and SQLite commands to it.
We will autograde your solution by running the following command that generates Q2.db and
Q2.OUT.txt (assuming the current directory contains the data files).
$ sqlite3 Q2.db < Q2.SQL.txt > Q2.OUT.txt
We will generate the Q2.OUT.txt using the above command.
You may not receive any points if we are unable to generate the 2 output files.
You may also lose points if you do not strictly follow the output format specified in each
question below. The output format corresponds to the headers/column names for your SQL
command output.
We have added some lines of code in the Q2.SQL.txt file which are for the purpose of autograding. .
DO NOT REMOVE/MODIFY THESE LINES. You may not receive any points if these statements are
modified in any way (our autograder will check for changes). There are clearly marked regions in the
Q2.SQL.txt file where you should add your code. We have also provided a
Q2.OUT.SAMPLE.txt which gives an example of how your final Q2.OUT.txt should look like after
running the above command. Please avoid printing unnecessary items in your final submission as it
may affect autograding and you may lose points. Purpose of some lines of code in file Q2.SQL.txt are
as follows:
● .headers off. : After each question, an output format has been given with a list of
column names/headers.This command ensures that such headers are not displayed in the
output.
● .separator ‘,’ : To specify that the input file and the output are commaseparated.
● select ‘’: This command prints a blank line. After each question’s query, this command
ensures that there is a new line between each result in the output file.
WARNING: Do not copy and paste any code/command from this PDF for use in the sqlite command
prompt, because PDFs sometimes introduce hidden/special characters, causing SQL error. Manually
type out the commands instead.
Note: For the questions in this section, you must use only INNER JOIN when you perform a join
between two tables. Other types of join may result in incorrect outputs.
Note: Do not use .mode csv in your Q2.SQL.txt file. This will cause quotes to be printed in the output
of each select ‘’; statement.
a. Create tables and import data.
i. [2 points] Create the following two tables (“movies” and “cast”) with columns having the indicated
data types:
● movies
○ id (integer)
○ name (text)
○ score (integer)
● cast
○ movie_id (integer)
○ cast_id (integer)
○ cast_name (text)
ii. [1 point] Import the provided movienamescore.txt file into the movies table, and moviecast.txt into the cast table. You can use SQLite’s .import command for this. Please use relative
paths while importing files since absolute/local paths are specific locations that exist only on your
computer and will cause the autograder to fail right in the beginning.
b. [2 points] Create indexes. Create the following indexes which would speed up subsequent operations
(improvement in speed may be negligible for this small database, but significant for larger databases):
i. scores_index for the score column in movies table
ii. cast_index for the cast_id column in cast table
iii. movie_index for the id column in movies table
c. [2 points] Calculate average score. Find the average score of all movies having a score >= 5
Output format:
average_score
d. [3 points] Find poor movies. List the five worst movies (lowest scores). Sort your output by score from
lowest to highest, then by name in alphabetical order.
Output format:
id,name,score
e. [4 points] Find laid back actors. List ten cast members (alphabetically by cast_name) with exactly two
movie appearances.
Output format:
cast_id,cast_name,movie_count
f. [6 points] Get high scoring actors. Find the top ten cast members who have the highest average movie
scores. Sort your output by score (from high to low). In case of a tie in the score, sort the results based
on the name of the cast member in alphabetical order. Skip movies with score <40 in the average score calculation. Exclude cast members who have appeared in two or fewer movies. Output format: cast_id,cast_name,average_score g. [8 points] Creating views. Create a view (virtual table) called good_collaboration that lists pairs of actors who have had a good collaboration as defined here. Each row in the view describes one pair of actors who have appeared in at least three movies together AND the average score of these movies is >= 50.
The view should have the format:
good_collaboration(
cast_member_id1,
cast_member_id2,
movie_count,
average_movie_score)
For symmetrical or mirror pairs, only keep the row in which cast_member_id1 has a lower numeric
value. For example, for ID pairs (1, 2) and (2, 1), keep the row with IDs (1, 2). There should not be any
self pairs (cast_member_id1 == cast_member_id2).
Full points will only be awarded for queries that use joins.
Remember that creating a view will not produce any output, so you should test your view with a few
simple select statements during development. One such test has already been added to the code as part
of the autograding.
Optional Reading: Why create views?
h. [4 points] Find the best collaborators. Get the five cast members with the highest average scores from
the good_collaboration view made in the last part, and call this score the collaboration_score.
This score is the average of the average_movie_score corresponding to each cast member,
including actors in cast_member_id1 as well as cast_member_id2. Sort your output in a
descending order of this score (and alphabetically in case of a tie).
Output format:
cast_id,cast_name,collaboration_score
i. SQLite supports simple but powerful Full Text Search (FTS) for fast textbased querying (FTS
documentation). Import movie overview data from the movieoverview.txt into a new FTS table called
movie_overview with the schema:
movie_overview (
id integer,
name text,
year integer,
overview text,
popularity decimal)
NOTE: Create the table using fts3 or fts4 only. Also note that keywords like NEAR, AND, OR and NOT
are case sensitive in FTS queries.
1. [1 point] Count the number of movies whose overview field contains the word “fight”.
Output format:
count_overview
2. [2 points] List the id’s of the movies that contain the terms “love” and “story” in the
overview field with no more than 5 intervening terms in between.
Output format:
id
Deliverables: Place all the files listed below in the Q2 folder
● Q2.SQL.txt: Modified file additionally containing all the SQL statements and SQLite commands you
have used to answer questions a i in the appropriate sequence.
● Q2.OUT.txt: Output of the queries in Q2.SQL.txt. Check above for how to generate this file.
Q3 [15 pt] D3 Warmup and Tutorial
● Go through the D3 tutorial here before attempting this question.
● Complete steps 0116 (Complete through “16. Axes”).
● This is a simple and important tutorial which lays the groundwork for Homework 2.
Note: We recommend using Mozilla Firefox or Google Chrome, since they have relatively robust builtin
developer tools.
Deliverables: Place all the files/folders listed below in the Q3 folder
● A folder named d3 containing file d3.v3.min.js (download)
● index.html : When run in a browser, it should display a scatterplot with the following specifications:
a. [12 pt] Generate and plot 60 objects: 30 upwardpointing equilateral triangles and 30 crosses.
Each object’s X and Y coordinates should be a random integer between 0 and 100 inclusively
(i.e., [0, 100]). An object’s X and Y coordinates should be independently computed.
Each object’s size will be a value between 5 and 50 inclusively (i.e., [5, 50]). You should use the
“symbol.size()” function of d3 to adjust the size of the object. Use the object’s X coordinate to
determine the size of the object. You should use a linear scale for the size, to map the domain of
X values to the range of [5,50]. Objects with larger x coordinate values should have larger sizes.
This link explains how size is interpreted by symbol.size(). You may want to look at this
example for the usage of “symbol.size()” function.
All objects with size greater than the average size of all scatterplot objects should be colored blue
and all other objects should be colored green.
All these objects should be filled (Please see the figure below) and the entire graph should fit in
the browser window (no scrolling).
b. [2 pt] The plot must have visible X and Y axes that scale according to the generated objects.
The ticks on these axes should adjust automatically based on the randomly generated scatterplot
objects.
c. [1 pt] Your full name (in upper case) should appear above the scatterplot. Set the HTML title
tag (