Description
Movie reviews are a fairly commonly used tool used by consumers to understand if a movie is
worth the price and time. There are different methods to create reviews about movies. One of them
is rating the movies by different users. GroupLens Research has collected and made available
rating datasets from the MovieLens web site (http://movielens.org). We used extra information
about movies, Dennis Schwartz’s reviews.
In this assignment, you will implement a python program that analyzes GroupLens’ data and
compares them with Dennis Schwartz’s reviews. This program will create html files for movies
which are both in Dennis Schwartz’s reviews and in GroupLens’ data and try to guess genres of
movies based on the data which obtained from movies.
Fall 2016
BBM103: Introduction to Programming Laboratory 1
T.A. : Res. Assist. (Necva BOLUCU, Selma DILEK, Burcu YALCINER, Selim YILMAZ)
2
Stage 1: Create HTML Files for Movies
Step 1: Understand the GroupLens’ data
In this assignment, we will give you different files to analyze. The most important stage is
understanding the data.
u.item
Information about the items (movies);
The last 19 fields are the genres, a 1 indicates the movie is of that genre, a 0 indicates it is not;
movies can be in several genres at once.
The movie ids are the ones used in the u.data.
Example: The content of the data
Analyzing a line:
1|Toy Story (1995)|01-Jan-1995|http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0
2|GoldenEye (1995)|01-Jan-1995|http://us.imdb.com/M/title-exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
3|Four Rooms (1995)|01-Jan-1995|http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
4|Get Shorty (1995)|01-Jan-1995|http://us.imdb.com/M/title-exact?Get%20Shorty%20(1995)|0|1|0|0|0|1|0|0|1|0|0|0|0|0|0|0|0|0|0
1176|Welcome To Sarajevo (1997)|01-Jan-1997 |http://us.imdb.com/M/title-exact?Welcome+To+Sarajevo+(1997)|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0|1|0
movie id | movie title | video release date | IMDb URL | unknown |
Action | Adventure | Animation | Children’s | Comedy | Crime |
Documentary | Drama | Fantasy | Movie-Noir | Horror | Musical |
Mystery | Romance | Sci-Fi | Thriller | War | Western |
movie id | movie title | video release date | IMDb URL | unknown |
Action | Adventure | Animation | Children’s | Comedy | Crime |
Documentary | Drama | Fantasy | Movie-Noir | Horror | Musical |
Mystery | Romance | Sci-Fi | Thriller | War | Western |
Movie id : 1176
Movie title : Welcome To Sarajevo (1997)
Release date : 01-Jan-1997
IMDB Link :http://us.imdb.com/M/title-exact?Welcome+To+Sarajevo+(1997)
Genre : 0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0|1|0
Fall 2016
BBM103: Introduction to Programming Laboratory 1
T.A. : Res. Assist. (Necva BOLUCU, Selma DILEK, Burcu YALCINER, Selim YILMAZ)
3
u.genre
This file contains a list of the genres.
You will use this file to format genre field which are taken from u.item.
Example: Convert genre by taking genre names from u.genre file
u.user
This file contains demographic information about the users; (The user ids are the ones used in the
u.data data set.)
u.occupation
This file consists of list of the occupations. (The occupation ids are the ones used in the u.user data
set.)
Analyzing a line of u.user file by using u.occupation file:
User id : 1
User Age : 24
Gender : M
Occupation: technician
Zip Code : 85711
Analyzing a line of u.user file by using u.occupation file:
genre | genre id
genre | genre id
Movie id :1176
Movie title : Welcome To Sarajevo (1997)
Genre : Drama War
Movie id :1176
Movie title : Welcome To Sarajevo (1997)
Genre : Drama War
user id | age | gender | occupation id | zip code
occupation id | occupation
BBM103: Introduction to Programming Laboratory 1
T.A. : Res. Assist. (Necva BOLUCU, Selma DILEK, Burcu YALCINER, Selim YILMAZ)
4
u.data
The full data set, 100000 ratings by 943 users on 1682 items comprised of this file.
Step 2: Understand the Dennis Schwartz’s data
Dennis Schwartz’ review data is taken from (https://www.cs.cornell.edu/people/pabo/moviereview-data/ You can look here to get information about Dennis Schwartz). This data consists of
different txt files.
Example: Content of a file in this folder (16748.txt)
Each of files is about only a movie review. These files are supposed to be in a folder which is
named film. You are expected to read these files one by one from the folder. These files count can
be changed, so you must read them in a loop.
This is a tab separated list of
user id movie id rating timestamp.
Fall 2016
BBM103: Introduction to Programming Laboratory 1
T.A. : Res. Assist. (Necva BOLUCU, Selma DILEK, Burcu YALCINER, Selim YILMAZ)
5
Example: film folder
Step 3: Combine the GroupLens’ data and Dennis Schwartz’s data
In order to create html files for movies, you must combine the datasets. You are expected to create
html files for the movies which are in film folder. In this step, we expected to use list
comprehensions.
Firstly, you compare the both dataset (movies in film folder and u.item) and select the movies
which are in both datasets. You will create review.txt file to write messages for movies which are
in u.item but not in film folder and movies which are found in folder. Use user-defined exception
to take messages.
Example: review.txt
After selecting movies, you will find user ids who rate them from u.data and get detail information
about these users from u.user.
Fall 2016
BBM103: Introduction to Programming Laboratory 1
T.A. : Res. Assist. (Necva BOLUCU, Selma DILEK, Burcu YALCINER, Selim YILMAZ)
6
Step 4: Write review to html file
When you extract information from given data for movies, you are going to use this data to create
html files which are located in filmList folder. In html file, the necessary fields are shown;
! The file name is must be the film id which are given u.item.
Times New Roman size=”6″ color=”red” bold NAME OF THE FILM
Genre
IMDB Link
Times New Roman size=”4″ color=”black” boldReview (taken from Dennis
Schwartz ‘s data)
Total User/Total Rate
User who rate the film:
User id User rate
User Detail: Age – Gender – Occupation – Zip Code
Times New Roman size=”6″ color=”red” bold NAME OF THE FILM
Genre
IMDB Link
Times New Roman size=”4″ color=”black” boldReview (taken from Dennis
Schwartz ‘s data)
Total User/Total Rate
User who rate the film:
User id User rate
User Detail: Age – Gender – Occupation – Zip Code
BBM103: Introduction to Programming Laboratory 1T.A. : Res. Assist. (Necva BOLUCU, Selma DILEK, Burcu YALCINER, Selim YILMAZ)7Example: One of the output fileStage 2: Guess Genre of Movie Based on Film Given in Film FolderIn this stage, it is expected to guess movies genre or genres based on the movies given in filmfolder.Step 1: Getting Unique Words Without Stop WordsFirstly, you should extract unique words for all genres by taking movies genres and their reviewdata, then make stop word elimination by stop word list which are taken stopwords.txt.Step 2: Guess Genres of MoviesIn this step, you will read the movie reviews from the filmGuess folder and implement step1 forgetting unique words.After getting words, if the intersection of words of movie and genres words is higher or equal 20,we remark the genre for the movie.Fall 2016BBM103: Introduction to Programming Laboratory 1T.A. : Res. Assist. (Necva BOLUCU, Selma DILEK, Burcu YALCINER, Selim YILMAZ)8Step 3: Write to FileAfter getting genre or genres for the all movie in the filmGuess folder, you will write the filmnames and genres to filmGenre.txtExample: Movies in The filmGuess FolderNotes specified to this assignment Feel free to employ any built-in function. Use list comprehension and user-defined exception to take message in your project. Be careful to open and create files in try except block to avoid forum IOError. Be sure your submitted work exactly matches the hierarchy detailed below, as thesubmission with 0 score will not be considered for evaluation. Should you have a question, do not hesitate to as but consider office hours for BBM103 ofTA in charge (Necva BÖLÜCÜ). You will use static file and folder names, but be careful to name files correctly. Due date for this assignment is 04.01.201718485.txt STAR WARS: EPISODE I–THE PHANTOM MENACE21168.txt THE GAME18687.txt A.I. ARTIFICIAL INTELLIGENCE29852.txt SERENDIPITYMovies in The filmguess Folder18485.txt STAR WARS: EPISODE I–THE PHANTOM MENACE21168.txt THE GAME18687.txt A.I. ARTIFICIAL INTELLIGENCE29852.txt SERENDIPITYOutput: filmGenre.txt for the given movies in filmGuess folderfilmGenre.txt for the given moviesBBM103: Introduction to Programming Laboratory 1T.A. : Res. Assist. (Necva BOLUCU, Selma DILEK, Burcu YALCINER, Selim YILMAZ)9Notes Do not miss the deadline. Compile your code on dev.cs.hacettepe.edu.tr before submitting your work to make sure itcompiles without any problems on our server. Save all your work until the assignment is graded. The assignment must be original, individual work. Duplicate or very similar assignmentsare both going to be considered as cheating. You can ask your questions via Piazza (https://piazza.com/hacettepe.edu.tr/fall2016/bbm101) and you are supposed to be aware ofeverything discussed in Piazza. You cannot share algorithms or source code. All work mustbe individual! Assignments will be checked for similarity, and there will be seriousconsequences if plagiarism is detected. The submissions whose upload score is 0 will not be considered for evaluation. Do not include any other text files in your submission. Input files will have the same namesbut different content than those you worked on. Output files should be created when yourprogram is executed. You will submit your work from https://submit.cs.hacettepe.edu.tr/index.php with the filehierarchy as below:This file hierarchy must be zipped before submitted (Not .rar, only .zip files are supported by thesystem) assignment5.pyBBM103: Introduction to Programming Laboratory 1T.A. : Res. Assist. (Necva BOLUCU, Selma DILEK, Burcu YALCINER, Selim YILMAZ)10Who is Dennis Schwartz?Dennis Schwartz is editor of the Vermont based movie magazine “Ozus’s World Movie Reviews.”He has been a prolific online movie reviewer since 1998, also contributing to various publicationsall over the globe and maintaining an active website–where it’s not uncommon for him to reviewas many as 365 movies a year.