Description
In this assignment, you will practice sentiment analysis with textual data.
You are provided with a dataset “MovieReview-Sample.csv” which contains 2,000 movie review text, and a labeled sentiment. Label “0” is Negative and label “1” is Positive.
Question 1: Performance Comparisons
You are asked to use four approaches taught in Lab 2 to perform sentiment analysis on the dataset: 1) Using Bing Liu’s Lexicon; 2) Using LM dictionary; 3) Using TextBlob; and 4) Using Vader (either from NLTK or from Vader directly).
Please report the following:
- Report Precision, Recall and F measure achieved by each tool. Notice that you will calculate them by comparing your prediction and the gold standard (label 0 and 1). Please present the result in a comparison table and highlight the highest performance.
(Hint, you should report precision not accuracy. This means you need to calculate positive precision, negative precision and then average precision)
- Provide your analysis of the performances. If you are in charge of identifying the appropriate software to perform sentiment analysis for movie reviews, which one will you choose? Give 1-2 reasons.
Question 2: Ensemble
You are going to using ensemble method to improve the performance of individual tool. Can you think of a way to ensemble the three methods/tools to improve the performance?
(Hint 1: you may choose the 3 best performing algorithms to ensemble. There is no need to include inferior algorithms from the previous step. Hint 2: the simplest form of ensemble is a majority vote, or a weighted majority vote based on the algorithm performances). Report your performance improvement (in percentage) over any single models.
Bonus: I also provide the original full dataset “Movie_review_Polarity_CSV.zip”. Please notice that this file contains pos.csv and neg.csv. You may run your algorithm on the full dataset and see if the performance hold from the sample dataset.
Submission:
- Word Report
- Python program. Please make sure your python program can run successfully.
Other instructions:
- DO NOT submit your dataset. Only submit Word and python program.
- Do not use absolute path to read your input data (it won’t run on your TA’s computer)
- Name all your files FirstName_LastName.xxx. This will make our grading easier.
- Do not zip your file. Submit two files directly.
Thank you!