COMP 370 Homework 1 – Mini Data Science Project solved

$30.00

Category: Tags: , , , , , You will Instantly receive a download link for .zip solution file upon Payment || To Order Original Work Click Custom Order?

Description

5/5 - (1 vote)

1. Data Collection

a. Download the raw tweet data. You will ONLY be using the data from the first file
(IRAhandle_tweets_1.csv).
b. Looking at only the first 10,000 tweets in the file, keep those that (1) are in English and (2) don’t
contain a question. This will be our dataset. To filter the right tweets out, take a look at the
columns.

i. There are specific columns that call our language. You can trust these.
ii. Assume that a tweet which contains a question contains a “?” character.
c. Create a new file (I would suggest in TSV – tab-separated-value – format) containing these
tweets.

2. Data Annotation

a. To do our analysis, we need to add one new feature: whether or not the tweet mentioned
Trump. This feature “trump_mention” is Boolean (=”T”/”F”). A tweet mentions Trump if and only
if it contains the word “Trump” (case-sensitive) as a word. This means that it is separated from
other alphanumeric letters by either whitespace OR non-alphanumeric characters (e.g., “antiTrump protesters” contains “trump”, but “I got trumped” does not).
b. Create a new version of your dataset that contains this additional feature.

3. Analysis

a. Using your newly annotated dataset, compute the statistic: % of tweets that mention Trump.
b. It turns out that our approach isn’t counting tweets properly … meaning that some tweets are
getting counted more than once. Go through and look at your annotated data. Identify where
the counting problem is coming from.

Submission Instructions
To be considered complete, your submission should contain the following and some non-trivial attempt to
provide a solution.
– README.md
o In 3 sentences or less, explain where the counting problem is coming from.
– dataset.tsv

o This should be the output of your Data Annotation phase.
o Format is tab-separated value, utf-8 (as long as you don’t do anything fancy, it will be in utf-8)
o The first line should be a header line

o The file should contain the following columns, in this order: tweet_id, publish_date, content, and
trump_mention. Tweets should appear in the same order they appeared in the original file from
538.
– results.tsv
o Format is tab-separated value

o The first line should be a header line, with headers “result” and “value”.
o The second line should contain the result for “frac-trump-mentions”. If necessary, truncate your
answer to three decimal places.