Description
Problem #1: This problem contains two tasks.
Task 1: Cluster Setup – Apache Spark Framework on GCP
Using your GCP cloud account, configure and initialize Apache Spark cluster.
(Follow the tutorials provided in Lab session).
Create a flowchart or write ½ page explanation on how you completed the task, include this part in
your PDF file.
Task 2: Data Extraction and Preprocessing Engine: Sources – Twitter, and Reuters articles
Steps for Twitter Operation
Step 1: Create a Twitter developer account
Step 2: Explore the Twitter search and streaming APIs and data format
Step 3: Write a well-formed script/program using (Java or Python) to extract data (Extraction
Engine) from Twitter. Execute/Run the program on GCP
Winter 2021 saurabh.dey@dal.ca
(Do not use any online program codes or scripts. You can only use API specification codes given
by Twitter)
o The search keywords are “covid”, “emergency”, “immune”, “vaccine”, “flu”, “snow”.
Step 4: You need to include a flowchart/algorithm of your tweet extraction program in the PDF file.
Step 5: You need to extract the tweets and metadata related to the given keywords.
o For some keywords, you may get less number of tweets, which is not a problem.
Collectively, you should get approximately 3000 to 5000 tweets.
Step 6: If you get less data, run your method/program using a scheduler module to extract more
data points from Twitter at different time intervals. Note: Working on small datasets will not use
huge cloud resource or your local cluster memory.
Step 7: You should extract tweets, and retweets along with provided meta data, such as location,
time etc.
Step 8: The captured raw data should be kept (programmatically) in files. Each file should not
contain more than 100 tweets. These files will be needed for Problem #2
Step 9: Your program (Filtration Engine) should automatically clean and transform the data stored
in the files, and then upload each record to new MongodB database myMongoTweet
o For cleaning and transformation -Remove special characters, URLs, emoticons etc.
o Write your own regular expression logic. You cannot use libraries such as,
beautifulsoup, jsoup, JTidy etc.
Step 10: You need to include a flowchart/algorithm of your tweet cleaning/transformation program
on the PDF file.
Problem #2: This problem contains two tasks.
Task 1: Data Processing using Spark – MapReduce to perform count
Step 1: Write a MapReduce program (WordCounter Engine) to count (frequency count) the
following substrings or words. Your MapReduce should perform the frequency count on the stored
raw tweets files
o “flu”, “snow”, “emergency”
o You need to include a flowchart/algorithm of your MapReduce program on the PDF file.
Step 2: In your PDF file, report the words that have highest and lowest frequencies.
Task 2: Data Visualization using Graph Database – Neo4j for graph generation
Step 3: Explore Neo4j graph database, understand the concept, and learn cypher query language
Step 4: Using Cypher, create graph nodes with name: “flu”, “snow”, “emergency”
You should add properties to the nodes. For adding properties, you should check the
relevant tweets Collections.
o Check if there are any relationships between the nodes.
o If there are relationships between nodes, then find the direction
o Include your Cypher and generated graph in the PDF file.
Winter 2021 saurabh.dey@dal.ca
Assignment 3 Submission Format:
1) Compress all your reports/files into a single .zip file and give it a meaningful name.
You are free to choose any meaningful file name, preferably – BannerId_Lastname_firstname_5408_A3
but avoid generic names like assignment-3.
2) Submit your reports only in PDF format.
Please avoid submitting .doc/.docx and submit only the PDF version. You can merge all the reports into
a single PDF or keep them separate. You should also include output (if any) and test cases (if any) in the
PDF file.
3) Your executable code/script needs to be submitted on https://git.cs.dal.ca/