Description
Background
In development and delivery of non-trivial software systems, working as part of a team is generally
(typically!) the norm. This assignment is very much a group project. Students will be put into
software teams to work on the implementation of the system described below. These will be teams
of up to 5 students. In this assignment, students need to organize their team and their collective
involvement throughout. There is no team leader as such, but teams may decide to set up processes
for agreeing on the work and who does what. Understanding the dependencies between individual
efforts and their successful integration is key to the success of the work and for software
engineering projects more generally. If teams have “issues”, then please let me know asap and I will
help resolve them.
Assignment Description
The software engineering activity builds on the lecture materials describing Cloud systems and
especially the UniMelb/NeCTAR Research Cloud and its use of OpenStack; on Instagram data
(provided); on Twitter APIs, and CouchDB and the kinds of data analytics (e.g. MapReduce) that
CouchDB supports as well as data from the Australian Urban Research Infrastructure Network
(AURIN – https://portal.aurin.org.au).
The focus of this assignment is to explore the Seven Deadly Sins
(https://en.wikipedia.org/wiki/Seven_deadly_sins) through social media analytics. There has been a
huge amount of work on sentiment analysis of social media, e.g. are people happy or sad as recorded
by their tweets, but far less work on other aspects of human nature and emotion: greed, lust, laziness
etc. Teams will explore one or more deadly sins and collect social media data that captures some
aspect of that sin and compares it with official data from AURIN. A few examples of the deadly sins
might be:
• Pride: how many selfies are taken in particular areas, how many tweets/images about
make-up/personal care, …
• Greed: tweets/images about food/drink or about money/income, …
• Lust: tweets that include the likes of “I want…”, “I love…”, “I’m jealous…” or images of a
“certain” adult nature;
• Envy: tweets that include the likes of “I wish…”, “I need…”, “I desire…”
• Gluttony: tweets/images that show overweight people or about dietary issues, e.g. fast food
restaurants such as #maccas or related products such as #bigmac etc;
• Wrath: tweets that include the likes of “I hate…”, “I’m angry…” or about crime or areas with
high levels of negative emotion (sentiment) on particular topics etc
• Sloth: tweets that mention sleep, laziness, or areas where there are more/less tweets at
night/early morning.
These are just examples and students are encouraged to be creative. Indeed, a prize will be awarded
for the most original scenario.
The goal of the assignment is to harvest tweets from across the cities of Australia on the UniMelb
Research Cloud and undertake a variety of data analytics scenarios that tell interesting stories of life
in Australian cities related to one or more deadly sins and importantly how the Twitter and precollected Instagram data can be used alongside/compared with/augment the data available within
the AURIN platform to assess/validate these sins. For example,
• The VicHealth survey identifies individuals and their sedentary patterns, e.g. how many
hours do people sleep or spend seated per day. How does this correlate with Sloth data?
• PHIDU data includes information on prevalence of health risk factors such as obesity. How
does this correlate with tweets/posts related to food and/or fast food in particular for
Gluttony data?
• The Crime Statistics Agency in Victoria contains data related to sexual offences and stalking.
How does this relate to tweets/images related to Lust?
• The Crime Statistics Agency in Victoria also contains data related to assaults. How does this
relate to tweets/images related to Wrath?
• The Victoria Commission for Licensing and Gambling Registration Authority contains data
related to places to money spent on gambling. How does this relate to Greed?
• The Australian Bureau of Statistics includes data on the population and household income
statistics. Is there are correlation with make-up/personal care in areas with more wealthy
females? Do more young people take selfies? Are there other Pride scenarios that can be
explored?
• The Australian Bureau of Statistics has information on homelessness and areas of high
household income. Are there any scenarios that explore social media use related to Envy that
connect these two societal issues?
The above are examples – students may decide to create their own analytics based on the data they
obtain. Students are not expected to build advanced “general purpose” data analytic services that
can support any scenario, but show how tools like CouchDB with targeted data analysis capabilities
like MapReduce when provided with suitable inputs can be used to capture the essence of life in
Australia. Teams are encouraged to combine twitter data with AURIN data and potentially other data
of relevance to the city, e.g. information on weather, sport events, TV shows, visiting celebrities,
stock market rise/falls, images from Instagram etc.
Teams can download data from the AURIN platform, e.g. as JSON, CSV or as Shapefiles, or use the
AURIN openAPI (https://aurin.org.au/aurin-apis/). This data can / should be included into the
team’s CouchDB database for analysis with Twitter and Instagram data.
The teams should develop a Cloud-based solution that exploits a multitude of virtual machines
(VMs) across the UniMelb Research Cloud for harvesting tweets through the Twitter APIs (using
both the Streaming and the Search API interfaces). The teams should produce a solution that can be
run (in principle) across any node of the Cloud to harvest and store tweets. Teams have been
allocated four medium-sized VMs with 8 virtual CPUs (36Gb memory total) and 250Gb of volume
storage. More storage can be provided if required. All students should have access to the NeCTAR
Research Cloud as individual users and can test/develop their applications using their own (small)
VM instances. (Remembering that there is no persistence in these small, free and dynamically
allocated VMs).
The solution should include a Twitter harvesting application for one or more of the cities of
Australia. The teams are expected to have multiple instances of this application running on the Cloud
together with an associated CouchDB database containing the amalgamated collection of Tweets
from the harvester applications. The CouchDB setup may be a single node or adopt a cluster setup. A
key aspect of this work is in removing duplicate tweets, i.e. the system should be designed such that
duplicate tweets will not arise.
Students may want to explore other social media APIs for collection of data, e.g. Foursquare and
FlickR, however these are not compulsory to complete the work. A corpus of Instagram posts will be
made available for data analytics, but again teams may decide that they only wish to focus on Twitter
data. (See appendix for how to access/download Instagram data). It is noted that social media
providers such as Instagram are evolving their APIs as well as the policies on access to and use of
their data, so bear this in mind is you wish to harvest data from other social media providers. Such
data is not compulsory and it is suggested that teams focus on Twitter. Teams may also find other
data “on the web” that augments the AURIN data sets and scenarios that can be told.
A front-end web application is required for visualising these data sets/scenarios.
For the implementation teams are recommended to use a commonly understood language across
team members – most likely Java or Python. Information on building and using Twitter harvesters
can be found on the web, e.g. see https://dev.twitter.com/ and related links to resources such as
Tweepy and Twitter4J. Teams are free to use any pre-existing software systems that they deem
appropriate for the analysis. This can include sentiment analysis libraries, gender identification
libraries, and machine learning systems as well as front-end Javascript libraries and visualisation
capabilities, e.g. Googlemaps.
Error Handling
Issues and challenges in using the UniMelb Research Cloud for this assignment should be
documented. You should describe in detail the limitations of mining twitter content and language
processing (e.g. sarcasm). You should outline any solutions developed to tackle such scenarios.
Removing duplicates of tweets should be handled. The database may however contain re-tweets.
You should demonstrate how you tackled working within the quota imposed by the Twitter APIs
through the use of the Cloud.
Final packaging and delivery
You should collectively write a team report on the application developed and include the
architecture, the system design and the discussions that lead into the design. You should describe
the role of the team members in the delivery of the system and where the team worked well and
where issues arose and how they were addressed. The team should illustrate the functionality of the
system through a range of scenarios and explain why you chose the specific examples. Teams are
encouraged to write this report in the style of a paper than can ultimately be submitted to a
conference/journal.
Each team member is also expected to complete a confidential report on their role in the project and
the experiences in working with their individual team members. This will be handed in separately to
the final team report. (This is not to be used to blame people, but to ensure that all team members
are able to provide feedback and to ensure that no team has any member that does nothing!!!).
The length of the team report is not fixed. Given the level of complexity of the assignment and total
value of the assignment a suitable estimate is a report in the range of 20-25 pages. A typical report
will comprise:
● A description of the system functionalities, the scenarios supported and why, together with
graphical results, e.g. pie-charts/graphs of Tweet analysis and snapshots of the web
apps/maps displaying certain Tweet scenarios;
● A simple user guide for testing (including system deployment and end user invocation/usage
of the systems);
● System design and architecture and how/why this was chosen;
● A discussion on the pros and cons of the UniMelb Research Cloud and tools and processes for
image creation and deployment;
● Teams should also produce a video of their system that is uploaded to YouTube (these videos
can last longer than the UniMelb deployments unfortunately!);
● Reports should also include a link to the source code (github or bitbucket).
It is important to put your collective team details (team, city, names, surnames, student ids) in:
● the head page of the report;
● as a header in each of the files of the software project.
Individual reports describing your role and your teams contributions should be submitted
separately. A link to the Qualtrics system will be sent separately for this.
Implementation Requirements
Teams are expected to use:
● a version-control system such as GitHub or Bitbucket for sharing source code.
● MapReduce based implementations for analytics where appropriate, using CouchDB’s built
in MapReduce capabilities. You may also consider using Hadoop/Spark for this task if
desired.
● The entire system should have scripted deployment capabilities. This means that your team
will provide a script, which, when executed, will create and deploy the virtual machines and
orchestrate the set up of all necessary software on said machines (e.g. CouchDB, the twitter
harvesters, web servers etc.) to create a ready-to-run system. Note that this setup need not
populate the database, but demonstrate your ability to orchestrate the necessary software
environment on the UniMelb Research Cloud. Teams should use Ansible
(http://www.ansible.com/home) for this task.
● Teams may wish to utilise Docker and technologies, but this is not mandatory.
● The server side of your analytics web application may expose its data to the client through a
ReSTful design. Authentication or authorization is NOT required for the web front end.
Teams are also encouraged to describe:
● How fault-tolerant is your software setup? Is there a single point-of-failure?
● Can your application and infrastructure dynamically scale out to meet demand?
Deadline
One copy of the team assignment is to be submitted through the LMS. The zip file must be named
with your team, i.e. <CCC2018-n>.zip.
Individual reports describing your role and your team’s contributions should be submitted via the
Qualtrics system (link to be provided separately). These individual reports will be based on the
completion of web based forms and do not require extension Word/PDF documents etc.
The deadline for submitting the team assignment is Wednesday 15th May (by 1pm!).
Marking
The marking process will be structured by evaluating whether the assignment (application + report)
is compliant with the specification given. This implies the following:
● A working demonstration of the Cloud-based solution with dynamic deployment – 25%
marks
● A working demonstration of tweet harvesting and CouchDB utilization for specific analytics
scenarios and any novel analytics capabilities required for the data, e.g. image recognition
support – 30% marks
● Detailed documentation on the system architecture and design – 20%
● Report and write up discussion including pros and cons of the UniMelb Research Cloud and
supporting twitter data analytics – 15% marks
● Proper handling of the errors and removal of duplicate tweets – 10% marks
The (confidential) assessment by your peers will be used to weight your individual scores
accordingly.
Timeliness in submitting the assignment in the proper format is important. A 10% deduction per
day will be made for late submissions.
Demonstration Schedule and Venue
The student teams are required to give a presentation (with a few slides) and a demonstration of the
working application. This should include the key data analytics scenarios supported as well the
design and implementation choices made. Each team has up to 15 minutes to present their work.
This will take place on Wednesday 15th May (12 teams present) and 22nd May (12 teams
present). Note that given the numbers of teams this year, not all teams will be able to present
– however all teams should be prepared to present on 15th May!!! I will randomly identify a
team on the day (using a random number generator for fairness!!!). Note this is the same day as
submission hence the deadline for submission is a hard one!!!
As a team, you are free to develop your system(s) where you are more comfortable with (at home,
on your PC/laptop, in the labs…) but obviously the demonstration should work on the UniMelb
Research Cloud.
Appendix – Access to Twitter and/or Instagram Data
Note: you do not have to use Instagram data, but if you want to for your scenarios then follow the
recipe below. We have been collecting Instagram data on the UniMelb Research Cloud. There are
around 19million posts. We have divided them by location and date of harvesting; the breakdown
can be requested with this request (noting that CURL needs to be installed):
Twitter and Instagram data can be selected by location (each of the main capital cities) and time
interval (down to the day):
curl “http://45.113.232.90/couchdbro/twitter/_design/twitter/_view/summary” \
-G \
–data-urlencode ‘start_key=[“perth”,2014,1,1]’ \
–data-urlencode ‘end_key=[“perth”,2014,12,31]’ \
–data-urlencode ‘reduce=false’ \
–data-urlencode ‘include_docs=true’ \
–user “readonly:ween7ighai9gahR6” \
-o /tmp/twitter.json
curl “http://45.113.232.90/couchdbro/instagram/_design/instagram/_view/summary” \
-G \
–data-urlencode ‘start_key=[“perth”,2014,1,1]’ \
–data-urlencode ‘end_key=[“perth”,2014,12,31]’ \
–data-urlencode ‘reduce=false’ \
–data-urlencode ‘include_docs=true’ \
–user “readonly:ween7ighai9gahR6” \
-o /tmp/instagram.json
To aggregate tweets by city, the “summary” view can be used:
curl -XGET “http://45.113.232.90/couchdbro/twitter/_design/twitter/_view/summary” \
-G \
–data-urlencode ‘reduce=true’ \
–data-urlencode ‘include_docs=false’ \
–data-urlencode ‘group_level=1’ \
–user “readonly:ween7ighai9gahR6”
For Instagram data:
curl -XGET “http://45.113.232.90/couchdbro/instagram/_design/instagram/_view/summary” \
-G \
–data-urlencode ‘reduce=true’ \
–data-urlencode ‘include_docs=false’ \
–data-urlencode ‘group_level=1’ \
–user “readonly:ween7ighai9gahR6”
The following request extracts GeoJSON limited to an area around Melbourne CBD (“r1r0” to
“r1r1” geohash):
curl -XGET
“http://45.113.232.90/couchdbro/twitter/_design/twitter/_list/geojson/geoindex?reduce=false” \
-G \
–data-urlencode ‘start_key=[“r1r0”,null,null,null]’ \
–data-urlencode ‘end_key=[“r1r1”,{},{},{}]’ \
–user “readonly:ween7ighai9gahR6”
The GeoHash code can be found at http://geohash.gofreerange.com/.
The key components are not independent, hence they cannot be queried separately or in a
different order (think of it as one single key build as a concatenation, as in: “r1r0_2019_13_2”)
On Windows, apostrophes work in a different way, hence double quote have to be used instead, as
in:
curl -XGET
“http://45.113.232.90/couchdbro/twitter/_design/twitter/_list/geojson/geoindex?limit=10&reduce
=false”
-G\
–data-urlencode “start_key=[\”r1r0\”,null,null,null]”\
–data-urlencode “end_key=[\”r1r1\”,{},{},{}]”\
–user “readonly:ween7ighai9gahR6”