Description
Project Proposal
Include the following information:
Project title
Data set
Project idea
The approach you will use
Software you will use
References.
Teammate (if any).
Timeline.
Suggestions of Project Ideas:
Machine Learning for COVID-19:
Data and Resources:
https://sites.google.com/view/data-science-covid-19/data-and-resources?authuser=0
Text
Autonomous Tagging of StackOverflow Questions
o Make a multi-label classification system that automatically assigns tags for
questions posted on a forum such as StackOverflow or Quora.
o Dataset: StackLite or 10% sample
Keyword/Concept identification
o Identify keywords from millions of questions
o Dataset: StackOverflow question samples by Facebook
Topic identification
o Multi-label classification of printed media articles to topics
o Dataset: Greek Media monitoring multi-label classification
Natural Language Understanding
Automated essay grading
o The purpose of this project is to implement and train machine learning algorithms
to automatically assess and grade essay responses.
o Dataset: Essays with human graded scores
Sentence to Sentence semantic similarity
o Can you identify question pairs that have the same intent or meaning?
o Dataset: Quora question pairs with similar questions marked
Fight online abuse
o Can you confidently and accurately tell whether a particular comment is abusive?
o Dataset: Toxic comments on Kaggle
Open Domain question answering
o Can you build a bot which answers questions according to the student’s age or her
curriculum?
o Facebook’s FAIR is built in a similar way for Wikipedia.
o Dataset: NCERT books for K-12/school students in India, NarrativeQA by
Google DeepMind and SQuAD by Stanford
Social Chat/Conversational Bots
o Can you build a bot which talks to you just like people talk on social networking
sites?
o Reference: Chat-bot architechture
o Dataset: Reddit Dataset
Automatic text summarization
o Can you create a summary with the major points of the original document?
o Abstractive (write your own summary) and Extractive (select pieces of text from
original) are two popular approaches
o Dataset: CNN and DailyMail News Pieces by Google DeepMind
Copy-cat Bot
o Generate plausible new text which looks like some other text
o Obama Speeches? For instance, you can create a bot which writes some new
speeches in Obama’s style
o Trump Bot? Or a Twitter bot which mimics @realDonaldTrump
o Narendra Modi bot saying “doston”? Start by scrapping off his Hindi speeches
from his personal website
o Example Dataset: English Transcript of Modi speeches
Check mlm/blog for some hints.
Sentiment Analysis
o Do Twitter Sentiment Analysis on tweets sorted by geography and timestamp.
o Dataset: Tweets sentiment tagged by humans
De-anonymization
o Can you classify the text of an e-mail message to decide who sent it?
o Dataset: 150,000 Enron emails
Forecasting
Univariate Time Series Forecasting
o How much will it rain this year?
o Dataset: 45 years of rainfall data
Multi-variate Time Series Forecasting
o How polluted will your town’s air be? Pollution Level Forecasting
o Dataset: Air Quality dataset
Demand/load forecasting
o Find a short term forecast on electricity consumption of a single home
o Dataset: Electricity consumption of a household
Predict Blood Donation
o We’re interested in predicting if a blood donor will donate within a given time
window.
o More on the problem statement at Driven Data.
o Dataset: UCI ML Datasets Repo
Recommendation systems
Movie Recommender
o Can you predict the rating a user will give on a movie?
o Do this using the movies that user has rated in the past, as well as the ratings
similar users have given similar movies.
o Dataset: Netflix Prize and MovieLens Datasets
Search + Recommendation System
o Predict which Xbox game a visitor will be most interested in based on their search
query
o Dataset: BestBuy
Can you predict Influencers in the Social Network?
o How can you predict social influencers?
o Dataset: PeerIndex
Vision
Image classification
o Object recognition or image classification task is how Deep Learning shot up to
it’s present-day resurgence
o Datasets:
CIFAR-10
ImageNet
MS COCO is the modern replacement to the ImageNet challenge
MNIST Handwritten Digit Classification Challenge is the classic entry
point
Character recognition (digits) is the good old Optical Character
Recognition problem
Bird Species Identification from an Image using the Caltech-UCSD Birds
dataset dataset
o Diagnosing and Segmenting Brain Tumours and Phenotypes using MRI Scans
Dataset: MICCAI Machine Learning Challenge aka MLC 2014
o Identify endangered right whales in aerial photographs
Dataset: MOAA Right Whale
o Can computer vision spot distracted drivers?
Dataset: State Farm Distracted Driver Detection on Kaggle
Bone X-Ray dompetition
o Can you identify if a hand is broken from a X-ray radiographs automatically with
better than human performance?
o Stanford’s Bone XRay Deep Learning Competition with MURA Dataset
Image Captioning
o Can you caption/explain the photo a way human would?
o Dataset: MS COCO
Image Segmentation/Object Detection
o Can you extract an object of interest from an image?
o Dataset: MS COCO, Carvana Image Masking Challenge on Kaggle
Large-Scale Video Understanding
o Can you produce the best video tag predictions?
o Dataset: YouTube 8M
Video Summarization
o Can you select the semantically relevant/important parts from the video?
o Example: Fast-Forward Video Based on Semantic Extraction
o Dataset: Unaware of any standard dataset or agreed upon metrics? I
think YouTube 8M might be good starting point.
Style Transfer
o Can you recompose images in the style of other images?
o Dataset: fzliu on GitHub shared target and source images with results
Chest XRay
o Can you detect if someone is sick from their chest XRay? Or guess their radiology
report?
o Dataset: MIMIC-CXR at Physionet
Face Recognition
o Can you identify whose photo is this? Similar to Facebook’s photo tagging or
Apple’s FaceId
o Dataset: face-rec.org, or facedetection.com
Clinical Diagnostics: Image Identification, classification & segmentation
o Can you help build an open source software for lung cancer detection to help
radiologists?
o Link: Concept to clinic challenge on DrivenData
Satellite Imagery Processing for Socioeconomic Analysis
o Can you estimate the standard of living or energy consumption of a place from
night time satellite imagery?
o Reference for Project details: Stanford Poverty Estimation Project
Satellite Imagery Processing for Automated Tagging
o Can you automatically tag satellite images with human features such as buildings,
roads, waterways and so on?
o Help free the manual effort in tagging satellite imagery: Kaggle Dataset by DSTL,
UK
Reinforcement Learning
Deep Q Learning
o Can you make AI play games and automate stuff by learning in an enviroment.
o Environments (dataset of Reinforcement Learning) OpenAI GYM
o T-REX Chrome Dino BOT Git Repo
Music
Music/Audio Recommendation Systems
o Can you tell if two songs are similar using their sound or lyrics?
o Dataset: Million Songs Dataset and it’s 1% sample.
o Example: Anusha et al
Music Genre recognition using neural networks
o Can you identify the musical genre using their spectrograms or other sound
information?
o Datasets: FMA or GTZAN on Keras
o Get started with Librosa for feature extraction
Other Dataset Suggestions
UCI also has a collection of datasets sorted for various tasks (Classification,
Regression, etc)
Data.gov: U.S. Government’s open data
KDD Cup: http://www.kdd.org/kdd-cup, annual competition in data mining,
like Kaggle
Google public datasets.
NYC Taxi data for 2013 (FOILed by Chris Wong). 2013 Trip Data (11.0GB).
2013 Fare Data (7.7GB).Visualization for a days trip.
Yahoo WebScope
Freebase
Yelp
Numerous APIs from Google (e.g., Maps, Freebase, YouTube, etc.)
Trulia, Zillow: real estate listing sites
Numerous graph datasets (large and small): SNAP, Konect
Movies data: Rotten Tomatoes, IMDB
List of lists of datasets for recommendations.
Million song dataset by Echo Nest.
It contains not only the basic information of songs (artist, genre, year, length
etc), but also some musical features(like tempo, pitch, key, brightness).
The Free ‘Big Data’ Sources Everyone Should Know
Quandl – a dataset search engine for time-series data.
Amazon AWS Public Data Sets (Thanks Jonathan!)
KDD Cup: annual competition in data mining, like Kaggle
Academic domain: Microsoft Academic Search, DBLP
Retrosheet: MLB statistics (Game/Play logs)
Classification datasets
Various geophysical datasets for the oceans (magnetism, gravity, seismology,
etc).
Social trends
Beer data
Academic torrents (terabytes)
Article Search API from the New York Times (all the way back to 1851!)