Description

5/5 - (2 votes)

Project Proposal
Include the following information:
 Project title
 Data set
 Project idea
 The approach you will use
 Software you will use
 References.
 Teammate (if any).
 Timeline.
Suggestions of Project Ideas:
Machine Learning for COVID-19:
 Data and Resources:
https://sites.google.com/view/data-science-covid-19/data-and-resources?authuser=0
Text
 Autonomous Tagging of StackOverflow Questions
o Make a multi-label classification system that automatically assigns tags for
questions posted on a forum such as StackOverflow or Quora.
o Dataset: StackLite or 10% sample
 Keyword/Concept identification
o Identify keywords from millions of questions
o Dataset: StackOverflow question samples by Facebook
 Topic identification
o Multi-label classification of printed media articles to topics
o Dataset: Greek Media monitoring multi-label classification
Natural Language Understanding
 Automated essay grading
o The purpose of this project is to implement and train machine learning algorithms
to automatically assess and grade essay responses.
o Dataset: Essays with human graded scores
 Sentence to Sentence semantic similarity
o Can you identify question pairs that have the same intent or meaning?
o Dataset: Quora question pairs with similar questions marked
 Fight online abuse
o Can you confidently and accurately tell whether a particular comment is abusive?
o Dataset: Toxic comments on Kaggle
 Open Domain question answering
o Can you build a bot which answers questions according to the student’s age or her
curriculum?
o Facebook’s FAIR is built in a similar way for Wikipedia.
o Dataset: NCERT books for K-12/school students in India, NarrativeQA by
Google DeepMind and SQuAD by Stanford
 Social Chat/Conversational Bots
o Can you build a bot which talks to you just like people talk on social networking
sites?
o Reference: Chat-bot architechture
o Dataset: Reddit Dataset
 Automatic text summarization
o Can you create a summary with the major points of the original document?
o Abstractive (write your own summary) and Extractive (select pieces of text from
original) are two popular approaches
o Dataset: CNN and DailyMail News Pieces by Google DeepMind
 Copy-cat Bot
o Generate plausible new text which looks like some other text
o Obama Speeches? For instance, you can create a bot which writes some new
speeches in Obama’s style
o Trump Bot? Or a Twitter bot which mimics @realDonaldTrump
o Narendra Modi bot saying “doston”? Start by scrapping off his Hindi speeches
from his personal website
o Example Dataset: English Transcript of Modi speeches
Check mlm/blog for some hints.
 Sentiment Analysis
o Do Twitter Sentiment Analysis on tweets sorted by geography and timestamp.
o Dataset: Tweets sentiment tagged by humans
 De-anonymization
o Can you classify the text of an e-mail message to decide who sent it?
o Dataset: 150,000 Enron emails
Forecasting
 Univariate Time Series Forecasting
o How much will it rain this year?
o Dataset: 45 years of rainfall data
 Multi-variate Time Series Forecasting
o How polluted will your town’s air be? Pollution Level Forecasting
o Dataset: Air Quality dataset
 Demand/load forecasting
o Find a short term forecast on electricity consumption of a single home
o Dataset: Electricity consumption of a household
 Predict Blood Donation
o We’re interested in predicting if a blood donor will donate within a given time
window.
o More on the problem statement at Driven Data.
o Dataset: UCI ML Datasets Repo
Recommendation systems
 Movie Recommender
o Can you predict the rating a user will give on a movie?
o Do this using the movies that user has rated in the past, as well as the ratings
similar users have given similar movies.
o Dataset: Netflix Prize and MovieLens Datasets
 Search + Recommendation System
o Predict which Xbox game a visitor will be most interested in based on their search
query
o Dataset: BestBuy
 Can you predict Influencers in the Social Network?
o How can you predict social influencers?
o Dataset: PeerIndex
Vision
 Image classification
o Object recognition or image classification task is how Deep Learning shot up to
it’s present-day resurgence
o Datasets:
 CIFAR-10
 ImageNet
 MS COCO is the modern replacement to the ImageNet challenge
 MNIST Handwritten Digit Classification Challenge is the classic entry
point
 Character recognition (digits) is the good old Optical Character
Recognition problem
 Bird Species Identification from an Image using the Caltech-UCSD Birds
dataset dataset
o Diagnosing and Segmenting Brain Tumours and Phenotypes using MRI Scans
 Dataset: MICCAI Machine Learning Challenge aka MLC 2014
o Identify endangered right whales in aerial photographs
 Dataset: MOAA Right Whale
o Can computer vision spot distracted drivers?
 Dataset: State Farm Distracted Driver Detection on Kaggle
 Bone X-Ray dompetition
o Can you identify if a hand is broken from a X-ray radiographs automatically with
better than human performance?
o Stanford’s Bone XRay Deep Learning Competition with MURA Dataset
 Image Captioning
o Can you caption/explain the photo a way human would?
o Dataset: MS COCO
 Image Segmentation/Object Detection
o Can you extract an object of interest from an image?
o Dataset: MS COCO, Carvana Image Masking Challenge on Kaggle
 Large-Scale Video Understanding
o Can you produce the best video tag predictions?
o Dataset: YouTube 8M
 Video Summarization
o Can you select the semantically relevant/important parts from the video?
o Example: Fast-Forward Video Based on Semantic Extraction
o Dataset: Unaware of any standard dataset or agreed upon metrics? I
think YouTube 8M might be good starting point.
 Style Transfer
o Can you recompose images in the style of other images?
o Dataset: fzliu on GitHub shared target and source images with results
 Chest XRay
o Can you detect if someone is sick from their chest XRay? Or guess their radiology
report?
o Dataset: MIMIC-CXR at Physionet
 Face Recognition
o Can you identify whose photo is this? Similar to Facebook’s photo tagging or
Apple’s FaceId
o Dataset: face-rec.org, or facedetection.com
 Clinical Diagnostics: Image Identification, classification & segmentation
o Can you help build an open source software for lung cancer detection to help
radiologists?
o Link: Concept to clinic challenge on DrivenData
 Satellite Imagery Processing for Socioeconomic Analysis
o Can you estimate the standard of living or energy consumption of a place from
night time satellite imagery?
o Reference for Project details: Stanford Poverty Estimation Project
 Satellite Imagery Processing for Automated Tagging
o Can you automatically tag satellite images with human features such as buildings,
roads, waterways and so on?
o Help free the manual effort in tagging satellite imagery: Kaggle Dataset by DSTL,
UK
Reinforcement Learning
 Deep Q Learning
o Can you make AI play games and automate stuff by learning in an enviroment.
o Environments (dataset of Reinforcement Learning) OpenAI GYM
o T-REX Chrome Dino BOT Git Repo
Music
 Music/Audio Recommendation Systems
o Can you tell if two songs are similar using their sound or lyrics?
o Dataset: Million Songs Dataset and it’s 1% sample.
o Example: Anusha et al
 Music Genre recognition using neural networks
o Can you identify the musical genre using their spectrograms or other sound
information?
o Datasets: FMA or GTZAN on Keras
o Get started with Librosa for feature extraction
Other Dataset Suggestions
 UCI also has a collection of datasets sorted for various tasks (Classification,
Regression, etc)
 Data.gov: U.S. Government’s open data
 KDD Cup: http://www.kdd.org/kdd-cup, annual competition in data mining,
like Kaggle
 Google public datasets.
 NYC Taxi data for 2013 (FOILed by Chris Wong). 2013 Trip Data (11.0GB).
2013 Fare Data (7.7GB).Visualization for a days trip.
 Yahoo WebScope
 Freebase
 Yelp
 Numerous APIs from Google (e.g., Maps, Freebase, YouTube, etc.)
 Trulia, Zillow: real estate listing sites
 Numerous graph datasets (large and small): SNAP, Konect
 Movies data: Rotten Tomatoes, IMDB
 List of lists of datasets for recommendations.
 Million song dataset by Echo Nest.
It contains not only the basic information of songs (artist, genre, year, length
etc), but also some musical features(like tempo, pitch, key, brightness).
 The Free ‘Big Data’ Sources Everyone Should Know
 Quandl – a dataset search engine for time-series data.
 Amazon AWS Public Data Sets (Thanks Jonathan!)
 KDD Cup: annual competition in data mining, like Kaggle
 Academic domain: Microsoft Academic Search, DBLP
 Retrosheet: MLB statistics (Game/Play logs)
 Classification datasets
 Various geophysical datasets for the oceans (magnetism, gravity, seismology,
etc).
 Social trends
 Beer data
 Academic torrents (terabytes)
 Article Search API from the New York Times (all the way back to 1851!)

COEN 140 Final Project solved

Description

Related products

COEN 140 Lab 1

Math 160 Final Project solved

COEN 140 lab 3 solved