Description
Learning Objectives
The purpose of this assignment is to provide a refresher on fundamental concepts that we will use throughout this course and provide an opportunity to develop skills in any of the related skills that may be unfamiliar to you. Through the course of completing this assignment, you will…
- Refresh you knowledge of probability theory including properties of random variables, probability density functions, cumulative distribution functions, and key statistics such as mean and variance.
- Revisit common linear algebra and matrix operations and concepts such as matrix multiplication, inner and outer products, inverses, the Hadamard (element-wise) product, eigenvalues and eigenvectors, orthogonality, and symmetry.
- Practice numerical programming, core to machine learning, by loading and filtering data, plotting data, vectorizing operations, profiling code speed, and debugging and optimizing performance. You will also practice computing probabilities based on simulation.
- Develop or refresh your knowledge of Git version control, which will be a core tool used in the final project of this course
- Apply your skills altogether through an exploratory data analysis to practice data cleaning, data manipulation, interpretation, and communication
We will build on these concepts throughout the course, so use this assignment as a catalyst to deepen your knowledge and seek help with anything unfamiliar.
If some references would be helpful on these topics, I would recommend the following resources:
- Mathematics for Machine Learning by Deisenroth, Faisal, and Ong
- Deep Learning; Part I: Applied Math and Machine Learning Basics by Goodfellow, Bengio, and Courville
- The Matrix Calculus You Need For Deep Learning by Parr and Howard
- Dive Into Deep Learning; Appendix: Mathematics for Deep Learning by Weness, Hu, et al.
Note: don’t worry if you don’t understand everything in the references above – some of these books dive into significant minutia of each of these topics.
Probability and Statistics Theory
Note: for all assignments, write out equations and math using markdown and LaTeX. For this assignment show ALL math work for questions 1-4, meaning that you should include any intermediate steps necessary to understand the logic of your solution
1
[3 points]
Let f(x)=⎧⎩⎨0αx20x<00≤x≤22<x𝑓(𝑥)={0𝑥<0𝛼𝑥20≤𝑥≤202<𝑥
For what value of α𝛼 is f(x)𝑓(𝑥) a valid probability density function?
ANSWER
2
[3 points] What is the cumulative distribution function (CDF) that corresponds to the following probability distribution function? Please state the value of the CDF for all possible values of x𝑥.
f(x)={1300<x<3otherwise𝑓(𝑥)={130<𝑥<30otherwise
ANSWER
3
[6 points] For the probability distribution function for the random variable X𝑋,
f(x)={1300<x<3otherwise𝑓(𝑥)={130<𝑥<30otherwise
what is the (a) expected value and (b) variance of X𝑋. Show all work.
ANSWER
4
[6 points] Consider the following table of data that provides the values of a discrete data vector x𝑥 of samples from the random variable X𝑋, where each entry in x𝑥 is given as xi𝑥𝑖.
Table 1. Dataset N=5 observations
x0𝑥0 | x1𝑥1 | x2𝑥2 | x3𝑥3 | x4𝑥4 | |
---|---|---|---|---|---|
xx | 2 | 3 | 10 | -1 | -1 |
What is the (a) mean and (b) variance of the data?
Show all work. Your answer should include the definition of mean and variance in the context of discrete data. In this case, use the sample variance since the sample size is quite small
ANSWER
Linear Algebra
5
[5 points] A common task in machine learning is a change of basis: transforming the representation of our data from one space to another. A prime example of this is through the process of dimensionality reduction as in Principle Components Analysis where we often seek to transform our data from one space (of dimension n𝑛) to a new space (of dimension m𝑚) where m<n𝑚<𝑛. Assume we have a sample of data of dimension n=4𝑛=4 (as shown below) and we want to transform it into a dimension of m=2𝑚=2.
x=⎡⎣⎢⎢⎢x1x2x3x4⎤⎦⎥⎥⎥𝑥=[𝑥1𝑥2𝑥3𝑥4]
(a) What are the dimensions of a matrix, A𝐴, that would linearly transform our sample of data, x𝑥, into a space of m=2𝑚=2 through the operation Ax𝐴𝑥?
(b) Express this transformation in terms of the components of x𝑥: x1𝑥1, x2𝑥2, x3𝑥3, x4𝑥4 and the matrix A𝐴 where each entry in the matrix is denoted as ai,j𝑎𝑖,𝑗 (e.g. the entry in the first row and second column would be a1,2𝑎1,2). Your answer will be in the form of a matrix expressing result of the product Ax𝐴𝑥.
Note: please write your answers here in LaTeX
ANSWER
6
[14 points] Matrix manipulations and multiplication. Machine learning involves working with many matrices, so this exercise will provide you with the opportunity to practice those skills.
Let A=⎡⎣⎢123245356⎤⎦⎥𝐴=[123245356], b=⎡⎣⎢−138⎤⎦⎥𝑏=[−138], c=⎡⎣⎢4−36⎤⎦⎥𝑐=[4−36], and I=⎡⎣⎢100010001⎤⎦⎥𝐼=[100010001]
Compute the following using Python or indicate that it cannot be computed. Refer to NumPy’s tools for handling matrices. While all answers should be computer using Python, your response to whether each item can be computed should refer to underlying linear algebra. There may be circumstances when Python will produce an output, but based on the dimensions of the matrices involved, the linear algebra operation is not possible. For the case when an operation is invalid, explain why it is not.
When the quantity can be computed, please provide both the Python code AND the output of that code (this need not be in LaTex)
- AA𝐴𝐴
- AAT𝐴𝐴𝑇
- Ab𝐴𝑏
- AbT𝐴𝑏𝑇
- bA𝑏𝐴
- bTA𝑏𝑇𝐴
- bb𝑏𝑏
- bTb𝑏𝑇𝑏
- bbT𝑏𝑏𝑇
- b+cT𝑏+𝑐𝑇
- bTbT𝑏𝑇𝑏𝑇
- A−1b𝐴−1𝑏
- A∘A𝐴∘𝐴
- b∘c𝑏∘𝑐
Note: The element-wise (or Hadamard) product is the product of each element in one matrix with the corresponding element in another matrix, and is represented by the symbol “∘∘“.
ANSWER
7
[8 points] Eigenvectors and eigenvalues. Eigenvectors and eigenvalues are useful for some machine learning algorithms, but the concepts take time to solidly grasp. They are used extensively in machine learning and in this course we will encounter them in relation to Principal Components Analysis (PCA), clustering algorithms, For an intuitive review of these concepts, explore this interactive website at Setosa.io. Also, the series of linear algebra videos by Grant Sanderson of 3Brown1Blue are excellent and can be viewed on youtube here. For these questions, numpy may once again be helpful.
- Calculate the eigenvalues and corresponding eigenvectors of matrix A𝐴 above, from the last question.
- Choose one of the eigenvector/eigenvalue pairs, v𝑣 and λ𝜆, and show that Av=λv𝐴𝑣=𝜆𝑣. This relationship extends to higher orders: AAv=λ2v𝐴𝐴𝑣=𝜆2𝑣
- Show that the eigenvectors are orthogonal to one another (e.g. their inner product is zero). This is true for eigenvectors from real, symmetric matrices. In three dimensions or less, this means that the eigenvectors are perpendicular to each other. Typically we use the orthogonal basis of our standard x, y, and z, Cartesian coordinates, which allows us, if we combine them linearly, to represent any point in a 3D space. But any three orthogonal vectors can do the same. We will see this property is used in PCA to identify the dimensions of greatest variation in our data when we discuss dimensionality reduction.
ANSWER
Numerical Programming
8
[10 points] Loading data and gathering insights from a real dataset
In data science, we often need to have a sense of the idiosyncrasies of the data, how they relate to the questions we are trying to answer, and to use that information to help us to determine what approach, such as machine learning, we may need to apply to achieve our goal. This exercise provides practice in exploring a dataset and answering question that might arise from applications related to the data.
Data. The data for this problem can be found in the data
subfolder in the assignments
folder on github. The filename is a1_egrid2016.xlsx
. This dataset is the Environmental Protection Agency’s (EPA) Emissions & Generation Resource Integrated Database (eGRID) containing information about all power plants in the United States, the amount of generation they produce, what fuel they use, the location of the plant, and many more quantities. We’ll be using a subset of those data.
The fields we’ll be using include:
field | description |
---|---|
SEQPLT16 | eGRID2016 Plant file sequence number (the index) |
PSTATABB | Plant state abbreviation |
PNAME | Plant name |
LAT | Plant latitude |
LON | Plant longitude |
PLPRMFL | Plant primary fuel |
CAPFAC | Plant capacity factor |
NAMEPCAP | Plant nameplate capacity (Megawatts MW) |
PLNGENAN | Plant annual net generation (Megawatt-hours MWh) |
PLCO2EQA | Plant annual CO2 equivalent emissions (tons) |
For more details on the data, you can refer to the eGrid technical documents. For example, you may want to review page 45 and the section “Plant Primary Fuel (PLPRMFL)”, which gives the full names of the fuel types including WND for wind, NG for natural gas, BIT for Bituminous coal, etc.
There also are a couple of “gotchas” to watch out for with this dataset:
- The headers are on the second row and you’ll want to ignore the first row (they’re more detailed descriptions of the headers).
- NaN values represent blanks in the data. These will appear regularly in real-world data, so getting experience working with these sorts of missing values will be important.
Your objective. For this dataset, your goal is to answer the following questions about electricity generation in the United States:
(a) Which plant has generated the most energy (measured in MWh)?
(b) What is the name of the northern-most power plant in the United States?
(c) What is the state where the northern-most power plant in the United States is located?
(d) Plot a bar plot showing the amount of energy produced by each fuel type across all plants.
(e) From the plot in (d), which fuel for generation produces the most energy (MWh) in the United States?
ANSWER
9
[6 points] Vectorization. When we first learn to code and think about iterating over an array, we often use loops. If implemented correctly, that does the trick. In machine learning, we iterate over so much data that those loops can lead to significant slow downs if they are not computationally efficient. In Python, vectorizing code and relying on matrix operations with efficient tools like numpy is typically the faster approach. Of course, numpy relies on loops to complete the computation, but this is at a lower level of programming (typically in C), and therefore is much more efficient. This exercise will explore the benefits of vectorization. Since many machine learning techniques rely on matrix operations, it’s helpful to begin thinking about implementing algorithms using vector forms.
Begin by creating an array of 10 million random numbers using the numpy random.randn
module. Compute the sum of the squares of those random numbers first in a for loop, then using Numpy’s dot
module to perform an inner (dot) product. Time how long it takes to compute each and report the results and report the output. How many times faster is the vectorized code than the for loop approach? (Note – your results may vary from run to run).
Your output should use the print()
function as follows (where the # symbols represent your answers, to a reasonable precision of 4-5 significant figures):
Time [sec] (non-vectorized): ######
Time [sec] (vectorized): ######
The vectorized code is ##### times faster than the nonvectorized code
ANSWER
10
[10 points] This exercise will walk through some basic numerical programming and probabilistic thinking exercises, two skills which are frequently use in machine learning for answering questions from our data.
- Synthesize n=104𝑛=104 normally distributed data points with mean μ=2𝜇=2 and a standard deviation of σ=1𝜎=1. Call these observations from a random variable X𝑋, and call the vector of observations that you generate, xx.
- Calculate the mean and standard deviation of xx to validate (1) and provide the result to a precision of four significant figures.
- Plot a histogram of the data in xx with 30 bins
- What is the 90th percentile of xx? The 90th percentile is the value below which 90% of observations can be found.
- What is the 99th percentile of xx?
- Now synthesize n=104𝑛=104 normally distributed data points with mean μ=0𝜇=0 and a standard deviation of σ=3𝜎=3. Call these observations from a random variable Y𝑌, and call the vector of observations that you generate, yy.
- Create a new figure and plot the histogram of the data in yy on the same axes with the histogram of xx, so that both histograms can be seen and compared.
- Using the observations from xx and yy, estimate E[XY]𝐸[𝑋𝑌]
ANSWER
Version Control via Git
11
[4 points] Git is efficient for collaboration, and expectation in industry, and one of the best ways to share results in academia. You can even use some Git repositories (e.g. Github) as hosts for website, such as with the course website. As a data scientist with experience in machine learning, Git is expected. We will interact with Git repositories (a.k.a. repos) throughout this course, and your project will require the use of git repos for collaboration.
Complete the Atlassian Git tutorial, specifically the following listed sections. Try each concept that’s presented. For this tutorial, instead of using BitBucket as your remote repository host, you may use your preferred platform such as Github or Duke’s Gitlab.
- What is version control
- What is Git
- Install Git
- Setting up a repository
- Saving changes
- Inspecting a repository
- Undoing changes
- Rewriting history
- Syncing
- Making a pull request
- Using branches
- Comparing workflows
I also have created two videos on the topic to help you understand some of these concepts: Git basics and a step-by-step tutorial.
For your answer, affirm that you either completed the tutorials above OR have previous experience with ALL of the concepts above. Confirm this by typing your name below and selecting the situation that applies from the two options in brackets.
ANSWER
I, [your name here], affirm that I have [completed the above tutorial / I have previous experience that covers all the content in this tutorial]
Exploratory Data Analysis
12
[15 points] Here you’ll bring together some of the individual skills that you demonstrated above and create a Jupyter notebook based blog post on your exploratory data analysis. Your goal is to identify a question or problem and to work towards solving it or providing additional information or evidence (data) related to it through your data analysis. Below, we walk through a process to follow for your analysis. Additionally, you can find an example of a well-done exploratory data analysis here from past years.
- Find a dataset that interests you and relates to a question or problem that you find intriguing.
- Describe the dataset, the source of the data, and the reason the dataset was of interest. Include a description of the features, data size, data creator and year of creation (if available), etc. What question are you hoping to answer through exploring the dataset?
- Check the data and see if they need to be cleaned: are there missing values? Are there clearly erroneous values? Do two tables need to be merged together? Clean the data so it can be visualized. If the data are clean, state how you know they are clean (what did you check?).
- Plot the data, demonstrating interesting features that you discover. Are there any relationships between variables that were surprising or patterns that emerged? Please exercise creativity and curiosity in your plots. You should have at least a ~3 plots exploring the data in different ways.
- What insights are you able to take away from exploring the data? Is there a reason why analyzing the dataset you chose is particularly interesting or important? Summarize this for a general audience (imagine your publishing a blog post online) – boil down your findings in a way that is accessible, but still accurate.
Here your analysis will evaluated based on:
- Motivation: was the purpose of the choice of data clearly articulated? Why was the dataset chosen and what was the goal of the analysis?
- Data cleaning: were any issues with the data investigated and, if found, were they resolved?
- Quality of data exploration: were at least 4 unique plots (minimum) included and did those plots demonstrate interesting aspects of the data? Was there a clear purpose and takeaway from EACH plot?
- Interpretation: Were the insights revealed through the analysis and their potential implications clearly explained? Was there an overall conclusion to the analysis?