Description
Linear Regression on Lake Mendota Ice The Wisconsin State Climatology Office keeps a record on the number of days Lake Mendota was covered by ice at http://www.aos.wisc.edu/∼sco/lakes/Mendota-ice.html. 1 Question 1: Data Curation As with any real problems, the data is not as clean or as organized as one would like for machine learning. Curate a clean data set starting from year 1855-56 and ending in year 2020-21. We care about the following aspects of the data: • x, the starting year: for 1855-56, x = 1855; for 2017-18, x = 2017; and so on. • y, the number of ice days in that year. For Mendota in 1855-56, y = 118; for 2017-18, y = 94; and so on. Some years have multiple freeze thaw cycles such as 2001-02; you should use the aggregated number of days. In the table, this appears as a year followed by a ” under it. You’ll notice exactly one of the two lines has a number for ice days. That number is the one you’ll use as the y for that year. For example, in year 2001-02 your feature will be: x = 2001, y = 21. Save your data set as “hw5.csv”. We gave you an example toy.csv with the correct format (but the numbers are fake). Your file should follow this standard format for “.csv” files. For example, the first 4 lines of your ‘hw5.csv” would be the following: year,days 1855,118 1856,151 1857,121 Output: Create (possibly manually) a file named: “hw5.csv”, as described above. 2 Question 2: Visualize Data For this and the following questions, you need to write a python program hw5.py. Your code should take as argument the name of the csv file that you want to read (eg. toy.csv or hw5.csv). To get the first argument you can use the following code: import sys sys.argv[1] #this contains the first argument as string Your hw5.py then needs to produce the outputs in the order described in Question 2 – Question 6. Plot year vs. ice days from your data set. You should save the plot as a plot.jpg . You can save the plot by using the following code: plt.savefig(“plot.jpg”) For reference, we gave you the output plot.jpg for toy.csv: 1 Homework 5 CS540 Spring 2022 Note: You do not need to fully match the plot style, but should have x-axis labels, y-axis labels, and the curve. Output: Your “hw5.py” would need to produce a “plot.jpg” file. 3 Question 3: Linear Regression Using the whole data set as the training set, train a linear regression model: yˆ = βˆ 0 + βˆ 1x. Recall, this means finding the closed-form MLE solution for β = (β0, β1) ⊤: βˆ = arg min β 1 n Xn i=1 (x ⊤ i β − yi) 2 . To write the solution in matrix form, we first augment the feature vector: xi = 1 xi . Note boldface xi is a vector while xi is a scalar. Now organize the features in a n×2 array, where n is the number of data points (roughly the number of rows in your computed csv). X = x ⊤ 1 . . . x ⊤ n . And the y values into a vector: Y = y1 . . . yn . The MLE solution can be written as βˆ = arg min β ∥Xβ − Y ∥ 2 . By setting the gradient to zero, we arrive at the closed-form MLE: βˆ = X⊤X −1 X⊤Y. This involves the inverse of X⊤X, which for our problem is invertible. Your program should compute βˆ as specified here. You should break down this process into several steps as follows: Homework 5 CS540 Spring 2022 3.1 Q3a: Represent the data as a matrix X, which will have dimension n × 2. Recall that for each individual data point xi , you should transform the point into a full feature vector xi = 1 xi Then, you make each feature vector become a row of the overall data matrix X: X = x ⊤ 1 . . . x ⊤ n Your output for this section is formed by printing out X as follows: print(“Q3a:”) print(X) 3.2 Q3b: Next, you need to place all the corresponding yi values into a vector Y = y1 . . . yn Your output for this section is formed by printing out Y as follows: print(“Q3b:”) print(Y) 3.3 Q3c: Next, compute the matrix product Z = XT X. Your output for this section is formed by printing out Z as follows: print(“Q3c:”) print(Z) 3.4 Q3d: Next, compute the inverse of XT X, which we call I. Your output for this section is formed by printing out I as follows: print(“Q3d:”) print(I) 3.5 Q3e: Next, compute what we call the pseudo-inverse of X, which we call P I. Mathematically, P I = (XT X) −1XT . Your output for this section is formed by printing out P I as follows: print(“Q3e:”) print(PI) 3.6 Q3f: Lastly, compute βˆ using the results from the previous parts. Recall, βˆ = X⊤X −1 X⊤Y. Your output for this section is formed by printing out P I as follows: print(“Q3f:”) print(hat_beta) Homework 5 CS540 Spring 2022 Q3 Summary Output: Your program should output the matrices X, Y, X⊤X,(X⊤X) −1 ,(X⊤X) −1X⊤ and β. If you do each part correctly, your code will print out the answers in the following format, where each variable will be replaced with the actual computed value: Q3a: X Q3b: Y Q3c: X⊤X Q3d: (X⊤X) −1 Q3e: (X⊤X) −1X⊤ Q3f: βˆ We gave you the complete sample output for toy.csv at the end of this file. You should test your code on this file. 4 Question 4: Prediction Using your βˆ, predict the number of ice days for winter 2021-22. Equivalently, we have a test item xtest = 2021 and you should predict yˆtest = βˆ 0 + βˆ 1xtest. The Wisconsin State Climatology Office does have official data for the number of 2021-22 ice days; you can see how close your prediction was. You may also use your model to predict the number of ice days for winter 2022-2023 and see how close you are when the official number is released (likely sometime in March 2023). Output: Print the following similar to in Q3 but without the new line: Q4: yˆtest You can do this with the following code: print(“Q4: ” + str(y_test)) You will use the same formatting to print the answers to the remaining questions 5 Question 5: Model Interpretation (a) What is the sign of your βˆ 1? Print a symbol where the symbol should be either >, <, = depending if the sign is positive, negative or zero. (b) Interpret, in English, the meaning of this sign for Mendota ice. Print a short answer explanation Output: Your program should print the following lines: Q5a: Symbol Q5b: Short Answer 6 Question 6: Model Limitation (a) Given your MLE βˆ, predict the year x ∗ by which Lake Mendota will no longer freeze. That is, 0 = βˆ 0 + βˆ 1x ∗ . Note x ∗ will in general be a real number instead of an integer. (b) Discuss whether x ∗ is a compelling prediction based on the trends in the data, and why. 4 Homework 5 CS540 Spring 2022 Output: Your program should print the following lines: Q6a: x ∗ Q6b: Answer Where Answer should be replaced with your answer for part b. Summary You will create two files named hw5.csv and hw5.py that you will zip into one zip file hw5 .zip. hw5.csv This will be a csv file containing two column headers: “year” and “days” in that order. Then, each row will contain the corresponding x and y defined in Question 1 in the same order as they appear in the original Mendota data set starting with year 1855 − 56. hw5.py This will be a python file that produces a plot for Q2 (save it do not plt.show() it!) and then that prints out the answers to the following questions: Q3a: Compute X Q3b: Compute Y Q3c: Compute XT X Q3d: Compute (XT X) −1 Q3e: Compute (XT X) −1XT Q3f: Compute βˆ Q4: Compute a prediction for year xtest = 2021 using your linear regression model. Q5a: Compute the sign of βˆ 1 Q5b: Explain what this sign could mean. Q6a: Solve the equation 0 = βˆ 0 + βˆ 1x ∗ for x ∗ . Q6b: Discuss whether this x ∗ makes sense given what we see in the data trends. Submission Details • Please submit your files in a zip file named hw5 .zip • Inside your zip file, there should be have two files named hw5.csv and hw5.py. • All code should be contained in functions or under a if name ==” main “: • Be sure to remove all debugging output before submission. Example Input/Output toy.csv: year,days 1800,120 1801,155 1802,99 Run parameters ./python3 hw5.py toy.csv Output: (note plot.jpg should be produced but is not shown here) 5 Homework 5 CS540 Spring 2022 Q3a: [[ 1 1800] [ 1 1801] [ 1 1802]] Q3b: [120 155 99] Q3c: [[ 3 5403] [ 5403 9730805]] Q3d: [[ 1.62180083e+06 -9.00500000e+02] [-9.00500000e+02 5.00000000e-01]] Q3e: [[ 9.00833334e+02 3.33333333e-01 -9.00166667e+02] [-5.00000000e-01 0.00000000e+00 5.00000000e-01]] Q3f: [ 1.90351667e+04 -1.05000000e+01] Q4: -2185.3333340429162 Q5a: < Q5b: Answer Q6a: 1812.8730158716883 Q6b: Answer 6