ISTA 331 HOMEWORK 3: CURVE FITTING

$35.00

Category: You will Instantly receive a download link for .zip solution file upon Payment || To Order Original Work Click Custom Order?

Description

5/5 - (3 votes)

Introduction. This homework is intended to refresh your knowledge of linear regression and introduce you
to more complicated curve-fitting. It is also intended to refresh your knowledge of pandas, numpy, and
matplotlib.
Instructions. Create a module named hw3.py. Below is the spec for 9 functions. Implement them and
upload your module to the D2L Assignments folder.
Testing. Download hw3 test.py and auxiliary files and put them in the same folder as your hw3.py module.
Run it from the command line to see your current correctness score. Each of the first 8 functions is worth
12.5% of your correctness score. The ninth can only hurt you. You must plot the 5 curves and have a correct
legend. Details like color and legend placement don’t matter, but the curves must be correct. The data must
be markers, not a line (see the screenshot closeup). Missing/incorrect curves/legend are each a -3 deduction
from your hw grade. You can examine the test module in a text editor to understand better what your code
should do. The test module is part of the spec. The test file we will use to grade your program will be
different and may uncover failings in your work not evident upon testing with the provided file. Add any
necessary tests to make sure your code works in all cases.
Documentation. Your module must contain a header docstring containing your name, your section leader’s
name, the date, ISTA 331 Hw3, and a brief summary of the module. Each function must contain a docstring.
Each function docstring should include a description of the function’s purpose, the name, type, and purpose
of each parameter, and the type and meaning of the function’s return value.
Grading. Your module will be graded on correctness, documentation, and coding style. Code should be
clear and concise. You will only lose style points if your code is a real mess. Include inline comments to
explain tricky lines and summarize sections of code.
Collaboration. Collaboration is allowed. You are responsible for your learning. Depending too much on
others will hurt you on the tests. “Helping” others too much harms them in reality. Cite any sources or
collaborators in your header docstring. Leaving this out is dishonest.
Resources. Some links that may be useful:
• https://docs.scipy.org/doc/numpy/reference/generated/numpy.polyfit.html
• http://www.statsmodels.org/stable/examples/notebooks/generated/ols.html
• http://www.statsmodels.org/stable/examples/index.html
• https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html
• http://aa.usno.navy.mil/cgi-bin/aa_rstablew.pl?ID=AA&year=2018&task=0&state=AZ&place=
Tucson
• https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
• https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html
1
2 ISTA 331 HOMEWORK 3: CURVE FITTING
Function specifications.
(1) read frame. This function should read the file sunrise sunset.csv into a DataFrame with two
columns for each month. The columns should be named Jan r, Jan s, Feb r, Feb s, …, and
contain the sunrise and sunset times for each day of the corresponding month. Keep the data
type as str; don’t let pandas convert it to float or int.
For reference, the upper left corner of the CSV file looks like this:
The data frame should look like this:
(2) get daylength series. This function should take the data frame produced by read frame as an
argument and return a Series containing the length of each day in the data frame, indexed from 1
to 365 (we ignore the month).
Hint/suggestion: concatenate (using pd.concat) the appropriate columns from the data frame
into a Series containing all of the sunrise times, and another Series containing all of the sunset
times. You will have to clean the data by removing NaNs (which occur at dates that don’t exist, like
Apr 31) and converting the time strings (in hhmm format) into raw minutes. Then subtract.
The data in the resulting Series should be integers.
(3) best fit line. This function takes a Series of day lengths as an argument, fits a linear model
to it using statsmodels.OLS, and returns a tuple containing results.params, results.rsquared,
results.mse resid**0.5, results.fvalue, results.f pvalue (here, results is the RegressionResults
object returned by statsmodels.OLS; if you name this object something different make sure to return
the right values.)
(4) best fit parabola. This function does the same thing as the previous function, except it fits a
quadratic (y = ax2 + bx + c) instead of a line.
(5) best fit cubic. Same as before, but this one fits a cubic (y = ax3 + bx2 + cx + d)
(6) r squared. R2
, also called the coefficient of determination, is a common measure of goodness of fit,
i.e. a measure of how well a model fits a collection of observed values. This function takes a Series
and a function and returns R2
. The Series is the set of observations (the index is the x values
and the data is the corresponding y values). The function argument is the model. statsmodels
calculated R2
for us in the models we have fit to our data so far; but, we can’t use statsmodels to
fit a sine curve, so we are going to write a function to calculate it ourselves.
Use the following ingredients in the calculation:
• The total sum of squares (proportional to the sample variance of y)
SStot =
X
N
i=1
(yi − y¯)
2
ISTA 331 HOMEWORK 3: CURVE FITTING 3
• The model sum of squares:
SSmodel =
X
N
i=1
(ˆyi − y¯)
2
(remember ˆyi
is the predicted value of y at x = xi)
• The sum of squares of residuals:
SSres =
X
N
i=1
(ˆyi − yi)
2
• Finally:
R
2 = 1 −
SSres
SStot
(7) best fit sine. This one fits a sine with form y = a sin(bx + c) + d. As we saw in class, we can’t
use statsmodels to do this directly because two of the parameters that we want to optimize are
not coefficients and OLS is only good for coefficients. So we will use scipy.optimize.curve fit().
You may refer to the Jupyter notebook curve fitting 2.ipynb for an example; the application of
curve fit there is very similar.
In addition to the required arguments, we need to pass in starting estimates for [a, b, c, d].
Otherwise, curve fit will start them all as 1 by default, and that is too far away from the correct
values for the optimization to find the best solution. You will have to examine the data to choose
reasonable starting estimates. They don’t have to be perfect, just vaguely close.
curve fit returns two values. The first one is the optimized parameters. If this value is stored
in a variable called popt, use this syntax to define a function to pass to your r squared function:
f = lambda x: popt[0] * np.sin(popt[1] * x + popt[2]) + popt[3]
Lambda’s are one-line, “anonymous” functions.1 They are very handy in situations like this. Alternatively, you can define a model function explicitly, similar to what is done in the in-cass notebook.
Return all of the same info as in the previous model derivation functions. For the F-statistic and
the F-statistic p-value, return 813774.14839414635 and 0.0, respectively.
(8) get results frame This function takes a daylength Series and returns this a data frame containing
the coefficients, R2
, RMSE, F-statistic, and ANOVA p-value for each of the four models above. The
frame should look like this:
(9) make plot This function takes a daylength Series and a results frame and creates the following
image (the second image is a closeup of part of the first, to illustrate that the data should be plotted
as points and the fitted models as curves). Your function should end with plt.show() (this ensures
that the plot is displayed on the screen).
The plot should look something like this:
1This one isn’t really “anonymous” because we assigned it a name, but we used the same syntax.
4 ISTA 331 HOMEWORK 3: CURVE FITTING
Zooming in, you should be able to see that the data are plotted as points and the models as
curves: