W4242 Assignment 0

$30.00

Category: You will Instantly receive a download link for .zip solution file upon Payment || To Order Original Work Click Custom Order?

Description

5/5 - (7 votes)

All the datasets mentioned in the assignment are available in the courseworks
website except the last one.
1. Given 100 number as follows :
364 142 865 945 453 556 602 78 784 562
197 589 34 27 338 19 431 678 73 378
524 810 84 646 666 457 100 833 929 91
730 790 916 770 996 357 435 310 698 816
116 651 532 970 552 297 268 332 175 271
751 124 696 275 564 112 169 998 64 864
592 63 412 270 535 114 450 792 39 910
413 565 537 209 370 233 96 557 471 467
261 23 762 775 741 199 786 127 276 662
60 362 240 327 874 746 81 859 133 629
Try sorting them by hand and then write a description of your process of sorting. Suppose there is another human being who would do exactly what you tell
him to do, write directions for him to sort the numbers. Finally, write an R
script that can read the numbers (the file “sort data.txt”) and sort them. How
does your algorithm compare in computational complexity with selection sort
and merge-sort algorithms discussed in class?
2. There are two matrices X, Y and a vector β as follows:
X =


−1 −2 1
4 0 −6
−8 −7 9

 , Y =


1 2 3
4 5 6
7 8 9

 , β =


1
2
3


Find XY, Y X, Xβ, XT
, X−1
, the rank of X. Do these calculations by hand.
And then write R code to do them in R.
3. Download the dataset “prostate data.csv” from courseworks. The data comes
from a study of relation between the level of prostate specific antigen (PSA)
and a number of clinical measures. For example, lcavol, lweight and lbph represent cancer volume, prostate weight and benign prostatic hyperplasia amount
1
respectively. Excluding the categorical variables “svi” and “gleason”, do simple
linear regressions for each pairs within the remaining seven variables. Plot the
data and regression lines for each pair. Try to put all the plots together into
one picture.
4. Recall the least square estimator problem in the quiz. Now let’s derive the
formula for it. Consider the simple linear regression model:
yi = a0xi + b0 + i (1 ≤ i ≤ n)
The model says that there is almost a linear relation between yi and xi (perturbed by noise i). We have the data (xi
, yi)
n
i=1, then how do we find the best
straight line to capture the true relation (or how to estimate a0 and b0) between
xi and yi
. Intuitively, if there is no noise term i
, we can simply find a and b
such that Pn
i=1(yi − axi − b)
2 = 0. In the presence of noise, we can estimate
a and b by minimizing Pn
i=1(yi − axi − b)
2 as a denoising process. The least
square estimator is exactly
(ˆa, ˆb) = argmina,b∈R
Xn
i=1
(yi − axi − b)
2
Prove
aˆ =
Pn
i=1
P
(xi − x¯)yi
n
i=1(xi − x¯)
2
,
ˆb = ¯y − aˆx¯
where ¯x =
1
n
Pn
i=1 xi
, y¯ =
1
n
Pn
i=1 yi
5. Suppose that a tuberculosis (TB) skin test is 95 percent accurate. That is, if
the patient is TB-infected, then the test will be positive with probability 0.95,
and if the patient is not infected, then the test will be negative with probability
0.95. From research study, we know 1 in 1000 of the subjects in the population
is infected. Now suppose that a person comes to do the skin test, what is the
probability that he would get a positive test result? If we know he got a positive
test result, then what is the probability that he is infected?
6. Go to the website http://www.kaggle.com. Find the competition named
“Titanic: Machine Learning from Disaster”. Download the file “train.csv”. Try
to summarize the dataset with plots and summary statistics. Show what you
find by the descriptive procedure.