ECE 445: Machine Learning for Engineers Exercise #1 

$30.00

Category: You will Instantly receive a download link for .zip solution file upon Payment

Description

5/5 - (10 votes)

1. Heart Failure Prediction Dataset
Our first dataset is termed Heart Failure Prediction Dataset, which can hypothetically be used to determine the
likelihood of a death by heart failure event. A machine learning model trained on such a dataset can then potentially
be used by hospitals to assess the severity of patients with cardiovascular diseases. You can read further about this
dataset at Kaggle using the link provided below. The dataset is stored as a csv file, which is also being provided to
you as part of this exercise.
Dataset link: https://www.kaggle.com/andrewmvd/heart-failure-clinical-data
Dataset csv filename: heart_failure_dataset.csv
1. Load the dataset from the csv file as follows (you are free to choose variable names of your liking):
1
heart_df = pd.read_csv(‘heart_failure_dataset.csv’)
Note that the variable heart_df is of type pandas.DataFrame.
a) Print the shape, axes, and dtypes attributes of heart_df dataframe. [1 point]
b) Print first 10 rows of heart_df dataframe using pandas.DataFrame.head() function. [1 point]
c) What is each row of heart_df dataframe termed within the machine learning parlance? [1 points]
d) Based on your knowledge of the dataset and its stated usage, do you think we are dealing with an
unsupervised learning problem or a supervised learning problem? Justify your answer. [2 points]
e) How many independent variables (features, attributes, predictors, etc.) does this dataset have? List down
the names of these variables. [1 point]
f) How many dependent variables (if any) does this dataset have? List down the names of these variables.
[1 point]
g) How many of the variables in this dataset are categorical variables? List down the names of these
variables. [1 point]
h) What type of encoding do the categorical variables in the dataset follow? [1 point]
i) How many samples in the dataset correspond to deceased patients and how many samples correspond
to the remaining patients? [1 point]
j) How many samples in the dataset correspond to women patients and how many samples correspond to
male patients? [1 point]
k) How many samples in the dataset correspond to smokers and how many samples correspond to nonsmokers? [1 point]
2. Compute pairwise correlations between variables in the dataset using pandas.DataFrame.corr() function.
a) What two variables are the most positively correlated with the DEATH_EVENT variable? [1 point]
b) What two variables are the most negatively correlated with the DEATH_EVENT variable? [1 point]
c) Based on your knowledge of the dataset, why do you think it makes sense that the second-most positively correlated variable with the DEATH_EVENT variable should have been positively correlated? [1
point]
d) Based on your knowledge of the dataset, why do you think it makes sense that the two most negatively
correlated variables with the DEATH_EVENT variable should have been negatively correlated? [2 points]
3. Write commented code cell to validate that all entries in the dataset are ‘valid’ and have not been filled-in
with inconsistent values. If the code finds any invalid values then it should convert them to NaN and store
the resulting dataframe as a csv file with name heart_failure_dataset_NaNs.csv. Justify your logic by
explaining it in a markdown cell. [5 points]
4. Write commented code to process the validated and potentially NaN-converted dataframe so that each noncategorical independent variable in the dataset has empirically zero mean and unit variance. This processing
should ignore any NaN entries in the dataframe. Print first 20 rows of the processed dataframe. [5 points]
5. Write commented code to modify the processed dataframe so that the DEATH_EVENT variable is encoded
using one-hot encoding. The code must be written from scratch, i.e., you cannot use a library. Store the final
pre-processed dataframe as a csv file with name heart_failure_dataset_processed.csv [3 points]
2
2. Pima Indians Diabetes Dataset
Our second dataset is termed Pima Indians Diabetes Dataset, which can hypothetically be used to predict the onset
of diabetes based on several diagnostic measures. A machine learning model trained on such a dataset can then
potentially be used by physicians to monitor their patients for early signs of diabetes. This dataset is peculiar in the
sense that all patients in it are adult females, at least 21 years old, of Pima Indian heritage. You can read further
about this dataset at Kaggle using the link provided below. This dataset is also stored as a csv file, which is being
provided to you as part of this exercise.
Dataset link: https://www.kaggle.com/uciml/pima-indians-diabetes-database
Dataset csv filename: diabetes_dataset.csv
1. Load the dataset from the csv file into a pandas dataframe.
a) Print the shape, axes, and dtypes attributes of the dataframe. [1 point]
b) Print first 10 rows of the dataframe using function. [1 point]
c) Based on your knowledge of the dataset and its stated usage, do you think we are dealing with an
unsupervised learning problem or a supervised learning problem? Justify your answer. [2 points]
d) How many independent variables (features, attributes, predictors, etc.) does this dataset have? List down
the names of these variables. [1 point]
e) How many dependent variables (if any) does this dataset have? List down the names of these variables.
[1 point]
f) How many of the variables in this dataset are categorical variables? List down the names of these
variables. [1 point]
g) What type of encoding do the categorical variables in the dataset follow? [1 point]
h) How many samples in the dataset correspond to the following age groups: [2 points]
Young adults (Ages 21–40)
Middle-aged adults (Ages 41–60)
Old-aged adults (Ages 61 and older)
2. Write commented code cell to validate that all entries in the dataset are ‘valid’ and have not been filled-in
with inconsistent values. If the code finds any invalid values then it should convert them to NaN and print first
20 rows of the processed dataframe. Justify your logic by explaining it in a markdown cell. [3 points]
3. Write commented code to process the validated and potentially NaN-converted dataframe so that each noncategorical independent variable in the dataset has empirically zero mean and unit variance. This processing
should ignore any NaN entries in the dataframe. Print first 20 rows of the processed dataframe. [2 points]
4. Write commented code to modify the processed dataframe so that the Outcome variable is encoded using
one-hot encoding. [1 point]
5. Replace the NaN values in the dataset for each variable with empirical median of that variable, where only
the median corresponding to the same Outcome should be used for replacement purposes. Store the final
pre-processed dataframe as a csv file with name diabetes_dataset_processed.csv [4 points]
3