Description

5/5 - (9 votes)

About The Data We’ll be using the Breast Cancer Wisconsin (Diagnostic) Data Set from kaggle for this lab, but feel free to follow along with your own dataset. The dataset contains a total of 32 columns, with following attribute information: 1) ID number 2) Diagnosis (M = malignant, B = benign) 3‑32) Ten real‑valued features are computed for each cell nucleus: a) radius (mean of distances from center to points on the perimeter) b) texture (standard deviation of gray‑scale values) c) perimeter d) area e) smoothness (local variation in radius lengths) f) compactness (perimeter^2 / area ‑ 1.0) g) concavity (severity of concave portions of the contour) h) concave points (number of concave portions of the contour) i) symmetry j) fractal dimension (“coastline approximation” ‑ 1) The mean, standard error and “worst” or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius. Our goal will be to predict the diagnosis (benign or malignant). Exploratory Data Analysis Let’s begin by importing some necessary libraries that we’ll be using to explore the data. Our first step is to load the data into a pandas DataFrame. There’s an odd column “Unnamed: 32”, which we’ll go ahead and drop since it’s full of NaN values. We also won’t need the id label, so we can drop that as well. Since a lot of the features in this dataset can be hard to interpret without domain knowledge of cancer or tumor cells, we’ll just do a few visualizations here, but feel free to explore as much as you’d like before constructing a model. <class ’pandas.core.frame.DataFrame’> RangeIndex: 569 entries, 0 to 568 Data columns (total 31 columns):  #   Column                   NonNull Count  Dtype                              0   diagnosis                569 nonnull    object   1   radius_mean              569 nonnull    float64  2   texture_mean             569 nonnull    float64  3   perimeter_mean           569 nonnull    float64  4   area_mean                569 nonnull    float64  5   smoothness_mean          569 nonnull    float64  6   compactness_mean         569 nonnull    float64  7   concavity_mean           569 nonnull    float64  8   concave points_mean      569 nonnull    float64  9   symmetry_mean            569 nonnull    float64  10  fractal_dimension_mean   569 nonnull    float64  11  radius_se                569 nonnull    float64  12  texture_se               569 nonnull    float64  13  perimeter_se             569 nonnull    float64  14  area_se                  569 nonnull    float64  15  smoothness_se            569 nonnull    float64  16  compactness_se           569 nonnull    float64  17  concavity_se             569 nonnull    float64  18  concave points_se        569 nonnull    float64  19  symmetry_se              569 nonnull    float64  20  fractal_dimension_se     569 nonnull    float64  21  radius_worst             569 nonnull    float64  22  texture_worst            569 nonnull    float64  23  perimeter_worst          569 nonnull    float64  24  area_worst               569 nonnull    float64  25  smoothness_worst         569 nonnull    float64  26  compactness_worst        569 nonnull    float64  27  concavity_worst          569 nonnull    float64  28  concave points_worst     569 nonnull    float64  29  symmetry_worst           569 nonnull    float64  30  fractal_dimension_worst  569 nonnull    float64 dtypes: float64(30), object(1) memory usage: 137.9+ KB calling .info() we see that there are no missing values in this dataset. There seems to be pretty good distinction between the diagnosis (blue & orange) in most of the atributes above. Majority of our data observations are of the benign class. area_mean could be a good predictor wheather malignant or benign since there is pretty good separation here. Most benign (orange) have area_mean of around 500 or lower. Some strong correlations are present. (very bright squares for example) Pre‑Processing Let’s go ahead and scale our data before training and creating our model Creating our Model We’re now ready to begin creating and training our model. We first need to split our data into training and testing sets. This can be done using sklearn’s train_test_split(X, y, test_size) function. This function takes in your features (X), the target variable (y), and the test_size you’d like (Generally a test size of around 0.3 is good enough). It will then return a tuple of X_train, X_test, y_train, y_test sets for us. We will train our model on the training set and then use the test set to evaluate the model. SVC() Model Evaluation Now that we’ve finished training, we can make predictions off of the test data and evaluate our model’s performance using the corresponding test data. Confusion matrix  [[104   1]  [  3  63]] True Positives(TP) =  104 True Negatives(TN) =  63 False Positives(FP) =  1 False Negatives(FN) =  3               precision    recall  f1score   support            B       0.97      0.99      0.98       105            M       0.98      0.95      0.97        66     accuracy                           0.98       171    macro avg       0.98      0.97      0.98       171 weighted avg       0.98      0.98      0.98       171 Hyperparameter Tuning Finding the right parameters (like what C or gamma values to use) is a tricky task, but luckily we can be a little lazy and just try a bunch of combinations and see what works best. This idea of creating a ‘grid’ of parameters and just trying out all the possible combinations is called a Gridsearch, this method is common enough that Scikit‑learn has this functionality built in with GridSearchCV. The CV stands for cross‑validation. GridSearchCV takes a dictionary that describes the parameters that should be tried and a model to train. The grid of parameters is defined as a dictionary, where the keys are the parameters and the values are the settings to be tested. Let’s go ahead and try a few different parameters to see which of them is the best set to use. You should add refit=True and choose verbose to whatever number you want. The higher the number, the more verbose.(verbose just means the text output describing the process). What fit does is a bit more involved than usual. First, it runs the same loop with cross‑validation to find the best parameter combination. Once it has the best combination, it runs fit again on all data passed to fit (without cross‑validation), to build a single new model using the best parameter setting. Note: This process may take a while. The more parameters we test, the longer it may take since it has to try all different combinations inorder to find the best set. Fitting 5 folds for each of 25 candidates, totalling 125 fits [CV] C=0.1, gamma=1, kernel=rbf ……………………………….. [CV] ………. C=0.1, gamma=1, kernel=rbf, score=0.637, total=   0.0s [CV] C=0.1, gamma=1, kernel=rbf ……………………………….. [CV] ………. C=0.1, gamma=1, kernel=rbf, score=0.637, total=   0.0s [CV] C=0.1, gamma=1, kernel=rbf ……………………………….. [CV] ………. C=0.1, gamma=1, kernel=rbf, score=0.625, total=   0.0s [CV] C=0.1, gamma=1, kernel=rbf ……………………………….. [CV] ………. C=0.1, gamma=1, kernel=rbf, score=0.633, total=   0.0s [CV] C=0.1, gamma=1, kernel=rbf ……………………………….. [CV] ………. C=0.1, gamma=1, kernel=rbf, score=0.633, total=   0.0s [CV] C=0.1, gamma=0.1, kernel=rbf ……………………………… [CV] …….. C=0.1, gamma=0.1, kernel=rbf, score=0.925, total=   0.0s [CV] C=0.1, gamma=0.1, kernel=rbf ……………………………… [CV] …….. C=0.1, gamma=0.1, kernel=rbf, score=0.950, total=   0.0s [CV] C=0.1, gamma=0.1, kernel=rbf ……………………………… [CV] …….. C=0.1, gamma=0.1, kernel=rbf, score=0.900, total=   0.0s [CV] C=0.1, gamma=0.1, kernel=rbf ……………………………… [CV] …….. C=0.1, gamma=0.1, kernel=rbf, score=0.962, total=   0.0s [CV] C=0.1, gamma=0.1, kernel=rbf ……………………………… [CV] …….. C=0.1, gamma=0.1, kernel=rbf, score=0.949, total=   0.0s [CV] C=0.1, gamma=0.01, kernel=rbf …………………………….. [CV] ……. C=0.1, gamma=0.01, kernel=rbf, score=0.912, total=   0.0s [CV] C=0.1, gamma=0.01, kernel=rbf …………………………….. [CV] ……. C=0.1, gamma=0.01, kernel=rbf, score=0.963, total=   0.0s [CV] C=0.1, gamma=0.01, kernel=rbf …………………………….. [CV] ……. C=0.1, gamma=0.01, kernel=rbf, score=0.975, total=   0.0s [CV] C=0.1, gamma=0.01, kernel=rbf …………………………….. [CV] ……. C=0.1, gamma=0.01, kernel=rbf, score=0.987, total=   0.0s [CV] C=0.1, gamma=0.01, kernel=rbf …………………………….. [CV] ……. C=0.1, gamma=0.01, kernel=rbf, score=0.962, total=   0.0s [CV] C=0.1, gamma=0.001, kernel=rbf ……………………………. [CV] …… C=0.1, gamma=0.001, kernel=rbf, score=0.688, total=   0.0s [CV] C=0.1, gamma=0.001, kernel=rbf ……………………………. [CV] …… C=0.1, gamma=0.001, kernel=rbf, score=0.688, total=   0.0s [CV] C=0.1, gamma=0.001, kernel=rbf ……………………………. [CV] …… C=0.1, gamma=0.001, kernel=rbf, score=0.688, total=   0.0s [CV] C=0.1, gamma=0.001, kernel=rbf ……………………………. [CV] …… C=0.1, gamma=0.001, kernel=rbf, score=0.684, total=   0.0s [CV] C=0.1, gamma=0.001, kernel=rbf ……………………………. [CV] …… C=0.1, gamma=0.001, kernel=rbf, score=0.709, total=   0.0s [CV] C=0.1, gamma=0.0001, kernel=rbf …………………………… [CV] ….. C=0.1, gamma=0.0001, kernel=rbf, score=0.637, total=   0.0s [CV] C=0.1, gamma=0.0001, kernel=rbf …………………………… [CV] ….. C=0.1, gamma=0.0001, kernel=rbf, score=0.637, total=   0.0s [CV] C=0.1, gamma=0.0001, kernel=rbf …………………………… [CV] ….. C=0.1, gamma=0.0001, kernel=rbf, score=0.625, total=   0.0s [CV] C=0.1, gamma=0.0001, kernel=rbf …………………………… [CV] ….. C=0.1, gamma=0.0001, kernel=rbf, score=0.633, total=   0.0s [CV] C=0.1, gamma=0.0001, kernel=rbf …………………………… [Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s [Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s remaining:    0.0s [CV] ….. C=0.1, gamma=0.0001, kernel=rbf, score=0.633, total=   0.0s [CV] C=1, gamma=1, kernel=rbf …………………………………. [CV] ………… C=1, gamma=1, kernel=rbf, score=0.637, total=   0.0s [CV] C=1, gamma=1, kernel=rbf …………………………………. [CV] ………… C=1, gamma=1, kernel=rbf, score=0.637, total=   0.0s [CV] C=1, gamma=1, kernel=rbf …………………………………. [CV] ………… C=1, gamma=1, kernel=rbf, score=0.625, total=   0.0s [CV] C=1, gamma=1, kernel=rbf …………………………………. [CV] ………… C=1, gamma=1, kernel=rbf, score=0.633, total=   0.0s [CV] C=1, gamma=1, kernel=rbf …………………………………. [CV] ………… C=1, gamma=1, kernel=rbf, score=0.633, total=   0.0s [CV] C=1, gamma=0.1, kernel=rbf ……………………………….. [CV] ………. C=1, gamma=0.1, kernel=rbf, score=0.950, total=   0.0s [CV] C=1, gamma=0.1, kernel=rbf ……………………………….. [CV] ………. C=1, gamma=0.1, kernel=rbf, score=0.950, total=   0.0s [CV] C=1, gamma=0.1, kernel=rbf ……………………………….. [CV] ………. C=1, gamma=0.1, kernel=rbf, score=0.963, total=   0.0s [CV] C=1, gamma=0.1, kernel=rbf ……………………………….. [CV] ………. C=1, gamma=0.1, kernel=rbf, score=0.975, total=   0.0s [CV] C=1, gamma=0.1, kernel=rbf ……………………………….. [CV] ………. C=1, gamma=0.1, kernel=rbf, score=0.987, total=   0.0s [CV] C=1, gamma=0.01, kernel=rbf ………………………………. [CV] ……… C=1, gamma=0.01, kernel=rbf, score=0.950, total=   0.0s [CV] C=1, gamma=0.01, kernel=rbf ………………………………. [CV] ……… C=1, gamma=0.01, kernel=rbf, score=0.988, total=   0.0s [CV] C=1, gamma=0.01, kernel=rbf ………………………………. [CV] ……… C=1, gamma=0.01, kernel=rbf, score=0.988, total=   0.0s [CV] C=1, gamma=0.01, kernel=rbf ………………………………. [CV] ……… C=1, gamma=0.01, kernel=rbf, score=0.987, total=   0.0s [CV] C=1, gamma=0.01, kernel=rbf ………………………………. [CV] ……… C=1, gamma=0.01, kernel=rbf, score=0.975, total=   0.0s [CV] C=1, gamma=0.001, kernel=rbf ……………………………… [CV] …….. C=1, gamma=0.001, kernel=rbf, score=0.912, total=   0.0s [CV] C=1, gamma=0.001, kernel=rbf ……………………………… [CV] …….. C=1, gamma=0.001, kernel=rbf, score=0.963, total=   0.0s [CV] C=1, gamma=0.001, kernel=rbf ……………………………… [CV] …….. C=1, gamma=0.001, kernel=rbf, score=0.975, total=   0.0s [CV] C=1, gamma=0.001, kernel=rbf ……………………………… [CV] …….. C=1, gamma=0.001, kernel=rbf, score=0.987, total=   0.0s [CV] C=1, gamma=0.001, kernel=rbf ……………………………… [CV] …….. C=1, gamma=0.001, kernel=rbf, score=0.962, total=   0.0s [CV] C=1, gamma=0.0001, kernel=rbf …………………………….. [CV] ……. C=1, gamma=0.0001, kernel=rbf, score=0.688, total=   0.0s [CV] C=1, gamma=0.0001, kernel=rbf …………………………….. [CV] ……. C=1, gamma=0.0001, kernel=rbf, score=0.700, total=   0.0s [CV] C=1, gamma=0.0001, kernel=rbf …………………………….. [CV] ……. C=1, gamma=0.0001, kernel=rbf, score=0.700, total=   0.0s [CV] C=1, gamma=0.0001, kernel=rbf …………………………….. [CV] ……. C=1, gamma=0.0001, kernel=rbf, score=0.696, total=   0.0s [CV] C=1, gamma=0.0001, kernel=rbf …………………………….. [CV] ……. C=1, gamma=0.0001, kernel=rbf, score=0.709, total=   0.0s [CV] C=10, gamma=1, kernel=rbf ………………………………… [CV] ……….. C=10, gamma=1, kernel=rbf, score=0.637, total=   0.0s [CV] C=10, gamma=1, kernel=rbf ………………………………… [CV] ……….. C=10, gamma=1, kernel=rbf, score=0.637, total=   0.0s [CV] C=10, gamma=1, kernel=rbf ………………………………… [CV] ……….. C=10, gamma=1, kernel=rbf, score=0.637, total=   0.0s [CV] C=10, gamma=1, kernel=rbf ………………………………… [CV] ……….. C=10, gamma=1, kernel=rbf, score=0.633, total=   0.0s [CV] C=10, gamma=1, kernel=rbf ………………………………… [CV] ……….. C=10, gamma=1, kernel=rbf, score=0.633, total=   0.0s [CV] C=10, gamma=0.1, kernel=rbf ………………………………. [CV] ……… C=10, gamma=0.1, kernel=rbf, score=0.950, total=   0.0s [CV] C=10, gamma=0.1, kernel=rbf ………………………………. [CV] ……… C=10, gamma=0.1, kernel=rbf, score=0.963, total=   0.0s [CV] C=10, gamma=0.1, kernel=rbf ………………………………. [CV] ……… C=10, gamma=0.1, kernel=rbf, score=0.963, total=   0.0s [CV] C=10, gamma=0.1, kernel=rbf ………………………………. [CV] ……… C=10, gamma=0.1, kernel=rbf, score=0.987, total=   0.0s [CV] C=10, gamma=0.1, kernel=rbf ………………………………. [CV] ……… C=10, gamma=0.1, kernel=rbf, score=0.987, total=   0.0s [CV] C=10, gamma=0.01, kernel=rbf ……………………………… [CV] …….. C=10, gamma=0.01, kernel=rbf, score=0.963, total=   0.0s [CV] C=10, gamma=0.01, kernel=rbf ……………………………… [CV] …….. C=10, gamma=0.01, kernel=rbf, score=0.975, total=   0.0s [CV] C=10, gamma=0.01, kernel=rbf ……………………………… [CV] …….. C=10, gamma=0.01, kernel=rbf, score=1.000, total=   0.0s [CV] C=10, gamma=0.01, kernel=rbf ……………………………… [CV] …….. C=10, gamma=0.01, kernel=rbf, score=0.975, total=   0.0s [CV] C=10, gamma=0.01, kernel=rbf ……………………………… [CV] …….. C=10, gamma=0.01, kernel=rbf, score=0.987, total=   0.0s [CV] C=10, gamma=0.001, kernel=rbf …………………………….. [CV] ……. C=10, gamma=0.001, kernel=rbf, score=0.950, total=   0.0s [CV] C=10, gamma=0.001, kernel=rbf …………………………….. [CV] ……. C=10, gamma=0.001, kernel=rbf, score=0.988, total=   0.0s [CV] C=10, gamma=0.001, kernel=rbf …………………………….. [CV] ……. C=10, gamma=0.001, kernel=rbf, score=1.000, total=   0.0s [CV] C=10, gamma=0.001, kernel=rbf …………………………….. [CV] ……. C=10, gamma=0.001, kernel=rbf, score=0.987, total=   0.0s [CV] C=10, gamma=0.001, kernel=rbf …………………………….. [CV] ……. C=10, gamma=0.001, kernel=rbf, score=0.975, total=   0.0s [CV] C=10, gamma=0.0001, kernel=rbf ……………………………. [CV] …… C=10, gamma=0.0001, kernel=rbf, score=0.912, total=   0.0s [CV] C=10, gamma=0.0001, kernel=rbf ……………………………. [CV] …… C=10, gamma=0.0001, kernel=rbf, score=0.963, total=   0.0s [CV] C=10, gamma=0.0001, kernel=rbf ……………………………. [CV] …… C=10, gamma=0.0001, kernel=rbf, score=0.975, total=   0.0s [CV] C=10, gamma=0.0001, kernel=rbf ……………………………. [CV] …… C=10, gamma=0.0001, kernel=rbf, score=0.987, total=   0.0s [CV] C=10, gamma=0.0001, kernel=rbf ……………………………. [CV] …… C=10, gamma=0.0001, kernel=rbf, score=0.949, total=   0.0s [CV] C=100, gamma=1, kernel=rbf ……………………………….. [CV] ………. C=100, gamma=1, kernel=rbf, score=0.637, total=   0.0s [CV] C=100, gamma=1, kernel=rbf ……………………………….. [CV] ………. C=100, gamma=1, kernel=rbf, score=0.637, total=   0.0s [CV] C=100, gamma=1, kernel=rbf ……………………………….. [CV] ………. C=100, gamma=1, kernel=rbf, score=0.637, total=   0.0s [CV] C=100, gamma=1, kernel=rbf ……………………………….. [CV] ………. C=100, gamma=1, kernel=rbf, score=0.633, total=   0.0s [CV] C=100, gamma=1, kernel=rbf ……………………………….. [CV] ………. C=100, gamma=1, kernel=rbf, score=0.633, total=   0.0s [CV] C=100, gamma=0.1, kernel=rbf ……………………………… [CV] …….. C=100, gamma=0.1, kernel=rbf, score=0.950, total=   0.0s [CV] C=100, gamma=0.1, kernel=rbf ……………………………… [CV] …….. C=100, gamma=0.1, kernel=rbf, score=0.963, total=   0.0s [CV] C=100, gamma=0.1, kernel=rbf ……………………………… [CV] …….. C=100, gamma=0.1, kernel=rbf, score=0.975, total=   0.0s [CV] C=100, gamma=0.1, kernel=rbf ……………………………… [CV] …….. C=100, gamma=0.1, kernel=rbf, score=0.975, total=   0.0s [CV] C=100, gamma=0.1, kernel=rbf ……………………………… [CV] …….. C=100, gamma=0.1, kernel=rbf, score=0.987, total=   0.0s [CV] C=100, gamma=0.01, kernel=rbf …………………………….. [CV] ……. C=100, gamma=0.01, kernel=rbf, score=0.975, total=   0.0s [CV] C=100, gamma=0.01, kernel=rbf …………………………….. [CV] ……. C=100, gamma=0.01, kernel=rbf, score=0.938, total=   0.0s [CV] C=100, gamma=0.01, kernel=rbf …………………………….. [CV] ……. C=100, gamma=0.01, kernel=rbf, score=0.975, total=   0.0s [CV] C=100, gamma=0.01, kernel=rbf …………………………….. [CV] ……. C=100, gamma=0.01, kernel=rbf, score=0.949, total=   0.0s [CV] C=100, gamma=0.01, kernel=rbf …………………………….. [CV] ……. C=100, gamma=0.01, kernel=rbf, score=0.975, total=   0.0s [CV] C=100, gamma=0.001, kernel=rbf ……………………………. [CV] …… C=100, gamma=0.001, kernel=rbf, score=0.950, total=   0.0s [CV] C=100, gamma=0.001, kernel=rbf ……………………………. [CV] …… C=100, gamma=0.001, kernel=rbf, score=0.975, total=   0.0s [CV] C=100, gamma=0.001, kernel=rbf ……………………………. [CV] …… C=100, gamma=0.001, kernel=rbf, score=1.000, total=   0.0s [CV] C=100, gamma=0.001, kernel=rbf ……………………………. [CV] …… C=100, gamma=0.001, kernel=rbf, score=0.975, total=   0.0s [CV] C=100, gamma=0.001, kernel=rbf ……………………………. [CV] …… C=100, gamma=0.001, kernel=rbf, score=0.987, total=   0.0s [CV] C=100, gamma=0.0001, kernel=rbf …………………………… [CV] ….. C=100, gamma=0.0001, kernel=rbf, score=0.950, total=   0.0s [CV] C=100, gamma=0.0001, kernel=rbf …………………………… [CV] ….. C=100, gamma=0.0001, kernel=rbf, score=0.988, total=   0.0s [CV] C=100, gamma=0.0001, kernel=rbf …………………………… [CV] ….. C=100, gamma=0.0001, kernel=rbf, score=1.000, total=   0.0s [CV] C=100, gamma=0.0001, kernel=rbf …………………………… [CV] ….. C=100, gamma=0.0001, kernel=rbf, score=0.987, total=   0.0s [CV] C=100, gamma=0.0001, kernel=rbf …………………………… [CV] ….. C=100, gamma=0.0001, kernel=rbf, score=0.975, total=   0.0s [CV] C=1000, gamma=1, kernel=rbf ………………………………. [CV] ……… C=1000, gamma=1, kernel=rbf, score=0.637, total=   0.0s [CV] C=1000, gamma=1, kernel=rbf ………………………………. [CV] ……… C=1000, gamma=1, kernel=rbf, score=0.637, total=   0.0s [CV] C=1000, gamma=1, kernel=rbf ………………………………. [CV] ……… C=1000, gamma=1, kernel=rbf, score=0.637, total=   0.0s [CV] C=1000, gamma=1, kernel=rbf ………………………………. [CV] ……… C=1000, gamma=1, kernel=rbf, score=0.633, total=   0.0s [CV] C=1000, gamma=1, kernel=rbf ………………………………. [CV] ……… C=1000, gamma=1, kernel=rbf, score=0.633, total=   0.0s [CV] C=1000, gamma=0.1, kernel=rbf …………………………….. [CV] ……. C=1000, gamma=0.1, kernel=rbf, score=0.950, total=   0.0s [CV] C=1000, gamma=0.1, kernel=rbf …………………………….. [CV] ……. C=1000, gamma=0.1, kernel=rbf, score=0.963, total=   0.0s [CV] C=1000, gamma=0.1, kernel=rbf …………………………….. [CV] ……. C=1000, gamma=0.1, kernel=rbf, score=0.975, total=   0.0s [CV] C=1000, gamma=0.1, kernel=rbf …………………………….. [CV] ……. C=1000, gamma=0.1, kernel=rbf, score=0.975, total=   0.0s [CV] C=1000, gamma=0.1, kernel=rbf …………………………….. [CV] ……. C=1000, gamma=0.1, kernel=rbf, score=0.987, total=   0.0s [CV] C=1000, gamma=0.01, kernel=rbf ……………………………. [CV] …… C=1000, gamma=0.01, kernel=rbf, score=0.950, total=   0.0s [CV] C=1000, gamma=0.01, kernel=rbf ……………………………. [CV] …… C=1000, gamma=0.01, kernel=rbf, score=0.950, total=   0.0s [CV] C=1000, gamma=0.01, kernel=rbf ……………………………. [CV] …… C=1000, gamma=0.01, kernel=rbf, score=0.950, total=   0.0s [CV] C=1000, gamma=0.01, kernel=rbf ……………………………. [CV] …… C=1000, gamma=0.01, kernel=rbf, score=0.937, total=   0.0s [CV] C=1000, gamma=0.01, kernel=rbf ……………………………. [CV] …… C=1000, gamma=0.01, kernel=rbf, score=0.975, total=   0.0s [CV] C=1000, gamma=0.001, kernel=rbf …………………………… [CV] ….. C=1000, gamma=0.001, kernel=rbf, score=0.975, total=   0.0s [CV] C=1000, gamma=0.001, kernel=rbf …………………………… [CV] ….. C=1000, gamma=0.001, kernel=rbf, score=0.963, total=   0.0s [CV] C=1000, gamma=0.001, kernel=rbf …………………………… [CV] ….. C=1000, gamma=0.001, kernel=rbf, score=0.975, total=   0.0s [CV] C=1000, gamma=0.001, kernel=rbf …………………………… [CV] ….. C=1000, gamma=0.001, kernel=rbf, score=0.949, total=   0.0s [CV] C=1000, gamma=0.001, kernel=rbf …………………………… [CV] ….. C=1000, gamma=0.001, kernel=rbf, score=0.975, total=   0.0s [CV] C=1000, gamma=0.0001, kernel=rbf ………………………….. [CV] …. C=1000, gamma=0.0001, kernel=rbf, score=0.950, total=   0.0s [CV] C=1000, gamma=0.0001, kernel=rbf ………………………….. [CV] …. C=1000, gamma=0.0001, kernel=rbf, score=0.975, total=   0.0s [CV] C=1000, gamma=0.0001, kernel=rbf ………………………….. [CV] …. C=1000, gamma=0.0001, kernel=rbf, score=1.000, total=   0.0s [CV] C=1000, gamma=0.0001, kernel=rbf ………………………….. [CV] …. C=1000, gamma=0.0001, kernel=rbf, score=0.975, total=   0.0s [CV] C=1000, gamma=0.0001, kernel=rbf ………………………….. [CV] …. C=1000, gamma=0.0001, kernel=rbf, score=0.987, total=   0.0s [Parallel(n_jobs=1)]: Done 125 out of 125 | elapsed:    1.0s finished GridSearchCV(estimator=SVC(),              param_grid={‘C’: [0.1, 1, 10, 100, 1000],                          ’gamma’: [1, 0.1, 0.01, 0.001, 0.0001],                          ’kernel’: [‘rbf’]},              verbose=3) You can inspect the best parameters found by GridSearchCV using the bestparams attribute, and the best estimator using the best_estimator_ attribute. Here we see that the best set of parameters from the ones we specified are 10 for c value, 0.01 for gamma, and ‘rbf’ for the kernel. {‘C’: 10, ’gamma’: 0.01, ’kernel’: ’rbf’} Then you can re‑run predictions on this grid object just like you would with a normal model. [[105   0]  [  2  64]]               precision    recall  f1score   support            B       0.98      1.00      0.99       105            M       1.00      0.97      0.98        66     accuracy                           0.99       171    macro avg       0.99      0.98      0.99       171 weighted avg       0.99      0.99      0.99       171 Nice! We got a slightly better improvement using these parameters, though our original accuracy was already very good. Keep this grid search in mind when you need to do hyperparameter tuning. It can save you a lot of time. Congrats! You now know how to use SVM and hyperparameter tuning in sklearn. Try using this on your own dataset and refer back to this lecture if you get stuck. id diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavi 0 842302 M 17.99 10.38 122.80 1001.0 0.11840 0.27760 1 842517 M 20.57 17.77 132.90 1326.0 0.08474 0.07864 2 84300903 M 19.69 21.25 130.00 1203.0 0.10960 0.15990 3 84348301 M 11.42 20.38 77.58 386.1 0.14250 0.28390 4 84358402 M 20.29 14.34 135.10 1297.0 0.10030 0.13280 5 rows × 33 columns radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean 0 1.097064 ‑2.073335 1.269934 0.984375 1.568466 3.283515 2.652874 2.532475 1 1.829821 ‑0.353632 1.685955 1.908708 ‑0.826962 ‑0.487072 ‑0.023846 0.548144 2 1.579888 0.456187 1.566503 1.558884 0.942210 1.052926 1.363478 2.037231 3 ‑0.768909 0.253732 ‑0.592687 ‑0.764464 3.283553 3.402909 1.915897 1.451707 4 1.750297 ‑1.151816 1.776573 1.826229 0.280372 0.539340 1.371011 1.428493 5 rows × 30 columns In [1]: import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns In [2]: from matplotlib import rcParams rcParams[‘figure.figsize’] = 15, 5 sns.set_style(‘darkgrid’) In [3]: breast_cancer_df = pd.read_csv(‘data.csv’) breast_cancer_df.head() Out[3]: In [4]: breast_cancer_df.drop(labels=[‘Unnamed: 32’, ‘id’], axis=1, inplace=True) In [5]: breast_cancer_df.info() In [6]: sns.pairplot(breast_cancer_df, hue=’diagnosis’, vars=[‘radius_mean’, ‘texture_mean’, ‘perimeter_mean’, ‘area_                                      ’smoothness_mean’, ‘compactness_mean’, ‘concavity_mean’,                                      ’concave points_mean’, ‘symmetry_mean’, ‘fractal_dimension_mean’]) plt.show() In [7]: sns.countplot(x=breast_cancer_df[‘diagnosis’]) plt.show() In [8]: sns.scatterplot(x = ‘area_mean’, y = ‘smoothness_mean’, hue = ‘diagnosis’, data = breast_cancer_df) plt.show() In [9]: plt.figure(figsize=(20,10)) sns.heatmap(breast_cancer_df.corr(), annot=True) plt.show() In [10]: from sklearn.preprocessing import StandardScaler # all columns except ’Outcome’ X = breast_cancer_df.drop(‘diagnosis’, axis=1) y = breast_cancer_df[‘diagnosis’] # create our scaler object scaler = StandardScaler() # use our scaler object to transform/scale our data and save it into X_scaled X_scaled = scaler.fit_transform(X) # reassign X to a new DataFrame using the X_scaled values. X = pd.DataFrame(data=X_scaled, columns=X.columns) In [11]: X.head() Out[11]: In [12]: from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101) In [13]: from sklearn.svm import SVC # instantiate the model with default parameters model = SVC() # fit/train  model.fit(X_train,y_train) Out[13]: In [14]: predictions = model.predict(X_test) In [15]: from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_test, predictions) print(‘Confusion matrix\n\n’, cm) print(‘\nTrue Positives(TP) = ’, cm[0,0]) print(‘\nTrue Negatives(TN) = ’, cm[1,1]) print(‘\nFalse Positives(FP) = ’, cm[0,1]) print(‘\nFalse Negatives(FN) = ’, cm[1,0]) In [16]: from sklearn.metrics import classification_report print(classification_report(y_test,predictions)) In [17]: param_grid = {‘C’: [0.1,1, 10, 100, 1000], ‘gamma’: [1,0.1,0.01,0.001,0.0001], ‘kernel’: [‘rbf’]} In [19]: from sklearn.model_selection import GridSearchCV grid = GridSearchCV(SVC(),param_grid,refit=True,verbose=3) In [20]: grid.fit(X_train,y_train) Out[20]: In [21]: grid.best_params_ Out[21]: In [23]: grid_predictions = grid.predict(X_test) In [24]: print(confusion_matrix(y_test,grid_predictions)) print(classification_report(y_test,grid_predictions))

Week 7 Lab (Support Vector Machine) COSC 3337

Description

Related products

COSC 3337 Week 2 Lab (Intro to pandas)

Week 3 Lab (Intro to Matplotlib) COSC 3337

Week 4 Lab (Linear Regression) COSC 3337