ECE 661: Homework 5 Adversarial Attacks and Defenses

$30.00

Category: You will Instantly receive a download link for .zip solution file upon Payment || To Order Original Work Click Custom Order?

Description

5/5 - (1 vote)

1 True/False Questions (10 pts)

For each question, please provide a short explanation to support your judgment.
Problem 1.1 (1 pt) In an evasion attack, the attacker perturbs a subset of training instances which
prevents the DNN from learning an accurate model.

Problem 1.2 (1 pt) In general, modern defenses not only improve robustness to adversarial attack, but
they also improve accuracy on clean data.

Problem 1.3 (1 pt) In a backdoor attack, the attacker first injects a specific noise trigger to a subset
of data points and sets the corresponding labels to a target class. Then, during deployment, the attacker
uses a gradient-based perturbation (e.g., Fast Gradient Sign Method) to fool the model into choosing the
target class.

Problem 1.4 (1 pt) Outlier exposure is an Out-of-Distribution (OOD) detection technique that uses OOD
data during training, unlike the ODIN detector.

Problem 1.5 (1 pt) It is likely that an adversarial examples generated on a ResNet-50 model will also
fool a VGG-16 model.

Problem 1.6 (1 pt) The perturbation direction used by the Fast Gradient Sign Method attack is the
direction of steepest ascent on the local loss surface, which is the most efficient direction towards the
decision boundary.

Problem 1.7 (1 pt) The purpose of the projection step of the Projected Gradient Descent (PGD) attack
is to prevent a misleading gradient due to gradient masking.

Problem 1.8 (1 pt) Analysis shows that the best layer for generating the most transferable feature space
attacks is the final convolutional layer, as it is the convolutional layer that has the most effect on the
prediction.

Problem 1.9 (1 pt) The DVERGE training algorithm promotes a more robust model ensemble, but the
individual models within the ensemble still learn non-robust features.

Problem 1.10 (1 pt) On a backdoored model, the exact backdoor trigger must be used by the attacker
during deployment to cause the proper targeted misclassification.

2 Lab 1: Environment Setup and Attack Implementation (20 pts)

In this section, you will train two basic classifier models on the FashionMNIST dataset and implement a few
popular untargeted adversarial attack methods. The goal is to prepare an “environment” for attacking in
the following sections and to understand how the adversarial attack’s ϵ value influences the perceptibility
of the noise.

All code for this set of questions will be in the “Model Training” section of HWK5_main.ipynb
and in the accompanying attacks.py file. Please include all of your results, figures and observations into
your PDF report.

(a) (4 pts) Train the given NetA and NetB models on the FashionMNIST dataset. Use the provided
training parameters and save two checkpoints: “netA_standard.pt” and “netB_standard.pt”. What
is the final test accuracy of each model? Do both models have the same architecture? (Hint: accuracy should be around 92% for both models).

(b) (8 pts) Implement the untargeted L∞-constrained Projected Gradient Descent (PGD) adversarial
attack in the attacks.py file. In the report, paste a screenshot of your PGD_attack function and
describe what each of the input arguments is controlling. Then, using the “Visualize some perturbed samples” cell in HWK5_main.ipynb, run your PGD attack using NetA as the base classifier
and plot some perturbed samples using ϵ values in the range [0.0, 0.2].

At about what ϵ does the
noise start to become perceptible/noticeable? Do you think that you (or any human) would still
be able to correctly predict samples at this ϵ value? Finally, to test one important edge case, show
that at ϵ = 0 the computed adversarial example is identical to the original input image. (HINT:
We give you a function to compute input gradient at the top of the attacks.py file)

(c) (4 pts) Implement the untargeted L∞-constrained Fast Gradient Sign Method (FGSM) attack and
random start FGSM (rFGSM) in the attacks.py file. (Hint: you can treat the FGSM and rFGSM
functions as wrappers of the PGD function).

Please include a screenshot of your FGSM_attack
and rFGSM_attack function in the report. Then, plot some perturbed samples using the same ϵ
levels from the previous question and comment on the perceptibility of the FGSM noise. Does
the FGSM and PGD noise appear visually similar?

(d) (4 pts) Implement the untargeted L2-constrained Fast Gradient Method attack in the attacks.py
file. Please include a screenshot of your FGM_L2_attack function in the report. Then, plot some
perturbed samples using ϵ values in the range of [0.0, 4.0] and comment on the perceptibility of
the L2 constrained noise.

How does this noise compare to the L∞ constrained FGSM and PGD
noise visually? (Note: This attack involves a normalization of the gradient, but since these attack
functions take a batch of inputs, the norm must be computed separately for each element of the
batch).

3 Lab 2: Measuring Attack Success Rate (30 pts)

In this section, you will measure the effectiveness of your FGSM, rFGSM, and PGD attacks. Remember,
the goal of an adversarial attacker is to perturb the input data such that the classifier outputs a wrong
prediction, while the noise is minimally perceptible to a human observer. All code for this set of questions
will be in the “Test Attacks” section of HWK5_main.ipynb and in the accompanying attacks.py file. Please
include all of your results, figures and observations into your PDF report.

(a) (2 pts) Briefly describe the difference between a whitebox and blackbox adversarial attacks. Also,
what is it called when we generate attacks on one model and input them into another model that
has been trained on the same dataset?

(b) (3 pts) Random Attack – To get an attack baseline, we use random uniform perturbations in
range [−ϵ, ϵ]. We have implemented this for you in the attacks.py file. Test at least eleven ϵ
values across the range [0, 0.1] (e.g., np.linspace(0,0.1,11)) and plot two accuracy vs epsilon
curves (with y-axis range [0, 1]) on two separate plots: one for the whitebox attacks and one for
blackbox attacks. How effective is random noise as an attack? (Note: in the code, whitebox and
blackbox accuracy is computed simultaneously)

(c) (10 pts) Whitebox Attack – Using your pre-trained “NetA” as the whitebox model, measure the
whitebox classifier’s accuracy versus attack epsilon for the FGSM, rFGSM, and PGD attacks. For
each attack, test at least eleven ϵ values across the range [0, 0.1] (e.g., np.linspace(0,0.1,11))
and plot the accuracy vs epsilon curve. Please plot these curves on the same axes as the whitebox
plot from part (b). For the PGD attacks, use perturb_iters = 10 and α = 1.85 ∗ (ϵ/perturb_iters).

Comment on the difference between the attacks. Do either of the attacks induce the equivalent
of “random guessing” accuracy? If so, which attack and at what ϵ value? (Note: in the code,
whitebox and blackbox accuracy is computed simultaneously)

(d) (10 pts) Blackbox Attack – Using the pre-trained “NetA” as the whitebox model and the pretrained “NetB” as the blackbox model, measure the ability of adversarial examples generated
on the whitebox model to transfer to the blackbox model. Specifically, measure the blackbox
classifier’s accuracy versus attack epsilon for both FGSM, rFGSM, and PGD attacks. Use the same
ϵ values across the range [0, 0.1] and plot the blackbox model’s accuracy vs epsilon curve. Please
plot these curves on the same axes as the blackbox plot from part (b). For the PGD attacks, use
perturb_iters = 10 and α = 1.85 ∗ (ϵ/perturb_iters). Comment on the difference between the
blackbox attacks. Do either of the attacks induce the equivalent of “random guessing” accuracy?
If so, which attack and at what ϵ value? (Note: in the code, whitebox and blackbox accuracy is
computed simultaneously)

(e) (5 pts) Comment on the difference between the attack success rate curves (i.e., the accuracy vs.
epsilon curves) for the whitebox and blackbox attacks. How do these compare to effectiveness of
the naive uniform random noise attack? Which is the more powerful attack and why? Does this
make sense? Also, consider the epsilon level you found to be the “perceptibility threshold” in Lab
1.b. What is the attack success rate at this level and do you find the result somewhat concerning?

4 Lab 3: Adversarial Training (40 pts + 10 Bonus)

In this section, you will implement a powerful defense called adversarial training (AT). As the name suggests, this involves training the model against adversarial examples. Specifically, we will be using the AT
described in https://arxiv.org/pdf/1706.06083.pdf, which formulates the training objective as
min
θ
E
(x,y)∼D 
max
δ∈S
L(f(x + δ; θ), y)


Importantly, the inner maximizer specifies that all of the training data should be adversarially perturbed
before updating the network parameters. All code for this set of questions will be in the HWK5_main.ipynb
file. Please include all of your results, figures and observations into your PDF report.

(a) (5 pts) Starting from the given “Model Training” code, adversarially train a “NetA” model using
a FGSM attack with ϵ = 0.1, and save the model checkpoint as “netA_advtrain_fgsm0p1.pt”.
What is the final accuracy of this model on the clean test data? Is the accuracy less than the
standard trained model? Repeat this process for the rFGSM attack with ϵ = 0.1, saving the
model checkpoint as “netA_advtrain_rfgsm0p1.pt”. Do you notice any differences in training
convergence when using these two methods?

(b) (5 pts) Starting from the given “Model Training” code, adversarially train a “NetA” model using a PGD attack with ϵ = 0.1, perturb_iters = 4, α = 1.85 ∗ (ϵ/perturb_iters), and save the
model checkpoint as “netA_advtrain_pgd0p1.pt”. What is the final accuracy of this model on the
clean test data? Is the accuracy less than the standard trained model? Are there any noticeable
differences in the training convergence between the FGSM-based and PGD-based AT procedures?

(c) (15 pts) For the model adversarially trained with FGSM (“netA_advtrain_fgsm0p1.pt”) and rFGSM
(“netA_advtrain_rfgsm0p1.pt”), compute the accuracy versus attack epsilon curves against both
the FGSM, rFGSM, and PGD attacks (as whitebox methods only). Use ϵ = [0.0, 0.02, 0.04, . . . , 0.14].
Please use a different plot for each adversarially trained model (i.e., two plots, three curves each).
Is the model robust to all types of attack? If not, explain why one attack might be better than
another. (Note: you can run this code in the “Test Robust Models” cell of the HWK5_main.ipynb
notebook).

(d) (15 pts) For the model adversarially trained with PGD (“netA_advtrain_pgd0p1.pt”), compute the
accuracy versus attack epsilon curves against the FGSM, rFGSM and PGD attacks (as whitebox
methods only). Use ϵ = [0.0, 0.02, 0.04, . . . , 0.14], perturb_iters = 10, α = 1.85∗(ϵ/perturb_iters).
Please plot the curves for each attack in the same plot to compare against the two from part (c). Is
this model robust to all types of attack? Explain why or why not. Can you conclude that one
adversarial training method is better than the other? If so, provide an intuitive explanation as to
why (this paper may help explain: https://arxiv.org/pdf/2001.03994.pdf). (Note: you can
run this code in the “Test Robust Models” cell of the HWK5_main.ipynb notebook).

(e) (Bonus 5 pts) Using PGD-based AT, train a at least three more models with different ϵ values.
Is there a trade-off between clean data accuracy and training ϵ? Is there a trade-off between
robustness and training ϵ? What happens when the attack PGD’s ϵ is larger than the ϵ used for
training? In the report, provide answers to all of these questions along with evidence (e.g., plots
and/or tables) to substantiate your claims.

(f) (Bonus 5 pts) Plot the saliency maps for a few samples from the FashionMNIST test set as measured on both the standard (non-AT) and PGD-AT models. Do you notice any difference in
saliency? What does this difference tell us about the representation that has been learned?
(Hint: plotting the gradient w.r.t. the data is often considered a version of saliency, see https:
//arxiv.org/pdf/1706.03825.pdf)