Description
1 Written Questions [52 pts]
Answer the following questions in the template provided. Then upload your solutions to Gradescope. You
may use LATEX or print the template and hand-write your answers then scan it in. Failure to use the template
may result in a penalty. There are 52 points and 12 questions.
1.1 Semantic Segmentation
Figure 1.1: Input Image Figure 1.2: Image Segmentation
Let’s assume we have a 4 x 4 grid consisting of RGB pixels. Our task is to perform image segmentation, ie
assign a label to each pixel from the set {foreground, background}. Intuitively, neighboring pixels should
have similar values in y, i.e. pixels associated with a mountain should form one continuous blob. This
knowledge can be naturally modeled via a Potts model which consists of potentials φ(yi
, x) that encode the
likelihood that any given pixel is from our subject and pairwise potentials φ(yi
, yj ) which will encourage
adjacent y’s to have the same value with high probability. If visualized as a graphical model, we would have
a node for each pixel and an edge for each pair of pixels that are touching vertically or horizontally but not
diagonally. We will refer to these sets as V and E respectively.
As seen in class, solving this task corresponds to finding y such that
yˆ = argmaxy
logw p(y|x)
= argmaxy
X
j
w
T φ(xj , yj ) + X
j,k∈E
w
T φ(yj , yk)
NOTE: This is not the same model used in the programming section. We will provide the set-up for that
separately.
1.2 MAP Inference with Integer Linear Programming
In this section, we will formulate our problem as an Integer Linear Program (ILP) and use it to predict
the segmentation of a small patch of our image. For convenience, denote the background class with 0 and
the foreground class with 1. Furthermore suppose we have variables α and β that we will be optimizating
over. Let αi(y) represent an indictator function that is one when pixel i is labeled class y. Let βij (yn, ym)
represent an indicator function that is one when pixel i is set to class yn and pixel j is set to class ym.
First we will need to translate our requirements into a constraint set.
2 of 14
Homework 3: Structured SVM and MAP Inference 10-418 / 10-618
1. (2 points) Short answer: Write constraints on the variables αi(y) such that each pixel i is assigned
one and only one class y.
2. (4 points) Short answer: Write constraints on the variables βij (yn, ym) such that pixels i and j are
consistently applied to the same classes yn and ym in the pairwise potentials. Note that this will only be
defined over ij pairs in our edge set E.
3. (2 points) Short answer: Using the score functions and goal defined above, write an ILP that represents
our constraint set and objective such that if solved we would have a solution to our image segmentation
task.
3 of 14
Homework 3: Structured SVM and MAP Inference 10-418 / 10-618
Suppose we had the following unary feature functions:
φ1(yellow pixel, background) = 1
φ2(yellow pixel, foreground) = 2
φ3(brown pixel, foreground) = 3
φ4(brown pixel, background) = 0
In addition suppose we had the following non-zero pairwise feature functions:
φ5(background, background) = 1
φ6(foreground, background) = 2
φ7(background, foreground) = 0
φ8(foreground, foreground) = 1
Image Patch Segmentation
Figure 1.3: Extracted Image Patch and Segmentation
4. (7 points) Short answer: In this question we will walk through the process of solving the ILP for the
1 × 2 patch in the upper left hand corner of our image shown in figure 1.3. Assume an initial weight
vector w = [1, 1, 1, …1].
(a) What are the scores assigned by the model for all possible y configurations shown below?
Yellow Background Yellow Foreground
Brown Background
Brown Foreground
(b) What are the settings of variable α that solves the above ILP?
(c) What would be the updated weight vector w after running one iteration of structured perceptron?
4 of 14
Homework 3: Structured SVM and MAP Inference 10-418 / 10-618
5. (5 points) Short answer: Assume the same set-up as seen in question 4. Also assume an initial weight
vector w = [1, 1, 1, …1] and the Hamming loss as l.
(a) What are the loss augmented scores for all possible y configurations shown below?
Yellow Background Yellow Foreground
Brown Background
Brown Foreground
(b) What would be the updated weight vector w after running one iteration of structured SVM?
5 of 14
Homework 3: Structured SVM and MAP Inference 10-418 / 10-618
1.3 Loss-Augmented Inference and Learning
1. (1 point) Select all that apply: Structured Perceptron differs from Structured SVM in?
2 Training
2 Inference
2 Evaluation
2. (4 points) Suppose you are given an MRF’s factor graph G = (C, ψ, y) consisting of a set of factors C,
potential functions ψ, and variables y ∈ Y, where Y is the output space. Each factor c ∈ C touches
a subset of the variables yc = [yc1
, yc2
, . . . , ycm]
T
. The probability distribution defined by this graph
decomposes multiplicatively according to the factors:
pG(y) = 1
Z
Y
c∈C
ψc(yc)
Further, you are given a loss function that decomposes additively according to the factors:
`(y
∗
, yˆ) = X
c∈C
`c(y
∗
c
, yˆc)
Show that you can define a new factor graph G
0
(with the same factors and variables as G, but different
potential functions ψ
0
) such that the most probable assignment to yˆ
0 , argmaxy∈Y pG0(y) solves the
loss-augmented MAP inference problem for G with loss `.
6 of 14
Homework 3: Structured SVM and MAP Inference 10-418 / 10-618
1.4 Empirical Questions
The following questions should be completed after you work through the programming portion of this
assignment (Section 2).
1. (12 points) Report the mean IOU and pixel accuracy scores for all models. For this question, you
should run the FCN for 2 epochs and the linear SVM and structured SVM for 3 epochs each. Round
each value to two significant figures
Model Mean IOU Pixel Accuracy
FCN (Train)
FCN (Test)
FCN+l-SVM (Train)
FCN+l-SVM (Test)
FCN+s-SVM (Train)
FCN+s-SVM (Test)
2. (10 points) Plot training and testing curves for the hinge loss (FCN+l-SVM) and structured hinge loss
(FCN+s-SVM) respectively over 3 epochs. Note: Your plot must be machine generated.
3. (3 points) For any image of your choice from the test set, plot the labels predicted by FCN, FCN+l7 of 14
Homework 3: Structured SVM and MAP Inference 10-418 / 10-618
SVM and FCN+s-SVM as grayscale images. Use the visualization function provided in the code to plot
images. Do you see any remarkable differences between the model predictions?
1.5 Wrap-up Questions
1. (1 point) Multiple Choice: Did you correctly submit your code to Autolab?
Yes
No
2. (1 point) Numerical answer: How many hours did you spend on this assignment?.
8 of 14
Homework 3: Structured SVM and MAP Inference 10-418 / 10-618
1.6 Collaboration Policy
After you have completed all other components of this assignment, report your answers to the collaboration
policy questions detailed in the Academic Integrity Policies for this course.
1. Did you receive any help whatsoever from anyone in solving this assignment? If so, include full
details including names of people who helped you and the exact nature of help you received.
2. Did you give any help whatsoever to anyone in solving this assignment? If so, include full details
including names of people you helped and the exact nature of help you offered.
3. Did you find or come across code that implements any part of this assignment? If so, include full
details including the source of the code and how you used it in the assignment.
9 of 14
Homework 3: Structured SVM and MAP Inference 10-418 / 10-618
2 Programming [48 pts]
Your goal in this assignment is to implement a structured SVM on top of a convolutional neural network
for image segmentation. Given an image as a 32×32 grid of pixels, you will assign a label from the set
{foreground, background} to each pixel.
Your solution for this section must be implemented with PyTorch and Google OR-Tools2 using the data
files we have provided to you. This restriction is because we will be grading your code by hand to check
your understanding as well as your model’s performance.
2.1 Task Background
Semantic segmentation is the task of assigning a class label to each pixel in an image. This task can be
thought of as image classification, but at the pixel level. Figure 2.1 shows a sample image with its corresponding segmented image.
(a) Original Image (b) Segmented Image
Figure 2.1: Sample image from the PASCAL VOC-2012 Image Segmentation dataset
Semantic segmentation is useful for a variety of applications such as autonomous vehicles, photo-editing
tools and many tasks in robotics. In this question, we will first train a simple semantic segmentation model
using a fully convolutional network (FCN). Next, we will use this FCN as a feature extractor to generate
100-dimensional features for each pixel in the image. Using these pixel features, we will build two semantic
segmentation models. The first baseline model is a simple linear support vector machine (SVM) which uses
the FCN features and predicts a class label for each pixel in the image independently. The second model
is a structured SVM which takes in FCN features and predicts class labels for all pixels in the image while
incorporating interactions between neighboring pixels. Following subsections provide more details about
the dataset, metrics and model structure.
2.2 Data
For this task, we will be using the semantic segmentation dataset from the PASCAL VOC 2012 Challenge.3
This dataset consists of 2913 examples in total (2330 for training and 583 for testing). For each example
in this dataset, we have a pair of images: the original image and the segmented image. To make it easier
to work with the data, we have converted segmented images into numpy arrays of class labels. To ensure
reasonable runtime, we have also cropped and downsampled all images to 32×32 pixels. Finally, instead of
having multiple class labels (eg: person, aeroplane etc.), we will treat this as a binary classification task with
the label set {foreground, background}.
2.3 Metrics
To evaluate model performance on this dataset, we will be using two metrics:
2https://developers.google.com/optimization
3http://host.robots.ox.ac.uk/pascal/VOC/voc2012/index.html
10 of 14
Homework 3: Structured SVM and MAP Inference 10-418 / 10-618
• Pixel Accuracy: Pixel-level classification accuracy for each image computed as
#CorrectP ixels
#T otalP ixels (2.1)
• Mean IOU (Intersection-over-Union): Mean of per-class IOU (intersection-over-union) across all
classes for each image. For one class (eg: foreground), IOU is computed as
#CorrectF oregroundP ixels
#GoldF oregroundP ixels + #P redictedF oregroundP ixels − #CorrectF oregroundP ixels
(2.2)
We will average these metrics across all images
2.4 Fully Convolutional Network
As a first step towards building a semantic segmentation model, we will train a fully convolutional network.
A fully convolutional network differs from a regular convolutional neural network in the later half of its
architecture. While a convolutional neural network returns a dense representation for each image (eg: a
6×6 representation for a 32×32 image), a fully convolutional network returns a representation of the same
size as the original image. FCNs use regular convolution layers to first compress an image into a dense
representation before deconvolving the dense representation to get a tensor of the same size as the original
image (as shown in figure 2.2).
Figure 2.2: Fully convolutional network for semantic segmentation
For this assignment, we have provided you with code to train and test a FCN. Our FCN implementation uses
the first 5 layers from AlexNet as convolution layers and an interpolation layer to perform the deconvolution.
After running the FCN on a 32×32 image, you should be able to extract a 32x32x100 tensor which contains
a 100-dimensional feature vector for each pixel. We add an additional 100×2 linear layer to predict (foreground, background) scores for each pixel and train this network using cross-entropy loss. After training,
we discard the linear layer and use the FCN weights to generate pixel-level features for a given image.
2.5 Baseline
The first model you will be implementing is a simple linear support vector machine. This model takes as
input a 32x32x100 tensor computed by the FCN and uses weights wfc to calculate per-class scores for each
11 of 14
Homework 3: Structured SVM and MAP Inference 10-418 / 10-618
pixel independently. This is analogous to simply adding a linear layer on top of the FCN. However, since
the baseline is a linear SVM, it is trained using hinge loss. (Hint: To implement hinge loss in PyTorch, you
might find it helpful to look up MultiMarginLoss()).
2.6 Main Model
The second model you will be implementing is a structured support vector machine. Unlike the linear SVM,
this model does not predict labels for each pixel independently. This is useful because adjacent pixels tend to
belong to the same class, indicating that there may be some benefits to treating this problem as a structured
prediction problem and leveraging pixel interactions. In this assignment, we will model the structure of the
output using a formulation based on associative markov networks.4 Associative markov networks are a class
of undirected graphical models where each node is a discrete variable and clique potentials tend to favor the
same label for all nodes in the clique. In our task, we treat each pixel as a node and only model pairwise
interactions between pixels (i.e no cliques larger than 2 exist). For tractability, we further restrict pairwise
interactions to non-diagonal neighboring pixels (up, down, left, right). This results in a lattice structure as
shown in the figure:
Figure 2.3: Output structure used by structured SVM
Given an input image as a 32x32x100 feature tensor T extracted from the FCN, a SVM over this structured
output space can be formulated as follows:
• Variables: xi,c where i ∈ [0, 32 ∗ 32), c ∈ {0, 1}, yi,j,c, where i ∈ [0, 32 ∗ 32), j ∈ [0, 32 ∗ 32), c ∈
{0, 1}. xi,c and yi,j,c are pixel assignment and edge assignment variables respectively. xi,c = 1
indicates that pixel at position i belongs to class c, while yi,j,c = 1 indicates that pixels in neighboring
positions i and j both belong to class c and are linked to one another.
• Weights: wf,c and wf
0
,c, where f ∈ [0, 100), f0 ∈ [0, 200), c ∈ {0, 1}. wf,c and wf
0
,c are pixel
weights and edge weights respectively. These weights can be used to compute unary pixel potentials
and edge potentials as φi,c = wf,cT(i) and φi,j,c = wf
0
,cconcat(T(i), T(j)). Recall that we used
a similar process in homework 2 to compute unary/ edge potentials for nodes in the parse tree from
LSTM hidden states.
• ILP formulation for MAP inference: Using the pixel and edge potentials computed using weights
and FCN features, the highest scoring output structure (i.e. the assignment of values to xi,c, yi,j,c) is
chosen by maximizing the following objective:
max
x,y
X
i
X
c
φi,cxi,c +
X
i
X
j
X
c
φi,j,cyi,j,c
4https://ai.stanford.edu/˜koller/Papers/Taskar+al:ICML04.pdf
12 of 14
Homework 3: Structured SVM and MAP Inference 10-418 / 10-618
subject to the following constraints:
X
c
xi,c = 1
yi,j,c ≤ xi,c
yi,j,c ≤ xj,c
The first constraint ensures that each pixel is assigned to exactly one class, while the next two constraints ensure that if an edge exists between two pixels, they both belong to the same class.
• Loss-augmented Inference: To define loss-augmented inference, we need a loss function. We choose
`(xˆ, x
∗
) to be the Hamming loss between the two outputs:
`(xˆ, x
∗
) = 1
|i||c|
X
i
X
c
I{xˆi,c 6= x
∗
i,c}
Setting up an ILP for loss-augmented inference uses the exact same constraints as the MAP Inference
ILP above, but we change the objective function to be the following:
max
x,y
X
i
X
c
φi,cxi,c +
X
i
X
j
X
c
φi,j,cyi,j,c +
1
|i||c|
X
i
X
c
x
∗
i,c(1 − xi,c) + (1 − x
∗
i,c)xi,c
• Loss function: The structured hinge loss for this SVM is as follows:
L = max(0, S(xˆ, yˆ, T) + `(xˆ, x
∗
) − S(x
∗
, y
∗
, T))
where the scoring function S(x, y, T) = P
i
P
c φi,cxi,c+
P
i
P
j
P
c φi,j,cyi,j,c is the same as the ILP
objective (note that the potentials φ in the objective are computed using vectors from T), and (xˆ, yˆ)
is the highest scoring structure returned by loss-augmented inference, and (x
∗
, y
∗
) is the ground truth
assignment to pixel/edge variables.
Putting everything together, given an image I, you should run the following procedure:
1. Extract feature tensor T = F CN(I)
2. Compute pixel and edge potentials φi,c, φi,j,c using T
3. Perform loss-augmented MAP inference via ILP to get highest scoring structure (xˆ, yˆ)
4. Compute structured hinge loss with (xˆ, yˆ)
5. Backpropagate through the structured loss to update weights
At test time, you only need to return the most probable assignment as your output (i.e. step 3 below does
not augment with the loss).
1. Extract feature tensor T = F CN(I)
2. Compute pixel and edge potentials φi,c, φi,j,c using T
3. Perform MAP inference via ILP to get highest scoring structure (xˆ, yˆ)
13 of 1