The bag-of-words (BoW) approach, which you learned about in class, has been applied to a myriad of
recognition problems in computer vision. For example, two classic ones are object recognition [5, 7] and
scene classification [6, 8]1
Beyond that, a great deal of study has aimed at improving the BoW representation. You will see a large
number of approaches that remain in the spirit of BoW but improve upon the traditional approach which
you will implement here. For example, two important extensions are pyramid matching [2, 4] and feature
An illustrative overview of the homework is shown in Figure. 2. In Section. 1, we will build the visual
words from the training set images. With the visual words, i.e. the dictionary, in Section. 2 we will represent
an image as a visual-word vector. Then the comparison between images is realized in the visual-word vector
space. Finally, we will build a scene recognition system based on the visual bag-of-words approach to classify
a given image into 8 types of scenes.
Visual word vocabulary
Visual word vocabulary Feature
Image represented as
histogram of visual-words
Building the dictionary
Represent images as histograms of visual words and compare images
Figure 2: An overview of the bags-of-words approach to be implemented in the homework. First, given
the training set of images, we extract the visual features of the images. In our case, we will use the filter
responses of the pre-defined filter bank as the visual features. Next, we build visual words, i.e. a dictionary,
by finding the centers of clusters of the visual features. To classify new images, we first represent each image
as a vector of visual words, and then compare new images to old ones in the visual-word vector space – the
nearest match provides a label!
What you will be doing: You will implement a scene classification system that uses the bag-of-words
approach with its spatial pyramid extension. The paper that introduced the pyramid matching kernel  is
K. Grauman and T. Darrell. The Pyramid Match Kernel: Discriminative Classification with Sets of Image Features. ICCV 2005. http://www.cs.utexas.edu/
Spatial pyramid matching  is presented in
S. Lazebnik, C. Schmid, and J. Ponce, Beyond Bags of Features: Spatial Pyramid
Matching for Recognizing Natural Scene Categories, CVPR 2006. http://www.di.
1This homework is largely self-contained, but reading the listed papers (or even just skimming them) will likely be helpful.
Figure 3: The provided multi-scale filter bank
You will be working with a subset of the SUN database2
. The data set contains 1600 images from various
scene categories like “aquarium, “desert” and “kitchen”. And to build a recognition system, you will:
• take responses of a filter bank on images and build a dictionary of visual words, and then
• learn a model for images based on the bag of words (with spatial pyramid matching ), and use
nearest-neighbor to predict scene classes in a test set.
In terms of number of lines of code, this assignment is fairly small. However, it may take a few hours
to finish running the baseline system, so make sure you start early so that you have time to debug things.
Also, try each component on a subset of the data set first before putting everything together.
We provide you with a number of functions and scripts in the hopes of alleviating some tedious or
error-prone sections of the implementation. You can find a list of files provided in Section 4.
Notice that, we include num workers as input for some functions you need to implement. Those are not
necessary, but can be used with multi-threading python libraries to speed up your code.
This homework was tested with Python 3.6.7, installed through Miniconda3.
1 Representing the World with Visual Words
1.1 Extracting Filter Responses (Rishi)
We want to run a filter bank on an image by convolving each filter in the bank with the image and concatenating all the responses into a vector for each pixel. In our case, we will be using 20 filters consisting of 4 types of filters in 5 scales. The filters are: (1) Gaussian, (2) Laplacian of Gaussian, (3)
derivative of Gaussian in the x direction, and (4) derivative of Gaussian in the y direction. The convolution function scipy.ndimage.convolve() can be used with user-defined filters, but the functions
scipy.ndimage.gaussian filter() and scipy.ndimage.gaussian laplace() may be useful here for improved efficiency. The 5 scales we will be using are 1, 2, 4, 8, and 8√
2, in pixel units.
Q1.1.1 (5 points): What properties do each of the filter functions pick up? (See Figure 3) Try to group
the filters into broad categories (e.g. all the Gaussians). Why do we need multiple scales of filter responses?
Answer in your write-up.
Q1.1.2 (10 points): For the code, loop through the filters and the scales to extract responses. Since color
images have 3 channels, you are going to have a total of 3F filter responses per pixel if the filter bank is of
size F. Note that in the given dataset, there are some gray-scale images. For those gray-scale images, you
Figure 4: An input image and filter responses for all of the filters in the filter bank. (a) The input image (b)
The filter responses in Lab colorization, corresponding to the filters in Figure. 3
can simply duplicate them into three channels using the command repmat. Then output the result as a 3F
channel image. Complete the function
visual words.extract filter responses(image)
and return the responses as filter responses. We have provided you with template code, with detailed
instructions commented inside.
Remember to check the input argument image to make sure it is a floating point type with range [0, 1],
and convert it if necessary. Be sure to check the number of input image channels and convert it to 3-
channel if it is not. Before applying the filters, use the function skimage.color.rgb2lab() to convert your
image into the Lab color space, which is designed to more effectively quantify color differences with respect
to human perception. (See here for more information.) If the input image is an M × N × 3 matrix, then
filter responses should be a matrix of size M ×N ×3F. Make sure your convolution function call handles
image padding along the edges sensibly.
Apply all 20 filters on aquarium/sun aztvjgubyrgvirup.jpg, and visualize the responses as an image
collage as shown in Figure 4. You can use the included helper function util.display filter responses()
(which expects a list of filter responses with those of the Lab channels grouped together with shape M×N ×3)
to create the collage. Submit the collage of 20 images in your write-up.
1.2 Creating Visual Words (Rishi)
You will now create a dictionary of visual words from the filter responses using k-means. After applying
k-means, similar filter responses will be represented by the same visual word. You will use a dictionary with
a fixed size. Instead of using all of the filter responses (which might exceed the memory capacity of
your computer), you will use responses at α random pixels3
. If there are T training images, then you
should collect a matrix filter responses over all the images that is αT × 3F, where F is the filter bank
size. Then, to generate a visual words dictionary with K words, you will cluster the responses with k-means
using the function sklearn.cluster.KMeans as follows:
kmeans = sklearn.cluster.KMeans(n clusters=K).fit(filter responses)
dictionary = kmeans.cluster centers
If you like, you can pass the n jobs argument into the KMeans() object to utilize parallel computation.
Q1.2 (10 points): Write the functions
visual words.compute dictionary one image(args),
visual words.compute dictionary().
Given a list of images, these functions generate a dictionary. The overall goal of compute dictionary() is
to load the training data, iterate through the paths to the image files to read the images, and extract αT
filter responses over the training files, and call k-means. This can be slow to run; however, the images can
be processed independently and in parallel. Inside compute dictionary one image(), you should read an
image, extract the responses, and save to a temporary file. Here, args is a collection of arguments passed into
the function. Inside compute dictionary(), you should load all the training data and create subprocesses
to call compute dictionary one image(). After all the subprocesses finish, load the temporary files back,
collect the filter responses, and run k-means. A sensible initial value to try for K is between 100 and 300,
and for α is between 50 and 500, but they depend on your system configuration and you might want to play
with these values.
Finally, execute compute dictionary(), and go do some push-ups while you wait for it to complete. If
all goes well, you will have a file named dictionary.npy that contains the dictionary of visual words (in
the same directory as the code). If the clustering takes too long, reduce the number of clusters and samples.
To debug, try passing in a small number of training files manually.
1.3 Computing Visual Words (Shumian)
Q1.3 (10 points): We want to map each pixel in the image to its closest word in the dictionary. Complete
the following function to do this:
visual words.get visual words(image,dictionary)
and return wordmap, a matrix with the same width and height as image, where each pixel in wordmap is
assigned the closest visual word of the filter response at the respective pixel in image. We will use the standard
Euclidean distance to do this; to do this efficiently, use the function scipy.spatial.distance.cdist().
Some sample results are shown in Fig. 5.
Visualize 3 wordmaps for each of three images from any one category. Include these in your write-up,
along with the original RGB images. Include some comments on these visualizations: do the
“word” boundaries make sense to you? The visualizations should look similar to the ones in Figure 5.
2 Building a Recognition System
We have formed a convenient representation for recognition. We will now produce a basic recognition system
with spatial pyramid matching. The goal of the system is presented in Fig. 1: given an image, classify (i.e.,
recognize/name) the scene depicted in the image.
3Try using numpy.random.permutation().
Figure 5: Visual words over images. You will use the spatially un-ordered distribution of visual words in a
region (a bag of visual words) as a feature for scene classification, with some coarse information provided by
spatial pyramid matching 
Traditional classification problems follow two phases: training and testing. At training time, the computer
is given a pile of formatted data (i.e., a collection of feature vectors) with corresponding labels (e.g., “desert”,
“park”) and then builds a model of how the data relates to the labels (e.g., “if green, then park”). At test
time, the computer takes features and uses these rules to infer the label (e.g., “this is green, therefore it is
In this assignment, we will use the simplest classification method: nearest neighbor. At test time, we
will simply look at the query’s nearest neighbor in the training set and transfer that label. In this example,
you will be looking at the query image and looking up its nearest neighbor in a collection of training images
whose labels are already known. This approach works surprisingly well given a huge amount of data. (For
a cool application, see the work by Hays & Efros ).
The key components of any nearest-neighbor system are:
• features (how do you represent your instances?) and
• similarity (how do you compare instances in the feature space?).
You will implement both.
2.1 Extracting Features (Shumian)
We will first represent an image with a bag of words. In each image, we simply look at how often each word
Q2.1 (10 points): Write the function
visual recog.get feature from wordmap(wordmap,dict size)
that extracts the histogram4 of visual words within the given image (i.e., the bag of visual words). As inputs,
the function will take:
• wordmap, a H × W image containing the IDs of the visual words
• dict size, the maximum visual word ID (i.e., the number of visual words, the dictionary size). Notice
that your histogram should have dict size different bins.
As output, the function will return hist, an “L1 normalized” dict size-length histogram The L1 normalization makes the sum of the histogram equal to 1. You may wish to load a single visual word map, visualize
it, and verify that your function is working correctly before proceeding.
4Look into numpy.histogram()
2.2 Multi-resolution: Spatial Pyramid Matching (Harsh)
A bag of words is simple and efficient, but it discards information about the spatial structure of the image
and this information is often valuable. One way to alleviate this issue is to use spatial pyramid matching .
The general idea is to divide the image into a small number of cells, and concatenate the histogram of each
of these cells to the histogram of the original image, with a suitable weight.
Here we will implement a popular scheme that chops the image into 2l × 2
cells where l is the layer
number. We treat each cell as a small image and count how often each visual word appears. This results
in a histogram for every single cell in every layer. Finally to represent the entire image, we concatenate all
the histograms together after normalization by the total number of features in the image. If there are L + 1
layers and K visual words, the resulting vector has dimensionality K
l = K
(L+1) − 1
Now comes the weighting scheme. Note that when concatenating all the histograms, histograms from
different levels are assigned different weights. Typically (and in the original work ), a histogram from
layer l gets half the weight of a histogram from layer l + 1, with the exception of layer 0, which is assigned
a weight equal to layer 1. A popular choice is to set the weight of layers 0 and 1 to 2−L, and set the rest of
the weights to 2l−L−1
(e.g., in a three layer spatial pyramid, L = 2 and weights are set to 1/4, 1/4 and 1/2
for layer 0, 1 and 2 respectively. See Fig. 6 for an illustration of a spatial pyramid. Note that the L1 norm
(absolute values of all dimensions summed up together) for the final vector is 1.
Figure 6: Spatial Pyramid Matching: From . Toy example of a pyramid for L = 2. The image has
three visual words, indicated by circles, diamonds, and crosses. We subdivide the image at three different
levels of resolution. For each level of resolution and each channel, we count the features that fall in each
spatial bin. Finally, weight each spatial histogram.
Q2.2 (15 points): Implement a function getImageFeaturesSPM that form a multi-resolution representation
of the given image.
visual recog.get feature from wordmap SPM(wordmap,layer num,dict size)
As inputs, the function will take:
• layer num, the number of layers in the spatial pyramid, i.e., L + 1
• wordmap, a H × W image containing the IDs of the visual words
• dict size, the maximum visual word ID (i.e., the number of visual words / the dictionary size)
As output, the function will return hist all, a vector that is L1 normalized. Please implement this
function for any given L. However, we use a 3-layer spatial pyramid (L = 2) for all of the
following recognition tasks.
One small hint for efficiency: a lot of computation can be saved if you first compute the histograms of
the finest layer, because the histograms of coarser layers can then be aggregated from finer ones. Make sure
you normalize the histogram after aggregation.