Description
1 Problem Statement
In this assignment, you are required to implement a pipeline for dense depth reconstruction from a sequence
of images captured using a calibrated smartphone camera.
2 Tasks
1. Structure from Motion. In the shared drive folder, there is a .mat file named ‘matchedPoints.mat’.
The file contains the matched keypoints (in image coordinates) across the sequence of 25 frames. Use
these points and the camera intrinsic matrix (also provided in the shared drive), to perform bundle
adjustment and obtain the 3D points and the camera poses for the different views. To make the
optimization process simpler, use the following two assumption:
• You can make small angle approximation i.e. sin θ ≈ θ and cos θ ≈ 1. As a result of this
assumption, your 3D rotation matrix for each view can be written as follows:
Ri ≈
1 −θ
z
i
θ
y
i
θ
z
i
1 −θ
x
i
−θ
y
i
θ
x
i
1
(1)
1
Here Θi = [θ
x
i
, θy
i
, θz
i
], is the angular displacement of the camera in the i
th view.
• Instead of solving for all the coordinates (X,Y and Z) of the 3-D points, you can parametrize
the points by their inverse depth reducing the number of unknowns and making the optimization
easier. Specifically, if (xj , yj ) is the projection of the j
th 3-D point in the reference frame1
, then the
same j
th 3-D point can be represented as Pj =
1
wj
[xj , yj , 1]T
. Here, wj =
1
zj
is the inverse-depth
of the j
th 3-D point.
The projection of 3-D point Pj in the i
th image is pij = [p
x
ij , p
y
ij ]
T
. π : R3 → R2
is the projection
function i.e. π([x, y, z]
T
) = [x/z, y/z]
T
. Correspondingly the cost function for bundle adjustment
becomes,
F =
X
Nc
i=1
X
Np
j=1
||pij − π(RiPj + Ti)||2
(2)
Here Np is the number of 3-D points to reconstruct and Nc is the number of views (or frames). You can
use your favorite optimization routine (like lsqnonlin() in MATLAB or scipy.optimize.least squares()
in python) to minimize F in 2. F in Equation 2 is minimized with respect to the pose for each view
and the inverse depth wj for each point. The steps involved in SfM are as follows:
(a) The matched key-points provided in the .mat file are in image coordinates. Write a function that
converts the matched points given in image coordinates to the normalized camera coordinates
compatible with equation 2. To do this, subtract the principal point from the matched points
provided to you and divide the difference by the focal length. For example, if one of the matched
points is mij = [mx
ij , m
y
ij ] in image coordinates, then to obtain pij = [p
x
ij , p
y
ij ] in normalized
camera coordinates, perform the following operation p
x
ij = (mx
ij −cx)/fx and p
y
ij = (m
y
ij −cy)/fy.
Here [cx, cy] is the principal point and fx and fy are the focal lengths. These can be obtained
from the intrinsic matrix provided to you.
(b) Write a projection function that takes a 3D point in world coordinate (Pj ), rotation angles (Θi),
and translation vector (Ti) and projects it to the normalized camera coordinate i.e. performs
π(RiPj + Ti).
(c) Using the projection function just defined, write a function that calculates the reprojection error
given the rotation angles, translation vector of all the views, and the inverse depth of all the 3D
points.
(d) Using an optimization routine (e.g. lsqnonlin in MATLAB or scipy.optimize.least squares in
python), minimize the reprojection error function defined above with respect to the rotation
angles (Θi) and translation (Ti) for each view and the inverse depth (wj ) of each 3D point. You
can initialize the Θi and Ti as zero vectors and wj as vector of ones. You can choose the first
frame as the reference frame.
Once you obtain the 3-D points, plot them as a 3-D point cloud. Perform this experiment for 5, 15
and 25 frames and report the point cloud obtained in each case.
2. Plane Sweep. Minimizing F in Equation 2 provided us with depths of a sparse set of 3-D points with
respect the reference frame. To obtain dense depth reconstruction, use the camera rotation matrices
and translation vectors obtained from the bundle adjustment problem of Equation 2 along with the
sequence provided in the shared folder in a plane-sweeping framework. For plane-sweeping algorithm,
one needs to find the plane-induced homography. For a plane at depth d and with normal vector n,
the plane-induced homography mapping the reference frame to the i
th frame is given as
Hd,i = K[Ri − (nT
T
i
)/d]K−1
(3)
As we are only concerned with the planes parallel to the reference frame, n = [0, 0, −1]T
. K is the
intrinsic matrix for the camera. You can use a set of 10 candidate depth planes within the minimum
1
(xj , yj ) is in normalized camera coordinates i.e. after subtraction of principle coordinate from the image coordinates and
division of the difference by the focal length in each direction.
2
and the maximum depth obtained from the above bundle adjustment problem. The steps involved in
plane sweeping are as follows:
(a) Map the i
th frame to the reference frame through the plane induced homography H
−1
d,i . Perform
this warping for each frame and each candidate depth plane. Stack the warped frames into a
tensor. This tensor will be of size H × W × D × Nc where H and W are the height and width of
the frames, D is the number of candidate depth plane and Nc is the number of views or frames.
(b) Find the variance across all the warped frames for each candidate depth. This will reduce the
tensor to a cost volume of dimension H × W × D. Find the depth for each pixel as the one that
gives the minimum variance.
Plot the obtained depth map alongside the reference frame. Perform this experiment for 5, 15 and 25
frames and report the depth map obtained in each case. How does increasing the number of frames
change the quality of the depth map reconstruction?
3