COLMAP: A software for 3D Spare & Dense Reconstruction

3D reconstruction is one of the important applications in Computer Vision. With the input being a series of photos taken around an object, the output requirement is a 3D spatial image of the object. This construction process requires many steps and is complex. In this section, we will introduce you to COLMAP software - a versatile software that helps you easily construct the 3D space of the object. Through the article, we hope that readers can find new inspiration from computer vision problems.

1. Dataset and steps

COLMAP constructs the object space through the following steps: 1) Feature Detection, COLMAP uses the SIFT algorithm to extract image features. 2) Feature Matching, considering each image in turn, COLMAP will find a set of images whose parts match the image being considered. 3) Spare Reconstruction, using known images with overlapping regions in the previous step, COLMAP solves the Triangulation problem to find the image depth, thereby constructing a sparse space. 4) Dense Reconstruction: Constructs a dense 3D space from a sparse space.

In this presentation, we use the South Building data, which consists of 128 large-scale photographs around the building as shown below. We will use Colmap software to reconstruct the dense 3D space as shown on the right:

2. Feature Detection

The default feature extraction algorithm of COLMAP is SIFT. Therefore, we will introduce this algorithm before presenting the results from COLMAP.
Scale-Invariant Feature Transform (SIFT) is a local image extraction algorithm proposed by researcher David Lowe in computer vision. The algorithm works with stable results with different scales of images, besides, it can also be said that this algorithm is rotation-invariant. The algorithm is widely used in Object recognition, Image stitching, .... This algorithm includes the following steps:

1. Scale-space Extrema Detection: In this step, the algorithm finds the location of important information points in the image using the Laplacian of Gaussian (LoG) filter. However, these features can be changed for images with different scales. Therefore, the author proposes to use multiple filters of different sizes at the same time on images with different sizes. In addition, because the LoC filter is complicated to calculate, the author proposes to approximate this filter as the Difference of Gaussian: Take the difference of two consecutive Gaussians of magnitude. The algorithm continues to perform a data filtering operation to get the final result.

2. Keypoint Localization: In this step, the author proposes a better way to refine the positions of the key points in the previous step. Furthermore, the author also removes any key points that are suspected to be edges in this step.

3. Orientation assignment: Besides the scaling issue, the features of a point may differ when performing rotation. To avoid this situation, the author proposes to calculate the rotation direction of the feature. Thanks to that, two features will be rotated to the same angle before being compared. The author takes the majority direction of all pixels in the 16x16 square around the feature point

4. Keypoint Descriptor: SIFT describes the features of an important point with a 128-dimensional vector. Considering a 16x16 region around this point, SIFT divides this region into 4 adjacent 4x4 squares, each of which is not in the middle. For each square, SIFT plots the histogram in the direction divided into 8 dimensions, each dimension representing an angle of 45 degrees. The histograms of these 4 regions combine to form a 4×4×8=128-dimensional vector.

3. Feature Matching

In this step, COLMAP will use the extracted features from the Feature Detection step to find images with overlapping regions. Colmap provides many different matching algorithms including:

Exhaustive matching: Algorithm suitable for small data. Each image will be performed with ALL images in the database. This algorithm gives high accuracy, but has high computational cost.
Sequential Matching: Suitable when the data is known to be in order.
Vocabulary Tree Matching: The most popular algorithm, has fast execution speed with large data sets.
Spatial Matching: Suitable when each sample has coordinate information. Colmap will take images with this information close together for matching.
Transitive Matching: This algorithm is suitable when the data already has information that some images have been matched together.
Custom Matching

In this section, we will use the Vocabulary Tree Matching algorithm. We will first quickly demonstrate how to create a Vocabulary Tree, and use it to solve the Image Retrieval problem. Then, we will present the results from Colmap.

Vocabulary Tree

Vocabulary Tree is built by segmenting each layer of feature space from the training image dataset. Specifically, Vocabulary Tree is built step by step as follows:

1. Given a training image dataset, the dataset can be the set of images used to reconstruct the current scene, or another dataset (e.g. ImageNet). The entire image will be transformed into the feature space using an algorithm, e.g. SIFT detection.

2. With the extracted feature space, a hierarchical clustering algorithm is used and groups similar features into clusters. This is the Vocabulary Tree.

After building the Vocabulary Tree, Colmap performs the following search for similar image pairs:

1. Extract features from images used for construction, use vocabulary tree to find corresponding leaf node for feature. Each leaf node will store information: How many images have this leaf feature? How many features belong to this leaf does this image have? This information is used to calculate distance to query image.

2. Considering any image (query image), Colmap extracts and finds the corresponding leaf node for each feature. The more features the image has in the same cluster as the query image, the more similar it is to the query image (In fact, Colmap uses the score calculation from the proposed algorithm). The algorithm will select the set of K images with the highest score with the most query image in the database.

3. After finding the pair of images, Colmap will perform Feature Matching: Find similar keypoints between the two images. To limit noise, the algorithm performs an additional step of Image Verification: Remove outliers using the RANSAC algorithm.

4. Spare reconstruction

After finding the image pairs as well as the corresponding keypoints of the two matching images, COLMAP will reconstruct the sparse 3D space by projecting these keypoints onto the space. Constructing the sparse space belongs to the Structure From Motion (SFM) problem, Colmap performs the following steps to solve this problem:

Step 1: Initialization

Colmap picks two random images (we think the pair of images with the largest number of keypoint matches). With the keypoint matches of the two images calculated, Colmap will solve the following problems:

Find the relative position between two cameras: by solving the problem 1) Find Fundamental matrix, 2) find Essential matrix from Fundamental matrix, 3) Find camera pose from Essential matrix, this step gives 4 results, and 4) choose the result that satisfies the Cheirality condition. Noise filtering methods such as RANSAC are applied in steps 1, 2, 3.
Project these points onto 3D space, this step could have been done in step 1.4

Step 2: Image Registration:

After constructing the sparse space from two images, Colmap will add more images to continue constructing the sparse space. The selected image must have keypoints matching the available keypoints from the constructed images. In this step, Colmap solves the following problems:

Determine which keypoints of the added image match the keypoints and points that have been constructed in which 3D space in the constructed images.
Perspective-n-Point problem: Find the relative position of the added camera compared to the cameras.

Step 3: Triangulation

Colmap will project the keypoints of the added unconstructed image into 3D space.

Step 4: Bundle Adjustment

In step 2, the camera position is optimized from the available point cloud points. In this step, Colmap will simultaneously change the camera position and the point cloud to get the most optimal result.

Step 5: Outlier filtering

After optimizing the point clouds and camera positions, Colmap will remove the point clouds that give large errors when projected onto the 2D image plane.
Colmap continues to step two until all images have been added.
Colmap performs the Incremental Structure from Motion algorithm. There are also some other algorithms such as Global, Hierarchical SFM.

Result from Colmap:

5. Dense Reconstruction

In the previous step, Colmap calculated the relative positions between the cameras, and constructed the sparse space. In this step, Colmap will construct the dense space that fills the gaps of the spare space. This stage is turned off in the next step:

Estimate the depth and normal map for each image. This value must be the same for corresponding pixels in different views. Compared to the spare reconstruction step, the point clouds are constructed from all pixels instead of just keypoints, so Colmap can construct a denser sparse space.
Get the space between point cloud points using Poisson Surface Reconstruction algorithm.

6. Conclusion

In this article, we have presented the steps of implementing COLMAP software from a set of input images to a dense 3D space. At each step, we summarize the algorithm COLMAP uses including SIFT (Feature Detection), Vocabulary Tree (Feature Matching), Structure From Motion. The results from the dense space show that COLMAP reconstructs the 3D space very well with only 128 data images. Readers interested in COLMAP can find other results on the software homepage.

7. Reference

Large-scale 3D modeling. CVPR 2017 Tutorial - Large-scale 3D Modeling from Crowdsourced Data. (n.d.) - https://demuc.de/tutorials/cvpr2017/.
Vocabulary Tree - https://www.micc.unifi.it/delbimbo/wp-content/uploads/2013/12/A62_Vocabulary_tree.pdf.
Chahat Deep Singh. “Structure from Motion”, CMSC 426 Computer Vision.