[ECCV 2020] Training neural networks to predict visual overlap of images, through interpretable non-metric box embeddings
Anita Rau, Guillermo Garcia-Hernando, Danail Stoyanov, Gabriel J. Brostow and Daniyar Turmukhambetov – ECCV 2020 (Spotlight presentation)
To what extent are two images picturing the same 3D surfaces? Even when this is a known scene, the answer typically
requires an expensive search across scale space, with matching and geometric verification of large sets of
local features. This expense is further multiplied when a query image is evaluated against
a gallery, e.g. in visual relocalization. While we don’t obviate the need for geometric
verification, we propose an interpretable image-embedding that cuts the search in scale space to essentially a lookup.
Neural networks can be trained to predict a vector representations for images, such that the relative camera position between
pairs of images is approximated by a distance in vector space. And there are a few versions of such relations, that
unfortunately are not interpretable.
We propose to capture camera position relations through normalized surface overlap (NSO).
NSO measure is not symmetric, but it is interpretable.
We propose to represent images as boxes, not vectors. Two boxes can intersect, and boxes can have different volumes.
The ratio of intersection over volume can be used to approximate normalized surface overlap. So, box representation
allows us to model non-symmetric (non-metric) relations between pairs of images. The result is that with box embeddings
we can quickly identify, for example, which test image is a close-up version of another.
Next we plot the predicted NSO relationship between a test query image and a set of test images. We say “enclosure” for NSO of query pixels visible in the retrieved image, and concentration for NSO of retrieved image pixels visible in the query image.
bash
conda env create -f environment.yml -n boxes
conda activate boxes
path_sfm
and path_depth
with the correct paths of you machine in (each of) the dataset files ondata/dataset_jsons/megadepth/<scene name>
.data/overlap_data/megadepth/<scene name>/
. Each file (train.txt
, val.txt
and test.txt
) contains the filenames ofsrc/datasets/dataset_generator
. The package has two main components i)compute_normals.py
and ii)compute_overlap.py
.compute_normals.py
computes the surface normals using available depth images. The list of available for each scenedata/overlap_data/megadepth/<scene name>/images_with_depth
. Don’t forget to update thecompute_overlap.py
computes the normalized surface overlap between image pairs given the surface normals fromgenerate_dataset.sh
. NOTE: Normal data is stored uncompressed and50MB
per image in average, so storage size requirement can easily escalate.python -m src.train \
--name my_box_model \
--dataset_json data/dataset_jsons/megadepth/bigben.json \
--box_ndim 32 \
--batch_size 32 \
--model resnet50 \
--num_gpus 1 \
--backend dp
box_ndim
are the dimensions of the embedding space. backend
is the PyTorch Lightning distributed backend which is flexible (we have only tested this implementation on dp
and ddp
) and can be used with different num_gpus
.train.sh
. By default tensorboard logs and models are saved on a folder with the same name as the experiment /<name>
.python -m src.test \
--model_scene bigben \
--model resnet50 \
--dataset_json data/dataset_jsons/megadepth/bigben.json
test.sh
.relative_scale_example.ipynb
.
Scene | Input size and model | filesize | Link |
---|---|---|---|
Big Ben | 256 x 456 ResNet50 | 95 MB | Download 🔗 |
Notre Dame | 256 x 456 ResNet50 | 95 MB | Download 🔗 |
Venice | 256 x 456 ResNet50 | 95 MB | Download 🔗 |
Florence | 256 x 456 ResNet50 | 95 MB | Download 🔗 |
If you find our work useful or interesting, please consider citing our paper:
@inproceedings{rau-2020-image-box-overlap,
title = {Predicting Visual Overlap of Images Through Interpretable Non-Metric Box Embeddings},
author = {Anita Rau and
Guillermo Garcia-Hernando and
Danail Stoyanov and
Gabriel J. Brostow and
Daniyar Turmukhambetov
},
booktitle = {European Conference on Computer Vision ({ECCV})},
year = {2020}
}
Copyright © Niantic, Inc. 2020. Patent Pending. All rights reserved. Please see the license file for terms.