Self-supervised Learning of Contextualized Local Visual Embeddings

Thalles Santos Silva, Helio Pedrini, Adín Ramírez Rivera

4th Visual Inductive Priors for Data-Efficient Deep Learning Workshop (ICCV 2023)

...

Abstract

We present Contextualized Local Visual Embeddings (CLoVE), a self-supervised convolutional-based method that learns representations suited for dense prediction tasks. CLoVE deviates from current methods and optimizes a single loss function that operates at the level of contextualized local embeddings learned from output feature maps of convolution neural network (CNN) encoders. To learn contextualized embeddings, CLoVE proposes a normalized multi-head self-attention layer that combines local features from different parts of an image based on similarity. We extensively benchmark CLoVE’s pre-trained representations on multiple datasets. CLoVE reaches state-of-the-art performance for CNN-based architectures in 4 dense prediction downstream tasks, including object detection, instance segmentation, keypoint detection, and dense pose estimation.

Pre-trained models

Models trained with ResNet50 encoders.

Epochs Multicrop URL
CLoVE 50 2x224 + 6x96 Checkpoints
CLoVE 200 2x224 + 6x96 Checkpoints
CLoVE 400 2x224 + 6x96 Checkpoints

Important links

Code arXiv OpenReview

Poster

It appears you don't have a PDF plugin for this browser. No biggie... you can click here to download the PDF file.

Reference


@inproceedings{silva2023self,
    title={Self-supervised Learning of Contextualized Local Visual Embeddings},
    author={Silva, Thalles and Pedrini, Helio and Ram{\'\i}rez, Ad{\'\i}n},
    booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
    pages={177--186},
    year={2023}
}
            

Authors

Thalles Santos Silva

Helio Pedrini

Adín Ramírez Rivera

Acknowledgements

The computations were performed in part on resources provided by Sigma2---the National Infrastructure for High Performance Computing and Data Storage in Norway---through Project NN8104K. This work was funded in part by the Research Council of Norway, through its Centre for Research-based Innovation funding scheme (grant no. 309439), and Consortium Partners. This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior---Brasil (CAPES)---Finance Code 001