Compact Hash Codes for Efficient Visual Descriptors Retrieval in Large Scale Databases (IEEE 2017).


In this paper we present an efficient method for visual descriptors retrieval based on compact hash codes computed using a multiple k-means assignment. The method has been applied to the problem of approximate nearest neighbor (ANN) search of local and global visual content descriptors, and it has been tested on different datasets: three large scale standard datasets of engineered features of up to one billion descriptors (BIGANN) and, supported by recent progress in convolutional neural networks (CNNs), on CIFAR-10, MNIST, INRIA Holidays, Oxford 5K and Paris 6K datasets; also the recent DEEP1B dataset, composed by one billion CNN-based features, has been used. Experimental results show that, despite its simplicity, the proposed method obtains a very high performance that makes it superior to more complex state-of-the-art methods. Index Terms—Retrieval, nearest neighbor search, hashing, SIFT, CNN.


E FFICIENT nearest neighbor (NN) search is one of the main issues in large scale information retrieval for multimedia and computer vision tasks. Also methods designed for multidimensional indexing obtain a performance that is only comparable to exhaustive search, when they are used to index high dimensional features [1]. A typical solution to avoid an exhaustive comparison is to employ methods that perform approximate nearest neighbor (ANN) search, that is finding the elements of a database that have a high probability to be nearest neighbors. ANN algorithms have been typically evaluated based on the trade-off between efficiency and search quality, but the inception of web-scale datasets have introduced also the problems of memory usage and speed. For these reasons the most recent approaches typically use feature hashing to reduce the dimensionality of the descriptors, and compute Hamming distances over these hash codes. These methods generally use inverted files, e.g. implemented using hash tables, to index the database, and require to use hash codes with a length of several tens of bits to obtain a reasonable performance in retrieval, a typical length is 64 bits. Their application to systems with relatively limited memory (e.g. mobile devices still have 1-2 GB RAM only) or to systems that involve a large-scale media indexing, that need to maintain the index in main memory to provide an adequate response time, requires to use compact hash codes.


Previous works on visual feature hashing can be divided in methods based on hashing functions, scalar quantization, vector quantization and, more recently, neural networks. a) Hashing functions: Weiss et al. [2] have proposed to treat the problem of hashing as a particular form of graph partitioning, in their Spectral Hashing (SH) algorithm. Li et al. [3] have improved the application of SH to image retrieval optimizing the graph Laplacian that is built based on pairwise similarities of images during the hash function learning process, without requiring to learn a distance metric in a separate step. Heo et al. [4] have proposed to encode high-dimensional data points using hyperspheres instead of hyperplanes; Jin et al. [5] have proposed a variation of LSH, called Density Sensitive Hashing, that does not use random projections but instead uses projective functions that are more suitable for the distribution of the data. Du et al. [6] have proposed the use of Random Forests to perform linear projections, along with a metric that is not based on Hamming distance. Lv et al. [7] address the problem of large scale image retrieval learning two hashes in their Asymmetric Cyclical Hashing (ACH) method: a short one (k bits) for the images of the database and a longer one (mk bits) for the query. Hashes are obtained using similarity preserving Random Fourier Features, and computing the Hamming distance between the long query hash and the cyclically m-times concatenated compact hash code of the stored image.


The proposed method exploits a novel version of the kmeans vector quantization approach, introducing the possibility of assignment of a visual feature to multiple cluster centers during the quantization process. This approach greatly reduces the number of required cluster centers, as well as the required training data, performing a sort of quantized codebook soft assignment for an extremely compact hash code of visual features. Table I summarizes the symbols used in the following. are assigned to the nearest cluster center, whose code is used as hash code. Considering the case of 128-dimensional visual content descriptors, like SIFT or the FC7 layer of the VGGM-128 CNN [33], this means that compressing them to 64 bits codes requires to use k = 264 centroids. In this case the computational cost of learning a k-means based quantizer becomes expensive in terms of memory and time because: i) there is the need of a quantity of training data that is several times larger than k, and ii) the execution time of the algorithm becomes unfeasible. Using hierarchical k-means (HKM) makes it possible to reduce execution time, but the problem of memory usage and size of the required learning set affects also this approach. Since the quantizer is defined by the k centroids, the use of quantizers with a large number of centroids may not be practical or efficient: if a feature has a dimension D, there is need to store k × D values to represent the codebook of the quantizer. A possible solution to this problem is to reduce the length of the hash signature, but this typically affects negatively retrieval performance. The use of product k-means quantization, proposed originally by Jegou ´ et al. [1], overcomes this issue.


BIGANN Dataset [1], [36] is a large-scale dataset commonly used to compare methods for visual feature hashing and approximate nearest neighbor search [1], [15], [16], [18], [20], [36], [37]. The dataset is composed by three different sets of SIFT and GIST descriptors, each one divided in three subsets: a learning set, a query set and base set; each query has corresponding ground truth results in the base set, computed in an exhaustive way with Euclidean distance, ordered from the most similar to the most different. For SIFT1M and SIFT1B query and base descriptors have been extracted from the INRIA Holidays images [38], while the learning set has been extracted from Flickr images. For GIST1M query and base descriptors are from INRIA Holidays and Flickr 1M datasets, while learning vectors are from [39]. In all the cases query descriptors are from the query images of INRIA Holidays (see Figure 2). The characteristics of the dataset are summarized in Table II. DEEP1B Dataset [22] is a recent dataset produced using a deep CNN based on the GoogLeNet [40] architecture and trained on ImageNet dataset [41]. Descriptors are extracted from the outputs of the last fully-connected layer, compressed using PCA to 96 dimensions, and l2-normalized. The characteristics of the dataset are summarized in Table III. CIFAR-10 Dataset [42] consists of 60,000 colour images (32 × 32 pixels) in 10 classes, with 6,000 images per class (see Figure 3). The dataset is split into training and test sets, composed by 50,000 and 10,000 images respectively. A retrieved image is considered relevant for the query if it belongs to the same class. This dataset has been used for ANN retrieval using hash codes in [24], [29]. MNIST Dataset [43] consists of 70,000 handwritten digits images (28×28 pixels, see Figure 4). The dataset is split into 60,000 training examples and 10,000 test examples. Similarly to CIFAR-10, a retrieved image is considered relevant if it belongs to the same class of the query. This dataset has been used for ANN retrieval in [24], [29]. Image retrieval datasets INRIA Holidays [38], Oxford 5K [44] and Paris 6K [45] are three datasets typically used to evaluate image retrieval systems. For each dataset are given a number of query images, and the associated ground truth.


We have proposed a new version of the k-means based hashing schema called multi-k-means – with 4 variants: mk-means-t1, m-k-means-t2, m-k-means-n1 and m-k-means-n2 – which uses a small number of centroids, guarantees a low computational cost and results in a compact quantizer. These characteristics are achieved thanks to the association of the centroids to the bits of the hash code, that greatly reduce the need of a large number of centroids to produce a code of the needed length. Another advantage of the method is that it has no parameters in its m-k-means-t1 and m-k-means-t2 variants, and only one parameter for the other two variants; anyway, as shown by the experiments, it is quite robust to variations of such parameter, as well as hash code length. Our compact hash signature is able to represent high dimensional visual features obtaining a very high efficiency in approximate nearest neighbor (ANN) retrieval, both on local and global visual features.