Ligang Zheng and Chao Song
Abstract: There is a steep increase in data encoded as symmetric positive definite (SPD)matrix in the past decade. The set of SPD matrices forms a Riemannian manifold that constitutes a half convex cone in the vector space of matrices, which we sometimes call SPD manifold. One of the fundamental problems in the application of SPD manifold is to find the nearest neighbor of a queried SPD matrix. Hashing is a popular method that can be used for the nearest neighbor search. However, hashing cannot be directly applied to SPD manifold due to its non-Euclidean intrinsic geometry. Inspired by the idea of kernel trick, a new hashing scheme for SPD manifold by random projection and quantization in expanded data space is proposed in this paper. Experimental results in large scale nearduplicate image detection show the effectiveness and efficiency of the proposed method.
Keywords: Riemannian manifold, congruent transformation, hashing, kernel trick.
A huge amount of multimedia (especially images, videos and photos) has been flooding websites such as YouTube, Facebook, Google video and many others. On one hand, the easy access to the multimedia big data gives a lot of fun to the public. On the other hand,the deluge of multimedia contents has undoubtedly created problems such as copyright infringements and wasteful usage of storage space and network bandwidth.
Website owner can easily take measures to prevent the users from uploading the exactly same images or videos by using hash code (for example, MD5). According to some statistics, there are many images or videos on the Internet which are just the nearduplicates instead of the exactly same copies. Any auxiliary textual information associated with the image or video would be of no use when it came to determining if the image or video had been illegally copied. Therefore, it is very challenging to detect the near-duplicate images or videos.
A promising approach to tackle this problem is to look directly into the visual content of the videos or images. This is the so-called content-based copy detection (CBCD)approach [Zheng, Lei, Qiu et al. (2012); Lei, Qiu, Zheng et al. (2014a)].
Unlike watermarking, CBCD extracts a small number of features (called the fingerprint or signature) from the original image or video content instead of inserting some external information [Wang, Zhu and Shi (2018); Ma, Luo, Li et al. (2018); Li, Castilione and Dong (2018)] into the image or video content prior to the distribution of image or video content. It is based on content similarity and relies on the assumption that an image or video will share a significant amount of information with its copies and will be distinguishable from other non-copies. A major challenge of CBCD lies in the fact that a copy is not necessarily an identical or a near replication, but rather a photometric or geometric transformation of the original that remains recognizable [Poullot, Crucianu and Buisson (2008); Douze, Jegou, Sandhawalia et al. (2009); Lei, Qiu, Zheng et al.(2014a);Lei, Zheng and Huang (2014b)]. The transformations may include changing color/brightness, camera recording, blurring, inserting logos/subtitles, cropping, and flipping, etc.
A key issue to the successful detection of a copied video or image lies in the design of an effective image or video content descriptor. These descriptors can be classified into global and local statistical descriptors [Zheng, Lei, Qiu et al. (2012)]. The global statistics are generally efficient to compute, and compact to store, but less accurate in terms of their retrieval quality. On the other hand, local descriptors (such as SIFT [Lowe (2004)])are relatively more robust to image transformations, such as occlusion, cropping, etc.
Salient covariance (SCOV) [Zheng, Lei, Qiu et al. (2012)] is a kind of compact and robust descriptor based on visual saliency and region covariance matrices for near duplicate image and video detection. The salient covariance (SCOV) descriptor has shown its advantages of being compact, discriminative and robust over other state of the art global descriptors [Zheng, Qiu and Huang (2018)]. Like other region covariance based descriptors, SCOVs are symmetric positive defined (SPD) matrices, which is a kind of Riemannian manifold of non-positive curvature. Therefore, the Euclidean computation framework cannot directly applied to the SPD matrices. Instead the Riemannian computation framework should be adopted and the similarities of the descriptors have to be measured using the Riemannian metric such as affine invariant Riemannian metric(AIRM) [Pennec, Fillard and Ayache (2006)], log-Euclidean Riemannian metric (LERM)[Arsigny, Fillard, Pennec et al. (2006)] and Jensen-Bregman LogDet Divergence (JBLD)[Cherian, Sra, Banerjee et al. (2011)].
As a result of the nonlinear computational framework of SPD manifold, it is usually very time-demanding to conduct nearest neighbor (NN) search in SPD manifold. Many papers have made effort to design efficient NN search algorithm [Cherian, Sra, Banerjee et al.(2011)]. In spite of the hashing’s success in visual similarity search, existing techniques have some important restrictions. Current methods generally assumed the data to be processed lie in multidimensional vector space. In this paper, we investigate the hashing based methods for NN search in SPD manifold. Inspired by the idea of kernel trick, this paper tries to map the SPD matrices into a high-dimensional Euclidean space and then using random projection and quantization to get the binary bits representation of SPD matrix. Experimental results show the effectiveness and efficiency of the proposed method.
In this part, we give a short introduction of a covariance matrix-based descriptor [Tuzel,Porikli and Meer (2006)] the salient covariance (SCOV), which is used for near-duplicate image/video detection. Given an imageand lettingbe a-dimensional feature image, there are many ways to derive the feature image. Usually,can be set as,
The covariance matrix of the salient features is defined as
Locality-sensitive hashing (LSH) reduces the dimensionality of high-dimensional data.LSH hashes input items so that similar items map to the same buckets with high probability (the number of buckets being much smaller than the universe of possible input items). LSH differs from conventional and cryptographic hash functions because it aims to maximize the probability of a collision for similar items. Locality-sensitive hashing has much in common with data clustering and nearest neighbor search.
By using a set of hashing functions, we can easily get a binary code for the vector.
The inner product between two vectorsandis written as. A matrixis called positive, iffor all. It is usually denoted by. Denoteas the symmetric space of allsymmetric matrix and denoteThus, any()is a SPD matrix. The spaceforms a convex subset of.is not closed under multiplication with a negative scalar. Therefore, the space of SPD matrices,although a subset of vector space, is not a vector space. Instead, the set of SPD matrices forms a differential Riemannian manifold of non-positive curvature [Zheng, Kim, Adluru et al. (2017); Bhatia (2007); Hiai and Petz (2009)] which is usually called the SPD manifold. This forms a quotient space, wheredenotes the general linear group (the group ofnonsingular matrices) andis the orthogonal group(the group oforthogonal matrices).
The inner product of two tangent vectorsis as follows,
This defines the Fisher-Rao metric (also known as affine-invariant Riemannian metric[Pennec, Fillard and Ayache (2006)]) in the statistical model of multivariate distributions,which givesthe structure of a Riemannian homogeneous space of negative curvature. This metric is the starting point of this paper.
The geodesic distance with log-Euclidean Riemannian metric is as follows.
The manifold exponential operatormaps the tangent vectorto the location on the manifold reached by geodesic starting atin the tangent direction. Its inverse, the Riemannian logarithm operatorgives the vectors incorresponding to the geodesic fromto. The matrix logarithmand matrix exponentialof SPD matrices are calculated as
The above formula means that the congruent transformation with unitary matrices (an element of orthogonal group) can preserve the geodesic distance of two SPD matrices.The unitary matrices can be seen as an element of special orthogonal group -.Therefore, the congruent transformation just changes the orientation of the original ellipsoid while preserving the shape.
Since no single data structure can capture the diversity and richness of high-dimensional data, an ensemble strategy can be adopted to improve the diversity. Therefore we can get a set of randomized SPD matrices by choosing a set of stochastic unitary matrices.
Hashing techniques have achieved great success in visual similarity search. However, it only applied to the data which are assumed to lie in multidimensional vector space. Given that the data lying in the Riemannian manifold instead of linear vector space, it is usually embedding the manifold-valued data into the Reproducing kernel Hilbert space (RKHS)which is Euclidean while still honoring the manifold structure.
The kernels usually have an infinite-dimensional embedding, making it seemingly impossible to build a random hyper-plane. Generally speaking, there are two methods to deal with the kernel mapping so far. One is to construct the hyper-plane as a weighted sum of a subset of the database items [Kulis and Grauman (2009)]. The other method is to use random feature map to approximate the kernel function [Mukuta and Harada(2016)].
In this part, we propose a new method to deal with the kernel mapping, namely the data space expansion (DSE). We use a set of randomized congruent transformation to approximate the kernel embedding. The procedures are as follows.
1). Create a set of random unitary matrices.
3). Transform each new SPD matrix into the log-Euclidean space by LERM and combine them sequentially to get a new vector. The dimension ofis.
Tearing herself away from the portrait at last, she passed through into a room which contained every musical instrument under the sun, and here she amused herself for a long while in trying some of them, and singing until she was tired
We can formalize the procedure as the following mapping.
For a single SPD matrix, we can get a very high-dimensional vector by sequentially concatenating the vectors in log-Euclidean space. This process is to mimic the kernel embedding. The new expanded space can be seen as an approximation to the kernel space.As the set of unitary matrices are randomly generated, the vector x is also a randomized vector in a high-dimensional space. Just as stated in Subsection 3.1, the congruent transformation preserves the geodesic distance. Therefore their vectorial combination will preserve the Euclidean distance. In the new space, a random Gaussian matrixis adopted as a projection matrix. The projection and quantization is conducted by,
The proposed algorithm is used in near-duplicate image detection. The evaluation dataset contains 1) the INRIA Copydays dataset which contains 157 images [Douze, Jé gou,Sandhawalia et al. (2009)] as the testing dataset and 2) 25,000 Flickr images and another 40,000 images (65,000 total images) as the distracting image dataset. Examples of INRIA images are shown in Fig. 1. We create another 46 copies for each testing image.Therefore, there are totally 72222 (65,000+157×46) images. The transformations and parameters for creating the near-duplicate images please refer to [Zheng, Lei, Qiu et al.(2012)]. For each image, an(SPD matrix) [Zheng, Lei, Qiu et al. (2012)] is extracted to represent the image.
Figure 1: Examples of testing images used in the paper
Two well-known and widely used evaluation methods are adopted to evaluate the algorithm,including the receiver operating characteristic (ROC) and the mean average precision (mAP).The ROC curve is a graphical plot of the true positive rate versus the false positive rate. The mAP is the query’s average precision which can be defined as follows.mAP. For each query q,(in our experiments) images are returned. Then the average precision (AP) is calculated as
4.2.1 Accuracy analysis
In this part, the proposed method is compared with other kernel based methods such as KLSH [Kulis and Grauman (2009)] and Nystrom method. As we known the KLSH and Nystrom method both approximate the kernel by selecting a subset of data samples from the training dataset. On the contrary, the proposed method directly approximates the kernel by data space expansion (DSE). The brute-force method and LSH [Gionis, Indyk and Motwani (1999)] are used as a benchmark. LSH is conducted in the log-Euclidean space. The length of hashing code is set as 300 for all methods. In the DSE method, 10 congruent transformations are used.
Fig. 2(a) gives the results of LSH, KLSH, Nystrom method and proposed data space expansion (DSE) method. From the results, we can see that brute force search using LERM undoubtedly gets the best results when compared with hashing based methods.Kernel based methods including Nystrom method and KLSH achieve better results than LSH. Hashing with data space expansion (DSE) achieves best results amongst all hashing based methods, which demonstrate the effectiveness of the proposed scheme.
Besides ROC curve, the mAP of each method are also calculated and shown in Fig. 2(b).It can be seen from the figure that brute force method (LERM) gets the best performance while LSH method which is hashing in very low dimensional log-Euclidean space gets the worst performance. DSE method achieves the best results amongst all kernel related methods.
Figure 2: Fig. 2(a) gives the ROC performance comparison of state of the art methods.Fig. 3(b) gives the mean average precision (mAP) performance comparison of state of the art methods. The length of hashing code is 300. The benchmark methods are kernelized LSH, LSH, Nystrom method and brute-force search using LERM
4.2.2 Parameter analysis
It is known that the length of hashing code has some effect on the precision of nearest neighbor search. In this part, we give a precision comparison of different state of the art methods while varying the bit length from 100 to 300. The results are given in Fig. 3(a).
From the figure, we can see that almost all algorithms have better precision performance when increasing the length of hashing code. The DSE method has the best performance when the length of hashing code is greater than 150. In this comparison, 10 congruent transformation matrices are adopted.
The number of the congruent transformation determines the dimension of the expansion space. This paper also investigates the effect of the number of congruent matrices to the precision. Fig. 3(b) gives a precision comparison of DSE with different number of congruent transformations. We can see from the figure that more congruent transformations get better precision performance. The length of hashing code together with the number of congruent transformations determines the precision performance of the near-duplicate image detection. But the precision doesn’t improve much when the number of congruent transformation gets to a certain number, for example 10 in this experiment.
Figure 3: Fig. 3(a) gives a precision comparison of state of the art method with different bit length. Fig. 3(b) gives a precision comparison of DSE with different number of congruent transformations.
4.2.3 Time efficiency
In the large-scale near-duplicate image detection, we usually pay a lot of attention to the detecting efficiency. The SCOV and hashing code can be generated off-line, so we focus on the detecting time in this part. In the detecting stage, each image is finally represented with 300 bits binary code. As a comparison, the brute force LERM method is used as a benchmark. Tab. 1 shows the average time for 157 testing images. From the table, we can see it improves the time efficiency to more than 100 folds. As we know, some hashing algorithms improve time efficiency at the cost of bad searching accuracy. However, the accuracy degradation is not very large when using 300 hundred bits in our experiments,which can be seen from the Figs. 2(a) and 2(b).
Table 1: Time comparison for hashing based method and brute force LERM
Due to the non-Euclidean essence of the SPD manifold, hashing techniques designed for vector space can’t be directly used in SPD manifold. In this paper, a novel hashing scheme in SPD manifold is investigated. Inspired by the idea of kernel trick, the proposed scheme tries to map the SPD matrix into a high-dimensional Euclidean space and then use random projection and quantization to get the binary bits representation of SPD matrix. The congruent transformation which preserves the geodesic distance is used in this process. Experiments in large scale near-duplicate image detection show the effectiveness and efficiency of the proposed method.
Acknowledgement:The authors wish to express their appreciation to the reviewers for their helpful suggestions which greatly improved the presentation of this paper. The authors also thank for Fufang Li for the valuable discussions. Part of the paper is supported by NSFC (61472092, 61300205), Guangzhou BEY (670230174).
Computers Materials&Continua2018年9期