Xia Chen, Zhan-Li Sun, Member, IEEE, Kin-Man Lam, Senior Member, IEEE, and Zhigang Zeng, Fellow, IEEE
Abstract—In many traditional non-rigid structure from motion(NRSFM) approaches, the estimation results of part feature points may significantly deviate from their true values because only the overall estimation error is considered in their models.Aimed at solving this issue, a local deviation-constrained-based column-space-fitting approach is proposed in this paper to alleviate estimation deviation. In our work, an effective model is first constructed with two terms: the overall estimation error,which is computed by a linear subspace representation, and a constraint term, which is based on the variance of the reconstruction error for each frame. Furthermore, an augmented Lagrange multipliers (ALM) iterative algorithm is presented to optimize the proposed model. Moreover, a convergence analysis is performed with three steps for the optimization process. As both the overall estimation error and the local deviation are utilized,the proposed method can achieve a good estimation performance and a relatively uniform estimation error distribution for different feature points. Experimental results on several widely used synthetic sequences and real sequences demonstrate the effectiveness and feasibility of the proposed algorithm.
NOWADAYS, recovering 3D object shapes from 2D images has become a valuable approach to enhance tasks in computer vision, such as object detection [1]–[3], humancomputer interaction [4], image annotation [5]–[7], etc. As a fundamental method of 3D reconstruction [8], [9], non-rigid structure from motion (NRSFM) provides an approach to jointly estimate 3D object shapes and relative camera motions from corresponding 2D points in a sequence of images[10]–[12]. Because of the lack of prior information about the 3D shape deformation, NRSFM is still a very complex and illposed problem.
In order to alleviate uncertainty, the prior information and constraints have been gradually proposed for 3D reconstruction models. Remarkably, a matrix factorization method was proposed in [13] to represent the unknown 3D shapes as linear combinations of a small number of 3D shape bases. In the matrix factorization method, the decomposed motion factor and shape basis are constrained to be of a lowrank 3K matrix. Subsequently, many works have been proposed based on the low rank shape model. In [14], a closed form solution was presented by combining a rotation constraint and a low rank constraint. In [15], a Gaussian prior was assumed for the shape coefficients, and the optimization is solved using the expectation-maximization (EM) algorithm.Considering the approximate symmetry of facial feature points, an effective depth estimation model was proposed in[16] based on the constraint independent component analysis.A multilinear factorization based algorithm was proposed in[17] to deal with the NRSFM problem under orthographic cameras by combining the low-rank prior and a latentsmoothness prior.
In [18], a non-rigid structure from motion factorization model was proposed by solving a very small semi-definite programming and a nuclear-norm minimization problem. The same overall cost function and nuclear-norm used in [18] was also adopted in [11], [19] to deal with the dense NRSFM problem under orthographic cameras based on Grassmann manifold. A reconstruction-based metric learning method was presented in [20] to learn a discriminative distance metric for unconstrained face verification. A sequential non-rigid structure from the motion model was proposed in [21] by utilizing the physical priors of an object’s surface. When a non-rigid object has degenerate deformations, the extra degree of freedom yields spurious shape deformations due to nonnegligible noise in real applications. To deal with this problem, a low-rank shape deformation model was proposed in [12] to represent 3D structures of degenerate deformations by considering both the rank-deficient nature and the low-rank property. In [22], a dense NRSFM model was proposed that incorporates the physical, discontinuity-preserving deformation prior. Instead of a single object, multiple objects are considered in [23], [24] when they are assumed to be clustered a priori.
In order to decrease the number of unknown parameters, the 3D point trajectories were compactly modeled as the discrete cosine transform (DCT) basis under a smoothing constraint[25], [26]. Nevertheless, due to the limitation of rank 3K, the high-frequency deformation can not be well modeled for the trajectory representation. In [27], a smoothly deforming 3D shape was modeled as a single point moving along a smooth time-trajectory within a linear shape space. This representation provides a better reconstruction of highfrequency deformation without relaxing the rank-3K constraint. A column-space-fitting (CSF) method was developed to obtain the optimized solution [28]. Simulations on multiple sequences have demonstrated that the CSF algorithm can achieve a very good estimation performance for deformable objects.
In most traditional orthographic camera based NRSFM models [26]–[30], the 3D shape is generally estimated by minimizing the overall error of feature points. Because only the overall estimation error is considered, the estimation results of part feature points may deviate from their true values significantly. As a result, the constructed 3D shape may be deformed in a local area. Taking one frame of the sequence walking as an example, Fig.1 shows a comparison of the original 3D shape and the reconstructed 3D shape of CSF when the estimated results of a part of the feature points deviate from their true values significantly. Compared to the original 3D shape, we can see that the part marked with the rectangle has an obvious deformation in the reconstructed 3D shape. Therefore, it is necessary to make estimation errors to be uniform for different feature points.
Fig.1. A comparison of the original 3D shape and the reconstructed 3D shape of CSF when the estimated results of a part of feature points deviate from their true values significantly.
In order to solve this problem, a local deviation-constrainedbased column-space-fitting approach is presented in this paper to decrease the estimation deviation. In the proposed method,an effective model is constructed by considering both the overall estimation error and the variance of the reconstruction errors for each frame. Moreover, an augmented Lagrange multipliers (ALM) iterative algorithm is developed to optimize the local deviation-constrained-based estimation model. In addition, a convergence analysis is carried out in detail for the model optimization.
The remainder of the paper is organized as follows. A detailed description of the proposed method is presented in Section II. Experimental results are given in Section III.Finally, conclusions are made in Section IV.
reconstruction errors for N feature points can be computed as
and
respectively. Furthermore, for the tth frame, the standard deviations σtxand σtyof re-projection errors can be computed as
and
For different feature points, we can see from (10) and (11)that the estimation results are closer to the true values as a whole when σtxand σtyare smaller. Thus, σtxand σtycan be used as the indices to constrain the local deviation extent of the estimation results.
In terms of (2), the local deviation-constraint-based columnspace-fitting (LDS-CSF) model can be formulated as
where W?=MS. In the proposed model (12), other forms,e.g., inequality constraint, can also be adopted as the local deviation constrain. The goal of the constraint in (12) is to
For convenience, we first define some simplified notations before solving the model (12). Let wj∈R2T×1and sj∈R3K×1denote the jth column of the 2D observation matrix W and the jth point in the 3D shape basis S, respectively. The 2D reprojection error rjof the jth column of W?W?can be defined as
where M?denotes the pseudo inverse of M [28]. The symbols wjand rjare the 2D trajectory and the 2D re-projection error of the jth point, respectively. Referring to [28], furthermore,denote
where (·)Tdenote the transpose of a matrix, and
Then, the LDS-CSF model (12) can be rewritten as
As done in [28], the rotation matrix D is computed via a Euclidean upgrade method [25].
It can be seen from (2) that ?dis a predefined DCT basis matrix. Once the factor X is given, the factor M can be determined. Given M, the jth point sjof the shape basis S can be estimated by
This indicates that X is the only parameter to be optimized.
With the ALM iterative algorithm [31]–[33], the LDS-CSF model (16) can be reformulated as
where ρ>0 and λ are the weights of the penalty term and Lagrange multiplier, respectively.
According to the Gauss-Newton method, the one order partial derivative of L with respect to X can be given by
Furthermore, the second order partial derivative of L with respect to X can be computed as
According to (14), we can obtain the one order partial derivative and the second order partial derivative of f1, i.e.,
In terms of (15), we can obtain the one-order partial derivative and the second order partial derivative of f2, i.e.,
Noted that the second order term ?2rjis neglected in (22)and (24). Define
Equation (13) can be rewritten as
The performance of the proposed method is evaluated on twelve widely used motion sequences. Among these data,there are eight synthetic image sequences (jaws, walking,face2, face1, stretch, pickup, yoga, and drink) and four realimage sequences (dance, cubes, matrix, and dinosaur). For these sequences, the corresponding number of frames ( T) and the number of points tracked ( N) are listed in Table I. Note that these sequences are publicly available from [18], [27],[28], [35]. Figs. 2 and 3 show one frame of the eight synthetic image sequences and the four real-image sequences,respectively. All the simulations were conducted in the MATLAB environment, on a personal computer with an Intel i5-2320 CPU and 4GB RAM.
TABLE I The Numbers of Frames (T) and the Numbers of Point Tracks (N) For Twelve Motion Capture Sequences
Fig.2. One frame of the eight synthetic image sequences.
Fig.3. One frame of the four real-image sequences.
where
In order to evaluate the effectiveness of the proposed method (LDS-CSF), we compare it with several existing NRSFM algorithms, including the well-known block matrix method (denoted as BMM) [18], the consensus of non-rigid reconstructions (denoted as CNR) [36], the kernel shape trajectory approach (denoted as KSTA) [37], the rotation invariant kernel (denoted as RIK) [35], the column-spacefitting method (denoted as CSF) [27], and the CSF2 method[28].
Except for CNR, the low rank parameter K has a significant influence on the final estimation performance. For a fair comparison, the parameter K is successively set as 1, 2, ..., 13,for six methods. The parameter value corresponding to the smallest estimation error is selected as the approximate optimum parameter value of K, as shown in Table II.
Table III shows the mean and standard deviation of 3D reconstruction errors of the seven methods for twelve sequences, respectively. In Table III, the best result and the second-best result for each sequence are highlighted in red and blue, respectively. Compared to other methods, we can see from Table III that the reconstruction errors of LDS-CSF are the smallest or the second-smallest for most sequences.Thus, as a whole, LDS-CSF has a better performance than other methods. Moreover, the reconstruction errors of LDSCSF are mostly lower than that of CSF and CSF2. Thus, the local deviation constraint can effectively decrease the reconstruction errors of the column-space-fitting approach.
TABLE II The Approximate Optimal Values of K of Twelve Sequences for Six Methods
In addition, from Table III, we can see that the standard deviations of LDS-CSF are generally lower than those of other approaches for most sequences. Thus, the local deviation constraint can make the estimation errors to be uniform for different feature points. Taking one frame from yoga as an example, Fig.4 shows the comparisons of the z-coordinate reconstruction error εzof one feature point between LDS-CSF and other methods, i.e.,
TABLE III The Mean Values and the Standard Deviations (μ±σ) of the 3D Reconstruction Error ε of Twelve Sequences for Seven Methods
Fig.4. The comparisons of z -coordinate reconstruction error εz of the feature points for one frame of the sequence yoga between LDS-CSF and other methods.
reconstructed z-coordinate of the jth 3D point observed on the tth image, respectively. For the methods CNR, BMM, RIK,CSF, KSTA, and CSF2, we can see that the reconstruction errors of one section of feature points are smaller, but the reconstruction errors of the other section of feature points are larger. This indicates that the estimation results of a part of the feature points deviates from their true values significantly.Nevertheless, from Fig.4, it can be seen that reconstruction errors of the LDS-CSF model are more evenly distributed than that of other methods for different feature points. This means
that the proposed method can effectively decrease the local deviations.
Fig.5. The comparisons of the reconstructed results for one frame of the sequence cubes between LDS-CSF and other methods. The symbols “ ?” and “+”r epresent the observed ground truth and the reconstructed points, respectively.
TABLE IV The Mean Values and the Standard Deviations (μ±σ) of the 3D Reconstruction Error ε of Twelve Sequences With Noise for Seven Methods
As an example, Fig.5 shows the comparisons of the reconstructed results for one frame of the sequence cubes between LDS-CSF and the other methods. Compared to other methods, it can be seen from Fig.5 that the reconstruction results of LDS-CSF are closer to the true feature points.
In order to investigate the robustness to noise, we perform the experiments by adding the Gaussian noise on the original sequences. The parameter α is varied from 0.1 to 0.2 to control the noise rates. Table IV tabulates the mean and standard deviation (μ±σ) of 3D reconstruction errors when α is set as 0.1. The reconstruction errors of LDS-CSF is obviously lower than that of other methods for most sequences. Moreover, taking the sequence yoga as an example, Fig.6 shows the 3D reconstruction errors ε for the seven methods when α is set as different values. The reconstruction errors of LDS-CSF are mostly lower than that of other methods when α is set as different values.
Fig.6. The 3D reconstruction errors ε of the sequence yoga for the seven m ethods when α is set as different values.
A local deviation-constrained-based column-space-fitting approach is presented in this paper to alleviate the estimation deviation. The proposed method is demonstrated to be able to achieve a better and more even estimation performance as a whole compared to CSF. Moreover, the local deviation constraint is verified to be effective to enhance the estimation stability of different feature points. The experimental results based on the widely used synthetic image sequences and the real image sequences have demonstrated the effectiveness and feasibility of the proposed algorithm.
IEEE/CAA Journal of Automatica Sinica2020年5期