BaMBNet: A Blur-Aware Multi-Branch Network for Dual-Pixel Defocus Deblurring

2022-05-23 03:01:28PengweiLiangJunjunJiangXianmingLiuandJiayiMa

IEEE/CAA Journal of Automatica Sinica 2022年5期

Pengwei Liang, Junjun Jiang,, Xianming Liu,, and Jiayi Ma,

Abstract—Reducing the defocus blur that arises from the finite aperture size and short exposure time is an essential problem in computational photography. It is very challenging because the blur kernel is spatially varying and difficult to estimate by traditional methods. Due to its great breakthrough in low-level tasks, convolutional neural networks (CNNs) have been introduced to the defocus deblurring problem and achieved significant progress. However, previous methods apply the same learned kernel for different regions of the defocus blurred images, thus it is difficult to handle nonuniform blurred images. To this end, this study designs a novel blur-aware multi-branch network (Ba-MBNet), in which different regions are treated differentially. In particular, we estimate the blur amounts of different regions by the internal geometric constraint of the dual-pixel (DP) data,which measures the defocus disparity between the left and right views. Based on the assumption that different image regions with different blur amounts have different deblurring difficulties, we leverage different networks with different capacities to treat different image regions. Moreover, we introduce a meta-learning defocus mask generation algorithm to assign each pixel to a proper branch. In this way, we can expect to maintain the information of the clear regions well while recovering the missing details of the blurred regions. Both quantitative and qualitative experiments demonstrate that our BaMBNet outperforms the state-of-the-art (SOTA) methods. For the dual-pixel defocus deblurring (DPD)-blur dataset, the proposed BaMBNet achieves 1.20 dB gain over the previous SOTA method in term of peak signal-to-noise ratio (PSNR) and reduces learnable parameters by 85%. The details of the code and dataset are available at https://github.com/junjun-jiang/BaMBNet.

I. INTRODUCTION

DEFOCUS blurring is inevitable when the scene regions(with wider depth range) are out-of-focus due to the limitation of the hardware, i.e., cameras with a finite-size aperture can only focus on the shallow depth of field (DoF) at a time, and the rest scene regions will contain blur [1].Removing this blur and recovering defocused image details are challenging due to the spatially-varying point spread functions (PSFs) [2]?[4]. Recently, some studies have addressed this problem using dual-pixel (DP) sensors found on most modern cameras [5]. Although DP sensors were originally designed to facilitate autofocus [6]?[8], they have been found to be very useful in a wide range of applications,such as depth estimation [9], defocus deblurring [10],reflection removal [11], and synthetic DoF [12]. DP sensors provide a pair of photodiodes for each pixel location to capture two sub-aperture views of the same scene [13], [14].Compared with a single photodiode for each pixel in the traditional sensor, the two sub-aperture photodiodes provide more information for spatially-varying blur detection and defocus deblurring [15].

As shown in Fig. 1(a), the blur image can be approximately divided into two categories, in-focus and out-of-focus, which correspond to the sharp regions and blurred regions,respectively. We expect that the deblurred results can keep the details of the in-focus regions while sharpening the blurred regions. Thanks to the immense success of deep learning, in recent years, some deep neural networks such as dual-pixel defocus deblurring network (DPDNet) [1] and DP-based depth and deblur network (DDDNet) [10] have achieved pleasing deblurring results. However, they tackle sharp regions and blurred regions by the same deep convolution network, and it is a great challenge for a single network to balance between keeping the details in the focus regions while deblurring in the out-of-focus regions. For instance, the highlighted patches in Figs. 1(c) and 1(d) indicate that methods based on the single network may fail to handle scenes with large depth variation.

According to [16], the blur amounts that measure the blur levels of one image are varying with respect to different regions and can be determined via the circle of confusion(COC) sizec(d)

Fig. 1. Schematic illustration of defocus deblurring. (a) and (b) show a pair of blurred input images. For convenience, we will only show the left view as the input image in the following figures. (c)?(f) show the highlighted deblurred results of DDDNet [10], DPDNet [1], BaMBNet (Ours), and ground truth (GT), respectively.

To combat these challenges, we propose a blur-aware multibranch network to address the defocus deblurring problem. In practice, we first estimate the COC map of the input image pair and then transform the COC map into defocus masks by a meta-learning mechanism to assign different image pixels into proper branch networks. Based on the assumption that recovering the blurred regions requires considerable learning parameters while maintaining clear regions only requires a few parameters, we apply different branch networks to deal with different regions under the guidance of defocus masks. In this way, the lightest branch with the fewest learning parameters will pay attention only to the in-focus regions and maintain the clear regions of the input images. In contrast, the heaviest branch with the most parameters is used to reconstruct the missing details and recover sharp parts from the regions with a large amount of blur. The main idea of the strategy is to decompose the source problem into multiple easy sub-problems, and our model is prone to optimize with the assistance of defocus masks. We carry out comprehensive comparison experiments to demonstrate the effectiveness of BaMBNet, including methods based on traditional handcrafted methods and convolutional neural networks (CNNs)-based defocus deblurring approaches.

The contributions of this work can be summarized as follows:

1) We propose a blur-aware multi-branch network (BaMBNet) to address the problem of non-uniform blur distribution in realistic defocus images. Different regions with different blur amounts will be treated by different branches with different capacities; as a result, our method can well maintain the information of the clear regions while recovering the missing details of the blurred regions. Extensive experiments demonstrate that our method outperforms existing state-ofthe-art approaches (SOTAs), such as DPDNet [1] and DDDNet [10].

2) We effectively improve the COC map estimation method used in [15] by straightforwardly training the convolutional neural network in an unsupervised way. To effectively guide the optimization of multi-branch network, we introduce a meta-learning strategy to generate the defocus masks from the estimated COC map. The comprehensive ablation studies also verify the effectiveness of the meta-learning strategy and multiple branches in proposed method.

The remainder of this paper is organized as follows. Section II introduces the dual pixel and reviews existing defocus deblurring methods in the reviews literatures of existing defocus deblurring methods. Section III presents our image defocus deblurring network and the proposed COC estimation and assignment strategies. Section IV provides the comparison experiments with SOTAs and demonstrates the technical contributions of the proposed method in the ablation studies.Section V concludes this paper.

II. RELATED WORK

In recent years, the dual pixel has come into fashion in lowlevel vision tasks such as depth estimation and defocus deblurring [1], [9], [10]. In this section, we briefly introduce the dual pixel, summarize various defocus deblurring methods, and give an overview about multiple branch networks.

A. Dual-Pixel Camera Model

A dual pixel (DP) sensor allocates a microlen and a pair of photodiodes for each pixel, as shown in Fig. 2. Each photodiode can record the light ray independently. In other words, there will be two views of the same scene captured by a DP camera, called the left view and the right view [5], [11],[13]. When the region is far away from the focus plane, the left/right view shows a detectable disparity, called the defocus disparity. By measuring the level of defocus disparity, the autofocus routine can adjust the lens movement to bring the out-of-focus regions into focus. Recently, some studies have shown that the defocus disparity can be used for depth estimation, reflection removal, defocus deblurring, etc. Next,we will give details of the defocus disparity with two representative examples.

Fig. 2. Optical geometry of the dual-pixel camera based on the thin-lens model. The DP unit is the basic sensor unit at the pixel level that consists of two photodiodes. According to the arrangement of spatial orientation, we called it the right/left unit [9].

Fig. 2 illustrates an interesting phenomenon. As can be seen, there is an object recorded by the DP camera located in the DoF region, i.e., the character “C”. In this case, the light rays striking from different angles are projected into the surface of the micro lens and averaged arrive at each photodiode. As a result, the generated left and right views are very close, i.e., the character “C” is clear. For the traditional non-DP camera, “C” locates in DoF region also resulting in clear image. In the other case, if the object is placed far away from the DoF region, e.g., the character “N”, the light ray originating from the object will converge at a point away from the plane of the micro lens resulting in a few pixels wide blur on the sensor, e.g., the blur results of character “N” in Fig. 2.Since the two photodiodes record different striking angles, the final left and right blur views show the defocus disparity. Now let us consider what occurs if we replace the DP camera with a traditional non-DP camera while keeping the same settings.When more light rays in non-DP camera strike into each photodiode, there is more blur amounts in the blurred regions.

B. Defocus Deblurring

According to the procedure of deblurring in the testing phase, the technique of defocus deblurring can be summarized into two categories: 1) The two-stage cascade approaches consist of two steps: The first step is defocus estimation, and the second step is non-blind deblurring by deconvolution[17]?[22]; 2) The one-stage methods [1], [10].

In the two-stage methods, a common strategy is to first estimate the defocus map and then use a deconvolution to deblur the out-of-focus regions indicated by the estimated defocus map. The estimation of defocus map is the more important stage of the two stages. Representative works include Karaali and Jung [18], who used identifiable handcrafted features such as image gradients to calculate the difference between the original image edges and the reblurred image edges. Besides, other similar methods include using the edge representation by Shiet al.[23] and using a local binary pattern to measure the focus sharpness by Yi and Eramian[24]. Recently, some studies have used learning-based networks to estimate the defocus map. For example, Parket al.[20] combined the deep features and hand-crafted features together to estimate the blur amounts on edges. Following the work of Parket al.[20], Zhaoet al.[25], [26] proposed a Fully Convolutional Network which is robust to scale transformation. In addition, Leeet al.[19] introduced a largescale dataset for CNN-based training and estimated dense defocus maps via domain adaption. Nevertheless, the common disadvantage of these methods is that the information of the estimated defocus map would not be fully utilized, because the estimated defocus map is asked to convert into a binary mask before the second stage. To avoid the aforementioned disadvantage, our BaMBNet directly produces the sharp image from the blur DP image pairs in the testing phase,which can be classified as the one-stage way.

In the one-stage learning-based methods, Abdullah and Brown [1] first introduced DPDNet to address defocus deblurring on DP images, and they simultaneously released a supervised in-the-wild defocus deblurring dataset. The DPDNet achieves better performance compared to these twostage deep learning-based methods. However, DPDNet does not explicitly extract the latent blur amounts of the DP pairs and discriminatively treats the regions with different blur amounts. Since the DPDNet uses the kernels with the same number of learned parameters to learn to deblur regions with various blur amounts distribution, it is hard to achieve a good balance between preventing high-frequency artifacts on the clean region and deblurring the serious blurred regions. After that, Panet al.[10] proposed to jointly perform the defocus deblurring and depth estimation on the DP images, where the defocus deblurring was guided via the depth estimation maps.However, the depth provided by the DDDNet is simulated by the RGB-depth (RGB-D) model, which does not sufficiently use the prior information of the blurred DP images, and there is also a domain gap between the training RGB and depth(RGBD) dataset and dual-pixel defocus deblurring (DPD)-blur datasets. Nevertheless, these methods are all based on a single network to handle different regions in the DP image, and they seem to overlook the worth of prior knowledge that the deblurring is spatially varying for the DP data. To address this issue, our method adopts multiple branch networks that have different capacities to handle regions with different blur amounts, respectively. In this way, the information from the clear region can be well maintained (with a lighter branch network), and the missing details from the blurred region are prone to being recovered (with a heavier branch network).

C. Multiple Branch Network

Recently, multiple branch networks have been explored in low-level vision tasks, such as super-resolution [27]?[29],image denoising [30], [31], and single image deblurring [32].The basic idea of these methods is the divide-and-conquer scheme, in which every pixel is treated individually. By carefully designing dynamic branch selection or feature fusion strategies, multiple branch networks have shown great success in solving the spatially-varying image reconstruction problem.

Taking the super-resolution as an example, Zhanget al.[27]proposed the pixel-aware deep function-mixture network for spectral super-resolution. They used three branch networks with different numbers of filters to extract three group outputs from the same input. Then, the multiple outputs are fused by a weight map learned from another subnet to generate the final result. This method tries to adaptively determine the receptive field size of the input. However, we do not know whether the learned weight map provides effective guidance, because the weight map lacks prior supervision information. To address this issue, Xieet al.[28] used the frequency-domain information obtained by the discrete cosine transform to divide the input into multiple parts in the single image superresolution task. After the multiple parts are fed into corresponding branches with different computation burdens,the generated separated features will be recombined into complete and spatially continuous features as input of the next block.

Fig. 3. The workflow of our proposed BaMBNet. We stack the left view Il and right view Ir and feed them into the head module to extract the basic features.The COC map is used to generate the defocus masks, which can guide the multi-branch network to extract diversifiable residual features from the regions with different blur amounts.

Except for super-resolution tasks, the multiple branch network has developed in the image denoising and single image deblurring. Yuet al.[30] proposed the path-restore that can dynamically select the proper route (branch) for each image patch. Path-restore is driven by a pathfinder trained with a tailored reinforcement algorithm. In addition, Xuet al.[31] used a specific network as a pixel-wise multi-class classifier according to the gradient statistic. The classification results determine the assignment of convolution with different learned weights to extract features. The multiple branch network is also used to deblur single image, Shenet al.[32]used a triple-branch encoder-decoder architecture to learn the motion blur in foreground humans, background details, and global domain information, respectively. To distinguish the foreground from the background, they provided a labeled mask to guide the network training.

Although current multiple branch networks have shown great potential in various tasks, there are still some limitations in the application. On the one hand, some multi-branch networks directly determine how to assign input features into different branches without considering the prior knowledge about the data. On the other hand, the convolution layers usually take the spatially corrupted features as input due to the split operator added to the input. To circumvent this, in our method, the COC map is employed to explicitly exploit the characteristics of the DP data for optimizing multiple branch networks, while the split operator is moved from the input to the output to avoid losing the semantic information of the input.

III. THE PROPOSED BAMBNET METHOD

From the definition of COC size in (1), given the focal lengthf0, focus settingNand the subject-to-camera distanced, the blur amounts only rely on the focus distancedf. Only when the focus distancedfis proper, i.e., the COC size is within the allowable range, are the projected regions in the final image clear. Whether the focus distancedfis too far or too near, it will result in the defocus blur. Furthermore, when the scenarios exceed the limit range and gradually move away from or close to the lens, the COC size will become larger,and the projected region will show increasingly serious defocus blur. In other words, different regions of the image with different COC sizes will have different blur amounts,which is mainly determined by the subject-to-camera distance.

Following the above observations, we proposed a bluraware multi-branch network (BaMBNet) that consists of multiple different branches with different parameters. We wish that the image regions are assigned to different branches guided by the COC map. Specifically, we introduce an assignment strategy for in-the-wild image with non-uniform defocus blur. Here, the assignment strategy is to transform the COC map to defocus masks, where the continuous blur amounts are converted into several discrete states. Generally,we expect the lighter branch with few parameters to maintain the source in-focus regions and the heavier branch with more parameters to recover the image regions with the larger COC size.

In the following, we will first present the details of the BaMBNet. Then, we will introduce how to predict a COC map and how to automatically transform the COC map to the defocus mask.

A. Blur-Aware Multi-Branch Network

The workflow of the proposed multi-branch network is shown in Fig. 3. Our method takes 6-channel DP dataIconas input, which is generated by stacking the right and left views(the two RGB images have a total of 6 channels). The proposed BaMBNet consists of a head module ?head, multiple branch residual bottleneck modules Φ={?m;m=1,...,M},and a tail module ?tail. In this framework, the design intention of the head module is to simply transform image form image space to the latent representation space; and similarly for the tail module. Moreover, multiple bottleneck modules are used to extract the residual features to adaptively reconstruct the details.

In this paper, we represent the basic features extracted by the head module asFenc. Afterward, the bottleneck modules Φ take theFencas input and output a group of residual features guided by the defocus masks {Dm;m=1,...,M}, as shown in Fig. 3. The defocus masks indicate the blur level of the image,which are computed via combining COC maps and the thresholds. We will give details about estimating COC maps and solving the optimal thresholds in Sections III-B and III-C,respectively. Finally, the group of residual features{?mres;m=1,...,M}will be summed up to obtain the global residual features that will be added into the basic featuresFencas the input to the tail module. Afterwards, we will get the target deblur imageI? from the tail module. In summary, we can formulate the process as follows:

Note that the architecture of the head/tail module is really simple, and they both are composed of only a convolution layer and a rectified linear unit (ReLU) layer. For bottleneck modules, they consist of two main types: the fully convolutional network (FCN)-like network (the lightest branch) and the U-net-like network (other branches)[33]?[35]. Both types of networks are symmetric structures.The FCN-like network only consists of eight 1 ×1 convolution layers and activate layers, while the U-net-like network contains 8 blocks: 4 encode blocks and 4 decode blocks. We expect the FCN-like network to focus on the clear and slightly blurred image regions. Since the FCN-like network is the lightest branch, it does not need a large receptive field to recover the image regions and its goal is to maintain the information of image regions and to remove some slight noise.Compared with the FCN-like network, the U-net-like network has a more complex structure by introducing some upsampling layers and max-pooling layers. Therefore, it can better capture the multi-scale features of the DP images.Furthermore, the U-net-like network employs the residual channel attention blocks (RCAB) module that is widely used as a basic unit in super-resolution task [36]. To recover image regions with different blur amounts, the U-net-like networks(different branches) are required to hold different capacities,i.e., parameters. In practice, we simply change the number of RCAB modules in each block to adapt the capacities of different branches. From the lightest U-net-like bottleneck module to the heaviest, the number of RCAB modules in each block increases from 1 to 3. In addition, the output of every bottleneck module has the same channel number with the basic featureFencto allow the residual features ?mresto add withFenc.

In the training phase, we use theL1 loss function between the outputI? and the ground truthIgt

wherendenotes the number of samples.

B. The COC Estimation

Unlike existing one-stage deep networks [1], [10], which apply a single network to handle different regions in the DP image, the proposed BaMBNet treats different regions (with different blur amounts) by different networks (with different capacities). Therefore, how to estimate the blur amounts in different regions of image is a crucial step in our proposed method. In this section, we will introduce a method to estimate the COC map indicating the blur amounts of DP image pairs in an unsupervised way.

Firstly, according to the analysis in [15], the blur kernel of right and left views should be symmetrical

Fig. 4. An example of the blur kernel model in [15].

The {·} denotes an aggregation operator which gathers the pixels into a map, i.e., COC map, whose height and weight are the same asIlandIr. Due to the symmetry property of the blur kernel in (4), only one COC map will be generated for two views of one pair of DP images.

As can be seen, the radius of the local neighborhood is assumed to be equal to the COC size, where the depth of neighborhood image patch corresponds to the constant-depth.Since the COC size is usually small enough compared with the image size, the assumption can be regarded as an application of Riemann integral, that is, every coordinate in the DP image can be assigned with COC size for itself.Compared with the fixed size of the image patch in (5), the extension version (6) is more reasonable and accurate.

Loss function:According to (6), we can intuitively formulate the loss function as follows:

C. Meta-Learning Defocus Mask Generation

Given the COC map, we expect that clear regions can be preserved with lighter branches and that blurred regions can be well recovered with heavier branches. In this section, we will introduce a defocus mask generation method to divide the continuous COC values into a limited number of levels. As shown in Fig. 3, all elements in a defocus mask belong to the same level.

An intuitive strategy is to divide the COC size into different levels by some pre-defined thresholds. However, human crafted thresholds are sub-optimal and not suitable for all images. In the following, we will present an optimization method that jointly optimizes both thresholds and multibranch network parameters. Here, the thresholds that play a key role on the performance of network can be seen as the hyper-parameters of the network. In this paper, we introduce a method capable of adaptively learning the thresholds directly from a small amount of meta-data, thus they can be finely updated simultaneously with the learning process of the network parameters.

Algorithm 1 Meta-Learning Defocus Mask Generation Algorithm Input: The number of level M, multi-branch bottleneck modules, candidate solution set , DP dataset and estimated COC maps.T ={t0,...,tM}Φ={?m;m=1,...,M} T′={t′0,..,t′L}Output: The optimal thresholds .1: Set τ to 0.1 and L to 250.2: Set M to 4, and fix the minimum threshold to and the maximum threshold to according to [15]. Initialize the thresholds uniformly. while: do not converge do t0=0 t4=25 3: Generate defocus masks from defocus maps using the current threshold set .T ={t0,t1,t2,t3,t4}4: Guided by the defocus masks, we train the multi-branch bottleneck modules Φ on the training dataset for 2 epochs. To update the thresholds , we find the optimal branch for each minimum interval of on according to (9).T T′ Vsmall 5: Normally, every branch will be selected as the optimal one by a group of continuous intervals. The upper and lower bounds of the continuous intervals in each branch are the updated thresholds.t0, t4 t1, t2 t3 t1, t2 t3 ?m 6: Keep the thresholds be constant, and update the thresholds and . The converge condition is satisfied when the thresholds and do not need to be updated.7: end

IV. EXPERIMENTS AND RESULTS

In this section, we evaluate our proposed BaMBNet model for the defocus deblurring task. First, we introduce the details of implementation about the proposed method. Then, we compare our method with four state-of-the-art methods on the DPD-blur dataset [1] and dataset [9], [40]. Finally, we conduct extensive ablation studies to demonstrate the effectiveness of the proposed method.

A. Implementation Details

We train our proposed BaMBNet in two steps: 1) Training the COC estimation network to obtain the COC map; 2) Joint training of the defocus deblurring network and determining theM+ 1 thresholds.

As described in Section III-B, COC map estimation is trained with a common U-net architecture that removes the last activation layer. During training, we set the input size of the DP image and the batch size to 512 × 512 and 2,respectively. We run 10 epochs with the learning rate of 2E?5 to minimize the loss in (8) where the hyperparameterλis set to 10. Following the experimental setting of [15], we set the available range of the estimated COC size from ?25 to +25,where the sign (±) only indicates the relative position relationship between the captured scene and the focus plane.Since the range is fixed, we can use a trick called the lookup table to boost the computation speed of loss Lgem. In practice,we first build up a complete table for each input pair, which is fed with all loss values computed from all viable COC sizes.We can only use a generic parameterized convolution layer to achieve this, which is actually a technique of vectorized operation. To calculate the loss Lgem, we only need to simply select one matching value according to the current estimated COC size at each coordinate. Note that the final estimated COC maps after training are generated by adding an absolute operator to the network outputs.

After finishing the COC map estimation training, we determine the thresholds through the meta-learning strategy,the number of bottleneck modulesMis set to 4, the minimumt0and maximumt4thresholds are set to 0 and 25,respectively. We uniformly assign to the initial thresholdst1,t2,t3and update them at every 2 epochs until the iteration converges (at around 18 epochs). The initial learning rate is set to 2E?4. And then, we fix the thresholds to train the multibranch network. The initial learning rate starts from 2E?4 which is decreased by half every 60 epochs. Note that theVsmalldataset in meta-learning is split from the original training dataset rather than using the original validation dataset, whose size is 10% of the training dataset. In addition,we use an annealing strategy to train our network.Specifically, in the early training phase, the generated outputI? will rely on the guidance of the defocus masksD1,D2,D3,D4. As the model gradually converges, we gradually reduce the weights associated with the masks until they are zero. To decrease the effect caused by the continuous reducing of defocus masks weight, we introduce an ones mask 11Tin the process of generating residual features

B. Evaluation on Dual Pixel Datasets

We pre-process the DPDBlur training dataset following thesettings of DPDNet totally [1]. Specifically, a slide window of 512 × 512 pixels is applied to crop image patches on the training image of 1680 × 1120 with 60% overlap. By computing the sharpness energy, we discard 30% of the most homogeneous regions in the crop patches to achieve the best performance, which has been experimentally validated in [1].In addition, theVsmalldataset used in meta-learning is randomly selected from the training dataset.

TABLE I QUANTITATIVE EVALUATION RESULTS IN TERMS OF PSNR, SSIM, AND MAE FOR DIFFERENT DEFOCUS DEBLURRING METHODS ON THE DPD-BLUR DATASET. THE BOLD NUMBERS INDICATE THE BEST RESULTS WHILE THE SECOND BESTS ARE MARKED BY UNDERLINES

To verify the effectiveness of the proposed method, we compare four methods such as the edge-based defocus blur(EBDB) [18], the defocus map estimation network (DMENet)[19], the dual-pixel defocus deblurring network (DPDNet) [1],and the DP-based depth and deblur network (DDDNet) [10].

Note that EBDB [18] and DMENet [19] are proposed for defocus map estimation, and they cannot be directly applied for defocus deblurring. Following the advice by [1], we additionally leverage a non-blind deblurring method with the defocus map [42] to deblur the defocus blur images. Since DPDNet [1] shares the same experimental settings as our method, we directly evaluate the trained model that has already provided with best performance. The DDDNet requires training with NYU dataset first for the depth estimation, then training with their own private dataset to deblur, and finally fining tune in the DPDBlur dataset. For convenience, we use the pretrained model provided by the authors to evaluate the DPDBlur dataset.

Evaluation metrics:All methods are evaluated by five metrics: Peak signal-to-noise ratio (PSNR), structural similarity (SSIM) [43], mean absolute error (MAE), learned perceptual image patch similarity (LPIPS) [44], and Fréchet inception distance (FID) [45]. The PSNR, SSIM, and MAE provide traditional standard measurements in reconstruction errors, while the LPIPS and FID supplement similarity judgments from the human and semantic perceptions.

1) Quantitative Results:The defocusing deblurring results of different methods are reported in Table I. We can learn that our method achieves significant improvements compared to other SOTAs. As can be seen, the DPDblur test dataset is divided into three scene categories: indoor, outdoor, and combined. The indoor scenes consist of 37 pairs of DP images, and have smaller depth variations than the outdoor scenes. The combined scenes are the combination of indoor and outdoor scenes, and involve 76 DP image pairs totally. In general, our method achieves the best performance in all scene categories under both PSNR and SSIM evaluation.Specifically, although DPDNet and our method use exactly the same training dataset, our method improves the performance of 1.20 dB in term of PSNR, and the parameters of our method are also reduced by 85%. The improvement also is able to be further confirmed by the SSIM (+0.035) in the combined scenes. The reason why the parameters of BaMBNet are much less than those of DPDNet is that the depth and width of BaMBNet are much smaller than those of DPDNet. Moreover, we use the residual block with1×1 convolution layers (which reduce and then increase dimensions) to decrease the input/output dimensions of3×3 convolution layers and obviously reduce the parameters of BaMBNet. Since COC maps are not required during the testing phase, the total number of parameters of our method does not include the parameters of the COC estimation network. For the runtime, the proposed BaMBNet can achieve comparable results to those of DPDNet and DDDNet. Here,the runtime of the EBDB and DMENet is not available because the deconvolution of these two methods requires massive computation on the CPU. Nevertheless, our method still goes beyond these two SOTAs; it is mainly because our multiple branches can cope with larger blur variations while keeping the in-focus region details.

Aside from the traditional metrics that are mainly used to measure the reconstruction errors, we also provide two recent perceptive metrics, LPIPS and FID, which are widely used to evaluate the perceptive quality of generated images on the low-level vision tasks. Perceptive metrics leverage deep semantic representations considering context-dependent and high-order image structures, which are more closed to human judgments of similarity. As shown in Table II, compared to the results of DPDNet, our LPIPS and FID measure values have decreased by 37.7% and 16.8%, respectively. It is well known that the LPIPS appears to be more sensitive to blur andthat FID reflects the similarity of images in high-dimensional space. Therefore, the higher performance perceptive metrics indicate that our method can generate more sharp images.When compared with DDDNet [10], which is the latest and most competitive method, the proposed method achieves a considerable improvement in all metrics.

TABLE II QUANTITATIVE RESULTS IN TERMS OF LPIPS AND FID FOR DIFFERENT DEFOCUS DEBLURRING METHODS ON THE COMBINED DPD-BLUR DATASET

Fig. 5. Qualitative results of state-of-the-art deblurring methods on DPDBlur dataset. The second row is the estimated COC maps by our unsupervised method given the inputs of the first row. We highlight the cropped patches by green and red boxes.

2) Qualitative Results:In Fig. 5, we present defocus deblurring results of DPDBlur dataset by different methods including five representative scenes, i.e., the 1st?4th columns show the outdoor scenes, and the 5th column shows the indoor scenes. The proposed BaMBNet performs more appealing deblurring results compared with comparison methods. It demonstrates that our deblurred method is robust to varying scenes, and our results have better visual quality.

Fig. 6. Qualitative comparisons of various defocus deblurring methods on the dataset [40].

Fig. 7. Qualitative comparisons of various defocus deblurring methods on the dataset [9].

Taking the first column as an example, we can see that the DPDNet [1] not only generates unexpected artifacts on the region with mild blur amounts, but also fails to remove the defocus on the region with serious blur amounts, which are highlighted by red and green boxes, respectively. While our method can appropriately reconstruct the blur region and well maintain the sharp region, as shown by highlighted zoomed-in boxes.

In addition, let us focus on the sharp regions that do not be affected by blur and check whether these regions are fully preserved in the deblurring results. For instance, the highlighted regions by the red boxes shown in the last two columns demonstrate that the sharp regions can be well preserved. For the deblurring region shown in the fourth column, we expect the region, “grass”, to remain the same as the input, while DPDNet generates artifacts and damages the details. A similar issue is also shown in the last column, where the recovered result generated by DPDNet loses some details of the edge. The performance of DDDNet is worse than DPDNet, and DDDNet will introduce the artifacts to deblurring results as highlighted in the last column. In contrast, thanks to the lightest branch in the multi-branch, our method preserves the sharp information of input without additional artifacts. It demonstrates that our network is more efficient at handling varying scenes compared to its counterparts.

In the existing literature, only the DPDBlur dataset provides the full source data and corresponding ground truth.Therefore, all comparison methods train or fine-tune based on the DPDBlur dataset. To further evaluate the generalizability of these methods, we test the pretrained comparison models on datasets [40], [9]. Qualitative comparisons are shown in Figs. 6 and 7, respectively. On the dataset [40], our recovered results show more fine details, where the words are easier to distinguish as shown in the first example of Fig. 6. On the contrary, only a few words can be recognized in the best deblurring result generated by comparison methods. In the second example, the proposed BaMBNet generates more clear background textures while the comparsion methods cannot well recover the blurred region and fail to reconstruct some contents from the blur. For the dataset [9], the images show much smaller defocus blur than the other dataset. It is a great challenge to remove the blur while reconstruct appropriate high-frequency edge textures. As shown in the first example in Fig. 7, the DPDNet and DDDNet tend to generate opposite types of deblurring results. The DPDNet reconstructs too much texture details to introduce artifacts, while the DDDNet removes insufficient blur from the input, resulting in some indistinctive edges in the recovered image.

The deblurred results confirm that our method goes beyond other comparisons as our deblurred results show less highfrequency artifacts while keep more sharpen edges. By testing with more challenging datasets, we can learn that the proposed BaMBNet is more competitive in generalization abilities.

C. Ablation Studies

To comprehensively verify the components of the proposed network, we perform multiple groups of ablation studies shown in Table III.

Branch number is used to verify the effectiveness of multiple branches, where the comparison networks with different numbers of branches have the same number of network parameters. In general, the performance of the network increases as the number of branches increases when the number of branches is equal to 4, the network reaches its best effect. Note that the setting of 2 branches and 4 branches has nearly evaluated results in terms of PSNR and MAE.However, the setting of 4 branches shows better performance in terms of SSIM and LPIPS, which indicates that the recovered results show more clear textures and better visual perception. After that, increasing the number of branches cannot get better results, and the performance even decreases.It demonstrates that setting the hyperparameter of the number of branches to 4 is efficient.

Defocus masks includes two strategies for validating the effectiveness of the guidance information from the COC map.One design strategy is to ignore the defocus masks during training, the other is to place the defocus masks behind the head module to handle the features before feeding into the multiple branches. When we remove the guidance information, the network shows the worst performance in all comparison methods with complete network architecture. Itproves that defocus masks play a key role in the defocus deblurring task. In addition, the network suffers from performance decreasing when we send different regions to different branches by moving the defocus masks to the behind of the head module. The reason why the proposed defocus masks do not work well may be that the semantic information fed into the different branches is broken.

TABLE III BAMBNET ABLATION EXPERIMENTS. THE DEFAULT IS: THE DEFOCUS MASKS ARE GENERATED FROM META-LEARNING STRATEGY AND WORK ON FEATURE MAP EXTRACTED BY THE MULTIPLE BRANCHES, THE NUMBER OF BRANCHES IS 4, AND ALL RESIDUAL BRANCHES ARE USED TO DEBLUR. DEFAULT SETTINGS ARE MARKED IN GRAY

Threshold analysis is used to demonstrate the advantages of the meta-learning strategy, in which we use two intuitive strategies, called random assignment and manual assignment,to compare with the meta-learning strategy. Due to the assistance of prior information, manual assignment achieves better results than random assignment. Nevertheless, neither of these two intuitive strategies surpasses the meta-learning strategy, which demonstrates the effectiveness of the metalearning strategy.

Residual branch shows direct certification of the usefulness of each branch in the network. In this group, we extract the residual features of each branch from the one pretrained model, called the 1st, 2nd, 3rd, and 4th,respectively. Instead of using all residual features to test, we select part of them as a testing strategy for defocus deblurring.To show the usefulness of residual features in multiple branches, we select parts of residual features from the same pre-trained network into the tail module. According to the results reported in Table III(d), we can derive that every branch provides gain. Besides, the heaviest branch, “4th residual”, has obvious effects on the blur region indicated by the change of LPIPS. Furthermore, we also show the quantitative results about the Residual branch in Fig. 8,which will be discussed later.

Fig. 8. Quantitative results of network with various residual branch bottlenecks. (a) and (b) show the input view and the ground truth,respectively. (c) shows the results of our network. To investigate what role each branch plays in the deblurring, we extract the residual features of each branch from a pre-trained model and select only a part of them into the tail module to generate deblurring results. To make the description clear, we notate the branches from the lightest to the heaviest as 1st, 2nd, 3rd, and 4th,respectively. In the second row, the selected residual features are increasing from the light branch to the heavy branch. On the other hand, the selected residual features are increasing from the heavy branch to the light branch as shown in the third row.

Fig. 9. Comparisons between defocus masks generated in different iteration optimization epochs, and qualitative comparison results of COC map. (b) and (c)show the COC maps estimated by model of [15] and our model. Yellow, LightBLue, Blue, and Purple represent Heavy, Medium, Light, and Slight blur amounts, respectively.

Fig. 10. Qualitative comparisons are designed to validate the effectiveness of defocus masks. (a) shows the input view. (b) shows the result of without defocus masks. (c) shows the result of our normal network. We also display the zoom-in cropped patches for comparison. Since (b) and (c) share totally the same network architecture, (c) shows better deblurring performance than (b) under the guidance of defocus masks. (d) shows the corresponding ground truth.

Apart from the quantitative comparisons, we also show the generated defocus masks in different iteration epochs. As shown in Fig. 9, the initial defocus masks split the entire image into four separate groups (here we directly divide the entire COC interval into four equal parts). After several iterations, the masks are adjusted to be more reasonable. For example, the regions marked with yellow correspond to the image regions with large blur amounts. Besides, the defocus masks do not require pixel-wise preciseness because the masks work on deep features with inaccurate spatial position information. Generally, the regions with more blur amounts usually tend to use the heavier branch for restoration, while the slightly blurred regions are assigned to the lighter branches which has a small receptive field to avoid the artifacts. These results indicate that the refined defocus masks are able to offer effective guidance information (i.e., the blur distribution of image) for multi-branch bottleneck modules.

Qualitative results depicting the advantage of defocus masks are shown in Fig. 10, the networks trained directly by removing the defocus masks fail to recover the word, i.e.,“market”, in which these deblurred characters are difficult to recognize. It is because the network trained without the assistance of defocus masks would treat the blurred characters and other clean regions equally, and it is difficult for the network to focus on the deblurring of blurred characters. As for the network trained with the assistance of defocus masks,it will easily distinguish the difference between blurred characters and other clear regions, thus focusing on these blurred regions (by assigning a heavy network large capacity to these regions). In summary, our network achieves a performance improvement both in objective metrics and visual results with the assistance of defocus masks.

We adopt an extremely strict way to explore the role played by the residual features extracted from each branch bottleneck. Fig. 8 shows the quantitative results of various ablation strategies. Instead of retraining the network with a different number of branches for comparison, the models employed by the ablation strategies directly drop the residual features of some branches from the pre-trained model.Therefore, the deblurring results generated by the network with incomplete residual features have slight color distortions.In our design scheme, the heavy branch is used to focus on the blur regions, so when the residual features extracted from the heavy branch are lost, the blurred regions cannot be recovered to sharp. Just as expected, when the lighter branch takes part in deblurring, the network pays attention to preserving the details in the in-focus region. For instance, the in-focus region highlighted by the green box keeps a lot of texture information, as shown in Fig. 8(d). As more residual features in the heavier branches gradually participate in the deblurring,the defocused blur regions become gradually sharper and begin to recover more texture details from the blurred regions.Aside from exploring the effect of the lightest branch, we also present the results of the ablation strategy with only the heaviest branch in Fig. 8(g). Compared with the result in Fig. 8(d), the blur region of Fig. 8(g) shows a remarkable improvement highlighted by the red box, while the in-focus region of Fig. 8(g) has less details. It successfully demonstrates that the heavier residual branch just focuses on the larger blur region. Our method adaptively assigns the regions to the proper branches according to their blur amounts, which is beneficial for simultaneously preserving the details in the focus region and deblurring in the blur region.

D. Some Failure Cases in Term of PSNR

Fig. 11. Example failure measure cases of our method. The number shown on the images in the first two rows is the PSNR measures compared with GT. δ is the enhanced residual map of given images.

Overall, the performance of our proposed BaMBNet is much better than the SOTA methods when adopting the PSNR metric. However, when traversing the PSNR result in the test dataset, we find that some of our samples show a considerable decrease gap when compared with the SOTAs. To investigate the reasons, we present the two worst cases (e.g., 2 dB decreases in term of PSNR) and provide the zoomed-in patches of the most blur and clear regions in Fig. 11. In term of PSNR metric, although our method shows the worst evaluation results, our deblurring results dramatically alleviate the blur on the most blurred regions shown in the red boxes.Moreover, our results do not show visible differences in the clear regions highlighted by the green boxes. Therefore, it can be proved that our method tends to obtain clear and sharp results even if the PSNR result is poor. To explain the difference between PSNR measurement and human perceptions, we present the residual images between the results and ground truths in the last two rows of Fig. 11. We find an interesting phenomenon: The values of the residual images computed from the input images and ground truth are not zeros in the focus region. These residual values (which should be zeros) are even higher than the residual values computed from DPDNet method and the ground truth. Therefore, we guess that both the ground truth and the input images have been disturbed by noise or that they are misalignments.

Advantages and disadvantages:The key idea of our method is to use prior information, i.e., the blur amount at each pixel,to guide the network to tackle the blur region by different branches with different learned parameter burdens. The advantage of the method is that we avoid using too many parameters to overfit the clear region as well as to fit the noise information and using insufficient parameters to underfit the region with a lot of defocus blur. Therefore, our method always maintains more details without artifacts in the small defocus blur region and generates more sharp edges in the serious blurred region. However, there may be a data imbalance problem for the distribution of blur amount on the dataset. Since the size of blur amount is closely-related to the depth variations, the distribution of blur amount is a long-tail distribution. For the heaviest branch, there are not enough samples to train the network. As a result, although the proposed method can generate the sharp image in the much defocus blur region, the generated sharp regions are still obviously different from ground truth.

E. Discussion

Our method focuses on diversely treating regions with different blur amounts. To measure the blur amounts of DP images, we estimate the COC map that can provide explicit guidance for defocus deblurring. Subsequently, we explore how to effectively incorporate the estimated COC map with the defocus deblurring task. To end this, we employ a metalearning strategy to generate proper zoning of the COC map,and our method achieves interaction and collaboration among multiple branches. As a consequence, our method requires first solving two assistant tasks before training for defocus deblurring. Therefore, the performance of the proposed BaMBNet really depends on the effectiveness of previous tasks. During the COC map estimation, we only keep theLgemloss working while training the network at the early several iterations, and then replace the single lossLgemwith full loss(8) to continue train. The advantage of starting optimization with only lossLgemis to ensure that the network focuses on estimating the COC size at first. When finding the optimal thresholds by the meta-learning strategy, we refer to the distribution of blur amounts to initialize the initial thresholds.In detail, we compute several thresholds that can equally split the blur amounts into the number of BaMBNet. Compared with the random initializing solution, the specially designed initial value can remarkably boost the optimization process.

Limitations and broader impacts:Although the proposed BaMBNet is not required to estimate the COC map during the test phase, there are two inevitable preparations before starting training for defocus deblurring. The performance of additional preparations is especially related to the optimization of deblurring, which is somewhat complicated to run the paradigm and not very friendly to those who are interested in it. Since the COC map is also computed from the DP input, in the future, we plan to use a unified framework to solve the problem by fusing the additional prepared works into the deblurring task. The advantage of a unified network is to simplify the current overall framework, and the parameters of the network can be jointly optimized by training together.

The proposed method uses a multiple branch architecture to discriminatively recover the blurred region with different blur amounts. In that case, the spatially varying data can be quantified into different complexity and treated with different strategies. In the computer vision community, spatially varying data are widely presented in various tasks, such as image denoising [46], light estimation [47], and high dynamic range imaging [48]. If we can measure the distribution of spatially varying, the BaMBNet can be directly server as an available method. Therefore, it is worth exploring the application of the BaMBNet to other low-level image processing tasks.

V. CONCLUSIONS

In this paper, we propose a blur-aware multi-branch network (BaMBNet) to deblur the real defocused images with nonuniform blur amounts. The BaMBNet automatically applies the lighter branch to the clear regions to maintain the detailed information, while assigning the heavier branch to the blurred region to recover the latent structural edge. In particular, we first devise an unsupervised scheme to estimate the COC map by the DP data, then generate defocus masks from the COC map by a meta-learning strategy. The defocus masks indicate the mapping from regions to branches. At last,we employ the defocus masks to guide the multi-branch bottlenecks with different parameters to handle the blur regions with different recovered complexity. Therefore, the overall optimal complex problem is divided into multiple simple subproblems; each branch only needs to solve the assigned subproblems, which allow to avoid the overfitting on the clear region and underfitting on the most blurred region.Experimental results show that our BaMBNet performs robust to defocus blur in a wide depth range and achieves a better balance between maintaining the information of clear regions and generating sharp details for the blurred regions.

IEEE/CAA Journal of Automatica Sinica2022年5期

IEEE/CAA Journal of Automatica Sinica的其它文章: Attitude Regulation With Bounded Control in the Presence of Large Disturbances With Bounded Moving Average; A Primal-Dual SGD Algorithm for Distributed Nonconvex Optimization; Knowledge Learning With Crowdsourcing:A Brief Review and Systematic Perspective; Target Capturing in an Ellipsoidal Region for a Swarm of Double Integrator Agents; A Survey of Cyber Attacks on Cyber Physical Systems: Recent Advances and Challenges; Consensus Control for Multiple Euler-Lagrange Systems Based on High-Order Disturbance Observer:An Event-Triggered Approach

亚洲免费av电影一区二区三区,日韩爱爱视频,51精品视频一区二区三区,91视频爱爱,日韩欧美在线播放视频,中文字幕少妇AV,亚洲电影中文字幕,久久久久亚洲av成人网址,久久综合视频网站,国产在线不卡免费播放