Yichen Zhang and Lu Yu
(Department of Information Science&Electronic Engineering,Zhejiang University,Hangzhou 310007,China)
Abstract The Moving Picture Experts Group(MPEG)has been developing a 3Dvideo(3DV)coding standard for depth-based 3DVdata representations,especially for multiview video plus depth(MVD)format.With MVD,depth-image-based rendering(DIBR)is used to synthesize virtual views that are based on a few transmitted pairs of texture and depth data.In this paper,we discuss ongoing 3DV standardization and summarize coding tools proposed in the responses to MPEG's call for proposals on 3DVcoding.
Keyw ords3DVcoding;call for proposal;auto-stereoscopic;depth map
D epth-based 3D video(3DV),including multiview video plus depth(MVD),have attracted interest from industry and academia[1].3DVs are textured videos with several possible views and associated depth maps.With depth data,virtualviews can be synthesized from transmitted views using depth-image-based rendering(DIBR)[2].
Depth-based 3DVformats have some advantages over conventional multiview formats.In a multiview video shot using a camera rig,a stereo pair can be presented on a stereoscopic display,and the baseline from the stereo pair is fixed.In this case,the 3D experience may not sit comfortably with different users because they have different preferences for depth intensity(which is mainly dictated by the baseline).With view synthesis,the arbitrary viewpoints between two coded views can be easily interpolated,and new stereo pairs with desired baseline distances can be generated.This enables disparity-adjustable stereoscopic video.Autostereoscopic displays,which provide a glasses-free 3D experience,require five,nine,or even 28 views as input.Coding all views one by one(simulcasting)or using multiview video coding(MVC)is insufficient.With 3DV,most of the required views can be rendered with a few coded views,and a video in 3DVformat consumes less bandwidth than one in a multiview format.
At the 96th MPEG meeting in Geneva,a callfor proposals(CfP)was issued on 3DVcoding technology[3].The CfP represented the start of standardization of depth-based 3D formats,among which MVD was the first priority.
In section 2,we introduce the CfP,including all requirements and test conditions.In section 3,we summarize the responses to the CfP,give a brief overview of the proposed 3DVcoding tools,and introduce three representative coding algorithms.In section 4,we discuss the standardization schedule of 3DV.Section 5 concludes the paper.
In the CfP,two classes of test sequences(MVD format)were used as test materials.One class included four sequence sets:Poznan_Hall2,Poznan_Street,Undo_Dancer,and GT_Fly.These sets had 1920×1088 resolution and were 25 frames per second(fps).The other class included four sequence sets:Kendo,Balloons,Lovebird1,and Newspaper.These sets had 1024×768 resolution and were 30 fps.The individual sequences in each set were eight or ten seconds long.For each sequence set,two and three specific views of the texture and depth data were the input of the two-view and three-view test scenarios,respectively.
3DVcoding must be compatible with existing H.264/AVC(advanced video coding)or future high-efficiency video coding(HEVC).The compressed data format in 3DVmust be compatible with that of H.264/AVC,which supports mono or stereo video.Existing AVC decoders must be able to reconstruct samples from the mono(AVC-compatible)or stereo(MVC-compatible)views of the compressed bit streams.Similarly,the 3DVcompressed data format must also be compatible with the HEVC standard,which was close to completion when the 3DVCfPwas issued.HEVC decoders must be able to reconstruct at least one view from the compressed bit streams[4].
Two test categories were defined in the CfP:AVC-compatible,and HEVC-compatible and unconstrained.In the former,proposals are AVC compatible;in the latter,proposals are HEVC-compatible or have no compatibility constraints[3].Fig.1 shows examples of AVC-compatible and MVC-compatible coding schemes.
For the AVC-compatible test,anchors for the objective and subjective measurements were generated using an MVC encoder(JMVCversion 8.3.1)to encode the test sequences.For the HEVC compatibility test,anchors for the objective and subjective measurements were generated using an HEVC encoder(HM version 2.0)to encode the test sequences.Both encoders had a high efficient,random-access configuration.
At the time the CfPwas issued,the JMVC and HM encoders were state-of-the-art reference software for AVCand HEVC standards.
For the AVC compatibility test,MVC was applied separately to texture data and depth data.For the HEVC compatibility test,HEVC simulcasting was used for each view of texture data and depth data.To calculate the objective rate-distortion(RD)performance and provide appropriate materials for subjective evaluation,four rate points(R1,R2,R3,and R4)were determined for each test sequence.R1 was the lowest rate point,and R4 was the highest rate point.The rate points differed for each sequence according to the coding results of previous exploration experiments.These experiments were conducted to develop software and test coding configurations for 3DVstandardization[5].The ratio of R4 to R1 varied from 2 to 5.In all submissions for the CfP,bit rates were limited to below the corresponding target-rate points.
Submissions for the CfP(including decoded and synthesized videos)were subjectively evaluated against the anchors by using stereoscopic and autostereoscopic displays.
Stereoscopic evaluation was performed as follows:In the two-view test scenario,the stereo pair for stereoscopic viewing comprised one of the two decoded views and a synthesized view rendered from the two decoded views.In the three-view test scenario,two stereo pairs were selected.One was centered on the center view of the three decoded views;the other was randomly selected and had the same baseline distance between left view and right view.
For subjective tests with 28-view autostereoscopic displays,all views were formed from the three decoded views and 28 synthesized views.The synthesized views could be produced using VSRSsoftware or another new synthesis method.The 28 synthesized views were distributed evenly among the three coded views.Thus,in the subjective tests,the quality of reconstructed and synthesized texture videos was fully taken into consideration.
Twenty-three proposals were submitted for the CfP.Proposals from Nokia,Sony,Nagoya University,Fraunhofer HHI,NICT,Qualcomm,Philips and Ghent University-IBBT,NTT,Sharp,Samsung,MERL,and Zhejiang University were in the AVC-compatible category.Proposals from RWTH Aachen University,Sony,Fraunhofer HHI(two proposals with different encoder and renderer configurations),Disney and HHI,LG Electronics,Ericsson,ETRIand Kwangwoon University,Samsung(two proposals with different coding tools and prediction structure),and Poznan University of Technology were in the HEVC and unconstrained category.
Prior to the 98th MPEG meeting,the submitted test materials were subjectively assessed in 13 test laboratories around the world[6].The subjective evaluations showed that,for most test sequences,the subjective quality of R3 of the best-performing proposal was better than R1 of the anchor.This suggests a significant improvement in coding efficiency compared to the anchor.In terms of objective performance,more than 25%rate saving was reported by several proponents.
To encode the depth-based 3DVformat,existing video coding standards,for example,H.264/AVC or HEVC,can be used.However,these standards are optimized for single-view 2D video coding.MVC is an extension of H.264/AVC and is designed for coding a number of texture video sequences with interview prediction.It could be a good candidate for encoding depth-based 3DV;however,it does not take into account the unique functionality or statistical properties of depth data,and it does not exploit the coherence between texture and depth signals.New coding tools were included in some of the coding platforms of CfP submissions,and these tools are designed to improve coding efficiency in MVC-based 3DV.These coding tools,which stand apart from those usually found in AVC,MVC or HEVC,can be classified into five categories:
▲Figure 1.(a)AVC-compatible category for two-view case,(b)AVC-compatible category for the three-view case,and(c)MVC-compatible category for the three-view case.
1)Texture-coding dependent views that are independent of depth.This involves coding the texture images of the side view.Aside view is any view other than the first view in the coding order.The first view(also called the base view)is expected to be fully compatible with AVC or HEVC;the side view only uses inter-view texture information.Tools in this category include motion parameter prediction and coding,and inter-view residual prediction.
2)Texture-coding dependent views that are dependent on depth.This is applicable to side-view texture,in which original or reconstructed depth information is used to further exploit the correlation between texture images and associated depth maps.Tools in this category include view synthesis prediction for texture and depth-assisted in-loop filtering of texture.
3)Depth coding that is independent of texture.Inter-view depth information or neighboring reconstructed depth values are used to compress the current macroblock in the depth map.Tools in this category include depth intra coding,synthesis-based inter-view prediction and intersample prediction,and in-loop filtering for depth.
4)Depth coding that depends on texture.Original or reconstructed texture information is used to further exploit the correlation between texture images and associated depth maps.Tools in this category include prediction parameter coding,intrasample prediction,and coding of quantization parameters.
5)Encoder optimization.Tools in this category include new rate-distortion optimization(RDO)optimization techniques for depth and texture encoding.They do not affect syntax or semantics.
Tools in the first four categories are used for both encoding and decoding.Tools in the last category are used for encoder optimization only[7].
The proposal from Zhejiang University focused on depth coding that is AVC-compatible and MVC-compatible.Different tools were proposed,and the average rate reduction for both two-view and three-view cases was 8%[8].Three coding tools in this submission are introduced in the following subsections.Two of these—view synthesis prediction(VSP)and joint-rate distortion optimization(JRDO)—were also proposed in several other submissions.
3.2.1 Visual Synthesis Prediction for Depth
VSPfor depth is a depth-coding tool that synthesizes inter-view depth reference pictures.A depth map of a side view is rendered based on the reconstructed depth picture of another view(often the base view).This depth map is then inserted into the reference picture list.The rendering involves 1)projecting depth pixels onto the target side view according to the depth values of the pixels and the corresponding camera parameters,and 2)filling the holes in the warped image.
Inter-view prediction with VSPis better than inter-view prediction with MVC,which is based on reconstructed pictures from other views.View synthesis compensates for disparities in MVC-based inter-view reference pictures.A depth map synthesized to the target view may be closer to the coded depth map from another view provided that the original depth maps have high inter-view coherence.3.2.2 Joint Rate-Distortion-Optimization
JRDO is a new RDO for depth coding.Distortion is not measured as loss of depth fidelity due to coding,that is,as the mean-squared error between the originaland reconstructed depth signals.Instead,JRDO measures distortion based on reconstructed texture and depth values in order to estimate the distortions in synthesized views.The quality of a synthesized view is more important than that of the depth data,which is not viewed.
Specifically,the distortion measurement in JRDO is a function of the depth distortion(between original and reconstructed values)and the corresponding texture gradient.This measurement is given by
where D(x,y)and d(x,y)denote the original and reconstructed depth values of pixel(x,y),respectively,and t(x,y)denotes the reconstructed texture value.B is the current coding block,and c is a constant.
The distortion measurement is designed based on the fact that the same depth distortions generally cause higher synthesis errors in highly textured regions than in textureless regions.
3.2.3 BBDSin CABAC for Coding mb_type Element
The distribution of mb_type in depth coding is different to that in texture coding.Therefore,a new binarization based on the distribution of syntax(BBDS)element is proposed.We call this element mb_type.The context model selection(CS)also varies with the change of binarization.
According to Huffman code,high-frequency symbols are assigned short codes.This concept is applied to the binarization process.BBDSdoes not use Huffman code directly because the distribution of a certain syntax element is irregular,and the code is usually irregular.BBDSuses a code tree similar to configurable variable-length code(CVLC)[9]for binarization.
The binarization and CSprocess has four steps:
1)Remove the values of mb_type that correspond to chroma because the depth image is grayscale.
2)Use one-to-one mapping that translates mb_type into a new SEcalled mb_type_index.The value of mb_type with a higher probability is mapped to a smaller value of mb_type_index.
3)As in CVLC,a group of codes is used for the binarization.These codes reflect an approximated distribution of mb_type_index.
4)For each bin of the string,one or more context models are chosen as the modelcandidates.
Standardization plans were established after the Geneva meeting.
The main goal of this work item is to enable 3D enhancements and maintain MVC stereo compatibility.Block-level changes to AVC or MVC syntax and decoding processes willnot be considered in this item.However,high-level syntax that enables efficient depth-data coding will be supported.
Ashort-term goal is to significantly improve coding efficiency for 3D enhancements in systems that only require 2D AVC compatibility.The syntax and decoding process for non-base texture views and depth information may differ from AVC and MVC at the block level provided the process results in marked improvement in coding efficiency.Coding efficiency is expected to improve 30-40%on existing AVC and MVC technology.
A third goal is to extend emerging HEVC design to enable efficient stereoscopic/multiview video coding and to support depth coding.Coding efficiency is expected to improve 40-60%on the base specification of HEVC.
At the 98th MPEG meeting,a tentative timeline was established for the standardization of the three extensions(Table 1)[10].The timeline may be slightly adjusted depending on the development of the standards in the future.
The MVC-compatible extension is due to be finalized soon.The AVC-compatible second track will proceed at a similar pace to HEVC 3D extensions.
This paper provides an overview of recent activities on 3DV standardization in MPEG.It summarizes various coding tools that were proposed in submissions for the CfPon 3D video coding.In particular,three depth-coding tools proposed by Zhejiang University are described in some detail.An 8%rate reduction is possible using only a few depth-coding tools.Reduced rate is possible using other tools for depth-assisted texture coding and using tools that exploit the correlation between texture and depth.The best-performing codec in the CfPreduced the rate by more than 25%,which is encouraging evidence supporting the feasibility of AVC and Manuscript received:February 29,2012 HEVC 3D extensions.
▼Table 1.Timeline for the standardization of 3DV(three categories)