Siwei Ma,Shiqi Wang,and Wen Gao
(Institute of Digital Media,Peking University,Beijing 100871,China)
Abstract Following the success of the audio video standard(AVS)for 2Dvideo coding,in 2008,the China AVSworkgroup started developing 3D video(3DV)coding techniques.In this paper,we discuss the background,technicalfeatures,and applications of AVS 3DV coding technology.We introduce two core techniques used in AVS 3DVcoding:inter-view prediction and enhanced stereo packing coding.We elaborate on these techniques,which are used in the AVSreal-time 3DVencoder.An application of the AVS 3DVcoding system is presented to show the great practical value of this system.Simulation results show that the advanced techniques used in AVS 3DVcoding provide remarkable coding gain compared with techniques used in a simulcast scheme.
Keyw ords AVS;3Dvideo coding;inter-view prediction;stereo packing
T he China Audio Video Standard(AVS)video coding standard is developed by the AVS workgroup,whose role is to establish general technical standards for the compression,decoding,processing,and representation of digital audio and video[1].After ten years,the AVS workgroup has developed a series of standards based on different applications,and these standards have attracted the attention of both industry and academia.In 2007,AVSwas accepted by the ITU-TIPTVfocus group as one of four video formats.With the fast development of display technologies and rapidly growing demands of 3Dvideo(3DV)applications,high-efficiency 3DVcompression is needed.The most straightforward 3DVcoding scheme is simulcast,in which compression and transmission are performed separately for each view.However,simulcast ignores inter-view correlation,which produces double the amount of data compared with traditional video.Thus,simulcast is not the optimal solution for 3DVcoding.In 2008,the AVSworkgroup launched the 3DV coding project to satisfy demand for higher resolution and better quality that had arisen as a result of widespread 3DV usage[2].
Currently,the AVSworkgroup is focused on stereoscopic video coding because of the rapidly growing 3DVmarket and number of applications.Two advanced stereoscopic video coding schemes have been adopted:inter-view prediction and enhanced stereo packing coding[3].In inter-view prediction,the correlation between two channels is greatly reduced by allowing disparity compensation from the inter-view frame.Enhanced direct-mode prediction and enhanced motion-vector prediction further improve coding performance[4].In enhanced stereo packing,the stereoscopic image of each view is down-sampled by half and merged into a single frame.View prediction is allowed in the frame in order to improve coding efficiency.This technique supports backward compatibility with existing 2DV coding infrastructure.To flexibly support these two 3DV coding schemes,AVSdefines high-level syntax at both system layer and video layer.
Because of the fast development of microelectronic techniques,there is an urgent need to develop a dedicated AVS 3DVreal-time encoder chip that is capable of huge throughput and mass computation in consumer applications.Although AVSis designed for optimized coding and low complexity,compressing high-definition(HD)stereoscopic video in real time is a very big challenge.Several key techniques have therefore been proposed.These techniques include parallel pipeline video coding,advanced rate control,and inter-view synchronization.In this paper,we review these key techniques used in encoder chip design and propose an AVS 3D system that incorporates a real-time encoder for TV broadcasting.Our proposalis the first end-to-end AVS 3DV system,and it has already been successfully used to broadcast the Guangzhou Asia Games 2010 on 3DTV.
In section 2,we introduce inter-view prediction and stereo packing schemes used in AVS.In section 3,we propose the3D AVScoding system,including the core techniques for designing an AVS 3Dreal-time encoder chip and broadcasting system.In section 4,we give the results of experiments conducted with these technologies.Section 5 concludes the paper.
▲Figure 1.System structure of inter-view based AVS 3DVcoding.
Fig.1 shows the basic concept of the AVSinter-view stereoscopic coding system.The input signal comprises left and right views that are captured by a stereo camera.These views are coded using an AVS 3Dencoder,and the resulting bitstreams are multiplexed to form the final bitstream packet.At the receiver,the bitstream packet is decoded with the AVS 3D decoder for stereo display.To ensure compatibility with AVS 2D,the sub-bitstream,which represents the independent view,can also be decoded using an AVS 2D decoder and displayed on a conventional 2D display system.
Inter-view prediction uses the already coded data in the other view to efficiently represent the current view[3].One of the two views,referred to as the base view or independent view,is coded independently using an unmodified AVSP2 video coder.Fig.2 shows the coding structure of inter-view prediction.To ensure compatibility with monoview AVS,the number of reference frames for both the base view and dependent view is restricted to two.The base view can be decoded independently for 2Ddisplay.
To exploit the inter-view correlation,in the dependent view,the first frame is inter-predicted from the reconstructed I frame in the base view.Other Pframes in the dependent view can reference either the previous Pframes in the same view or the corresponding simultaneously displayed Pframe in the base view.Inter-view prediction for the Bframe does not affect coding performance;therefore,references for the B frame can only be reconstructed frames from forward and backward directions in the same view.
Because the AVSinter-view coding structure changes the reference-frame mechanism of the dependent view,the related view prediction techniques should also be developed.Recently,two advanced techniques were adopted by the AVS workgroup:enhanced motion vector prediction and enhanced direct mode[4].These techniques can be used to exploit the correlation between the base view and dependent view in order to improve coding performance.
Conventional AVSmotion vector prediction for monoview uses scaled motion vectors from four neighboring blocks.However,for the Pframe of the dependent view,it is not desirable to use the motion vectors of neighboring blocks if they refer to different channels.To resolve this problem,enhanced motion-vector prediction is proposed.We can assume the current block is A.If Ais temporally predicted,the inter-view predicted block in the neighboring blocks is unavailable.Similarly,if A is inter-view predicted,the temporally predicted block in the neighboring blocks is unavailable.This approach ensures that appropriate motion vectors are used for prediction.
In monoview AVScoding,the motion vectors of direct mode for the B frame are derived from the motion vector of the co-located block of the backward reference[5].However,in inter-view prediction,the farthest reference frame of the backward reference is substituted by the inter-view frame.To obtain accurate motion vectors,when the backward reference is inter-view predicted,the motion vectors of the neighboring blocks are used instead of the disparity vectors.
The stereo-packing mode is used for backward compatibility with 2DTVinfrastructure and to improve coding performance.Fig.3 shows how stereo-packing mode is used in AVS 3DVcoding.At the encoder,each view is first decimated by half using down-sampling,then the two down-sampled frames are merged into one frame that is the input of a conventional AVS 2D encoder.At the decoder,the bitstream can be decoded using an AVS 2D decoder and can then be detached into multiple views.Each view is up-sampled to support 3D display.Two key techniques in stereo-packing mode involve down-sampling and up-sampling algorithms,and view-merging[6].Because sampling algorithms are non-normative for the video coding standard,various algorithms can be supported depending on the application scenarios.For more details on the sampling algorithms refer to[6].Currently,AVSsupports two merging approaches:side-by-side and top-to-bottom(Fig.4).These two approaches make the down-sampling and up-sampling algorithms more flexible.
▲Figure 2.Inter-view prediction structure in AVS.
Coding efficiency in the stereo-packing scheme can be further improved by exploiting inter-view redundancies.Similar to inter-view prediction,AVSallows interprediction between two decimated views(Fig.5).This technique is limited in that the encoding process of the dependent view's decimated frame cannot begin until the base view has been encoded.
?Figure 3.Stereo packing scheme for AVS 3DV coding.
▲Figure 4.(a)Side-by-side and(b)top-bottom view merging employed in AVS 3DVcoding.
▲Figure 5.Inter-view coding structure in stereo packing.
To support the two kinds of AVS 3D coding,high-level syntax is designed at both the system layer and video layer.Three AVS 3D coding schemes are created by incorporating the descriptor:simulcast compression,inter-view prediction,and stereo packing.We define a syntax view_orgnizing_type for describing each coding system.If the syntax is zero,both simulcast compression and inter-view prediction are supported.If the syntax is one,only stereo packing is supported.
The syntax in the video layer indicates different merging approaches for stereo packing mode(Table 1).Stereo packing mode is fully compatible with monoview coding when the stereo packing mode is set at zero.Moreover,a reserved value is also defined for future extension.
From production to broadcasting,3DTVusually goes through the processes of acquisition,encoding,multiplexing,modulation,demodulation,demultiplexing,decoding,and display.Among these,the most important is real-time encoding of the HD stereoscopic video.In this section,we discuss AVS 3DVencoder chip design techniques.We also present a 3DTVbroadcasting system and discuss potential applications of the AVS 3DVcoding standard.
Fig.6 shows the AVS 3D real-time encoding system for HD stereoscopic video.In the encoder,the left and right views are down-sampled and merged into a single frame.The down-sampling direction can be horizontal,to support the side-by-side merging approach,or vertical,to support the top-to-bottom merging approach.The syntax for these approaches is defined in section 2.3.The packing frame is fed into the AVSHD encoder,which generates the AVS bitstream.Finally,the AVSbitstream is packaged into transport stream format for storage or transmission.
The computing power required by the SD/HD encoder is far beyond the capacity of a single central processing unit(CPU).Fortunately,multicore processors allow the possibility of achieving real-time encoding.To fully exploit multicore processors,parallel encoding algorithms are highly desired in the encoder design.
The motion estimation(ME)module generally takes more than 60%of the total encoding time,and this is a bottleneck for real-time compression.We therefore isolate the ME module for parallel processing.Because the MEmodule frequently needs to exchange data with other modules,it is not appropriate to use macroblock or finer-level parallel processing for ME.We propose a frame-level parallel ME algorithm that exchanges MEinformation with other modulesuntilthe MEof the whole frame is finished.Fig.7 shows the architecture of the proposed dual-pipeline parallel scheme.The MEprocess is completely isolated and is the first-level encoding process.The output of the MEmodule is used by other modules in second-level encoding.
▼Table 1.Stereo packing mode in AVS 3DVcoding
▲Figure 6.AVS 3DVencoder based on stereo packing.
The main obstacle in the proposed dual-pipeline parallel video coding scheme is the generation of the reference frame.Conventional video coding uses the reconstructed frame as the reference in the MEprocess,which means the frame-level MEprocess cannot start until the reconstructed frame has been obtained.This is problematic for frame-level parallel MEbecause the MEof the next frame can only begin after the current frame has been encoded.Fortunately,the original frame can be used in the encoder for reference.Because the output of the MEmodule is only the motion vector information,no error-drifting is incurred in this approach.The reconstructed frames are still used in residual calculation.Although this approach does not ensure that motion vectors obtained in the MEare optimal,it is a practical approach to frame-level parallel MEand strikes a good balance between computational complexity and coding performance.
Rate controlis important in a practicalencoder design.Without rate control,there would be mismatch between the source bit rate and channel capacity,and this would cause underflow or overflow.To accurately control the rate of AVS 3DVcoding,we propose a window-based scheme to tackle the problem of interference between rate control and rate distortion optimization(RDO).In this scheme,rate-Qstep(R-Q)and distortion-Qstep(D-Q)models are used to allocate the appropriate number of bits to each coding unit and to adjust the quantization parameter so that each unit is properly encoded with the allocated bits.With the proposed D-Q model,distortion can be estimated,and the optimized coding mode can be obtained by comparing the rate distortion cost of the coding modes.Scene switching is also considered in the window-based scheme because it may cause large bit-rate fluctuations.
Besides these encoder control techniques,synchronization between two views in the 3DVcoding is also very important.To synchronize the two views,we use a clock control mechanism and design a scheme based on the AVS system layer specification.We define a transport-stream program map table(PMT)that creates a relationship between the program and its elements.Using this map,we attach a timestamp to each frame and synchronize the timestamps of the two views for synchronized 3DVdisplay.
We have already incorporated the AVS 3D real-time encoder into a real 3D live broadcasting system.This system was the first end-to-end broadcasting system and was an example of the practical application of AVS 3DVcoding.The system was used for broadcasting 3D TVprograms from the Guangzhou Asian Games in 2010.The system successfully delivered an immersive entertainment experience.
Fig.8 shows the architecture of the broadcasting system,including content acquisition,and encoding and display modules.Two-channel high-definition serial digital interface(HD-SDI)audio and video signals are the input.These signals are first transmitted to the switching station for editing.Other program content,such as captions,can be integrated into the process.Then,the uncompressed audio and video signals are fed into the 3DVprocessor.In the processor,left and right views are adjusted to ensure the two views match exactly.The signals are then transmitted to the Quantel 3D broadcasting system where the programs can be edited and reviewed.Finally,the signals are fed into the real-time AVS 3D encoder for compression.In the display system,the 3D program stream is input into the set-top box for decoding,and the decoded signal can be displayed by various 3D display systems,such as a 3D TVor projector.
▲Figure 7.Architecture of the dual-pipeline parallelvideo coding scheme.
▲Figure 8.AVS 3Dlive broadcasting system.
This system is an optimal,low-cost solution to smoothly transferring from a monoview to 3D TV broadcasting system.This system also highlights the great value of the AVS 3DV coding standard in practical applications such as 3D mobile phone TV,remote interview,video surveillance,and remote learning.The whole 3DVindustry chain,from acquisition to display,will benefit from the development of AVS 3DVcoding technology.
▲Figure 9.Horizontaldownsampling for stereo packing.
▲Figure 10.Verticaldownsampling for stereo packing.
▼Table 2.Performance of inter-view prediction and simulcast schemes
Inter-frame prediction,enhanced motion-vector prediction,and directed mode are integrated into the AVS reference software RM52k_r2.The coding parameters are set according to the general test conditions[7].The RD performance comparison in[8]is shown in Table 2.From Table 2,inter-frame prediction can reduce the rate by up to 30%for the same peak signal-to-noise ratio(PSNR).The reason for the superior coding performance is that the correlations between two channels are exploited to reduce inter-view redundancy.
Side-by-side and top-to-bottom stereo packing make the coding process very flexible because various downsampling and upsampling algorithms can be used.The downsampling and upsampling methods greatly affect coding efficiency.Figs.9 to 11 show the horizontal,vertical,and diamond downsampling algorithms,respectively.
▲Figure 11.Diamond downsampling for stereo packing.
For each of the down-sampling algorithms,several corresponding upsampling algorithms are used.These include bilinear,cubic,and AVC-based interpolation algorithms.In Fig.12,Down0,Down1,and Down2 denote horizontal,vertical,and diamond down-sampling,respectively,and Up0,Up1,and Up2 denote bilinear,cubic,and AVC-based upsampling algorithms,respectively.Combinations of these algorithms are then integrated into the AVS 3D stereo packing scheme,which is implemented in RM52k_r2.The RD performance is shown in Fig.12.For the sequence Cafe,horizontal downsampling with AVC-based up-sampling performs the best.For sequence PoznanHall,vertical downsampling performs better than horizontal down-sampling.This suggests that the performance of the downsampling method is depends greatly on the properties of video sequence.In the case of low bit rate,the stereo packing scheme is capable of great coding gain compared with the simulcast scheme.At a low bit rate,the quantization of encoding causes most of the distortion,and the stereo packing scheme can provide the best RD trade-off.
?Figure 12.Performance of different sampling algorithms with the simulcast scheme.
In this paper,we have discussed the background,technical features,and applications of the AVS 3DVcoding standard.The AVS 3DVcoding greatly advances coding efficiency and backward compatibility in standard video coding technology.We also introduce the two main features in AVS 3DVcoding:inter-view prediction and stereo packing.The AVS 3D TVlive broadcasting system shows that the adopted schemes can provide great flexibility for effective use over broad application domains.In the future,more and more new applications will be developed over existing and future AVS 3DVcoding technology.