Lab of Application Specific Instruction-set Processors, Beijing Institute of Technology, Beijing 100081, China
* The corresponding author, email: dake@bit.edu.cn
The size of micro base station (mBS) in a 5thgeneration (5G) Ultra-Dense Networks (UDN)will be small enough when using mmWave beamforming (BF) technology. The inter-site distance is small and the mBS deployment density is high. The cost of a mBS is thus very important for operators. Since baseband is one major cost part in mBS and its implementation cost is so far unknown, we conduct research for the baseband design to expose the cost of different baseband solutions in this paper.
There have been a number of studies on the deployment of UDN for 5G networks. In [1][2] the key requirement and characteristics of UDN have been put forward, and the potential of using mmWave band for transmission is demonstrated. In [3][4] the technology issues including interference, mobility, backhauling and network parameters are discussed for UDN. In [5] the UDN small-cell BS distribution model and deployment scenarios are discussed, and system performance is evaluated through simulations. The standardization of 5G UDN is an ongoing topic in 3GPP, with many technical issues [6][7] still under discussions and evaluations. The simulation results seem so far sufficient for early standardization yet there are in lack of implementation constraint details. To get implementation cost in-formation, we thus cite listed publications and summarize the requirements as the starting points for our research of this paper:
In this paper, the authors studied the baseband hardware system design and implementation for lowcost mBS. The authors also proposed feasible chip-level solutions for the baseband with up to 128-antenna BS system, and estimated their implementation cost.
1) Supporting data rates of up to 10 Gb/s in urban and suburban environments, with 95%availability of user experience. The access-network latency is about the order of 1ms.
2) Covering scenarios of indoor and outdoor deployment, with distances between access nodes ranging from a few meters up to a hundred meters using low cost CMOS power amplifiers.
3) Realizing increased capacity, energy efficiency, and better spectrum efficiency.
To meet the UDN requirements, it is commonly recognized that the deployment of UDN should incorporate small cells into a macro cell, similar to the Heterogeneous cellular network (HetNet) [8]. BF for massive multi input multi output (MIMO) system is a key technology to compensate for the high propagation loss in mmWave band. To get the sufficient spectrum resource for high speed data transmission, very high bandwidth and capacities of the Application Specific Integrated Circuits / Instruction-set processors (ASIC/ASIPs) for baseband signal processing are needed. These indicate that a strong baseband system for UDN mBS is required to process signal from possibly hundreds of antennas,and multiple parallel broadband data streams,inducing heavy computing load.
Comparing to the cost of radio frequency(RF) module, analog baseband, and power amplifier [9], digital baseband cost of ASIC/ASIP is of the most uncertainty in mBS system. The function flow and implementation experiences of 5G UDN baseband will be rather different from that of 4G. Thus, lowcost baseband design and implementation is a great challenge. It includes the selection of feasible baseband algorithms; performance and cost trade-off among different BF architectures; and chip level partition based on the physical constraints of power and pins. So far,we cannot find publications on practical UDN BS implementation and related discussions on cost or feasibility. There are many algorithms for 5G yet not specifically implemented for UDN, e.g., BF algorithms for multi-user massive MIMO.
The goal of this paper is to provide a guiding reference, both for 5G communication researchers to select baseband algorithms of reasonable performance and implementation complexities for UDN, and for system hardware integrated circuit (IC) designers to understand critical implementation challenges and feasible solutions. The research scope is:the baseband feasibility study and architectural analysis for UDN mBS based on mmWave massive BF, with different number of antennas, different BF scenarios, and implementations of different baseband algorithms. The RF frequency is not the focus of the research.
The main contributions of this paper are:
1) We specified hardware (HW) structures of UDN baseband through algorithm analyses and selections. According to our knowledge,this paper is the first one providing comprehensive study on the design and implementation of mmWave UDN baseband HW with performance, power consumption, and cost analyses.
2) Specifically, we identified that zero-forcing (ZF) algorithm is the reasonable BF estimation algorithm for UDN BS employing full-digital BF, under the defined scope of the paper.
3) Furthermore, different implementation schemes (Low-Voltage Differential Signaling(LVDS) I/O or high-speed I/O with serializer/deserializer (SerDes), with or without ventilation) are proposed.
4) Finally, we studied on hybrid BF architectures. Taking the 128-antenna massive MIMO system as an example, we demonstrated that the 4×32 hybrid architecture (i.e., the number of RF chains is 32) is the reasonable one among the proposed solutions under low cost consideration.
In the next section, we introduce the method of this research, including the constraints of hardware implementation, and the design flow based on power and cost analysis. The design flow can be used as a method for fur-ther research. Section 3 extracted the requirement specification of UDN mBS. Different BF scenarios are thus collected and the baseband functions are specified following listed scenarios. In section 4, feasible algorithms are selected based on performance, complexity,and power analyses. In section 5 we proposed the architecture with implementation issues,and analyzed the implementation cost. Section 6 concludes the paper.
The research flow in this paper is illustrated in Figure 1. We start from the definitions of the BS and the possible associated User equipment (UE). Following the definitions, there are mainly 3 parts: implementation constraints modeling, functional design, and iterations for chip design.
Fig. 1 Research flow
To guide the chip design, implementation modeling for UDN baseband is first studied,including modeling of silicon cost and power consumption. Following system requirement specification, transceiver function flow is designed. Then early stage hardware design is carried out in an iterative process, scheduled into 3 iterations. In each iteration, we conduct power and/or cost analysis to discuss the design feasibilities. The analyses are conducted first on algorithm level and then on hardware level, to speed up design iterations. Details are given in the rest of this section.
In this paper, the time division duplex (TDD)UDN mBS and UE are defined as follows(TDD is chosen for spectrum efficiency):
UDN mBSs: indoor or outdoor access points providing high-speed data services for users with low mobility (less than 3km/h),with inter-site distance not more than 100 meters;
UEs: terminals in the UDN (e.g., mobile phones) with single isotropic antenna.
For hardware design and implementation, silicon cost and chip power consumption are the main constraints to be considered. Silicon cost and power models are given as follows. Note that some parameters in our models are from engineering experience, and the modeling accuracy is on high level with limited parameters. For early prediction on future implementations, the models are complete and adequate.
Silicon cost:As the density of BS deployment becomes much higher for 5G UDN, the BS cost is sensitive. In general, the baseband silicon cost is one of the major parts of the implementation cost.
The silicon cost of the baseband chip-set includes the silicon cost measured by silicon die size (Csilicon), the packet cost (Cpackaging), the share of NRE cost (no return engineering cost,including mainly the design cost (Cdesign) and the mask cost (Cmask) of a silicon wafer) dis-tributed on per chip:
IP purchase cost (the I/O cost) is trivial and not counted for baseband.
Packaging cost is related to the number of pins of a chip (Npin). Here we choose 1 US cent per pair of pins of high quality plastic packet for cost estimation.
Silicon cost can be estimated by:
Where,Cwaferis the wafer cost under certain silicon technology;yis the yield;AwaferandAchipsare wafer area and chip area respectively.For 16nm technology, a 12-inch (70686 mm2)wafer can cost about 4800US$ in 2018 [9].
Mask cost is roughly 3 million dollar for a 12-inch FinFET wafer. Total design cost of the baseband chip-set can be up to 20 million dollars. When production volume achieves 10 million chips per year and the yield 80%, the share of the mask cost is 0.3US$ per chip, and the share of design cost is 2US$ per chip-set.Finally, the implementation cost of single chip and the chip-set is:
Where,Cchipis the cost per chip without counting NRE;Npinis the pin count of the chip;Achipis the silicon area of the chip;Cchipis the number of chips of the same type in the chip-set.
Power consumption:Without fan-cooling,the power consumption of each plastic packed chip is within the limit of 4Watts according to normal packet selection rules from packet providers, i.e., for each single chip, there isPchip≤4W. For solutions that allow fan-cooling, the single-chip power can be up to 10W.
Power consumption of the baseband chipset is determined by the power of logic and memory modules (Plogic&memory), the power of pins (Ppins), and the power of all the overhead such as control, wiring and static power (represented by the overhead factor k in the equation) [10]:
Commercial power estimation tools are based on Register Transfer Level (RTL) design. For early estimation yet without RTL coding,Plogic&memorycan be estimated according to complexities of baseband algorithms. It is thus replaced by algorithm-level powerPalgin our model. According to the power modeling method demonstrated in our previous work[10],Palgis estimated based on basic operation counts of the algorithm, the power of each kind of operations calibrated with data precision, the clock frequency, and the required latency for running the algorithm. The formula for a specific algorithmais given by:
Where,Npd(i)is the architectural parallelization constrained by computing latency and computing complexity (derivation will be detailed in section 4.1).Pcom(i)is the power consumption of each kind of computation arithmetic(calibrated under a silicon technology and data precision at clock frequency F);Pmemis the average power consumption per memory cell,including memory peripheral power;Omemis the number of memory cells used by the algorithm.
Pin power is given by:
Where,Pnormalis the power of full-swing output pins (including functional and non-functional pins); PIO is the power of I/O pins (highspeed LVDS or high-speed I/O with SerDes)including driver and receiver (the receiver pin power is only for voltage level conversion,and is included in the power modeling of its driving pin power); Non-functional output pins include host, clock, reset, joint test action group (JTAG) pins etc.
The first iteration in the design process (noted by ① in Figure 1) is for the complexity and power analyses for a set of baseband algorithms. The inputs are the popular algorithms collected for UDN baseband; the output is the early results of algorithm level power analysis.The goal is to select feasible algorithms, and remove algorithms with excessive complexities.
The second iteration (noted by ②) is the analysis of baseband architectures. The inputs are different architectures and algorithms to be implemented on it, and the output is the chip-level implementation proposals with cost analysis results. The goal is to provide architectural cost estimations, thus designers can make comparison and cost-performance tradeoffs. In this stage, we first propose possible baseband architectures for full-digital and hybrid BF. Then we perform hardware-level power analysis to figure out the minimum required number of chips for each part. Chip partition and pin allocation are performed based on the analysis. Implementation costs of the proposed chips after partition are then analyzed.
The third iteration (noted by ③) is the anal-yses on different BS configurations. The inputs are the solutions obtained from the previous iteration; the output is the cost versus performances for these solutions. Performances of algorithm/ architectures are collected from published references.
Table I Requirements and configurations of UDN mBS
Table II Alternative schemes and BF control scenarios
Table III BF control scenarios: definition and specification
The requirements and configurations of UDN mBS are summarized in Table I. It is assumed to have aM-antenna array (32≤M≤ 160 for low-cost and small array size) providing Space Division Multiple Access (SDMA) for no more thanK=M/10 reception zones [11](A zone is a service area covered by a beam)through beamforming. Millimeter wave bands will be used to offer sufficient bandwidth.
For the TDD system, channel estimation is performed on the BS side based on orthogonal pilots transmitted by all terminals on the uplink, during a specified coherence interval.Through TDD reciprocity, the uplink channel state information (CSI) will be used for downlink. For a BF based communication system,there can be different control scenarios for the beam and channel estimation, based on the selected combinations of pilot allocation scheme and multiuser estimation scheme, as listed in Table II. We therefore have to supply programmable solution. The definition and specification of 4 scenarios (S1~S4) are listed in Table III. Available BF algorithms corresponding to each of the scenarios will be discussed in section 4.3. Among the 4 scenarios, S4 is of the highest computing complexity, thus in the following analyses, we will focus on S4 to get the performance constraints.
In this section we look into the function flow of UDN mBS. There are 2 fundamental approaches for the BF implementation: full-digital and hybrid. The function flow of UDN mBS with full-digital BF architecture is illustrated in Figure 2(a).
As shown in Figure 2(a), baseband algorithms are categorized into 3 independent groups: 1) BF Receiver (BFR) algorithms on upper left of the figure; 2) User Transceiver and Estimator (UTE), at the middle of the figure, algorithms for each beam; 3) BF transmitter (BFT) algorithms at the bottom of the figure.
The BFR flow chart is shown in Figure 3.Starting from the receiver RF analog signal,a symbol consisting of I and Q components is received by each of theMantennas, which is then fed into Analog-Digital Converters(ADCs). Cyclic Prefix (CP) removal, signal decimation, rotation, and band-pass filtering are then performed in the “Filter” module in the figure, following Fast Fourier Transforms(FFTs) for orthogonal frequency division multiplexing (OFDM) de-modulation. After FFT,pilots for channel estimation are extracted aspsubcarriers. Then one set ofM×K×4096 BF weightsWis applied on each antenna and each subcarrier, to separate beams intoKreception zones. Weights are estimated through typical MIMO BF algorithms such as zero-forcing, maximum ratio combining,singular value decomposition (SVD) precoding (using channel matrixHfrom BS channel estimation). Weights can be also from codebook-based beam selection (based on time-domain pilots or the beam indicator fed back from UE), corresponding to different schemes listed in Table III. Estimation algorithms will be discussed in section 4.3.
Fig. 2 Function flow of the UDN mBS (a) full-digital BF (b) Na×Nc hybrid BF(transmitter)
Fig. 3 BF receiver (BFR) flow
Fig. 4 User transceiver and estimator (UTE) flow
Fig. 5 BF transmitter (BFT) flow
A UTE consists of three parts: the beam estimator, the receiver and the transmitter.
The UTE can be divided into 2 sub flows: the single-user flow and the multi-user flow. In the single-user flow, the transceiver for each reception zone processes for one user offering the maximum bandwidth. In the multi -user flow for beam estimation and equalization,users are separated in either time domain (a symbol chain is assigned to a user) or frequency domain (a block of subcarriers are assigned to a user). The complexity for the 2 kinds of flows could be slightly different, and it has been much discussed in previous studies [12].For simplicity, we only estimate transceiver cost for one user using the full beam bandwidth. The multi-user flow illustrated in Figure 4 is only for functional specification.
Followed by user separation, in the uplink/receiver flow in UTE, demapping, de-interleaving, forward error correction (FEC) and cyclic redundancy check (CRC) are successively performed. In the downlink flow, bit stream to be transmitted are provided by the media access control (MAC) interface. CRC error check coding, channel coding (CC) and x quadrature amplitude modulation (xQAM)are included [13].
In UTE, channel estimation, weight estimation andp-to-4096 interpolation are performed in parallel. GeneratedM×K×4096 weight matrix are sent to the BFR and BFT modules.
Figure 5 shows the BFT flow. Signals for each transmit zone are weighted by its own BF weights. The weighted I and Q signals for each transmit zone are then summed and sent to each antenna after passing through M IFFT,digital pre-distortion (DPD), 2 (I and Q)×Msymbol shaping filters and DACs respectively.
WhenMis up to 100 and more, the baseband power consumption using full-digital BF can become too high. Figure 2(b) shows theNa×Nchybrid BF transmitter architecture[14] (similar architecture can be applied to the BF receiver), which can relax baseband complexity with acceptable degradation of performance, whereNcstands for the number of DACs/RF chains to the digital baseband,andNastands for the number of analog RF BF chain. The complexity reduction benefits of digital baseband are the relax of digital hardware for BF and less digital I/O pins, as pins can consume considerable, even dominant power when I/O bandwidth is high. The digital baseband function of the hybrid architecture is equivalent to theNc-antennafull-digital baseband.
In this section, we discuss the selection of feasible BF estimation algorithms that can possibly be implemented on mBS. The analyses of algorithms are based on complexity,algorithm-level power estimation, and performances from references.
Before conducting the algorithm analysis,we first plan and select the transmission parameters, which is dedicated for the UDN hotspots operating at above 25 GHz with short cell-range and walking speed users (e.g., less than 3km/h). The coherence time isτc= 10 ms. Bandwidth per beam is BW=250 MHz,achieving 1Gbps per beam with 16QAM modulation. As BW is less than 1% of the carrier frequency, phase compensation for BF is not required. The length of FFT for OFDM modulation is 4096, and subcarrier spacing is 62.5 kHz. Cyclic prefix size is set to 704, i.e., total symbol size with CP is 4096 + 704 = 4800.
Power estimation in the following subsections is carried out at 16nm CMOS technology, and the cost will be reasonable by year 2018 according to the prediction of ITRS [15].The FinFET (16nm silicon technology) based power values of adders, multipliers, D-flipflops and SRAM are given by recent references, and are scaled to the corresponding value at 0.6V and 500 MHz, as shown in Table IV.
Main baseband algorithms for BFR, UTE and BFT with their specification, parameters and data precision are listed in Table V. Baseband algorithms can be implemented with various degrees of parallelism. If the latency of baseband algorithms to be implemented is too high, then architecture optimization for higher parallel degrees is required to reduce the computing latency. While reducing latency, the power consumption is also increased if all other IC parameters are kept constant. The conceptual relationship between parallel degree (Npd), power (Palg) and algorithm latency(TL) is illustrated in Figure 6.
Based on this principle, power estimation for baseband algorithms is in the following steps:
1) Algorithm complexity analysis. Computing complexityOcom(i)is firstly obtained, which stands for the count of each computation arithmetic while the algorithm is executed once. It is then regarded as the clock cycles required for that arithmetic operation assuming no parallel execution:
2) Algorithm latency specification. The maximum latencyTLof baseband algorithms are specified respectively, with the total uplink/downlink latency requirements satisfied.Under the clock frequencyF, the maximum latency in cycle counts is obtained by:
3) Parallel degree specification. According to 1) and 2), the parallel degree is derived by:
4) Power estimation under the specified parallelism. WithNpdand parameters in Table IV, power is estimated using equation (5).
Fig. 6 Relationship of parallel degree, power and latency
Latency allocation described in 2) is specified for each algorithm listed in Table V. For channel and weight estimation,Tmaxis 5% of the 10 ms channel estimation interval, i.e.,500μs, which is chosen to be much less than the margin of the coherence interval; For symbol processing algorithms like filtering and FFT, 16μs is allocated, which is the OFDM symbol duration; for CRC,Tmaxcan be 30μs using the parallel architecture in [16]; for FEC,Tmaxcan be as low as 10μs for coding,and 256μs for decoding when Turbo with 2 iterations is processed with the 12-SISO (softin soft-out) architecture proposed in [17].Latencies are specified for total uplink latency 369μs, and total downlink latency 104μs, thus the 0.5ms latency constraints are satisfied including the BS MAC latency (another 0.5ms is assigned to the computing of UE).
Table IV Baseband IC parameters for analysis
Main algorithms for BFR functions include Finite Impulse Response (FIR) filtering, Fast Fourier Transform (FFT) and BF weighting.The algorithm complexity is analyzed in Table VI. “Nm” is the number of duplicated algorithm modules for multiple antennas/beams.The complexities of additions (Oadd), multiplications (Omul), register access measured by the size of D-flip-flop (Odff(bits)), and memory consumption (Omem(bits)) are listed.Omemfor FFT includes buffers for one symbol and the corresponding twiddle factors; For BF weighting, the memory size is estimated for one BF weighting matrix. The parallel degree “Npd”are given following the complexities.
Table V Algorithm specification for UDN baseband
Table VI Algorithm complexity for BFR
Main algorithms and parameters for UTE include channel & weight estimation, interpolation, demapping, interleaving, FEC and CRC.BF weight estimation algorithms considered in this paper are:
Full-digital BF estimation:For the estimation, four popular algorithms are considered:zero-forcing (ZF) [18], maximum-ratio transmission (MRT) [19], singular value decomposition (SVD) [20] precoding and codebook(CB)-based BF [21][22]. Algorithm principles and equations are as follows.
Zero-forcing:a linear BF algorithm which can suppress inter-beam interference by computing the weights for each beam while treating signals from all other beams as interference. It is realized by computing the pseudo-inverse of the channel matrix:
Maximum-ratio transmission. MRT algorithm directly applies the channel transpose for BF weighting, thus to enhance MIMO gain:
However, inter-beam interference is not considered, which could cause performance degradation.
Singular value decomposition. In SVD algorithm, SVD is applied to the channel matrixH:
Where,UandVare unitary matrices,Σis a diagonal matrix consisting of the singular values of matrixH. By applyingVat the transmitter andUHat the receiver, downlink BF transmission is achieved, i.e.,WSVD=V. If perfect channel knowledge is assumed at the BS,SVD algorithm achieves the theoretic limit of MIMO capacity.
Codebook-based.CB-based algorithm utilizes a group of pre-defined BF weighting vectorswi, each represents a specific beam direction. At the beam selection phase, uplink reference signals are sent by UEs, and BS selects the best beam direction for each user via scanning through allwi, and choose the one that yields the maximum BF gain. Inter-beam interference is not considered.
Hybrid BF estimation:For hybrid BF architectures, the weighting matrix is the product of matrixWfor digital beamformer and matrixFfor analog phase shifters (PS). The weights ofFare only corresponding to steering of beam directions, while the weights ofWprovides both beam steering and inter-beam interference cancellation. The design of hybrid BF algorithm aims at solving the joint optimization problem forWandF.
There has been much investigation on the design of mmWave hybrid BF algorithms. A well-known example is the spatially sparse precoding [23][24] based on basis pursuit or orthogonal matching pursuit (OMP). This method leverages the spatially sparse characteristic of realistic mmWave channels. However, discussions in [23][24] are only limited to single-user MIMO with large antenna arrays on both BS side and MS (mobile station) side.In our discussion, reference hybrid algorithms are chosen for multi-user MIMO with single-antenna terminals, to fit the defined scenario in this paper. Two typical algorithms are given as follows.
Phased-ZF. Phased-ZF (PZF) [25] algorithm is proposed for low-complexity multiuser hybrid BF. Its analog weighting matrix is generated based on the phases (φ(i,j)) of the elements (i,j) of the conjugate transposed multi-user aggregated downlink channel matrix:
The baseband weight matrix is estimated by applying ZF algorithm to the equivalent channel observed at the baseband:
Where,Heq=HFis the equivalent channel at the digital baseband,His the composite channel,Λis the diagonal matrix introduced for power normalization. Inter-beam interference is managed at the baseband.
Maximize-SE. In [26][27] an algorithm is proposed based on solving the optimization problem of overallSE(spectral efficiency, denoted byR) maximization:
Where,Wt,Ftare the digital/analog weighting matrix at the transmitter,Wr,Frare the digital/analog weighting matrix at the receiver, and P is the transmitter power budget. Inter-beam interference is taken into consideration during problem solution.
The performance comparison of different estimation algorithms for UDN can refer to[28][29][30] with performance/quality measurements. We will evaluate their performance together with the power and implementation cost issues in section 5.3.
The algorithm complexity for UTE is analyzed in Table VII.Omemfor different weight estimation algorithms are estimated based on the sizes of intermediate matrices in computation. Available scenarios for different BF algorithms are given in the last column. The complexity of (de)interleaving, CRC and FEC are not included in this table, since their basic operations are different from that in our power model. For these algorithms, we get reference values from [16][17][31] for estimation. Power values with different number of beams are shown in Figure 7.
As is also shown in Figure 7, the power comparisons for UTE with different BF estimation algorithms are conducted. It is observed that there is negligible difference(around 4%) for ZF, MRT and CB-based estimation algorithms, while SVD estimation consume up to 280% more power than ZF.
Main algorithms for BFT functions include transmitter BF weighting, Inverse-FFT (IFFT),digital pre- distortion (DPD) and shaping filtering. The algorithm complexity is analyzed in Table VIII.Omemare estimated similar to that in BFR. Power estimation result is shownin Figure 7.
Table VII Algorithm complexity for UTE
Table VIII Algorithm complexity for BFT
Quantitatively, we perform hardware-level analysis on power and pins for chip partition, takingM=128 andK=10 as an example.We choose 3 architectures for analyses: the full-digital architecture, the 2×64 hybrid architecture and the 4×32 hybrid architecture. The minimumNcis chosen to be 32 to keep acceptable performance loss. WhenNcreduces down to 16, performance loss will be significant[27]. The performance issue will be detailed in subsection 5.3.
Chip-level partition for the UDN baseband ASIC/ASIP is illustrated in Figure 8. The number of BFR chips (NBFR), UTE chips (NUTE)and BFT chips (NBFT) in this figure varies according to different digital/hybrid structures,and different I/O and cooling technologies. We provide 3 possible implementation schemes:A. LVDS I/O, B. high-speed I/O with SerDes no fan-cooling, C. high-speed I/O with SerDes and low cost fan-cooling.
Scheme A: LVDS I/O. Most of today’s LVDS technologies provide data transmission up to 2 Gbps [32][33]. When LVDS is implemented for the I/O of baseband chips,the number of I/O pins will be huge for massive-MIMO. For example, for a full-digital baseband receiver, 128 (antenna number)×10 (ADC resolution) ×2 (I/Q signal pair)×2 (LVDS pin pair) = 5120 total input pins are required at 600 Mbps (the ADC speed).Although the pin number could be halved through doubling the LVDS speed to 1.2Gbps(transmitting I and Q signals in one pin in serial), 2560 is still much larger than 1000, which is the maximum pin counts of a Ball Grid Array (BGA) packaging with normal cost by current experiences. Therefore, pin number is the boundary condition for chip partition.
Fig. 7 Algorithm level power estimation in each part
Fig. 8 Chip partition
We demonstrate the chip partition scheme based on pin constraints, by partitioning BFR chips as example. The pin constraint for BFR chips is given by:
Pinlvdsincludes input LVDS pinsPinlvds(in)from ADC (Pinadc) and pins for weight matrix (Pinweights) (1 pair of weight pins for each antenna is sufficient, according to theWsize and transmission time constraint).Pinadcis equal to 2 (LVDS pin pair) ×10 (data width from ADC) ×MBFR, whereMBFRis the number of antenna ports to each BFR chip. Output LVDS pins are forK=10 users with data widthWdata=16 bit, i.e.,Pinlvds(out)= 2×16×10=320.Pinpower&ground=200 is the number of power and ground pins. 200 pins are sufficient for a chip with 4W power under 0.6V supply voltage running at 500MHz.Pinnon-functional=16 is the non-functional pins including host, clock, reset, and test pins etc. Therefore, from (16) we getMBFR≤ 20, i.e., the maximum number of Rx antennas attached to each BFR chip is 20.Otherwise, total pin count would exceed 1000.Thus the minimum number of BFR chips for the full-digital, 2×64, and 4×32 architecture is 7, 4 and 2 respectively.
The minimum number of UTE chips and BFT chips are derived through the same method. Pin allocation and power estimation are shown in Table IX. Estimated power includes the logic and memory power (represented by the pre-estimated algorithm power), the overhead power, and the pin power. Power overhead factorkis set to 2.3 for all control, global wiring and static power (i.e., 70% overhead ratio). Pin power is estimated based on 6.9 mW per LVDS driver pair (for output LVDS signals) [32]; 2.0 mW per LVDS receiver pair(for input voltage level conversion) [33]; and 1.0 mW per normal pins (2.5V full swing pins driving 5pF load at 125 Mbps toggling).
According to the analysis results in Table IX, pin counts are main limitations and BFR chip approaches the power limit under the full-digital BF architecture. The full-digital architecture requires at least 7 (BFR) + 4(UTE) + 10 (BFT) = 21 digital chips, with the total power consumption of 64.2W; the 2×64 hybrid architecture requires 4+3+5=12 chips;and the 4×32 architecture requires 2+2+3=7 chips.
Scheme B: High-speed I/O with SerDes no fan-cooling. High-speed I/O with SerDes circuits are capable for more than 10 Gbps data transmission with acceptable power and cost[34][35]. When high-speed I/O (HSIO) is implemented instead of LVDS, each data sample can be transferred serially through I/O with the help of SerDes. Therefore, pin number is no longer the limiting factor. Instead, the 4 Watt power constraint (with no fan-cooling)has become the determining constraint for chip partition.
The power constraint is given by:
Where,PalgandPnormal, are estimated similarly as inscheme A. To estimatePHSIO, we first decide the required I/O pin number for serial data transmission (with the maximum transmission rate limited to 10 Gbps). Then according to [35], 10mW per pin pair is chosen for both the power of Tx (including serializer) and Rx (including de-serializer).
Through the estimation, we obtain the minimum number of chips satisfying (17), and give the pin allocation schemes for them, as shown in Table IX. In this scheme, the full-digital architecture requires 4+2+4=10 chips, with the total power of 36.2W; the 2×64 hybrid architecture requires 2+2+2=6 chips, with 20.8W power; and the 4×32 architecture requires 1+2+1=4 chips, with 13.2W power.
Scheme C: High-speed I/O with SerDes and low cost fan-cooling. When low cost fan-cooling can be added, the power consumption of a single chip can be as large as 10W. Following this constraint, the same chip partition problem as inscheme Bis solved. The results are shown in Table IX. In this scheme, The full-digital architecture requires 2+1+2 = 5 digital chips, with the total power consumption of 35.5W; the 2×64 hybrid architecturerequires 1+1+1=3 chips, with 20.5W power;and the 4×32 architecture requires 1+1+1=3 chips, with 13.2W power consumption.
Table IX Pin and power analysis for each scheme
To analyze cost for each chip, we first estimate the silicon area based on algorithm complexities. Then based on the cost estimation model described in section 2.3, implementation cost for each chip and for the chip set are estimated.
Chip silicon area is estimated by adding up the area of functional hardware modules(including logic and memory)Afuncand the area of overheadAov. We choose 2.8K-gate per multiplier, 1.6K-gate per adder, 10 gates per DFF, and 0.5 gate per memory cell, and use 0.7 mm2/M-gate (including wiring overheads)to computeAfunc. Area cost for high-speed I/O circuits is chosen as 0.10 mm2per Tx and 0.18 mm2per Rx [35]. Aov is computed according to the following equation:
Raddr&control=0.3 is the area overhead ratio of addressing and control path;Rwiring=1.5 is the area overhead ratio of global and local interconnect wires. The estimation results are shown in Table X.
The power and cost estimation results for different schemes and architectures are compared in Figure 9.
In this section we will discuss the performance of different BF specifications in Table X, considering the trade-off for power and cost in the selection of specifications in Table X.
Firstly, for full-digital BF, ZF outperformes MRT and CB-based BF [28][29][30] and is nearly optimal with the scaling up of MIMO antennas [11]. On the other hand, as showed in Figure 7, the power of UTE using ZF algorithm is 1/3 of SVD precoding for M=128, yet comparable with the power of MRT and CB-based BF. Therefore, ZF is the reasonable and reference algorithm for mBS with full-digital BF.
Secondly, for the proposed hybrid solutions of the 128-antenna system, we discussthe performance degradation compared to the full-digital solution. Performance results are obtained from references in [25][26]. The system parameters for evaluation and performance loss are shown in Table XI.
Table X Cost analysis for single chips and the chip-set (US$)
Through the third column of Table XI, evenNcis small (less thanM/8), the performance loss of hybrid BF can be small while using ideal (infinite resolution) analog phase shifters(PS). Yet practically, when finite resolution PSs are considered, the performance loss is not negligible, as shown in the last column of Table XI. With 1-bit quantization, the performance loss is 32.4% when using PZF algorithm withNc=M/32 RF chains; and the performance loss is 22.9% when maximize-SE algorithm is used andNc≈M/16.
Problem arises in determining the minimumNcto maintain acceptable performance.In practice, this problem is rather related to the trade-off between cost and performance.However, for ideal analog PS, the theoretical solution is given as the following theorem [27]for the minimum number of RF chains to realize optimal full-digital beamformer:
Theorem 1.For the spatial multiplexing transmission in MIMO systems, i.e. K>1, the optimal full-digital precoder can be realized using the hybrid structure if Nc≥2K.
This theorem is proved through finding a generalized set of hybrid solution that is numerically equivalent toVopt(a weighting matrix for BF, it is the optimal full-digital precoder that maximizes the overall data rate subject to a total power constraint). In this work,the minimumNcthat can satisfy the condition in this theorem is 2×K=20. This is the reason why only 2×64 and 4×32 hybrid architectures are studied in this paper, and architectures with lessNc(such as the 8×16 architecture) are not considered.
Fig. 9 Analysis results comparison for (a) power and (b) cost
Table XI Hybrid BF algorithm performance evaluations
According to the discussions above, the performance of hybrid BF algorithm can be optimal for the 4×32 hybrid architecture; when 1-bit quantized PSs is considered, its performance loss can still be definitely lower than 22.9%, according to the reference results in Table XI. For a UDN mBS with 128 antennas,this performance loss can be acceptable. On the other hand, the baseband implementation cost is much lower compared to that of the full-digital BF architecture. The extra cost on each RF includes two extra pins for beam control, slightly more design cost, and few percentage of extra silicon area. Comparing to the significant cost reduction on baseband and ADC/DAC (roughly three times lower), the extra RF cost can be negligible [36]. Thus we suggest the 4×32 hybrid architecture to be the feasible solution for practical UDN baseband implementation.
Finally, the discussions are summarized asfollows (see Table XII):
Table XII Summarized cost and performance comparisons
1. For full-digital BF, ZF algorithm is the reasonable selection for mmWave UDN mBS,which yields near-upper-bound performance,while the power expense is up to 2.8x lower than SVD. Under high volume, the total baseband cost includes 21 chips and 159.23US$for the LVDS I/O scheme, or 5 chips and 78.52US$ for the HSIO with fan-cooling scheme.
2. For hybrid BF architectures, we suggest that the 4×32 architecture to be the feasible low cost solution for the UDN baseband considering trade-off between cost and performance. It requires 7 chips at the implementation cost of 47.30US$ for LVDS scheme, or 3 chips at 25.54US$ for HSIO with fan-cooling scheme (3.1 – 3.4x reduction compared to full-digital), under the volume (sets of chipsets) assumption of 10 million.
In this paper, we studied the baseband ASIC/ASIP design for mBSs to be deployed under low-cost requirement in 5G UDN. Our study provides a comprehensive analysis and practical solutions for the mmWave UDN baseband ASIC/ASIP, covering its functional specification, algorithm selection, architecture analysis,and chip-level implementation proposals. For 128-antenna BS system with 10 beamformed transceiver zones, we estimated the power and implementation cost of the baseband chipsets using full-digital, 2×64 hybrid, and 4×32 hybrid BF architectures combined with 3 different I/O and cooling schemes, and discussed the performance degradation. The results suggest that ZF is a good choice for full-digital BF estimation algorithm, and 4×32 is one of the feasible hybrid BF architecture. This will significantly facilitate the standardization and implementation of UDN with mmWave massive-MIMO BF in 5G era. The proposed method can also be used for related research.
We thus propose that more research are needed around 4×32 mmWave BF systems, to further optimize the algorithm performance and analog/RF cost for low-cost 5G UDN implementation. Key challenges include low power and low cost RF design, hybrid algorithm implementation, reference signal design,interference control, and energy/spectral efficiency optimization.
The finance supporting from National High Technical Research and Development Program of China (863 program) 2014AA01A705 is sincerely acknowledged by authors. We also thank Synopsys Co.Ltd for the ASIP designer tools used in this research.
[1] R. Baldemair, T. Irnich, K. Balachandran, et al. ,“Ultra-dense networks in millimeter-wave frequencies,”IEEE Commun. Mag., vol. 53, no. 1,pp. 202–208, Jan. 2015.
[2] M. Fallgren and B. Timus, “Scenarios, requirements and KPIs for 5G mobile and wireless system,”METIS Deliv. D, vol. 1, p. 1, 2013.
[3] H. Peng, Y. Xiao, Y. Ruyue, and Y. Yifei, “Ultra dense network: Challenges, enabling technologies and new trends,”China Commun., vol. 13,no. 2, pp. 30–40, Feb. 2016.
[4] S. S. Mwanje, J. Ali-Tolppa, and H. Sanneck, “On the limits of PCI auto configuration and reuse in 4G/5G ultra dense networks,” in2015 11th International Conference on Network and Service Management (CNSM), 2015, pp. 92–98.
[5] S. Chen, Xiang Ji, C. Xing, Z. Fei, and Hualei Wang, “System-level performance evaluation of ultra-dense networks for 5G,” inTENCON 2015- 2015 IEEE Region 10 Conference, 2015, pp.1–4.
[6] 3GPP TR 36.872, “Technical Report - Small cell enhancements for E-UTRA and E-UTRAN -Physical layer aspects.” [Online]. Available: www.3gpp.org.
[7] 3GPP TR 36.897, “Technical Report - Study on elevation beamforming / Full-Dimension (FD)Multiple Input Multiple Output (MIMO) for LTE,”2015. [Online]. Available: www.3gpp.org.
[8] A. Ghosh, N. Mangalvedhe, R. Ratasuk, et al. ,“Heterogeneous cellular networks: From theory to practice,”IEEE Commun. Mag., vol. 50, no. 6,pp. 54–64, Jun. 2012.
[9] Haikun, Jia, et al. "Research on CMOS mm-wave circuits and systems for wireless communications." China Communications 12.5 (2015): 1-13.
[10] “IC cost and price model,” 2016. [Online]. Available: http://www.icknowledge.com/products/costmodels.html.
[11] W. Wang, X. Li, D. Liu, Z. Cai, and C. Gong, “Multilevel power modeling of base station and its ICs,”China Commun., vol. 12, no. 5, pp. 22–33,May 2015.
[12] F. Rusek, D. Persson, E. G. Larsson, et al. , “Scaling Up MIMO: Opportunities and Challenges with Very Large Arrays,”IEEE Signal Process. Mag.,vol. 30, no. 1, pp. 40–60, Jan. 2013.
[13] S. Shi, M. Schubert, and H. Boche, “Downlink MMSE Transceiver Optimization for Multiuser MIMO Systems: Duality and Sum-MSE Minimization,”IEEE Trans. Signal Process., vol. 55, no.11, pp. 5436–5446, Nov. 2007.
[14] D. Liu, “Baseband ASIP design for SDR,”China Commun., vol. 12, no. 7, pp. 60–72, Jul. 2015.
[15] S. Han, C. I, Z. Xu, and C. Rowell, “Large-scale antenna systems with hybrid analog and digital beamforming for millimeter wave 5G,”IEEE Commun. Mag., vol. 53, no. 1, pp. 186–194, Jan.2015.
[16] “ITRS 2013 report.” [Online]. Available: http://public.itrs.net/reports.html.
[17] Y. Huo, X. Li, W. Wang, and D. Liu, “High performance table-based architecture for parallel CRC calculation,” inThe 21st IEEE International Workshop on Local and Metropolitan Area Networks,2015, pp. 1–6.
[18] Z. Wu and D. Liu, “High-Throughput Trellis Processor for Multistandard FEC Decoding,”IEEE Trans. Very Large Scale Integr. Syst., vol. 23, no.12, pp. 2757–2767, Dec. 2015.
[19] Q. H. Spencer, A. L. Swindlehurst, and M. Haardt,“Zero-Forcing Methods for Downlink Spatial Multiplexing in Multiuser MIMO Channels,”IEEE Trans. Signal Process., vol. 52, no. 2, pp.461–471, Feb. 2004.
[20] H. Yang and T. L. Marzetta, “Performance of Conjugate and Zero-Forcing Beamforming in Large-Scale Antenna Systems,”IEEE J. Sel. Areas Commun., vol. 31, no. 2, pp. 172–179, Feb. 2013.
[21] G. G. Raleigh and J. M. Cioffi, “Spatio-temporal coding for wireless communication,”IEEE Trans.Commun., vol. 46, no. 3, pp. 357–366, Mar.1998.
[22] D. E. Berraki, S. M. D. Armour, and A. R. Nix,“Codebook based beamforming and multiuser scheduling scheme for mmWave outdoor cellular systems in the 28, 38 and 60GHz bands,”in2014 IEEE Globecom Workshops (GC Wkshps),2014, pp. 382–387.
[23] J. Singh and S. Ramakrishna, “On the Feasibility of Codebook-Based Beamforming in Millimeter Wave Systems With Multiple Antenna Arrays,”IEEE Trans. Wirel. Commun., vol. 14, no. 5, pp.2670–2683, May 2015.
[24] O. El Ayach, S. Rajagopal , et al., “Spatially Sparse Precoding in Millimeter Wave MIMO Systems,”IEEE Trans. Wirel. Commun., vol. 13,no. 3, pp. 1499–1513, Mar. 2014.
[25] O. El Ayach, R. W. Heath, S. Abu-Surra, et al.,“Low complexity precoding for large millimeter wave MIMO systems,”IEEE Int. Conf. Commun.,pp. 3724–3729, 2012.
[26] L. Liang, W. Xu, and X. Dong, “Low-complexity hybrid precoding in massive multiuser MIMO systems,”IEEE Wirel. Commun. Lett., vol. 3, no. 6,pp. 653–656, 2014.
[27] F. Sohrabi and W. Yu, “Hybrid Digital and Analog Beamforming Design for Large-Scale Antenna Arrays,” vol. 4553, no. June 2015, pp.1–13, 2016.
[28] F. Sohrabi and W. Yu, “Hybrid digital and analog beamforming design for large-scale MIMO systems,” in2015 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP), 2015, no. iii, pp. 2929–2933.
[29] E. G. Larsson, O. Edfors, F. Tufvesson, et al,“Massive MIMO for Next Generation Wireless Systems,”IEEE Commun. Mag., vol. 52, no. February, pp. 186–195, Apr. 2014.
[30] F. W. Vook, T. A. Thomas, and E. Visotsky, “Massive MIMO for mmWave systems,” in2014 48th Asilomar Conference on Signals, Systems and Computers, 2014, pp. 820–824.
[31] F. W. Vook, A. Ghosh, and T. a. Thomas, “MIMO and beamforming solutions for 5G technology,”2014 IEEE MTT-S Int. Microw. Symp., pp. 1–4,Jun. 2014.
[32] G. Wang, Y. Sun, J. R. Cavallaro, and Y. Guo,“High-throughput Contention-Free concurrent interleaver architecture for multi-standard turbo decoder,” inASAP 2011 - 22nd IEEE Int Conf on Application-specific Systems, Architectures and Processors, 2011, pp. 113–121.
[33] J. Silva-Martinez, M. Nix, and M. E. Robinson,“Low-voltage low-power LVDS drivers,”IEEE J.Solid-State Circuits, vol. 40, no. 2, pp. 472–479,Feb. 2005.
[34] K. Kim, S. Hwang, J. Song, and C. Kim, “An 11.2-Gb/s LVDS Receiver With a Wide Input Range Comparator,”IEEE Trans. Very Large Scale Integr.Syst., vol. 22, no. 10, pp. 2156–2163, Oct. 2014.
[35] A. Nazemi, H. Maarefi, B. ?atl, et al., “A 2.8 mW/Gb/s quad-channel 8.5–11.4 Gb/s quasi-digital transceiver in 28 nm CMOS,”2013 Symp. VLSI Circuits, pp. 276–277, 2013.
[36] S. Yuan, L. Wu, Z. Wang, X. Zheng, et al., “A 70 mW 25 Gb/s Quarter-Rate SerDes Transmitter and Receiver Chipset With 40 dB of Equalization in 65 nm CMOS Technology,”IEEE Trans. CircuitsSyst. I Regul. Pap., pp. 1–10, 2016.
[37] S. Emami, R. F. Wiser, E. Ali, et al., “A 60GHz CMOS phased-array transceiver pair for multi-Gb/s wireless communications,”Dig. Tech. Pap.- IEEE Int. Solid-State Circuits Conf., vol. 40, pp.164–165, 2011.
[38] Q. Xie, X. Lin, Y. Wang, et al., “Performance Comparisons Between 7-nm FinFET and Conventional Bulk CMOS Standard Cell Libraries,”IEEE Trans. Circuits Syst. II Express Briefs, vol. 62,no. 8, pp. 761–765, Aug. 2015.
[39] P. M. Munson and J. G. Delgado-Frias, “A performance-power evaluation of FinFET flip-flops under process variations,” in2011 IEEE 54th International Midwest Symposium on Circuits and Systems (MWSCAS), 2011, pp. 1–4.
[40] T. Fukuda, K. Kohara, T. Dozaka, et al., “A 7nsaccess-time 25μW/MHz 128kb SRAM for low-power fast wake-up MCU in 65nm CMOS with 27fA/b retention current,” in2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2014, vol. 3, pp.236–237.