Li Mo
(CTO Group,ZTE Corporation,RicTX 75080,USA)
Reliability of NFV NFV Using COTS Hardware
Li Mo
(CTO Group,ZTE Corporation,RicTX 75080,USA)
This paper describes a study on the feasibility of using com?mercial off?the?shelf(COTS)hardware for telecom equip?ment.The study outlines the conditions under which COTS hardware can be utilized in a network function virtualization environment.The concept of silent?error probability is intro?duced to account for software errors and/or undetectable hard?ware failures,and is included in both the theoretical work and simulations.Silent failures are critical to overall system availability.Site?related issues are created by combined site maintenance and site failure.Site maintenance does not no?ticeably limit system availability unless there are also site fail?ures.Because the theory becomes extremely involved when site failure is introduced,simulation is used to determine the impact of those facts that constitutes the undesirable features of using COTS hardware.
reliability;COTS hardware;NFV
T he use of commercial off?the?shelf(COTS)hard?ware in architectural frameworks such as IP multi?media subsystem(IMS)and evolved packet core (EPC)has drawn considerable attention in recent years[1].However,some operators have legitimate concerns about the overall reliability of COTS hardware,including re?duced mean time between failures(MTBF),and other undesir?able attributes that are unfamiliar in the traditional telecom in?dustry.
In previous reliability studies,such as[2],the focus has on?ly been on hardware failures,which are characterized by the MTBF and mean time?to?repair(MTTR).In this paper,the con?cept of silent error is introduced to account for failures in?duced by software and hardware that cannot be detected by the management system.
The analytical expression for silent error is almost impossi? ble to obtain.For hardware?related silent errors,the rate of er?ror depends on the hardware architecture and type of undetect?able hardware failure.For software?related silent errors,the software architecture and coding practice need to be investigat?ed.Because there are a variety of hardware and software is?sues,the analytical expression for silent error is extremely dif?ficult to obtain.
On the other hand,the silent error probability is relatively easy to obtain using observational data.Any error that cannot be attributed to a known cause by the management system can be classified as a silent error.Probability of such an error is the number of errors without known causes divided by the total number of errors observed by the management system.Silent errors affect system availability in different ways depending on the scenario.
In a typical system,a server with certain network functions has another dedicated server as backup.This is a normal mas?ter?slave(1+1 redundancy)configuration of the telecom equip?ment.
The server with the network functions is called the master server,and the dedicated backup is called the slave server.A 1+1 redundancy scheme is different from a 1:1 load?sharing re?dundancy scheme in that the slave is“dedicated”to backing up the master.In a 1:1 load?sharing redundancy scheme,both servers have network functions and protect each other at the same time.
Here,we assume a single error for ease of discussion.In any protection scheme,system availability is not affected if the slave experiences a silent error and such an error eventually becomes observable from the system’s behavior.In this case,another slave is identified,and the master continues with its network functions.Before the slave becomes fully functional,the system is less fault?tolerant.
If the master experiences a silent error,the data transmitted to the slave could be corrupted.In this case,system availabili?ty is affected when the error becomes observable in the system’s behavior.When a silent error is detected in the master,both the master and slave need time to recover.This recovery time is almost fixed in the network functions virtualisation(NFV) environment and is the COTS MTTR.In this time interval,the network functions are not available,and this is considered downtime in availability calculations.
The MTBF of COTS hardware is shorter than that of typical telecom?grade because COTS hardware has relaxed design cri?teria.The time needed to repair COTS hardware is not a ran?dom variable and is almost fixed in duration.The MTTR of COTS hardware is the mean time required to bring up a server so that it is ready to serve.In traditional telecom hardware,the time to repair is a random variable,and the MTTR is the mean of that random variable.Because manual intervention is nor?mally required in the telecom environment,the MTTR of COTS hardware is usually assumed to be less than that of tradi?tional telecom equipment.
The most obvious difference between COTS hardware and telecom?grade hardware is related to maintenance procedures and practices.With telecom?grade hardware,care is taken to minimize the effect of maintenance on system availability. With COTS hardware,maintenance is often done in a“cow?boy”fashion;that is,reset first and ask questions later.
In this study,an analytical solution is proposed for a situa?tion where there are no site?and maintenance?related issues and one or two dedicated backup COTS servers in the NFV en?vironment.An analytical solution is also proposed for a situa?tion where there is no site failure and only one dedicated back?up server.In this situation,site maintenance,which is unique to COTS hardware,is addressed.In order to study issues relat?ed to site failure,we constructed a simulator and observed sys?tem availability with one or two dedicated backup servers.The results show that COTS hardware,with all its undesirable fea?tures,still satisfies the telecom requirements under reasonable conditions.
In the NFV environment,reliability of the server part and network part can be analyzed.The server part provides the ac?tual network functions,and the network part connects all the servers with vSwitch(Fig.1).
Overall system availabilityAsysis a product of the availabil?ity of the server part of the systemAsand availability of the network part of systemAN,so that
GiventhatAs<1andAN<1,thenAsys ▲Figure 1.Availability in the network and server parts of the system. To improve availability in the network part,the 1+1 protec? tion scheme is used(Fig.1).It is possible for the vSwitch to cover long distance transmission network to connect multiple data centers. The mechanisms in the server part that increase availability are not specified.In this study,it is assumed that one or two backup servers support one active server.Normally,if the ac?tive server is faulty,one of the backup servers takes over and there is no loss of availability in the server part. There is a significant difference between the NFV environ?ment and an environment comprising traditional telecom equipment in terms of the time to recover from a server fault. In an environment with traditional telecom equipment,some equipment,e.g.,a faulty board,usually needs to be manually changed,and the time needed for restoration after a fault(MT?TR)is long. In an NFV environment,the MTTR is the time required to make another virtual machine(VM)available with the needed software and to re?synchronize the network state data with the server that is currently serving.Hence,the MTTR in the NFV environment is shorter than that of an environment comprising traditional telecom equipment and is a fixed constant. Also,multiple servers are active in order to share the work load.The failure of individual server only affects a fraction of users,and this has to be taken into account when considering overall system reliability. Contradictory to common belief,this arrangement neither in?creases nor decreases overall network availability if the active servers are supported by one or two backup servers.This fact will be elaborated in the later sections from both theoretical point of view and simulations. Availability in the network part can be analyzed in the tradi?tional way and is affected by 1)the availability of the switch and router that is part of the vSwitch,and 2)the maximum number of hops in the vSwitch.The vSwitch connects the VMs in the NFV environment. IfAnis the availability of the network element,hops,the availability of a vSwtich with a maximum ofhhops isAhn.Co?nsidering the 1+1 configuration of the vSwtich,ANis given by Table 1 shows network availability as a function of the num?ber of hops and per?network availability of network element An. In order to achieve five nines reliability usually demanded by telecom operators,network element reliability needs to be at least four nines if the hop count is greater than 10. In fact,in order to achieve five nines when per?network ele?ment availability is only three nines,the hop count needs to beless than two,which is not practical. ▼Table 1.Availability in the network part with various network elements availability and hop count 1+1 redundancy is common in the traditional telecom indus?try.In the NFV COTS environment,1+1 redundancy dictates that a set of servers providing network service are protected by other sets of servers.All the servers need to be hosted in differ?ent physical machines or even different data centers. The concept of 1+1 redundancy can also be extended so that multiple servers are used for backup,e.g.,two backup servers for one active server.Fig.2 shows a setup with one and two dedicated backup servers. Again,the master and slaves are not co?hosted on the same physical machine.If there is a fault with the master,any slave can become the new master.If there is a fault with a slave,a new VM is designated as the new slave.Before the new slave is ready,the network has reduced protection capabilities.The vSw?tich in Fig.1 provides connectivity between the master and the slaves. In traditional telecoms,1:1 re?dundancy is commonly used to max?imize resource utilization,and servers may share the load. Therefore,a server provides both real?time network functions and protects another server.In 1:1 configuration(Fig.3),mas?ters and slaves are co?hosted.If the physical machine for mas?ter 1 and slave 2 is faulty,slave 1 continues the work of master 1,and master 2 loses its protection. In this section,a continuous?time Markov chain[3]-[5]is described for analysis of system availability in the server part. 5.1 Availability for 1+1 Server Configuration In a practical environment,critical applications,such as the virtualized IP Multime?diaSubsystem(vIMS) andvirtualEvolved PacketCore(vEPC),are protected from sin?gle?server failure.As discussed in section 4,there are a number of schemestoprotect against single?server failure.In this section,the 1+1 scheme is discussed,and the continuous?time Markov chain is used to model the 1+1 protected system. The most significant types of server failure are observable hardware failure;observable non?corrupting software failure,and silent failure. With observable hardware failure,the faulty component is obvious,and the backup server continues the work of the faulty server without interrupting service.In the worst case scenario,a few services in a transient state,e.g.,in the process of estab?lishing a voice session,are affected.Services in a stable state,e.g.,an established voice session,are affected. With an observable,non?corrupting software failure,such as a runaway process,the state of services is not corrupted in thebackup server.In this case,the fault recovery is the same as that in the case of an observable hardware failure. ▲Figure 2.1+1 and 1+2 redundancy. ▲Figure 3.1:1 configuration. With a silent failure,the cause of the failure is not obvious. The silent failure affects service only if it occurs on the master or current working server.A silent failure affects system avail?ability in the following two ways: ?When the silent failure becomes observable,the network states in both the active server and backup server(s)have been corrupted already,and the whole system needs to be reset. ?When the silent failure becomes observable,the nature of the failure is not obvious,and the system continues to de?pend on the master or current working server to carry out work because the backup server(s)are assumed to be faulty. Therefore,a reset of the whole system is inevitable. Fig.4 shows the Markov state transition for a system with only one backup in a 1+1 configuration. In Fig.4,λis the inverse of the server MTBF,andμis the inverse of the server MTTR.Both MTBF and MTTR are ex?ponentially distributed random variables.The silent fault prob?ability is denoteds. To make the state transition in Fig.4 valid,the ranges of λ,s,and μare restricted: The real limitation in(3)is thatλhas to be less than 1,which means that the individual server does not have MTBF greater than 1 hour. The probability of being in the states S0,S1,S2,and S4 in the steady state is denotedP0,P1,P2,P3,respectively. Then,Chapman?Kolmogorov equation for continuous?time Mar?kov chain can be used to expressP3:▲Figure 4.Markov state transition for a system with only one backup in a 1+1 configuration. Then we have: Whens=1,there is a consistent silent fault on the master side,and the system operates as if there is no protection.Then we have: 5.2 Availability for 1:1 Server Configuration The Markov reliability model can also be used in the case of 1:1 load sharing.Fig.5 shows the state transition of signifi?cance.Trivial state transitions,e.g.,S0 to S0 in the case of 1+1 configuration(Fig.4),are omitted for clarity. The relationship between the probability of being in any state is given byPi,i=0,1,2,3,4,5.Therefore,we have the following algebraic relationships based on the Global Bal?ance Equations of the continuous?time Markov chain: ▲Figure 5.Significant Markov state transitions for a system with 1:1 load?sharing. The server part of the system is unavailable in S5 and only half available in S1 and S4.Therefore,the availability of a sys?tem with 1:1 load sharing is: Combining(5)and(12),we have: 5.3 Availability for a Server Configuration with Dual Backups Fig.6 shows the non?trivial Markov state transition for a sys?tem with two backups. The probability at stateiis given byPi,where i=0,1,2,3,4,5,6,7.As with 1+1 protection,the restri?ctions onλ,s,and μin(3)are still assumed in this part of study. Using the Chapman?Kolmogorov equation for continuous?time Markov chains in the steady state,the probability of being in state S0 is denotedP7and given by ▲Figure 6.Significant Markov state transition for a system with two backups. The availability of this configuration is The equivalent expression ofA1+2scan also be given as Whens=1,we have When the master is always corrupting the data,system reli?ability the same as that when there is no protection at all. It is important to consider site maintenance when using COTS hardware.To evaluate the ef?fect of site maintenance,the follow?ing assumptions are made: ?Site maintenance can only occur if there is no failure in both the working and backup server(s). ?Site recovery,i.e.,reverting to the original setup,can only occur if there is no failure in both the working and backup server(s). In the following analysis,two dis?tinctive cases are considered for 1+1 configuration: ?Three or more sites of equal capa?bility that cannot be reverted af?ter maintenance.In this case,the third site is used after one site has been put into mainte?nancemode.Duringmainte?nance,the system is still protect?ed by two sites.After mainte?nance,the system will not revert to its original operation sites. ?Two sites that can be reverted after maintenance.During maintenance,the system has 1+1 protection,but both the mater and slaves operate on the same site.After mainte?nance,the system reverts to its original mode of operation. This reversion only occurs if both servers are in operation mode. 6.1 Case 1:Non-Revertive After Maintenance with 1+1 Configuration In the Markov transition diagram for this case(Fig.7),site maintenance probability is assumed to be exponentially distrib? ▲Figure 7.State transition for non?revertive after maintenance in 1+1 mode. uted,and the inverse mean time?between?maintenance is de?notedη.The global balance equation is given by Therefore,the availability is given by Given the constraintsμ>0,η>0and1>λ>0,we can verify that A1+1is the availability for the 1+1 backup case given by(5).s Equation(20)provides the upper bound of degradation by introducing site maintenance.Given the nominal value of μ=10andη=1/1000(i.e.,doing maintenance every 1000 hours),degradation is negligible.In the case of dual backup,the effect of site maintenance is expected to be smaller than in the case of 1+1 and is therefore not pursued here. 6.2 Case 2:Revertive After Maintenance with 1+1 Configuration In the case of two sites,operation needs to be revertive in or?der to maintain the same level of fault tolerance.Fig.8 shows the Markov state transition for 1+1 configuration when the prin?ciples for initiating maintenance activity and revertive opera?tions are observed. In Fig.8,site maintenance probability is assumed to be ex?ponentially distributed,and the inverse mean time?between?maintenance is denotedη.The recovery time is also assumed to be exponentially distributed,and the inverse MTTR is denot?edγ. The global balance equation is given by Solving the linear equation(22),we have the site availability: Given the constraintsμ>0,γ>0,η>0,and1>λ>0,we can verify that A1+1is the availability for 1+1 backup case given by(5).s ▲Figure 8.Markov state transition with site maintenance and revertive after maintenance operations. Equation(24)provides the upper bound of degradation by introducing site maintenance with revertive operations.Given the nominal value ofμ=10andη=1/1000(i.e.,doing maintenance every 1000 hours),degradation is negligible.In the case of dual backup,the effect of site maintenance is ex?pected to be smaller and is therefore not pursued here. Therefore,even with associated site maintenance,COTS hardware does not affect system availability. In the above theoretical analysis of availability in the server part,neither site maintenance(e.g.software upgrade,patching,etc.that affects the whole site)nor site failure(e.g.,caused by natural disasters)are taken into account. Although it is relatively easy to obtain a closed?form solu?tion for system availability in an ideal case without site?related issues,it is extremely difficult to obtain an analytical solution when there are site issues.In this case,we resort to numerical simulation[6]under reasonable assumptions. 7.1 Methodology A discrete event simulator is constructed to determine the availability in the server part.In the simulator,an active serv?er,i.e.,a master that processes network traffic,is supported by a single backup or two backups in another site or in other sites. The probability of server failure is assumed to follow the bathtub probability distribution(WeiBull distribution).In NFV management,we need to provide servers that are on the flat part of the bathtub distribution.In this case,the familiar expo?nential distribution can be used. In the discrete event simulator,each server is scheduled to work for a certain period of time.This is a random variable with exponential distribution,which is commonly used to mea?sure server behavior during the server’s useful lifecycle.The mean is given by the MTBF of the server. The flat part of the bathtub distribution can be related to the normal server MTBF,and the failuredensityfunctionisgivenby After the period of time that the server is scheduled to work,the server is down for a fixed period of time,i.e.,the time needed to start another virtual machine that replaces the one in trouble.This is different for traditional telecom?grade equipment.Here,we assume that there will be another server available to replace the one that goes down.Regardless of the nature of the fault,the down?time for a faulty server is fixed and is the time needed for another server to be ready to take over. Fig.9 shows this arrangement for a system with only one backup.Although the server?up time varies,the server down time is of fixed duration. Servers are hosted at“sites,”which are considered to be da?ta centers.In this simulation,during the initial setup,the serv?ers supporting each other were hosted at different sites in order to minimize the effects of site failure and site maintenance. In order to model system behavior with one or two backups,the concept of protection group is introduced.A protection group comprises a master with one or two slaves at another site or sites(Fig.2).There may be multiple protection groups in?side the network,and each protection group serves a propor?tion of the users. A protection group is considered“down”if every server in the group is dead.While the protection group is down,network service is affected,and the network is considered to be down for the group of users that the protection group is responsible for. The up?time and down?time of the protection group is record?ed in the discrete event simulator.Availability in the serverpart is given by ▲Figure 9.Lifecycle of the servers. where the total time elapsed is the total simulation time elapsed in the discrete event simulator. The up?time of the weighted protection group is proportional to the workload of the protection group.In the simulation set?up,the workload is evenly divided between the given number of protection groups. Fig.10 shows the protection group,site,and server for a sys?tem with two backups.The protection group is an abstract con?cept,and a proportion of network functions are not available if and only if all the servers in the protection group are not func?tioning. Even though the simulator allows each site to have a num?ber of servers(this number is configurable),there is little use for this arrangement.System availability does not change re?gardless of how many servers per site are used to support the system and as long as there is no change in the number of serv?ers in the protection group.An increase in the number of serv?ers per site is essentially an increase in the number of protec?tion groups.For a long period of time,each protection group will experience similar down?time for the same up?time or will have the same availability. As in the theoretical analysis,the silent error caused by soft?ware or subtle hardware failure only affects the master server. If the master experiences a silent error,both the master and slave have a MTTR,which is the time to incarnate two VMs si?multaneously.In this case,this part of the system(or this pro?tection group)is considered to be faulty. In a reliability study,the focus is on the number of backups for each protection group,and a 1+1 configuration is a typical configuration for one backup mechanism.A load?sharing ar?rangement such as 1:1 can be viewed as two protection groups. For example,in Fig.3,master 1 and slave 1 is one protection group,and master 2 and slave 2 is another protection group. ▲Figure 10.Servers,sites,and protection group. From a theoretical point of view,the 1+1 and 1:1 schemes provide similar availability,as in(13).In this work,1:1 load?sharing is not simulated. The site also undergoes maintenance.Traditional telecom?grade equipment and COTS hardware differs in terms of main?tenance.With telecom?grade equipment,the impact of mainte?nance on system performance and availability has to be kept to a minimum during the maintenance window.With COTS hard?ware,maintenance can be more frequent. To simulate the maintenance aspect of COTS hardware,the simulator puts the site“under maintenance”at a random time. The interval for the site to be working is also assumed to be an exponentially distributed random variable,and the mean is configurable in the simulator.The duration of the maintenance is also a uniform distributed random variable with a configured mean,minimum,and maximum. In order to put a site under maintenance,there needs to be no faults inside the network,and all servers on the site that are under maintenance are moved to another site.Therefore,no traffic is affected and the probability of site failure is reduced when the site is being maintained. When a site is back up after maintenance,it attempts to re?claim all its server responsibilities that were transferred due to site maintenance. If every server is working in a protection group,the protec?tion group rearranges its protection relationship so that each site only has one server in the protection group.The new serv?er on the site that is back up from maintenance has an MTTR to be ready for backup.In this case,there is no loss of service in the system. If there is at least one working server and one faulty server with a protection group,one working server is added to the pro?tection group.The new server on the site that is back up from maintenance has an MTTR before it is ready for backup.In this case,there is no loss of service in the system. If no servers are working in the protection group,the protec?tion group gains a new working server from the site that is back up from maintenance.The new server from this site has an MTTR before it is ready for service.The system then pro?vides service when the new server is ready. A site may also be faulty because of loss of power,thermal issues,or natural disasters.The simulator can also simulate the effects of such events with the site?up duration as an expo?nentially distributed random variable with configurable mean. The duration of a site failure is expressed as a uniform distrib?uted random variable with configurable mean,minimum,and maximum. 7.2 Simulation Results In this simulation,both site failure,with MTBF of 10,000 hours,and site maintenance,with mean time between site maintenance of 1000 hours,are considered.The mean time for site failure is assumed to be 12 hours,uniformly distributed be?tween 4 hours and 24 hours.The mean time for site mainte?nance is 3 hours,uniformly distributed between 4 hours and 48 hours. The next step is to determine the effect of site failure and maintenance.The very bad site described above has a mean time between site failures that is double the server MTBF,and the mean time between site maintenance is assumed to be 0.1× the server MTBF.Table 2 shows the availability for a single?backup system. Availability in the server part is affected by the silent error,and a single redundant piece of hardware provides improve?ment when the silent error probability is small.Table 3 shows precise data on availability in the server part with dual backup. From the tables for single and dual backup,we can see that dual backup is only marginally beneficial when there are site issues.In practice,site issues are inevitable;therefore,a geo?graphically distributed single?backup system is recommended for simplicity. System availability can be divided into two parts:availabili?ty in the network and availability in the server.The maximum number of hops determines availability in the network part when the individual network element availability is available. The fault tolerance configuration is assumed to be 1+1.Avail?ability in the server part is mainly determined by ?the availability of an individual server,characterized by its MTBF and MTTR ?silent error probability ?site?related issues,such as maintenances and faults ?protection scheme,i.e.,one or two dedicated backups. The concept of silent error is introduced to account for soft?ware errors and errors that cannot be detected by hardware. Availability in the server part is dominated by the silent error if the silent error probability is more than 10%.This is shown in both the theory and simulations.The dual?backup scheme isof marginal benefit,and the added complexity may mean that it is not warranted in the real network. ▼Table 2.Availability in the server part with a single backup ▼Table 3.Availability in the server part with dual backup COTS hardware can provide the same high level of availabil?ity as traditional telecom hardware.The undesirable attributes of COTS hardware are modeled into the site?related issues,such as site maintenance and failure.This does not apply to traditional telecom hardware. Unlike site failure,site maintenance does not noticeably de?grade system availability.This applies in both the revertive and non?revertive cases.It is critical for the virtualization infra?structure management to provide as much information about hardware failure as possible in order to increase the availabili?ty of the application.In both the theory and simulation,the si?lent error probability becomes a dominant factor in overall sys?tem availability.The silent error probability can be reduced if the virtualization infrastructure management is capable of iso?lating faults. [1]Network Functions Virtualisation(NFV),ETSI GS NFV,2013. [2]Applied R&M Manual for Defense Systems,GR?77,2012. [3]A.Leon?Garcia,Probability,Statistics,and Random Processes for Electrical En?gineering,3rd Ed.Englewood Cliffs:Prentice Hall,2008. [4]A.Papoulis,Probability,Random Variables,and Stochastic Processes,3rd Ed. New York:McGraw?Hill,1991. [5]P.Bremaud,An Introduction to Probabilistic Modeling.New York:Springer,2012. [6]W.H.Press,S.A.Teukolsky,W.T.Vetterling,and B.P.Flannery,Numerical Recipes in C,2nd Ed.New York:Cambridge University Press,1992. t 2014?04?09 Biographyraphy 10.3969/j.issn.1673-5188.2014.03.007 http://www.cnki.net/kcms/detail/34.1294.TN.20140723.1827.001.html,published online 23 July,2014 Li Mo(Mo.li@ztetx.com)is the chief architect of the CTO Group,ZTE Corporation. He has more than 20 years’experience in the telecommunications industry.Prior to joining ZTE in 2001,he worked extensively with IBM,Nortel and Fujitsu.His cur?rent research interests include SDN,NFV,P2P network,next?generation core net?works,IMS,and fixed mobile convergence.He is an active member of the IEEE,ET?SI,and ITU?T,where he edited two published recommendations.He has more than 10 approved US patents and many more in progress. Dr.Mo received his PhD degree in electrical engineering from Queen's Universi?ty,Canada,in 1989.3 Availability in the Network Part of the System
4 Redundancy Schemes for Increasing Availability in the Server Part
5 Theoretical Analysis of the Server Part and Overall System Availability
6 Site Maintenance
7 Simulation of Availability in the Server Part
8 Conclusion