Zhai ZHANG,Yao QIU,Xiaoliang YUAN,Rui YAO,Yan CHEN,Youren WANG
College of Automation Engineering,Nanjing University of Aeronautics and Astronautics,Nanjing 211106,China
KEYWORDS Bio-inspired hardware;Embryonics;Fault-cell reutilization;Reliability analysis;Self-healing strategy;Transient fault
Abstract The self-healing strategy is a key component in designing the bio-inspired embryonics circuit with the structure of cell arrays.However,the existing self-healing strategies of embryonics circuits mainly focus on permanent faults inside the modules of cells such as the function module and the configuration register,while little attention is paid to transient faults.From the point of view of obtaining high efficiency of hardware utilization,it would be a huge waste of hardware resources by permanent elimination when a cell only suffers a transient fault which can be repaired by a configuration mechanism.A new self-healing strategy,the Fault-Cell Reutilization Self-healing Strategy(FCRSS)which presents a method for reusing transient fault cells,is proposed in this paper.The circuit structures of all the modules in the cells are described in detail.In the new strategy,two processes of elimination and reconfiguration are combined.Within the process of fault-cell elimination,cells with transient faults in the embryonics circuit array could be reused simultaneously to replace the functions of the cells on their left side in the same row.Therefore,transient fault-cells in a transparent state can be reconfigured to realize the fault-cell reutilization.Finally,a circuit simulation,resource consumption,a reliability analysis and a detailed normalization analysis are presented.The FCRSS can improve the hardware utilization rate and system reliability at the expense of a small amount of hardware resources and reconfiguration time.Following the conclusion,the method of determining the optimal self-healing strategy is presented according to the environmental conditions.
High reliability is an important requirement of the electronic system in harsh environments such as the space and the sky.Embryonics is a novel type of circuit in fault-tolerant hardware with the characteristics of distributed self-testing and adaptive self-healing.The typical structure of the embryonics circuit is the two-dimensional array with identical electronic cells1-5.The self-healing strategy is one of the most important elements of cell circuit designs due to its control of the process of cell elimination and reconfiguration. Therefore, choosing an appropriate and optimized self-healing strategy will lead to a high hardware utilization rate and high reliability.
At present,realization methods of permanent fault-cell elimination and layout optimization of the cell array are the main directions in self-healing strategy research on embryonics circuits.Tyrrell et al.6,7are the earliest proponents of the row/column elimination strategy with the characteristics of simple circuit structures of the cells and simple working process control. However, the consumption of hardware resources is tremendous because it eliminates the entire row/column cells for only partial faults in any cell.Mange et al.8,9proposed the single cell elimination strategy,which needs a large amount of configuration information backup in the cell register,leading to a serious hardware redundancy.Szasz et al.10,11put forward a self-healing strategy based on a fixed array of 9 cells;however,it is also accompanied with a large number of redundant cells and has a low hardware utilization rate.In Lala's strategy12,13,spare cells and wiring cells are evenly distributed around the working cells,thus the replacement efficiency of fault-cells is high.However,the rewiring process will consume significant resources as the accessory.It has been shown by Li et al.14that a self-healing strategy based on the endocrine system can simplify the wiring reconfiguration,while its disadvantages include the complicated circuit model and difficult asynchronous cooperation with cells.Samie et al.15,16proposed a self-healing strategy based on the prokaryotic cell array model17,which reduces the configuration information in the cell but increases the reconfiguration time.Cai et al.18,19researched on a self-healing strategy based on the bus structure;however,the circular removal process requires a long time.The core idea of most of these research findings on self-healing strategies is directly removing fault-cells,while only a few studies provide particular approaches for different types of faults.Therefore,these strategies will certainly lead to the low utilization rate of hardware resources,in particular,the permanent eliminated transient fault-cells,which could have been healed by reconfiguration.
Ionizing radiation is the primary source of digital system failures in space.Transient faults(also known as soft errors),unlike the manufacturing or design faults,do not occur consistently.Instead,these faults are caused by external events,such as highly energized particles striking the chip.These events do not lead to permanent physical damage to the chip;however,they can alter signal transfers or stored values and thus cause incorrect program execution.20For example, Single-Event Upsets(SEU)and Single-Event Transients(SET)are accounting for around 90%of transient faults.16,21Moreover,these faults can be repaired by reconfiguration22.The traditional self-healing strategies of embryonics circuits hold the view that‘‘all fault-cells should be eliminated no matter what kind of fault type they belong to”.Nevertheless,they ignore the recoverability of transient fault-cells,resulting in a waste of hardware resources. Furthermore, the waste rate will increase significantly in proportion to the transient faults.Based on the principle analysis of the traditional cell elimination process,this paper proposes a fault-cell reutilization self-healing strategy,which,unlike the traditional strategies directly making the fault cells transparent instead of reconfiguring them,adds fault-cell reconfiguration stage to the period of function replacement.The structures and self-healing methods of all the modules inside the cells are expounded.The effectiveness of the new self-healing strategy and a method for design guidance are verified by the simulation analysis and efficiency analysis in comparison with the cell elimination self-healing strategy widely used in bio-inspired embryonics circuits and has obtained some research findings on the reliability model and reliability analysis.
Cell elimination self-healing strategy works as follows:when a working cell is detected for malfunction(shown in Fig.1(a)cell 01),cell elimination will be triggered,and then functions of the working cells are shifted to the cells in the right direction in the row(shown in Fig.1(b)).The array can be repaired without spare rows when the number of spare cells is more than that of fault cells;otherwise the row elimination will be triggered as shown in Figs.1(c)and(d).4
The strategy CESS has two shortcomings in dealing with fault-cells in cell elimination.One is the waste of hardware resources due to the whole removal of the cells for inside partial faults.The other one is that it does not deal with faults in different types;instead,it considers all failures as permanent faults.For transient fault-cells,removal without reutilization will result in considerable increase in hardware resource consumption and decrease in array reliability.
The new strategy FCRSS combines the two mechanisms of elimination and reconfiguration,enabling the two processes to work simultaneously.The whole self-healing procedure consists of three stages:fault-cell self-detection,fault-cell elimination,and fault-cell reconfiguration.The latter two stages are called self-repairing.
Fault-cell self-detection stage:fault signal comes from all the modules inside the cell.In this paper,only two modules of the configuration register and the Look-Up Table(LUT)are self-detected with a Dual-Modular Redundancy(DMR)structure,each having a backup.
Fault-cell elimination stage:if a cell is self-detected as failure,it is configured to be transparent immediately,and its function will be replaced by the adjacent neighboring cell on the right side.This stage resembles the fault-cell replacement stage in the row of the strategy CESS.
Fault-cell reconfiguration stage: in the new strategy FCRSS, the eliminated cells in the transparent state are regarded as spare cells.When new faults are detected in the cells on the left,those transient fault-cells in the right direction can be reconfigured with the backup configuration information stored in the neighboring cells on the left.In most cases,transient fault-cells in a transparent state can be repaired after reconfiguration;otherwise,the transparent cell(suffering permanent faults)will be configured as in the transparent state constantly.
As shown in Fig.2,the self-healing process of FCRSS is introduced by an example with transient faults.
Fig.1 Principle of cell elimination self-healing strategy.
Fig.2 Process of fault-cell reutilization self-healing strategy.
(1)Fig.2(a)shows a regular working cell array.
(2)Working cell 12 suffers a transient fault,and the neighboring spare-cell on the right replaces its function(Fig.2(b)).Then,working cell 12 switches to the transparent state and becomes transparent cell 12,as shown in Fig.2(c).Cells in the transparent state do not connect to the cell array.They are waiting for the fault signals from cells on the left and then are re-enabled as spare cells.
(3)Another transient fault occurs in working cell 11,as shown in Fig.2(d).Transparent cell 12 can be set as a spare-cell by the fault signal of cell 11,and be configured to the function of cell 11 as a spare-cell,thus the cell array is repaired.The prerequisite for the successful replacement of cell 11's function by transparent cell 12 is that transparent cell 12 suffers a transient fault that can be repaired with reconfiguration.If,however,transparent cell 12 suffers faults which cannot be repaired with reconfiguration,the self-detection circuit will offer fault signals and trigger the elimination again.
(4)As shown in Fig.2(e),cell 11 is replaced by transparent cell 12,enabling the cell array to work normally after two transient faults were self-healed in the middle row,which is not going to happen in strategy CESS.
(5)Furthermore,when cell 10 suffers failure(Fig.2(f)),its function is shifted to transparent cell 11 as is shown in Fig.2(g).
(6)In this strategy,only the functions of cells on the left are allowed to be replaced by the neighboring cells on the right in this array.When the fourth fault occurs in the middle row,as in Fig.2(h),there are two fault-cells on the far left,the array will not be repaired,and row elimination will be triggered.
In strategy CESS, only one fault cell can be repaired because there is only one spare-cell inside the row,while the fault-cells in the new strategy FCRSS can be repaired up to 3 times.If the number of spare-cells in the row increases,the number of self-healing times in the row will increase significantly.All the transient faults like the SEUs can be repaired by the reconfiguration mechanism in LUT or the configuration register.Meanwhile,a permanent fault can also be repaired when the cell implements a new logic function after reconfiguration or when its output is not from the LUT unit with the permanent fault.
The typical architecture of the self-healing embryonics hardware is a two-dimensional cell array,with all cells having the same internal structure.Each cell is composed of four modules:Controller,Function module,I/O routing switch and Configuration register.The Controller controls all the operations of the cell.The Function module is the processing block,its core part being the LUT.The I/O routing switch is responsible for connecting and transferring data with the surrounding cells.The Configuration register deposits all the configuration information of cells.Fig.3 shows the classical structure of an embryonics cell array.
Fig.3 Two-dimensional cell array and internal modules.
The Controller module in FCRSS has been improved.It is an innovative design in this paper.The functional module and the I/O routing switch are basically the same as the circuit in CESS23. The FCRSS requires configuration registers to backup configuration information from the cell on the left in order to achieve reutilization.An Assistant Reroute(AR)module is used to reroute the cell connection between the rows.
The Controller is the‘‘brain”of a cell.It is the core module to guarantee the correct self-healing process of the cell array,and is in charge of generating the shifting signal of the Configuration register and the switching signal of the I/O routing switch according to the changes of intracellular fault signals and the cell state.The state transition diagram of the Controller contains four states: initial state, working state,transparent state,and reconfiguration state,as shown in Fig.4.
In the initial state,the cell array configures the functions of all cells.There are only working cells and spare cells in the cell array in this state.When the cell array starts working,the selfdetection circuit enables signal‘‘set=1”.In the working state,the intracellular fault detection module begins to work.The cell turns into the transparent state and sends a shift signal to the cell on the right after detecting a fault inside a working cell(self_fault=‘‘1”).Upon receiving the shift signal from the cell on the left(front_shift=‘‘1”),the transparent state cell switches into the reconfiguration state and reconfigures itself with the backup configuration information.If there is no fault inside the cell after reconfiguration(self_fault=‘‘0”),it turns into the working state and the self-healing is finished.Otherwise,the cell will be considered as in the transparent state again and send a shift signal to the cell on the right to continue the self-healing process.
Fig.4 State transition diagram of Controller.
Fig.5 Timing diagram of Controller.
Fig.5 shows the timing diagram of the Controller.The input signals are reclk,front_shift,and self_fault,and the output signals are shift,bypass,and cell_state.‘‘Reclk”is the reconfiguration clock.‘‘Front_shift”,connected to‘‘shift”of the cell on the left,is the output signal of the adjacent cell on the left.When a fault is detected in the cell,its‘‘self_fault”is set at‘‘1”,representing the flag signal of the cell failure.The‘‘cell_state” signal represents the state of the cell.‘‘Cell_state=1”means the cell is working;otherwise it is‘‘0”.‘‘Shift”is the output signal to trigger reconfiguration process of the cell on the right, which is connected with‘‘front_shift”of the cell on the right.‘‘Bypass”is the control signal for the I/O routing switch and the AR module.
The working process shown in Fig.5 is as follows:
(1)In reclk 1,the cell is working faultlessly.
(2)At the rising edge of reclk 2,a fault is detected(self_-fault=‘‘1”).At the same time,the signal bypass is set to 1.Then the cell turns into a transparent state and sends a shift signal(shift=‘‘1”)to the cell on the right.
(3)One clock later,the array has been self-repaired,and the signals of‘‘cell_state”and‘‘shift”turn to‘‘0”,while the fault-cell is still in the transparent state which will keep incessant until the next‘‘shift”signal is received.The cell's transparent state continues from reclk 3 through reclk 5 in Fig.5,and its self_fault signal will remain ineffective until it is reused.
(4)At the rising edge of reclk 6,‘‘front_shift”turns to‘‘1”,which means a fault in the cells on the left.Meanwhile,the cell switches into the reconfiguration state and the bypass signal is set to‘‘0”for cell substitution.
(5)After one clock,‘‘cell_state”turns to‘‘1”and‘‘self_-fault”turns to‘‘0”.The self-healing is finished and the cell works normally.
Fig.6 Cyclic backup structure of Configuration register.
The self-detection circuit structure of the Function circuit is the DMR.The outputs of two identical function blocks which have the same inputs are connected to an XOR gate.The output of the XOR gate is the fault signal of the Function circuit.
The I/O routing switch is used to switch signals inside the cells and change the propagation direction of the signals.In the original circuit we designed,the I/O routing switch has two inputs and two outputs in each direction.For larger cell arrays,the actual number of inputs and outputs should be equal to the number of spare cells in the row.The connection mode of the I/O routing switch depends on the configuration bits in the cells.When a cell works properly,the configuration information determines its connection relationship with the four surrounding directions.When the cell is in a transparent state,the I/O routing switch will directly communicate with the cells on both left and right sides.
The Configuration register deposits cell configuration information used to configure cell functions and control cell-to-cell connections.In addition to the innovative design of the controller module,the new self-healing strategy also requires the Configuration register to be able to hold the configuration information of the adjacent cell on the left and update the backup configuration information after the array reconfiguration.Fig.6 is a schematic diagram of the Configuration register with a cyclic backup structure. ‘‘mxx” represents the configuration information of cell xx.Light-colored boxes represent the backup configuration information of cells on the left(backup CR),and dark-colored ones express the configuration information of local cells(working CR).
Fig.7 shows the transfer process of the configuration information in the self-healing process.The connection used to bypass the fault configuration register is connected only when the cell is in a transparent state,and is represented by solid lines;otherwise it is expressed by dashed lines.
If the CR of cell 11 for the local cell fails(Fig.7(a)),cell 11 becomes transparent(Fig.7(b)),and the CR chain will directly connect cell 12 with cell 10.Simultaneously,the backup configuration information of all the cells on the right side of the fault cell is shifted into the working CR,as shown in Fig.7(c).In the next clock,the new CR chain moves the working configuration information of the cell on the left into the backup CR of the adjacent cell on the right,as in Fig.7(d).The configuration information of cell 10 will be copied into the backup CR of cell 12 and the configuration information of the latter will be transferred into cell 13 for backup.
Fig.7 Transfer principle of configuration information in self-repairing process.
Fig.8 Connection diagram of assistant reroute module.
Cells in the cell array can connect the cells on the left with those on the right in the same row directly,and the upper and lower cells are connected with each other through the AR module which functions as a connection module between the rows.The bypass signal controls the connection switching between the upper and lower rows.Fig.8 shows a connection diagram of the AR module.Working cell 10 connects to cell 00 through AR0.Cell 11 is in a transparent state.Working cell 11 connects to working cell 01 by AR1 and AR2 due to the bypass signal of cell 11.
To verify the self-healing ability of the proposed fault-cell reutilization self-healing strategy,a 4-bit binary adder circuit was realized as an example on a twelve-electronic-cell array with a structure of 3×4.Working cells are settled in the left two columns of the array as shown in Fig.9,and the right two columns are treated as redundant spare cells.Xilinx ISE is used to synthesize the circuit and ISim for function simulation.
Fig.10 shows the simulation results of the adder circuit,where x[3:0]and y[3:0]are two addends,s[3:0]is the sum,and c is the carry.In this simulation,the acquisition method of fault signals are in the form of external injection.In Fig.10(a),two transient fault signals in different cells(cell11_fault and cell10_fault)were injected,the transient characteristics being represented by the duration of one reconfiguration clock.In Fig.10(b),two permanent faults with persistent signals in different cells(cell11_fault and cell10_fault)were injected.Signal meanings in the sequence chart are as follows:cellxx_fault is the fault signal of cellxx,and cellxx_configbits is the native configuration information of cellxx.
(1)Fig.10(a)shows the sequential chart of the self-repairing process for two transient faults.The details of the process are as follows:
Fig.9 Self-repairing process corresponding to self-healing strategy FCRSS for transient faults.
Fig.10 Simulation results of the FCRSS.
Fig.11 Self-repairing process corresponding to self-healing strategy FCRSS for permanent faults.
I.At 50 ns,working cell 11 is injected with a transient fault,and the corresponding cell array is shown in Fig.9(a).In this situation,signal cell11_fault is‘‘1”and sum s[3:0]is in error during 50-55 ns.At 55 ns,the spare cell 12 reconfigures itself with the backup configuration information and transforms into the new working cell 11,the cell12_configbits is rewritten to be the same as the cell11_configbits.At 60 ns,cell 11 turns into the transparent state and the cell array is selfrepaired,which is shown in Fig.9(b).
II.At 70 ns,working cell 10 is injected with another transient fault.In Fig.10(a),cell10_fault is set to‘‘1”and s[3:0]outputs the erroneous result during 70-75 ns.The transparent cell 11 will be reutilized and reconfigured to substitute the function of fault cell 10 at 75 ns,as shown in Fig.9(d).
(2)Fig. 10(b) shows the sequential chart of the selfrepairing process for two permanent faults.The details of the process are as follows:
I.At 50 ns,working cell 11 is injected with a permanent fault,and the self-repairing process of the first fault in a row is the same as that of the transient fault.
II.At 70 ns,working cell 10 is injected with another permanent fault,as shown in Fig.11(c).At 75 ns,the transparent cell 11 is reconfigured completely, and cell11_configbits is rewritten.The self-repairing process of the array will be triggered after cell 11 has been reconfigured.Finally,the fault cells are replaced by spare cells 12 and 13 as shown in Fig.11(d),and the self-repairing process is not completed until 85 ns.
The Spartan3 XC3S400 chip was selected to implement the cell array and analyze the performance as an example.A comparison of hardware resource consumption of the strategy CESS,the most common strategy whose structure and principle of modules are most similar to FCRSS,and that of the FCRSS(according to the ISE simulation results)is shown in Table 1.The statistics contain the slices,flip-flops and LUTs.It can be seen that the new strategy FCRSS with the 3×4 array only has a small amount of increase in hardware resource consumption which are mainly used for the design of the new Controller and the construction of some cell status signals.
Table 1 Comparison of hardware resource consumption between two strategies.
We still take the same 3×4 cell array as an example.The new strategy FCRSS in this article needs one reconfiguration clock to replace fault cells with spare cells,which is the same as the traditional strategy CESS.However,the new self-healing strategy can reuse the fault cell by reconfiguring the transparent cell.If a transparent cell is a permanent fault cell which cannot be repaired by reconfiguration,the time consumption will increase by one reconfiguration clock for another elimination.Therefore,the self-repairing time is two reconfiguration clocks(as shown in Fig.9(b))because the permanent fault-cell would still be detected in failure after it has been reconfigured,and then the new elimination will be triggered.The general analysis of time consumption is presented in Section 5.2.
The CESS is the most common strategy in embryonics hardware,and most self-healing strategies are based on the optimization of it.There are comprehensive reliability models and optimal design research results of the CESS;therefore,to verify the effectiveness of the new self-healing strategy FCRSS,this paper gives its comparison results with the CESS in both reliability and time consumption.
An N×M cell array will be considered working if a sub-array of size n×m is faultless,where N is the number of the rows in the whole array and n is that in the working cell array,M and m are the columns in the whole array and working cell array(where M ≥m,N ≥n),respectively.In strategy CESS,each row of the array is a k-out-of-m system and all columns of the array constitute another k-out-of-m system.To assess the dependability of a system in engineering,the Mean Time to Failure(MTTF)is,in general cases,seen as the reliability of the system3.Several new definitions in the reliability model of FCRSS are proposed as follows:
(1)Defining Q as the in-row repairable number,which represents the number of repairable faults in a row.In strategy CESS,Q is a constant and equals M-m,while in strategy FCRSS,it is not a constant and will be far larger than M-m.Due to the randomness of the transient or permanent faults of the cells,the statistical average of 100 repairable times in the analysis process was taken in FCRSS.
(2)Defining L as the proportion of working cells in the row,so L=m/M.
(3)Defining S as the proportion of transient faults,which is calculated as transient faults divided by total faults(sum of transient faults and permanent faults).
(4)Defining Meas the equivalent number of cells in a row.Q is equivalent to the number of spare cells in the calculation of reliability,and the number of working cells m remains unchanged,hence Me=m+Q.
In the new strategy FCRSS,in-row repairable number Q increases with the increase of transient fault-cells reutilized.Therefore,the reliability model should be modified as:
Hence,the row reliability Rr(t)will be calculated by Eq.(1).
The array reliability Ra(t)and array MTTF are shown in Eqs.(2)and(3):
where λ is a constant,denoting the failure rate of cells,whose dimension is 10-6/h.In FCRSS,the repaired cells are treated the same as the normal working cells,and the same is true for the failure rates.
Take M=N=100,m=n=80,λ=1×10-6/h as an example. The changing trends of Q, array reliability and MTTF are calculated and analyzed in detail.
(1)Analysis of Q variations
In the strategy CESS,Q always equals M-m,while in FCRSS,Q is a variable number between M-m and M×(Mm)for fault-cell reutilization.The value of Q is affected by the size of the array,the proportion of working cells L and the proportion of transient faults S.
I.Analysis of Q variations changes with M and L when faults occur randomly in the cells,where S is 0.9(which means 90%of the faults are transient).
Fig.12(a)shows Q changes with M at different L.To show the changing trends more clearly,consider M as 200.Seen from these curves,Q is proportional to M when L is fixed.The larger M is,the more redundant spare cells are in the row,thus the greater the value of Q.When M is a constant,if L increases, the number of redundant spare cells will decrease,hence Q decreases.
II.Analysis of Q variation changes with different S
As shown in Fig.12(b),the repairable number Q in the row grows faster and faster with the increase of the transient faults proportion S,and all the values are larger than those in the strategy CESS.In addition,if S is small,Q grows slowly;however,when S is large,Q rises with S significantly.Therefore,in those systems with large values of S,the repairable ability will be significantly improved in the new strategy.
(2)Analysis of array reliability Ra(t)
Fig.13 shows the array reliability Ra(t)curves with different transient fault proportions S.As shown in the figure,the failure time(the starting time of Ra(t)<1)of the array is extended as S gradually increases,leading to synchronous increases of the array reliability.Moreover,both of them are larger than the values in strategy CESS where S is 0.
When S increases gradually from 0 to 1,the proportion of transient faults also increases gradually.At this point,as Q increases,the array reliability Ra(t)increases and the failure free time of the array also increases.
Fig.14 shows the variation curves of MTTF with different proportions of transient faults S.As can be seen,when S is 0,the MTTF of the two self-healing strategies are the same.With the increase of S,the MTTF of strategy FCRSS increases gradually,and the larger S is,the faster the MTTF increases,while the MTTF of CESS is a constant.When S is greater than 0,the MTTF of the FCRSS is superior to that of the CESS;therefore,as long as there is a transient fault,the new strategy can achieve higher reliability,and the larger S is,the more obvious the increase of the MTTF.
The self-healing strategy FCRSS consumes one more clock for the permanent fault cells, since these cells would still be detected in failure after they have been reconfigured,while the cell elimination strategy CESS consumes only one clock during the self-repairing process.To quantify the time consumption of the new self-healing strategy,we still take the array of M=N=100 and m=n=80 as the example for analysis.
(1)Defining T as the time required to repair a fault in a cell.
(2)Defining Tcas the reconfiguration time of the strategy CESS,and Tcis a constant.
(3)Defining Tnas the average reconfiguration time of the new self-healing strategy FCRSS.Reconfiguration time of a cell is a variable,which depends on the occurrence location of the fault cell.
As shown in Fig.15,Tcof the strategy CESS is a constant,equal to 1.Five curves are used to represent Tn.It can be seen that Tnis much larger than Tcwhen S is small,while with the increase of S,Tndecreases gradually,and the trend is more obvious when S >0.7.Tnand Tcare equal when S=1.The large proportion of S means most of the cells are suffering transient faults,and the extra reconfiguration time consumed by the new self-healing strategy will be reduced.
Fig.12 In-row repairable number curve.
Fig.13 Array reliability variation changes with different S.
Fig.14 MTTF variation changes with S in different strategies.
Fig.15 shows five different cases of Tn,which are:Q ≤20,Q ≤30,Q ≤40,Q ≤60,and unrestricted Q(Q tends to be greater than 60 in this situation)of cells that can be repaired inside the same row.In all the five cases,if Q is smaller than the threshold value,it takes the actual value.From the view of design optimization,the smaller the time consumption Tn,the better the result(smaller time consumption);therefore,the curve obtained when Q ≤20 is the best result,which means the time consumption is much smaller when repairing earlier faults(previous 20 faults).This shows that the advantages of the new strategy FCRSS are more obvious in repairing earlier faults.
Fig.15 Average reconfiguration time changes with S.
To receive the comprehensive influence of time consumption and reliability for the array optimal design,the normalization analysis was used to deal on time consumption T,in-row repairable number Q,and reliability MTTF.
(1)Defining Ptas the proportion of reconfiguration time in these two strategies,and Pt=Tn/Tc.
(2)Defining Pras the proportion of the in-row repairable number in both strategies,and Pr=Q/(M-m),where(M-m)is the in-row repairable number of the strategy CESS.
(3)Defining Pmas the proportion of MTTFs in these two strategies,and Pm=MTTFFCRSS/MTTFCESS.
In Fig.16,the curves of Pr,Pmand Ptare the normalized results of Figs.12(b),14,and 15.Prand Pmare favorable factors,thus the larger they are,the better;while Ptis a disadvantage,the smaller it is,the better.Seen from the figures,Prand Pmwill proportionally increase with the increase of S,while Ptchanges in the opposite direction.The intersection points distinguish the applicable conditions and circumstances of the new strategy FCRSS.With a specific S,the values of Prand Pmare higher than the value of Pt,which means the performance promotion of the array has outweighed the cost.The values of S at the intersection are 0.72(for Pr)/0.74(for Pm)when Q ≤20,and 0.9(for Pr)and 0.94(for Pm)when Q takes the maximum value.Therefore,to repair the early faults,a lower threshold value of S for the new strategy can achieve better results.
Fig.16 Comparison diagram of reconfiguration time proportion, in-row repairable number proportion, and MTTF proportion.
Therefore,in the space environment,the proportion of transient faults is larger than 0.9 in the case of large probability.The FCRSS in this paper pays a small amount of extra reconfiguration time to win a much higher hardware utilization rate and reliability than the strategy CESS.
(1)This paper describes the principle of a new Fault-Cell Reutilization Self-healing Strategy(FCRSS)and gives the circuit design methods of all the cell modules.The effectiveness of the new self-healing strategy is verified by the circuit design,simulation and analysis of a 4-bit adder with a 3×4 cell array.The simulation results show that the cell array can implement the logic function correctly and accomplish the transient fault-cell reutilization.
(2)A new reliability model has been presented according to the characteristics of the new self-healing strategy,which takes the effect of transient fault-cells into consideration.
(3)Reliability analysis demonstrates that,compared with the representative self-healing strategy CESS,the new self-healing strategy FCRSS can increase the hardware utilization rate and system reliability at the expense of a small amount of hardware and time consumption.Particularly in space where the proportion of transient faults is greater than 0.9,the advantage of a high hardware utilization rate and reliability will be even more pronounced.
(4)Future works to improve the FCRSS will be reducing the reconfiguration time,exploring better structures of the configuration register, proposing more in-depth analyses of the reliability and time consumption with different fault types,array sizes and circuit functions,and summing up more comprehensive design guidance methods.The application method of self-healing design of commercial FPGA will be studied emphatically.
Acknowledgements
This study is co-supported by the National Natural Science Foundation of China(Nos.61202001,61402226)and the Fundamental Research Funds for the Central Universities of NUAA of China(Nos.NS2018026,NS2012024).
CHINESE JOURNAL OF AERONAUTICS2019年7期