SA-RSR:a read-optimal data recovery strategy for XOR-coded distributed storage systems*

2022-06-30 05:52:52XingjunZHANGNingjingLIANGYunfeiLIUChangjiangZHANGYangLI

Frontiers of Information Technology & Electronic Engineering 2022年6期

Xingjun ZHANG， Ningjing LIANG， Yunfei LIU， Changjiang ZHANG， Yang LI

1School of Computer Science and Technology， Xi’an Jiaotong University， Xi’an 710049， China

2Beijing Electronic Engineering General Research Institute， Beijing 100854， China E-mail: xjzhang@xjtu.edu.cn; l_ningjing@stu.xjtu.edu.cn; liuyunfei@stu.xjtu.edu.cn; zcj9527@stu.xjtu.edu.cn

Received May 16， 2021; Revision accepted Aug. 16， 2021; Crosschecked Feb. 28， 2022

Abstract:To ensure the reliability and availability of data，redundancy strategies are always required for distributed storage systems. Erasure coding， one of the representative redundancy strategies， has the advantage of low storage overhead，which facilitates its employment in distributed storage systems. Among the various erasure coding schemes，XOR-based erasure codes are becoming popular due to their high computing speed. When a single-node failure occurs in such coding schemes， a process called data recovery takes place to retrieve the failed node’s lost data from surviving nodes. However， data transmission during the data recovery process usually requires a considerable amount of time. Current research has focused mainly on reducing the amount of data needed for data recovery to reduce the time required for data transmission， but it has encountered problems such as significant complexity and local optima. In this paper， we propose a random search recovery algorithm， named SA-RSR， to speed up single-node failure recovery of XOR-based erasure codes. SA-RSR uses a simulated annealing technique to search for an optimal recovery solution that reads and transmits a minimum amount of data. In addition， this search process can be done in polynomial time. We evaluate SA-RSR with a variety of XOR-based erasure codes in simulations and in a real storage system， Ceph. Experimental results in Ceph show that SA-RSR reduces the amount of data required for recovery by up to 30.0% and improves the performance of data recovery by up to 20.36% compared to the conventional recovery method.

Key words: Distributed storage system; Data reliability and availability; XOR-based erasure codes; Single-node failure; Data recovery

1 Introduction

With the rapid development of cloud computing and big data in recent years，distributed storage systems， such as Google’s GFS (Ghemawat et al.，2003)， HDFS (Borthakur， 2007)， Microsoft’s WAS(Calder et al.， 2011)， open source Ceph (Weil et al.，2006a)， and Swift (Arnold， 2014)， have been extensively applied. Considering the low reliability of the cheap commercial equipment in these systems and that storage nodes are prone to failure， some redundancy strategies are indispensable. Replication is a traditional redundancy technology and has been widely exploited in various storage systems. Replication places multiple copies of data on different nodes(in this paper， “node” refers to an independent failure domain， which can be a disk or a storage node)，which is beneficial for data recovery because only one replica needs to be read and transmitted when a single-node failure occurs. However， replication consumes a massive amount of extra storage space.

A redundancy strategy has been proposed based on erasure coding to save storage resource. In general， erasure codes are represented by (k，m)，denoting that the original file is divided intokequally-sized data chunks and thatmparity chunks are calculated by thesekdata chunks. Compared to replication，erasure codes usually have lower storage overhead under the same fault tolerance conditions.For example， Reed-Solomon (RS) code (4， 2) (Reed and Solomon，1960)and 3-replication can both tolerate any two-node failure，while the former consumes half the storage space of the latter. Therefore，enterprises are gradually transferring to the use of erasure codes to provide efficient and reliable data storage services. Maximum distance separable(MDS) codes are especially popular for use in real storage systems，because they optimally balance fault tolerance and storage overhead.

RS code，which tolerates multiple node failures，is a representative MDS code and has been deployed in various storage systems，e.g.，Google’s GFS(Ghemawat et al.， 2003) and Facebook’s HDFS storage clusters(Facebook，2018). Furthermore，array codes(Blaum et al.， 1996; Tamo et al.， 2011， 2013; Gad et al.， 2013; Hou et al.， 2019a; Hou and Lee， 2020)have been studied extensively，where each chunk is a column of elements in a two-dimensional array. Binary MDS array codes with node elements in GF(2)are an important family of array codes. Sometimes these codes are also called XOR-based codes， because they perform only binary XOR operations during encoding and decoding procedures，which can be efficiently implemented in hardware and software.Therefore， XOR-based codes are more efficient than RS code in terms of computation complexity. More importantly， recovery can work at symbol granularity for XOR-based codes (array elements are also referred to as “symbols” for representation convenience). Cauchy RS(CRS) code (Roth and Lempel，1989) has all the advantages of RS code and performs only XOR operations. Two-fault-tolerant erasure codes， known as RAID-6 codes， have received more attention in recent decades， such as EVENODD (Blaum et al.， 1995)， RDP (Corbett et al.，2004)， Blaum_Roth (Blaum and Roth， 1993)， Liberation code(Plank，2008)，Liber8Tion code(Plank，2009)， X-Code (Xu LH and Bruck， 1999)， H-Code(Wu et al.，2011)，P-Code(Jin et al.，2009)，and HV Code (Shen and Shu， 2014). X-Code， H-Code， PCode，and HV Code are vertical codes，which are seldom applied in real storage systems due to the complex rules for placement of parity chunks. Furthermore，triple-fault-tolerant XOR-based erasure codes，e.g.， STAR (Huang and Xu， 2008)， RTP (Goel and Corbett， 2012)， and TIP (Zhang et al.， 2015)， are also presented to offer higher reliability.

MDS codes can greatly reduce the consumption of storage space; however， their recovery performance is far poorer than that of replication. For example， in a (k，m) MDS coded system，kchunks will be retrieved from surviving nodes to reconstruct any single failure chunk， while in a triple-replicated system，failure chunk can be recovered by downloading any single surviving replica. Thisk-factor data to be read and transmitted will result in a long recovery time， which may dramatically affect system service performance. Single-node failure in distributed storage systems accounts for more than 99.75% of all node failures (Schroeder and Gibson， 2007). Therefore，current research aims to optimize data recovery efficiency of a single-node failure.

Two optimization ideas are available to optimize data recovery of an erasure-coded storage system.The first idea is to design new erasure codes with a low recovery cost，which can theoretically reduce the amount of data required during recovery. Such codes include local reconstruction code(Huang et al.，2012)deployed by Microsoft in the WAS cloud storage system， Facebook’s locally repairable code (Sathiamoorthy et al.， 2013)applied in the HDFS-Xorbas system， and shingled erasure code (SHEC) (Miyamae et al.， 2014; RedHat， 2018) exploited in the Ceph system. The improved recovery performance is obtained at the expense of extra storage space for these codes， which is not desired in real systems. In contrast， regenerating codes (RGCs) (Jiekak et al.，2013; Hou et al.， 2019b， 2019c; Ye et al.， 2020) are erasure codes that are developed from network coding，which can greatly reduce repair bandwidth without increasing storage overhead. However， they are rarely implemented in real systems because they lack several key desirable properties that are crucial for practical systems(Pamies-Juarez et al.，2016;Vajha et al.，2018).

Another idea is to optimize the repair processes of existing codes rather than trying to construct new codes. Some techniques that reduce the amount of data required have emerged for existing MDS codes，especially for XOR-based codes. For example， the minimum amount of data needed for RDP， EVENODD，and X-Code in recovering a single-node failure has been derived (Wang et al.， 2010; Xiang et al.，2011)， but these conclusions and recovery methods cannot be extended to other erasure codes because they have inherently different structures. Some general optimization algorithms have been proposed to recover any MDS XOR-based code， but these approaches are highly complex or easily fall into a local optimum solution. In addition， none of these techniques have been tested on real-world storage systems. To overcome these shortcomings， we focus on researching the issue of optimal recovery for any MDS XOR-based erasure code， and implement the read-optimal recovery approach in a real-world storage system to demonstrate its efficiency.

2 Background and related works

2.1 Background: XOR-based erasure codes

We consider a distributed storage system consisting ofnnodes，which are partitioned intoknodes that store data andm(m=n-k)nodes that store parity information. The data nodes are denoted asD0，D1，···，Dk-1， and the parity nodes are represented asP0，P1，···，Pm-1. A file written to the erasure-coded storage system is first divided into multiple data blocks. Each data block can be encoded simultaneously to obtainkdata chunks andmparity chunks. The resultant set ofnchunks is called a stripe. For XOR-based erasure codes， a chunk is composed ofwsymbols. Theithsymbol in thejthdata chunk of a stripe is labeled asdwj+i，wherewis the number of symbols in each chunk. Similarly，theithsymbol in thejthparity chunk is labeled aspwj+i.The size of a symbol is typically multiples of the sector size(e.g.，4096 bytes)and depends on the specific storage system implementation. An XOR-based erasure code can also be parameterized as (k，m，w).In storage systems， the encoding and decoding processes of each stripe are independent of each other，so we just consider a single stripe.

Fig. 1 depicts an example stripe of one XORbased erasure code. In Fig. 1， the stripe comprisesk= 4 data chunks andm= 2 parity chunks. Each chunk consists ofw= 3 symbols. Symbolsd0–d2are data symbols in the 0thdata chunk belonging to data nodeD0，whilep0–p2are parity symbols in the 0thparity chunk belonging to parity nodeP0. For the sake of discussion， the notations of XOR-based erasure codes are listed in Table 1.

Table 1 Notations used in this paper

An XOR-based erasure code can be represented by the product of a bit matrix and a vector. The bit matrix is called a generator matrix in coding theory and can be divided into two parts. The first part is akw×kwidentity matrix，which can be considered to be ak×kmatrix，whose element is aw×wbit matrix.The second part，known as a coding distribution matrix(CDM)，is composed of anmw×kwmatrix and determines the generation of parities. Specifically，parity symbolpi(0≤i ＜mw-1)is calculated from the data symbols whose corresponding columns are not zero in theithrow of the CDM. Therefore， the CDM is able to define a unique XOR-based erasure code.

Fig. 2 shows the CRS code encoding process withk= 4，m= 2， andw= 3. In Fig. 2， the encoding equations representing the generation of all parity symbols are as follows:

Fig. 1 A stripe of one XOR-based erasure code whenk =4， m=2， and w =3

Fig. 2 Encoding process of CRS (4， 2， 3)

We can observe that if there is only one symbol that experiences failure in one of the equations in Eq. (1)， XOR of other surviving symbols can reconstruct this failure symbol， and more than one equation is available for recovering the same symbol(e.g.，d0=p0⊕d3⊕d6⊕d9andd0=p3⊕d5⊕d6⊕d7⊕d10whend0fails).

We focus on recovering failure symbols using the least amount of data instead of constructing a new erasure code. Our recovery approach can be applied to any XOR-based erasure code， and we conduct our experiments on three representative erasure codes， that is， CRS code， Blaum_Roth code，and Liber8Tion code. These codes have been implemented in the open-source library Jerasure-2.0，and readers can consult the manual (Pamies-Juarez et al.， 2016) for details. CRS code is constructed based on a Cauchy matrix，has flexible configuration about (k，m，w)， and has a dense generator matrix.Blaum_Roth code and Liber8Tion code(w=8)are minimum density codes for RAID-6. Examples of CDMs for Blaum_Roth code and Liber8Tion code are shown in Fig. 3.

Fig. 3 CDMs for Blaum_Roth code (4， 2， 4) (a) and Liber8Tion code (4， 2， 8) (b) when k =4 and m=2

2.2 Related works

2.2.1 Conventional recovery

For storage systems that employ XOR-based erasure codes， conventional recovery methods will use the first surviving parity node to recover the failure data node. Assume that theFthdata node fails and that other nodes are all alive. Theithfailed symboldFw+ican be recovered by XOR-summing theithparity symbol ofP0and all the other surviving data symbols corresponding to the non-zero columns in theithrow of the CDM. Therefore， recovering each erased data symbol requires theksymbols be read for any XOR-based erasure code.

If a parity node fails，it simply needs to perform a new encoding process to reconstruct the erased parity symbols. To the best of our knowledge， there is no research indicating that when a parity node fails，the number of symbols to be read can be further reduced. Therefore， existing optimization methods focus mainly on data node failures and this also applies in our approach.

2.2.2 Hybrid recovery Xiang et al. (2011) first investigated the optimal recovery problem related to a single-node failure for RDP code (Corbett et al.， 2004). They derived the lower bound of the number of symbols needed for any single failure recovery of the RDP code and proposed an efficient recovery method. Wang et al.(2010)conducted related and independent studies for EVENODD， X-Code， and STAR code. The authors demonstrated that the lower bound of the number of symbols of X-Code is 3p2-2p+5. Later， the tight lower bound of the number of symbols of XCode was proved by Xu SL et al. (2014)， and the load balancing recovery among different nodes was considered. Liang et al.(2020)extended this idea to liberation codes and evaluated their recovery scheme on a real distributed storage system. All the studies mentioned here follow the idea of hybrid recovery from Xiang et al.(2011)，where symbols on different parity nodes are interchangeably employed to reconstruct the failed data symbols. Most of all， this is fundamental to other optimization approaches，e.g.，the enumeration algorithm (Khan et al.， 2012) and hill-climbing algorithm(Zhu et al.， 2014).

Example 1 The CDM of RDP (4， 2， 4) is shown in Fig. 4. RDP code is also represented by a (p -1)×(p+ 1) two-dimensional array， wherew=p-1，n=p+1， andpis a prime. A twodimensional array of RDP (4， 2， 4) is plotted in Fig.5. In Fig.5，the parity symbols are generated by the symbols with the same color and shape. We take this RDP code as an example to illustrate how conventional recovery and hybrid recovery work. Fig. 6 shows the symbols used in these two approaches. In Fig. 6a， the erased symbols are labeled “×，” and the surviving symbols to be read are marked with“○.” Conventional recovery uses parity nodeP0to recover symbols onD1， so all surviving data symbols and parity symbols onP0are read and the total number of symbols read iswk= 16. In Fig. 6b， the symbols read by parity nodesP0andP1are labeled“○” and “□，” respectively. The symbol that is read by more than one equation to recover failure symbols is referred to as an overlapping symbol. Overlapping symbols are marked using both “○” and “□.” One of the optimal recovery schemes to recoverD1for hybrid recovery is as follows:

Fig. 4 CDM of RDP code when k = 4， m = 2， and w =4

Fig. 5 Encoding of RDP code when k = 4， m = 2，and w =4: (a) P 0 parity layout; (b) P 1 parity layout

Fig. 6 The conventional(a) and hybrid (b) recoveries for RDP code when k =4， m=2， and w =4

The number of symbols read iswk-noverlap= 12，wherenoverlapis the number of overlapping symbols.

The core idea of hybrid recovery is the idea of“maximizing the number of overlapping symbols.”

This approach has the advantages of finding the optimal amount of data read in advance and providing an optimal recovery scheme quickly. However， only a few XOR-based erasure codes with regular structures have been deduced about the lower bound of the number of symbols read，such as RDP，EVENODD， and X-Code. It is difficult to extend the optimal recovery conclusion of hybrid recovery to other codes， especially erasure codes withm ≥3.

2.2.3 Enumeration recovery

To solve the optimal recovery problem of general XOR-based erasure codes， Khan et al. (2012)suggested modeling the problem of minimizing the number of symbols read as a combinatorial optimization problem and obtained the global optimized solution by enumeration.

For the sake of description， the notion of a recovery equation is defined as follows:

Definition 1(Recovery equation) A set of symbols that are XOR-summed to a zero vector is referred to as a recovery equation.

According to the definition of the recovery equation， we can conclude that any single symbol can be reconstructed as long as the remaining symbols survive in a recovery equation.

Example 2We use CRS(4，2，3)in Fig.2 to explain the recovery of the enumeration method. LetFcontain all the failure symbols. Each symbolfi ∈Fowns a recovery equation setEi. An equationebelongs toEionly ife ∩F=fi. This means that any recovery equation inEican recoverfibecause its remaining symbols survive. For example，we assume thatD1fails， soF={d3，d4，d5}， andE1={e0，e1}is the recovery equation set off1=d4， wheree0=

{p1，d1，d4，d7，d10}ande1={p5，d2，d4，d6，d9，d11}.Suppose that we can list all recovery equations forE0–Ew-1. The problem is formulated as follows: select one equation fromEito recoverfiand minimize the number of symbols in the union of the selected equations. In fact，the number of recovery equations is greater thanmw， because the sum of two or even more recovery equations in Eq. (1) is still zero; i.e.，{d0，d1，d4，d5，d6，p1，p3}is also a recovery equation obtained by adding the second and fourth equations in Eq. (1). In fact， there are up to 2mw -1 candidate equations， so this approach is time-consuming，especially for largemorw.

2.2.4 Hill-climbing recovery

Zhu et al.(2014)employed a heuristic algorithm to tackle the problem mentioned. First the algorithm reduces the search space fromby using onlymwrecovery equations in the CDM.Then，it exploits a hill-climbing technique to search out a proximate optimum solution.

Example 3Using CRS (4，2，3) in Fig. 2 ， we explain the principle of hill-climbing recovery next.Assume thatD1fails. LetSibe the set of parity symbols from parity nodePi，Xbe the set ofwsymbols used to recover failure symbols now，Nsbe the number of symbols read now， andYbe the collection symbols to replace the elements inX. Initially，S0={p0，p1，p2}，S1={p3，p4，p5}，X=S0，Ns= 12， andY=S1. Then， it starts to updateX. In every update process， it tries to useyj ∈Y(0＜j ＜3) to replace one elementxi ∈X(0＜i ＜3); if the replacement is valid (can recover all failed symbols) andNsis reduced， it will conduct the replacement， and vice versa. In the first update， supposing that we replacep1byp5and removep5fromY， we obtainX={p0，p5，p2}andY={p3，p4}. We haveNs= 10 after this replacement. Next we continue the second update. We replacep2byp3，whileNscannot be reduced further，so we stop the iteration process. Finally， the searched parity symbol set is{p0，p5，p2}.

From the analysis above，we can determine that the complexity of hill-climbing recovery is greatly reduced compared to enumeration recovery. However，the algorithm omits a lot of important recovery equations and it easily falls into a local optimum solution.

And so they laughed and jeered11, and it was so hot that the bear said, The enemy won t be here at this rate for many hours to come, so I ll just curl myself up in the fork of the tree and have a little sleep

3 Motivation

In general， enumeration recovery has an extremely high time complexity， because minimizing the number of symbols to be read for single datanode failure of XOR-based erasure codes is an NPhard problem (Khan et al.， 2012; Zhu et al.， 2014).For an erasure code (k，m，w)， the enumeration algorithm will selectwfrom 2mw-1 recovery equations and guarantee that the total number of symbols required for recovery is minimum. Thus， the maximum search space of the enumeration algorithm can beTherefore， it is not practicable to apply the enumeration algorithm to real-world storage systems.

Hill-climbing recovery can obtain a suboptimal symbol-reading scheme in polynomial time and the time complexity isO(m×w3). However， the algorithm omits several important recovery equations that generate the optimal solutions and uses greedy thoughts in the running process， which has one major shortcoming: it easily falls into a local optimum solution.

In this study，we propose a random search recovery algorithm that is based on simulated annealing(SA). The algorithm has a low time complexity and overcomes the problem of easily being trapped in local optima in the search process.

4 Detailed design of SA-RSR

Based on this motivation，we propose a random search recovery algorithm named SA-RSR， which can be decomposed into two steps: parity symbol grouping and recovery solution search.

4.1 Parity symbol grouping

SA-RSR proposed in this study draws on the idea of grouping the recovery equations in the enumeration algorithm. SA-RSR significantly reduces time complexity，which is beneficial in facilitating its implementation in real storage systems.

It is easily observed that each recovery equation contains only a parity symbol from Eq.(1). Hence，a parity symbol can represent a recovery equation，and grouping parity symbols is equivalent to grouping the corresponding recovery equations. Parity symbol grouping partitionsmwparity symbols intowgroups， where parity symbols in one group participates in the recovery of the same failure data symbol.Details of parity symbol grouping are shown in Algorithm 1. In summary， the goal of SA-RSR is to selectwsymbols frommwparity symbols to recoverwfailed data symbols.

To facilitate the analysis， the recovery group is defined as follows:

Definition 2(Recovery group) The collection of parity symbols whose corresponding recovery equations can be used to recover the same failure data symbols is known as a recovery group.

AddToG(g，dj，pi): Adding the parity symbolpito the recovery groupgj-Fw ∈Gindicates that the recovery equation corresponding to the parity symbolpican be used to recover data symboldj.

We initializeGto NULL (step 1)， traverse the rows of CDMM(step 2)， and check whether the corresponding columns of the failure data symbols on the row is one (steps 3 and 4). If [M]i，j= 1，the recovery equation corresponding to theithparity symbolpican recover thejthfailure data symboldjandpiis added to the recovery groupgj-Fw(step 5).

Example 4We take Fig. 2 as an example.When data nodeD0fails， three data symbols will be erased， namely，d0，d1， andd2. For the row that corresponds to the parity symbolp0， we note that [M]0，0= 1， which indicates that the recovery equation corresponding top0can recover the failure data symbold0. Thus， we can addp0tog0. Other parity symbols are also processed in this way. After parity symbol grouping，we can obtain the following results:

Furthermore，a parity symbol probably belongs to multiple groups simultaneously. For example，when data nodeD1fails， data symbolsd3，d4， andd5will be erased.p4will be simultaneously added tog0andg2due to [M]4，3= 1 and [M]4，5=1. The final recovery groups are as follows:

If one parity symbol is selected per group and the same symbol is chosen more than once， some failure data symbols may not be recovered. For example，ifp4，p1，andp4are chosen fromg0，g1，andg2respectively，d3andd5will be unrecoverable because two equations in Eq. (1)are not enough to uniquely solve three unknown variables. SA-RSR addresses this scenario to ensure that one parity symbol appears at most once in the current search procedure.

4.2 Recovery solution search

After the grouping operation of Algorithm 1，we can obtainwrecovery groups， and the corresponding recovery equations in each group are responsible for repairing a specific failure data symbol. For example， in Eq. (2)，g0consists ofp0andp3and their corresponding recovery equations can accomplish the recovery of data symbold0.

Definition 3(Recovery solution) A set ofwparity symbols that are selected from distinct recovery groups and the corresponding recovery equations that can repair all the failure symbols is known as a recovery solutionGr.

Example 5According to Eq.(2)，if we selectp0fromg0，p1fromg1， andp2fromg2to recoverd0，d1， andd2respectively， we will obtain the recovery solutionGr={p0，p1，p2}.Gris obtained by selecting one element fromgi(0≤i ＜w) to recover all failure symbols.

The recovery solution search problem can be considered as a combinatorial optimization problem.The set of feasible solutions is really massive if the erasure code parameters are a little larger. Let us use CRS (10，4，16) as an example， where CRS(10，4，16) has the samekandmparameter settings as the RS code used in HDFS-RAID in Facebook(Facebook，2018). There are，on average，about 2.43×1020alternative recovery solutions when one data node fails for CRS (10，4，16) and the number will increase to 2.53×1023ifmincreases from 4 to 5. Therefore， it is difficult to obtain the best solution in such a huge research space by employing a brute-force method. We choose an SA-based technique， and SA is a stochastic optimization algorithm based on a Monte-Carlo iteration strategy.Compared to the hill-climbing technique， accepting the deteriorating solution with a certain probability and decreasing the probability over time can help SA step out of stagnation effectively.

Before explicitly presenting our detailed algorithm，we introduce the notion of a recovery sequence and provide a sketch of our algorithm. A recovery sequence consists of two parts，where the second part，havingmwelements， indicates which symbol is selected out ofmwparity symbols to recover the given failed symbol，and the first part，havingkwelements，reveals the surviving data symbols that participate in the recovery of the failed symbol.

Definition 4Define a recovery sequence asR={r0，r1，···，r(k+m)w-1}. If parity symbolpiis selected to recover a failed symbol，rkw+i= 1， andrj= 1 (j ∈{e0，e1，...，el}) for those surviving data symbols{de0，de1，...，del}that belong to the recovery equation corresponding to the parity symbolpi;otherwise，rj=0(j /∈{e0，e1，···，el}∪{kw+i}).

Example 6Taking Eq. (2) as an example， ifp3，p1， andp2are chosen to recoverd0，d1， andd2respectively，the recovery sequences are as follows:

We useSto represent a symbol-reading scheme to recover all failure data symbols.Sis calculated using the following equation:

where “|” is the OR operation. If thejth(0≤j ＜kw)position in theith(0≤i ＜w)recovery sequenceRiis 1，Sjwill be set to 1. It indicates that we need to read data symboldjifSj=1 (0≤j ＜kw)and read parity symbolpj-kwifSj=1 (kw ≤j ＜(k+m)w). Based on Eq.(5)，our objective function is to minimize the number of 1’s inS:

Given the failure data symbols{dFw，dFw+1，...，dFw+w-1}， we first obtain a set of recovery groupsG={g0，g1，...，gw-1}， andgiis responsible for recoveringdFw+i， e.g.，g0→ dFwandg1→dFw+1. First， we initialize the current recovery solutionGrby appending a random symbol belonging togi. Then we updateGrthrough multiple iterations based on the idea of SA. Finally， we stop the algorithm until the convergence，obtain the recovery schemeScorresponding toGr，and read the smallest number of symbols.

The recovery solution search algorithm (Algorithm 2)is built on the following three functions:

1. generate_solution(Gr): Given a recovery solutionGr， the function returns a symbol-reading schemeSfor recovering the failure data symbols.

2. replace_symbol(Gr，ptmp_sid，gtmp_gid):Given the parity symbolptmp_sidand the recovery groupgtmp_gid，the function replaces the symbol that belongs togtmp_gidwithptmp_sidand returns a new recovery solutionGrnew.

3. cal_symbol_num(S): Given a symbolreading schemeS，the function returns the total number of symbols to be read to recover the failure data symbols.

We initialize the temperatureTasT0and current recovery solutionGras NULL(step 1). For each failure data symbol，we randomly select a parity symbol from the corresponding recovery groupg ∈Gto constructGr(steps 2–8). Note that the selection process requires multiple attempts to ensure that no parity symbol is repeated (steps 2–6). According toGr， we can generate the symbol-reading schemeS(step 9). In the iteration of each outer loop， we generate a series of new symbol-reading schemes at the current temperatureTand accept new schemes based on the idea of SA. At temperatureT， we randomly select a symbol from all parity symbols，which is referred to asptmp_sid(step 12)， and calculate the corresponding recovery groupsgs(step 13). Ifptmp_siddoes not belong toGr， we perform the following update operation.ptmp_sidmay be added to multiple recovery groups， and thus we randomly select one recovery groupgtmp_gidfrom them (step 15). We can obtain a new recovery solutionGrnewby replacing the parity symbol ofgtmp_gidwithptmp_sid(step 16). The recovery schemeSnewcorresponding toGrnewis generated (step 17). IfSnewreads fewer symbols， we updateSwithSnew; otherwise， we accept the deteriorating solutionSnewwith a certain probability (steps 19–21). We will repeat this processNetimes at each temperatureT.Twill decrease according to a certain attenuation coefficientKuntilTis lower than the termination temperatureTendor the number of external iterations reaches the default valueNe(steps 24–28). Finally， it will return the schemeSwhose number of symbols to be read is the smallest.

Example 7We consider the recovery groups shown in Eq. (3) as an example. First， suppose the initial recovery solutionGr={p0，p1，p2}by randomly selecting one parity symbol from each recovery group，and the corresponding symbol-reading schemeS={111000111111111000}. In addition，initialize temperatureT， the numbers of internal iterationsNiand external iterationsNe， and the termination temperatureTend. Thus，the initial number of symbols to be read is 12. Then we perform the following procedure multiple times to updateGrandSat temperatureT. Assume that we pick a random symbolp3/∈Grfrom{p0，p1，...，p5}. We replacep2∈g2withp3∈g2to generateGrnew={p0，p1，p3}，which corresponds to the symbol-reading schemeSnew={110000110110110100}， and the number of symbols to be read is 9 (9＜12). So， we need to updateGrandSusingGrnewandSnewrespectively.AfterNiiterations at temperatureT， we reduceTby multiplying by a factorK(K ＜1) and continue performing the update procedure multiple times at eachTuntilT ＜TendorNeis reached.

4.3 Complexity analysis

We analyze the complexities of Algorithms 1 and 2 based on the notations listed in Table 1.

1. Complexity of Algorithm 1: Algorithm 1 is a traversal process on the CDM of erasure codes， and the complexity of Algorithm 1 isO(m×w×w).

2. Complexity of Algorithm 2: During the execution of Algorithm 2，two loops are executed，while the number of external iterations equalsNeand the number of internal iterations equalsNi. However，temperatureTwill gradually decrease with the execution of the search process，and the number of external iterations may not reachNe. Therefore，the complexity of Algorithm 2 will not exceedO(Ne×Ni).

3. Complexity of SA-RSR: We have mentioned that SA-RSR can be decomposed into parity symbol grouping and recovery solution search. Thus， the complexity of SA-RSR isO(m×w×w)+O(Ne×Ni).

5 Simulations

We conducted a series of intensive simulations to evaluate the performance of SA-RSR. Both the enumeration algorithm (Khan et al.， 2012) and the hill-climbing algorithm Zpacr (Zhu et al.， 2014) can reduce the amount of data required for recovery and accelerate the data recovery process for any XORbased erasure code. We therefore chose these two algorithms as baselines to describe the advantage of SA-RSR. Our goal is to show that SA-RSR can return one symbol-reading scheme in a significantly short time， and that the amount of data required by SA-RSR for data recovery is minimum. Simulations were performed on a physical storage server with an Intel XeonE7 processor，16 GB memory，and several SATA HDDs. Each disk was 600 GB and it operated at 7200 r/min. The operating system was Centos 7. For each erasure code parameter setting，we performed 300 simulation runs to obtain the average value.

We implemented the enumeration algorithm，Zpacr， and SA-RSR using the C++ language and considered three kinds of XOR-based erasure code as the simulation objective， including Blaum_Roth code (Blaum and Roth， 1993)， Liber8Tion code(Plank， 2009)， and CRS code (Roth and Lempel，1989). Blaum_Roth code and Liber8Tion code belong to a class of minimum density RAID-6 codes(Plank et al.， 2011)， provide the optimal write performance， and can tolerate up to two node failures.All selected erasure codes were implemented in the Jerasure-2.0 Library，which is an open-source library and is commonly employed in the erasure code community(Plank et al.，2009). Table 2 lists the restrictions of these three coding schemes.

Table 2 Erasure codes and their parameters we tested

For the data recovery operation， three steps were performed: (1) determining the symbols involved in the data recovery process， (2) reading surviving symbols from the selected nodes， and (3)reconstructing the failure data symbols. In reality，SA-RSR primarily acts on step(1)， which calculates the corresponding symbol-reading scheme while selecting nodes to perform the data recovery. Our justification is that the performance of data recovery in a distributed storage system primarily depends on network transmission instead of computations. Previous studies (Khan et al.， 2012; Zhu et al.， 2014)have validated that data reading and transmission parts consume the majority of data recovery time.Thus， we need to ensure that SA-RSR does not become a computational bottleneck in the data recovery operation.

5.1 Search performance

We compared the enumeration algorithm，Zpacr， and SA-RSR in terms of runtime to prove that SA-RSR can rapidly return an optimal symbolreading scheme. Table 3 displays the runtime of different algorithms over various parameter settings.In the simulations， we randomly selected one data node and let it fail. When the parameters (k，m，w)were relatively large，the enumeration algorithm was too time-consuming to calculate an optimal symbolreading scheme. From Table 3， we can see that the enumeration algorithm took more than 1 min to obtain the results under most of the parameter settings.Conversely，both SA-RSR and Zpacr can maintain a short computation time for all parameter settings，where they took less than 0.5 s. SA-RSR execution time was slightly longer than that of Zpacr.

Table 3 Runtime of enumeration，Zpacr，and SA-RSR recoveries

5.2 Data required for data recovery

We simulated the amount of data required for data recovery to verify that SA-RSR can obtain a symbol-reading scheme with the least amount of data. The computation time of the enumeration algorithm is excessive，and we believe that accelerating the data recovery operation using the enumeration algorithm in real large-scale distributed storage systems is impractical. Therefore， we did not compare the enumeration algorithm with SA-RSR in terms ofthe amount of data required for data recovery. After making a random data node fail， we executed Zpacr and SA-RAR algorithms，and compared the amount of data required by the symbol-reading schemes that they generated.

Fig. 7 illustrates the percentage of the required amount of data for recovery of Zpacr and SA-RSR over the conventional recovery method for various double-failure-tolerant coding schemes， including Blaum_Roth code， Liber8Tion code， and CRS code (m=2). For Blaum_Roth code， SA-RSR reduced the required amount of data by up to 25.0%，while Zpacr can reduce it by up to only 20.0%.For Liber8Tion code， SA-RSR reduced the required amount of data by up to 28.12%， but Zpacr reduced it by up to 25.0%. For CRS code， SA-RSR can reduce the required amount of data by up to 30.0% for recovery， yet Zpacr can reduce it by up to only 26.70%. Fig. 8 shows the results for threefailure-tolerant codes. SA-RSR reduced the required amount of data by up to 25.08%，while Zpacr reduced it by up to 22.22%.

Fig. 7 Percentage of the data needed for Zpacr and SA-RSR over the conventional approach in terms of double-fault-tolerant codes: (a) Blaum_Roth code;(b) Liber8Tion code; (c) CRS code (m=2)

Fig. 8 Percentage of the data needed for Zpacr and SA-RSR over the conventional approach in terms of triple-fault-tolerant codes: (a)CRS code with varying k (m = 3， w = 10); (b) CRS code with varying w(k=4， m =3)

Figs. 7 and 8 demonstrate that SA-RSR can reduce the required amount of data for data recovery for various kinds of codes compared to Zpacr.The reason is that the SA approach can prevent a search from being confined in suboptimal search regions compared with other local search methods，such as the hill-climbing algorithm.

5.3 Algorithm stability

Another crucial metric we considered is algorithm stability. Russell and Norvig (2016) noted that SA algorithm is a probability-based algorithm，for which one may obtain different answers if one repeats the algorithm several times. We verified that the convergence of SA-RSR is satisfactory for the current algorithm implementation. We provide a new concept of accuracy:

Accuracy is used to denote the closeness between the optimum solution and the average solution obtained by SA-RSR.The result is a decimal number equal to or greater than 1. The closer the accuracy to 1， the better the convergence of SA-RSR.

We evaluated SA-RSR’s accuracy through experiments. During each experiment， we selected a data node randomly and made it fail. Table 4 shows the accuracies for various erasure code parameter settings， where SNmin， SNmax， and SNavgrepresent the minimum， maximum， and average numbers of symbols，respectively. We observed that the accuracies for Blaum_Roth code and Liber8Tion code can reach 1; the accuracy for CRS code did not exceed 1.04. The reason is that there are strict restrictions for Blaum_Roth code and Liber8Tion code. The parameters of Blaum_Roth code and Liber8Tion code are not large，which ensures that the total number of recovery equations is small. Thus， SA-RSR can obtain stable symbol-reading schemes in a very short time. We conclude that SA-RSR has good stability.

Table 4 Accuracy of SA-RSR with different coding schemes and different erasure code parameter settings

6 Experiments

In Section 4，we described how SA-RSR reduces the amount of data required for data recovery. To verify the effectiveness of SA-RSR in accelerating the data recovery process in real distributed storage systems， we implemented SA-RSR in the opensource distributed storage system， Ceph. Based on our implementation， we conducted substantial system experiments to test the recovery performance of SA-RSR in a distributed storage environment. In this section， we first give an overview of Ceph and its erasure coding scheme. We then present how to integrate SA-RSR into Ceph. Finally， we describe the SA-RSR experimental results.

6.1 Overview of Ceph

Ceph (Weil et al.， 2006a)is an open-source distributed storage system with a decentralized system architecture and without a single point of failure.Ceph provides an infinitely scalable storage cluster that is based on a reliable，autonomic distributed object store (RADOS) (Weil et al.， 2007)consisting of two types of daemons: the Ceph monitor (which is responsible for monitoring the entire storage cluster)and the Ceph object storage daemon (OSD) (which handles read/write operations on the storage disks).Generally， a single OSD is used to manage a single HDD or SSD.The Ceph clients and Ceph OSDs both use the CRUSH(Weil et al.，2006b)algorithm to efficiently determine an object’s location information.

The Ceph storage system supports the notion of “pools，” which are logical partitions for storing objects. Each storage pool has an access control and redundancy policy to provide a separate namespace for users and applications. Each storage pool has numerous placement groups(PGs)which are basic units of data storage and migration in the Ceph storage system. The settings of PGs depend on the number of replicas or the settings of the erasure code.PGs are distributed on multiple OSDs and can be dynamically adjusted according to the size of clusters and the number of objects. One primary OSD (p-OSD) must be selected from the placement group.The pool that is based on the replication scheme is responsible for forwarding objects to all other OSDs in the placement group， while the pool that uses erasure codes must divide and encode the object and send the coded block to other OSDs in the placement group. Ceph supports an erasure coding scheme via a pluggable interface that enables the use of a variety of traditional erasure codes (e.g.， RS code and CRS code)and locally repairable codes.

The encoding process of the erasure code in Ceph is performed as a real-time job;i.e.，the data is encoded while being written into the system. When writing data into a Ceph storage cluster using an erasure coded pool， the selected p-OSD divides the input data stream into many stripes. Each stripe is then divided intokdata chunks and is encoded to generatem=n-kparity chunks using(n，k，w)erasure codes. The stripe size in Ceph refers to the size ofkdata chunks and can be specified in the configuration file. Furthermore，a chunk is composed ofwsymbols，the size of which is a multiple of 8192 bytes.

Fig.9 shows an example of the encoding process of Ceph，where the original object is divided into two stripes， each of which is composed of two chunks labeled as chunk 0 and chunk 1. Each chunk is further partitioned into two symbols， e.g.，X00andX01in stripe 0 orY00andY01in stripe 1.

6.2 SA-RSR in Ceph

To integrate SA-RSR in the distributed storage system Ceph，we reconstructed the execution logic of the data recovery process of erasure codes in Ceph.Fig.10 shows our implementation architecture. During data recovery， Ceph first selects one OSD node，usually the p-OSD node， in the PG to perform data recovery operations. We refer to this OSD as the recovery node.

Fig. 9 A Ceph encoding example when k=2 and m=2 (The object is divided into two stripes， and the symbols within each of the stripe are encoded)

Fig. 10 Integration of SA-RSR in Ceph

Ceph，by default，uses the conventional recovery strategy mentioned in Section 2.2. In detail，according to the coding scheme in the PG and the number and location of the failed nodes， Ceph selectsksurviving nodes to participate in the data recovery operation，including data nodes and parity nodes. In our design，when performing data recovery，the recovery scheduling module in Ceph first executes SA-RSR to obtain an optimal symbol-reading scheme. Then according to the symbol-reading scheme， the recovery node constructs a reading request for each node involved in the data recovery process. Once the reading requests are built， they are sent to the corresponding nodes.

A recovery read handler runs on each surviving node and handles the read requests sent by the recovery node. OSDs in Ceph are responsible for reading data blocks from the disk. Based on this system design， the recovery read handler performs data read operations according to the messages in the read requests， which contain the identification numbers of the symbols that the current node needs to read. All symbols needed for data recovery are sent back to the recovery node.

The failure data symbols can be obtained by the decoding operation of the decoding module. Ceph uses an erasure code plug-in infrastructure that enables dynamic use of external erasure code libraries，while the XOR-based erasure codes are implemented in the Jerasure-2.0 Library in Ceph. Therefore， we improve the decoding part of the Jerasure library to ensure that it can decode the lost data with fewer data.

Our integration of SA-RSR is based on the release of Ceph’s Luminous v12.0.2. We consider CRS code， Blaum_Roth code， and Liber8Tion code as representative XOR-based erasure codes，which have been implemented in the Jerasure library in Ceph.

6.3 Results

In this subsection， we evaluated the recovery performance of SA-RSR in Ceph. Ceph was deployed on a cluster consisting of five Linux-based servers，each of which was equipped with an Intel XeonE7 CPU and 16 GB RAM. Each storage device was a SATA HDD with 600 GB capacity and it operated at 7200 r/min. The servers were interconnected viaa gigabit ethernet switch. In the Ceph system， a physical disk can serve as a logical OSD storage node.Here， each of the four servers was configured three disks to form three logical OSD nodes;the remaining server was responsible for managing the cluster and was used as a monitoring node. Thus，we had at most 12 OSD nodes with capacity of 7.03 TB depending on the chosen erasure coding schemes.

In Section 5， we explained the significant complexity of the enumeration algorithm and concluded that the enumeration algorithm is not always suitable for real distributed storage systems， so we implemented Zpacr and SA-RSR in Ceph without implementing the enumeration algorithm. We did not specifically optimize our implementation to demonstrate the absolute recovery performance. Instead，our goal is to evaluate the relative performance of SA-RSR compared with the conventional approach and Zpacr in fair conditions.

We focused on the metric of recovery speed，which is the amount of data recovered per second.We wrote a 200-MB object into a Ceph storage cluster using a specified coding scheme (e.g.， CRS code，Blaum_Roth code，or Liber8Tion code). Note that SA-RSR may read non-contiguous symbols in a disk， which will result in random access to HDDs and performance degradation. However，Khan et al.(2012) pointed out that the enumeration algorithm with large stripes in cloud file systems can amortize the seek costs incurred when reading non-contiguous symbols. This also applies to our SA-RSR algorithm when implemented in a distributed storage system.Specifically，we set our symbol size to 1 MB to amortize the seek cost in our experiments. We made one node fail in the corresponding PG by setting the state of the OSD as“out” to trigger the data recovery operation of the Ceph system;the recovery operation includes performing SA-RSR， building a read request for data recovery operation， reading corresponding symbols from disks， and performing the decoding operation. The recovery operation was performed 20 times， and we obtained the total average.

Fig. 11 demonstrates the comparison of the recovery speed among the conventional approach，Zpacr， and SA-RSR for different double-failuretolerant codes， including Blaum_Roth code，Liber8Tion code， and CRS code (m=2). For Blaum_Roth code，Liber8Tion code，and CRS code，the improvements of SA-RSR over the conventional approach for data recovery were up to 19.48%，17.92%， and 20.36% respectively， while Zpacr had an acceleration rate of 16.88%， 13.87%， and 18.56%in recovery speed over the conventional approach，respectively. Fig. 12 shows the results of different coding schemes that tolerated three or more node failures. We observed that SA-RSR improved the recovery speed by 19.05%，18.01%，and 15.63%withk=4， 5， and 6 respectively， while Zpacr improved the recovery speed by 14.88%， 13.04%， and 13.13%，respectively. SA-RSR achieved 17.03%，18.18%，and 19.05%improvements in recovery speed over the conventional approach withw=8， 9， and 10， respectively， while Zpacr increased the recovery speed by 15.38%，15.34%，and 14.88%，respectively.

Fig. 11 Recovery speed comparison among the conventional approach， Zpacr， and SA-RSR for different codes that tolerate two node failures: (a)Blaum_Roth code; (b) Liber8Tion code; (c) CRS code (m=2)

Fig. 12 Recovery speed comparison among the conventional approach， Zpacr， and SA-RSR for different codes that tolerate three or more node failures: (a)CRS code (m = 3， w = 10); (b) CRS code (k = 4，m =3)

Figs.11 and 12 indicate that SA-RSR can effectively accelerate the data recovery process of erasure coding schemes in the real distributed storage system. The reason is that SA-RSR effectively reduces the data amount in the data recovery process for various erasure coding schemes compared with the conventional approach， while Zpacr is prone to fall into a local optimum solution during execution. In Section 5，we declared that the recovery performance of the erasure coding schemes in distributed storage systems is determined mainly by network transmission rather than computing operation， and we verified that the computation overhead of SA-RSR for search is very small. Thus， the improvement of SARSR over Zpacr can be more significant.

7 Conclusions

Data recovery is a costly process in distributed storage systems that employ XOR-based erasure codes because it occupies a large amount of network bandwidth. To accelerate the recovery of failure data， we proposed a random search recovery algorithm that can provide the optimal recovery performance by minimizing the data needed for single-node failure recovery. This algorithm can significantly reduce the search time of an optimal recovery solution to a polynomial time. To demonstrate the effectiveness of our algorithm， we performed extensive simulations with a variety of XOR-based erasure codes. We also conducted experiments by implementing Zpacr and our algorithm in a real distributed storage system， Ceph. The results showed that our algorithm reduced the amount of data required for single-node failure recovery by up to 30.0%and improved the performance of data recovery by up to 20.36%in Ceph when compared to the conventional recovery method.

Contributors

Xingjun ZHANG designed the research. Ningjing LIANG and Yunfei LIU processed the data. Ningjing LIANG drafted the paper. Changjiang ZHANG helped organize the paper. Ningjing LIANG， Changjiang ZHANG，and Yang LI revised and finalized the paper.

Compliance with ethics guidelines

Xingjun ZHANG， Ningjing LIANG， Yunfei LIU，Changjiang ZHANG， and Yang LI declare that they have no conflict of interest.

Frontiers of Information Technology & Electronic Engineering2022年6期

Frontiers of Information Technology & Electronic Engineering的其它文章: Comment:New directions for artificial intelligence:human，machine，biological，and quantum intelligence*; Correspondence:A novel wideband ring antenna for polarization/pattern diversity*; Personal View:An evolutionary note on smart city development in China*; An efficient counter-based Wallace-tree multiplier with a hybrid full adder core for image blending#; Target height and multipath attenuation joint estimation with complex scenarios for very high frequency radar*; Sensor-guided gait-synchronization lower-extremity-exoskeleton for potential application on unilateral knee-injured people*

亚洲免费av电影一区二区三区,日韩爱爱视频,51精品视频一区二区三区,91视频爱爱,日韩欧美在线播放视频,中文字幕少妇AV,亚洲电影中文字幕,久久久久亚洲av成人网址,久久综合视频网站,国产在线不卡免费播放