李曉維 鄢貴海 韓銀和
摘 要:高通量計算系統(tǒng)由海量的計算節(jié)點、存儲節(jié)點通過網(wǎng)絡互連而成。由于規(guī)模巨大,系統(tǒng)的可靠性成為一個非常嚴重的問題,部件失效已經(jīng)成為一種常態(tài),系統(tǒng)設計必須考慮容錯的問題。我們需要建立新的高通量計算系統(tǒng)的可靠性保障框架,來適應高通量計算中不同層次的可靠性需求,研究從芯片級到系統(tǒng)級跨層次的可靠計算技術。圍繞該目標,該研究從高通量處理芯片的故障檢測和容錯設計方法,高通量計算系統(tǒng)的失效檢測和恢復方法和從芯片級到系統(tǒng)級的故障自預測、自檢測、自定位、自隔離和自愈合(5S)支撐環(huán)境3方面展開研究。截至2013年各項工作按照任務書原定計劃正在穩(wěn)步推進,部分工作取得階段性成果。在(1)針對NBTI老化故障的在線預測技術;(2)深度學習等系統(tǒng)故障預測技術;(3)寄存器故障診斷;(4)片上網(wǎng)絡通信隔離技術等技術點上取得了突破,共發(fā)表錄用了IEEE Transactions論文6篇,其他期刊論文1篇。從研究點覆蓋來看,部署到研究點已經(jīng)全部覆蓋了任務書規(guī)定的所有研究計劃,并對某些研究點進行了細化。
關鍵詞:可靠性設計 故障檢測 深度學習 在線預測 通信隔離
Abstract:High-throughput computing system incorporates massive computing nodes, storage nodes and their associate inner interconnection network. It is very common that components of such system will encounter malfunction due to its large scale, which makes reliability an imperative issue that needs to be considered seriously. In other words, computing system design must take fault tolerance into account. We intend to build unprecedented reliability framework specially for high-throughput computing system, in order to accommodate the desirable reliability demands of various layers in high-throughput computingdesign the corresponding reliable computing techniques across chip level and system level. To achieve this objective, this study commences the relevant research in three consecutive aspects: (1)fault detection/tolerance approaches in high-through computing, (2)malfunction detection/recovery methods in high-throughput computing system, (3)self-prediction, self-detection, self-isolation and self-healing across chip level and system level (5S supportive environments). Up to the year 2013, various work has been carried on in align with task specification steadily, and parts of the work have reached preset milestones. We have made breakthrough in some researches, such as (1) NBTI aging prediction, (2) fault prediction based on deep learning,(3)register fault diagnosis, and (4) on-chip communication isolation techniques, along with abundant high-rank research publications. In terms of research comprehensiveness, the deployment has covered all research plans defined in the proposal, and some research techniques are further refined as well.
Key Words:Reliability design;Fault detection;Deep learning;Online prediction;Communication isolation
閱讀全文鏈接(需實名注冊):http://www.nstrs.cn/xiangxiBG.aspx?id=50730&flag=1