亚洲免费av电影一区二区三区,日韩爱爱视频,51精品视频一区二区三区,91视频爱爱,日韩欧美在线播放视频,中文字幕少妇AV,亚洲电影中文字幕,久久久久亚洲av成人网址,久久综合视频网站,国产在线不卡免费播放

        ?

        連續(xù)空間中的隨機(jī)技能發(fā)現(xiàn)算法

        2016-04-12 00:00:00欒詠紅劉全章鵬
        現(xiàn)代電子技術(shù) 2016年10期

        摘 要: 針對(duì)大規(guī)模、連續(xù)空間隨著狀態(tài)維度指數(shù)級(jí)增加造成的“維數(shù)災(zāi)”問題,提出基于Option分層強(qiáng)化學(xué)習(xí)基礎(chǔ)框架的改進(jìn)的隨機(jī)技能發(fā)現(xiàn)算法。通過定義隨機(jī)Option生成一棵隨機(jī)技能樹,構(gòu)造一個(gè)隨機(jī)技能樹集合。將任務(wù)目標(biāo)分成子目標(biāo),通過學(xué)習(xí)低階Option策略,減少因智能體增大而引起學(xué)習(xí)參數(shù)的指數(shù)增大。以二維有障礙柵格連續(xù)空間內(nèi)兩點(diǎn)間最短路徑規(guī)劃為任務(wù),進(jìn)行仿真實(shí)驗(yàn)和分析,實(shí)驗(yàn)結(jié)果表明:由于Option被隨機(jī)定義,因此算法在初始性能上具有間歇的不穩(wěn)定性,但是隨著隨機(jī)技能樹集合的增加,能較快地收斂到近似最優(yōu)解,能有效克服因?yàn)榫S數(shù)災(zāi)引起的難以求取最優(yōu)策略或收斂速度過慢的問題。

        關(guān)鍵詞: 強(qiáng)化學(xué)習(xí); Option; 連續(xù)空間; 隨機(jī)技能發(fā)現(xiàn)

        中圖分類號(hào): TN911?34; TP18 文獻(xiàn)標(biāo)識(shí)碼: A 文章編號(hào): 1004?373X(2016)10?0014?04

        A random skill discovery algorithm in continuous spaces

        LUAN Yonghong 1,2, LIU Quan2,3, ZHANG Peng2

        (1. Suzhou Institute of Industrial Technology, Suzhou 215104, China; 2. Institute of Computer Science and Technology, Soochow University, Suzhou 215006, China; 3. MOE Key Laboratory of Symbolic Computation and Knowledge Engineering, Jilin University, Changchun 130012, China)

        Abstract: In allusion to the large and continuous space’s “dimension curse” problem caused by the increase of state dimension exponential order, an improved random skill finding algorithm based on Option hierarchical reinforcement learning framework is proposed. A random skill tree set is generated via defining random Option to construct a random skill tree set. The task goal is divided into several sub?goals, and then the increase of learning parameter exponent due to the increase of the intelligent agent is reduced through learning low?order Option policy. The simulation experiment and analysis were implemented by taking a shortest path between any two points in two?dimension maze with barriers in the continuous space as the task. The experiment result shows that the algorithm may have some intermittent instability in the initial performance because Option is defined randomly, but it can be converged to the approximate optimal solution quickly with the increase of the random skill tree set, which can effectively overcome the problem being hard to obtain the optimal policy and slow convergence due to “dimension curse”.

        Keywords: reinforcement learning; Option; continuous space; random skill discovery

        0 引 言

        強(qiáng)化學(xué)習(xí)[1?2](Reinforcement Learning,RL)是Agent通過與環(huán)境直接交互,學(xué)習(xí)狀態(tài)到行為的映射策略。經(jīng)典的強(qiáng)化學(xué)習(xí)算法試圖在所有領(lǐng)域中尋求一個(gè)最優(yōu)策略,這在小規(guī)?;螂x散環(huán)境中是很有效的,但是在大規(guī)模和連續(xù)狀態(tài)空間中會(huì)面臨著“維數(shù)災(zāi)”的問題。為了解決“維數(shù)災(zāi)”等問題,研究者們提出了狀態(tài)聚類法、有限策略空間搜索法、值函數(shù)逼近法以及分層強(qiáng)化學(xué)習(xí)等方法[3]。分層強(qiáng)化學(xué)習(xí)的層次結(jié)構(gòu)的構(gòu)建實(shí)質(zhì)是通過在強(qiáng)化學(xué)習(xí)的基礎(chǔ)上增加抽象機(jī)制來實(shí)現(xiàn)的,也就是利用了強(qiáng)化學(xué)習(xí)方法中的原始動(dòng)作和高層次的技能動(dòng)作[3](也稱為Option)來實(shí)現(xiàn)。

        分層強(qiáng)化學(xué)習(xí)的主要研究目標(biāo)之一是自動(dòng)發(fā)現(xiàn)層次技能。近年來雖然有很多研究分層強(qiáng)化學(xué)習(xí)的方法,多數(shù)針對(duì)在較小規(guī)模的、離散領(lǐng)域中尋找層次技能。譬如Simsek與Osentoski等人通過劃分由最近經(jīng)驗(yàn)構(gòu)成的局部狀態(tài)轉(zhuǎn)移圖來尋找子目標(biāo)[4?5]。McGovern和Batro等根據(jù)狀態(tài)出現(xiàn)的頻率選擇子目標(biāo)[6]。Matthew提出將成功路徑上的高頻訪問狀態(tài)作為子目標(biāo),Jong和Stone提出從狀態(tài)變量的無關(guān)性選擇子目標(biāo)[7]。但是,這些方法都是針對(duì)較小規(guī)模、離散的強(qiáng)化學(xué)習(xí)領(lǐng)域。2009年Konidaris和Barto等人提出了在連續(xù)強(qiáng)化學(xué)習(xí)空間中的一種技能發(fā)現(xiàn)方法,稱為技能鏈[8]。2010年Konidaris又提出根據(jù)改變子目標(biāo)點(diǎn)檢測(cè)方法[9]來分割每個(gè)求解路徑為技能的CST算法,這種方法僅限于路徑不是太長(zhǎng)且能被獲取的情況。

        本文介紹了一種在連續(xù)RL域的隨機(jī)技能發(fā)現(xiàn)算法。采用Option分層強(qiáng)化學(xué)習(xí)中自適應(yīng)、分層最優(yōu)特點(diǎn),將每個(gè)高層次的技能定義為一個(gè)Option,且隨機(jī)定義的,方法的復(fù)雜度與復(fù)雜學(xué)習(xí)領(lǐng)域的Option構(gòu)建數(shù)量成比例。雖然Option的隨機(jī)選擇可能不是最合適的,但是由于構(gòu)建的Option不僅是一個(gè)技能樹還是一個(gè)技能樹的集合,因此彌補(bǔ)了這個(gè)不足之處。

        1 分層強(qiáng)化學(xué)習(xí)與Option框架

        分層強(qiáng)化學(xué)習(xí)(Hierarchical Reinforcement Learning,HRL)的核心思想是引入抽象機(jī)制對(duì)整個(gè)學(xué)習(xí)任務(wù)進(jìn)行分解。在HRL方法中,智能體不僅能處理給定的原始動(dòng)作集,同時(shí)也能處理高層次技能。

        4 結(jié) 語

        實(shí)驗(yàn)的性能結(jié)果表明了RSD算法能顯著提高連續(xù)域中RL問題的性能,通過采用隨機(jī)技能樹集合和對(duì)每個(gè)樹葉學(xué)習(xí)一個(gè)低階的Option策略。RSD算法的優(yōu)點(diǎn),與其他的技能發(fā)現(xiàn)方法相比,可以采用Option框架更好地處理RL連續(xù)域的問題,無需分析訓(xùn)練集的圖或值自動(dòng)創(chuàng)建Option。因此,它可以降低搜索特定Option的負(fù)擔(dān),能使它更適應(yīng)于大規(guī)模或連續(xù)狀態(tài)空間,能分析一些困難較大的領(lǐng)域問題。

        參考文獻(xiàn)

        [1] SUTTON R S, BARTO A G. Reinforcement learning: An introduction [M]. Cambridge, MA: MIT Press,1998.

        [2] KAELBLING L P, LITTMAN M L, MOORE A W. Reinforcement learning: A survey [EB/OL]. [1996?05?01]. http:// www.cs.cmu.edu/afs/cs...vey.html.

        [3] BARTO A G, MAHADEVAN S. Recent advances in hierarchical reinforcement learning [J]. Discrete event dynamic systems. 2003, 13(4): 341?379.

        [4] SIMSEK O, WOLFE A P, BARTO A G. Identifying useful subgoals in reinforcement learning by local graph partitioning [C]// Proceedings of the 22nd International Conference on Machine learning. USA: ACM, 2005, 8: 816?823.

        [5] OSENTOSKI S, MAHADEVAN S. Learning state?action basis functions for hierarchical MDPs [C]// Proceedings of the 24th International Conference on Machine learning. USA: ACM, 2007, 7: 705?712.

        [6] MCGOVERN A, BARTO A. Autonomous discovery of subgolas in reinfoeremente learning using deverse density [C]// Proceedings of the 8th Intemational Coference on Machine Learning. San Fransisco:Morgan Kaufmann, 2001: 36l?368.

        [7] JONG N K, STONE P. State abstraction discovery from irrelevant state variables [J]. IJCAI, 2005, 8: 752?757.

        [8] KONIDARIS G, BARTO A G. Skill discovery in continuous reinforcement learning domains using skill chaining [J]. NIPS, 2009, 8: 1015?1023.

        [9] KONIDARIS G, KUINDERSMA S, BARTO A G, et al. Constructing skill trees for reinforcement learning agents from demonstration trajectories [J]. NIPS, 2010, 23: 1162?1170.

        [10] 劉全,閆其粹,伏玉琛,等.一種基于啟發(fā)式獎(jiǎng)賞函數(shù)的分層強(qiáng)化學(xué)習(xí)方法[J].計(jì)算機(jī)研究與發(fā)展,2011,48(12):2352?2358.

        [11] 沈晶,劉海波,張汝波,等.基于半馬爾科夫?qū)Σ叩亩鄼C(jī)器人分層強(qiáng)化學(xué)習(xí)[J].山東大學(xué)學(xué)報(bào)(工學(xué)版),2010,40(4):1?7.

        [12] KONIDARIS G, BARTO A. Efficient skill learning using abstraction selection [C]// Proceedings of the 21st International Joint Conference on Artificial Intelligence. Pasadena, CA, USA: [S.l.], 2009: 1107?1113.

        [13] XIAO Ding, LI Yitong, SHI Chuan. Autonomic discovery of subgoals in hierarchical reinforcement learning [J]. Journal of china universities of posts and telecommunications, 2014, 21(5): 94?104.

        [14] CHEN Chunlin, DONG Daoyi, LI Hanxiong, et al. Hybrid MDP based integrated hierarchical Q?learning [J]. Science China (information sciences), 2011, 54(11): 2279?2294.

        专区亚洲欧洲日产国码AV| 国产裸体xxxx视频在线播放| 免费观看黄网站在线播放| 国产一区二区三区啪| 丰满人妻一区二区三区精品高清| 久久一道精品一区三区| 国产丶欧美丶日本不卡视频| 国产一区二区三区四区五区vm| 亚洲AV无码成人精品区H| 国产一品二品三区在线观看| 国产爆乳无码一区二区麻豆| 97精品人妻一区二区三区香蕉| 国产精品久久中文字幕第一页 | 亚洲天堂手机在线| 玩弄放荡人妻一区二区三区| 亚洲视频在线观看一区二区三区| 成人免费直播| 91麻豆国产香蕉久久精品| 亚洲一区二区观看网站| 亚洲综合日韩一二三区| 狠狠色狠狠色综合| 亚洲黄色免费网站| 国产麻豆成人精品av| 欧美xxxxx高潮喷水| 欧妇女乱妇女乱视频| 国产一区二区三区4区| 日日噜噜噜夜夜狠狠久久蜜桃| 亚州性无码不卡免费视频| 7777精品久久久大香线蕉| 久久精品国产亚洲av麻豆四虎| 亚洲中文字幕乱码第一页| 免费看美女被靠的网站| a在线免费| 日本顶级片一区二区三区| 熟妇人妻无乱码中文字幕真矢织江| 久久久久亚洲av无码专区网站| 人妻无码∧V一区二区| 亚洲中文字幕日韩综合| 人妻久久久一区二区三区| 巨爆乳中文字幕爆乳区| 中文字幕精品一区二区三区av|