亚洲免费av电影一区二区三区,日韩爱爱视频,51精品视频一区二区三区,91视频爱爱,日韩欧美在线播放视频,中文字幕少妇AV,亚洲电影中文字幕,久久久久亚洲av成人网址,久久综合视频网站,国产在线不卡免费播放

        ?

        Coach-assisted multi-agent reinforcement learning framework for unexpected crashed agents*

        2022-07-26 02:19:12JianZHAOYoupengZHAOWeixunWANGMingyuYANGXunhanHUWengangZHOUJianyeHAOHouqiangLI

        Jian ZHAO,Youpeng ZHAO,Weixun WANG,Mingyu YANG,Xunhan HU,Wengang ZHOU?,Jianye HAO,Houqiang LI?

        1School of Information Science and Technology,University of Science and Technology of China,Hefei 230026,China

        2College of Intelligence and Computing,Tianjin University,Tianjin 300072,China

        Abstract: Multi-agent reinforcement learning is difficult to apply in practice,partially because of the gap between simulated and real-world scenarios.One reason for the gap is that simulated systems always assume that agents can work normally all the time,while in practice,one or more agents may unexpectedly“crash”during the coordination process due to inevitable hardware or software failures.Such crashes destroy the cooperation among agents and lead to performance degradation.In this work,we present a formal conceptualization of a cooperative multi-agent reinforcement learning system with unexpected crashes.To enhance the robustness of the system to crashes,we propose a coach-assisted multi-agent reinforcement learning framework that introduces a virtual coach agent to adjust the crash rate during training.We have designed three coaching strategies (fixed crash rate,curriculum learning,and adaptive crash rate) and a re-sampling strategy for our coach agent.To our knowledge,this work is the first to study unexpected crashes in a multi-agent system.Extensive experiments on grid-world and StarCraft II micromanagement tasks demonstrate the efficacy of the adaptive strategy compared with the fixed crash rate strategy and curriculum learning strategy.The ablation study further illustrates the effectiveness of our re-sampling strategy.

        Key words: Multi-agent system;Reinforcement learning;Unexpected crashed agents

        1 Introduction

        Cooperative multi-agent systems widely exist in various domains,where a group of agents need to coordinate with each other to maximize the team’s reward(Busoniu et al.,2008;Tuyls and Weiss,2012).Such a setting can be broadly applied in the control and operation of robots,unmanned vehicles,mobile sensor networks,and the smart grid (Zhang et al.,2021).Recently,many researchers have devoted their efforts to leveraging reinforcement learning techniques in multi-agent systems(Rashid et al.,2018;Sunehag et al.,2018;Wang JH et al.,2020;Wang YP et al.,2020).Despite the remarkable advancement in academia,multi-agent reinforcement learning (MARL) is still difficult to apply in practice.One non-trivial reason is that there always exists a gap between simulated and real-world scenarios,which degrades the performance of the policies once the models are transferred into real-world applications(Zhao et al.,2020).

        To close this sim-to-real gap and accomplish more efficient policy transfer,multiple research efforts are now being directed to identifying the causes of the gap and proposing corresponding solutions.One main cause is the difference between the physics engine of the simulator and the real-world scenario.To alleviate the difference,research efforts have been directed to building up more realistic simulators using mathematical models (Todorov et al.,2012;Furrer et al.,2016;Dosovitskiy et al.,2017;Shah et al.,2018;McCord et al.,2019;Wang YP et al.,2021).Another cause is the mismatch between the simulated environment’s data distribution and the real environment,which has inspired related research on domain adaptation(Higgins et al.,2017;Traoré et al.,2019;Arndt et al.,2020) and domain randomization(Tobin et al.,2017).

        Generally,simulated systems always assume that agents can work normally all the time.However,this assumption is usually not in line with reality.Because of the inevitable hardware or software failures in practice,one or more agents may unexpectedly“crash”during the coordination process.If the agents are trained in an environment without crashes,they only master how to cooperate in a crash-free environment.Once some agents“break down”and take abnormal actions,the remaining agents can hardly maintain effective cooperation,which will lead to performance degradation.Take a two-agent system as an example:two agents are required to finish two tasks in coordination.In the crash-free scenario,the optimal solution is for each agent to take responsibility for one task.When applying such a policy to the real-world application,the cooperation cannot be accomplished if any agent encounters a crash.This example indicates the necessity of considering unexpected crashes during training to obtain well-trained agents with high robustness.

        To our knowledge,this work is the first to study crashes in multi-agent systems,which is more consistent with real-world scenarios.In this study,we give a formal conceptualization of a cooperative MARL system with unexpected crashes,where any agent has a certain probability of crashing during operation.We assume that,for each agent,the probability of crashing independently follows a Bernoulli distribution.To enhance the robustness of the system to unexpected crashes,the agents should be trained in an environment that includes crashes.The key challenge is how to adjust the crash rate during training.

        In this work,we propose a coach-assisted MARL framework that introduces a virtual coach agent into the system.The coach agent is responsible for adjusting the crash rate during training.One straightforward coaching strategy for the coach is to set a fixed crash rate during training.Considering that it may be too difficult for agents to cooperate initially (Narvekar et al.,2020),increasing the crash rate gradually is another feasible strategy.In addition to these basic strategies,an experienced coach can automatically adjust the crash rate corresponding to the overall performance during training.Specifically,if the performance exceeds the threshold,the crash rate is increased to increase the learning difficulty;otherwise,the crash rate should be decreased.In this way,agents can learn coordination skills progressively while being exposed to unexpected crashes.

        To test the effectiveness of our method,we have conducted experiments on grid-world and StarCraft II micromanagement tasks.Compared to the fixed crash rate and curriculum learning strategies,the results demonstrate that an adaptive method achieves relatively stable performances with different crash rates.Furthermore,the ablation study shows the efficacy of our re-sampling strategy.

        2 Related works

        In this section,we briefly summarize the works related to cooperative MARL.With the development of this field,researchers are paying more and more attention to the MARL problem,which is more consistent with real-world settings.

        Early efforts treat the agents in a team independently and regard the team reward as the individual reward(Tan,1993;Mnih et al.,2015;Foerster et al.,2017;Omidshafiei et al.,2017).Consequently,the MARL task is transformed into multiple singleagent reinforcement learning tasks.Although trivially providing a possible solution,these approaches pay insufficient attention to an essential characteristic of MARL–coordination among agents.In other words,it will bring non-stationarity that agents cannot distinguish between the stochasticity of the environment and the exploitative behaviors of other co-learners(Lowe et al.,2017).

        Another line of research focuses on centralized learning of joint actions,which can naturally consider coordination problems(Sukhbaatar et al.,2016;Peng et al.,2017).Most of the centralized learning approaches require communication during execution.For instance,in CommNet (Sukhbaatar et al.,2016) a centralized network is designed for agents to exchange information.BicNet(Peng et al.,2017) leverages bi-directional recurrent neural networks (RNNs) for information sharing.Considering the communication constraint in practice,SchedNet(Kim et al.,2019) was proposed,in which agents learn how to schedule themselves for message passing and how to select actions based on received partial observations.Another challenge of centralized learning is the scalability issue because the joint action space grows exponentially as the number of agents increases.Some researchers investigated scalable strategies in centralized learning (Guestrin et al.,2001;Kok and Vlassis,2006).Sparse cooperative Q-learning (Kok and Vlassis,2006) allows only the necessary coordination between agents by encoding such dependencies.However,these methods require prior knowledge of the dependencies among agents,which is often inaccessible.

        To study a more practical scenario with the partial observability and communication constraint,an emerging stream is the paradigm of centralized training with decentralized execution (CTDE)(Oliehoek et al.,2008;Kraemer and Banerjee,2016).To our knowledge,value decomposition networks(VDNs)(Sunehag et al.,2018)make the first attempt to decompose a central state-action value function into a sum of individual Q-values to allow for decentralized execution.VDN simply assumes the equal contributions of agents and does not use additional state information during training.Based on VDN,QATTEN (Yang et al.,2020) uses a multi-head attention structure to distinguish the contributions of agents,and linearly integrates the individual Qvalues into the central Q-value.Instead of using linear monotonic value functions,QMIX(Rashid et al.,2018)and QTRAN(Son et al.,2019)employ a mixing network that satisfies the individual-global-max(IGM) principle (Son et al.,2019) to combine the individual Q-values non-linearly by leveraging state information.QPLEX (Wang JH et al.,2020) introduces the duplex dueling structure and decomposes the central Q-value into the sum of individual value functions and a non-positive advantage function.

        However,all of the existing works assume that agents can continuously maintain normal operations,which is inconsistent with real-world scenarios.As a matter of fact,it is a quite common phenomenon that some agents encounter unexpected crashes because of hardware or software failures.To this end,we aim to study a more practical problem by considering unexpected crashed agents in the cooperative MARL task.

        3 Problem formulation

        To better solve the problem of an unexpected crashed agent,we define a Crashed Dec-POMDP model,which is defined by a tupleM=<N,S,A,Ω,P,O,R,γ,α >,whereS,A,Ω,P,O,andRrepresent the state space,the action space,the observation space,a state transition function,an observation probability function,and a team reward,respectively.Each agentgi ∈N ≡{g1,g2,···,gn}has a probability of crashing and the crash rate is denoted asα.For simplicity,we assume that the crash occurs at the beginning of the episode and that the status of being crashed or not will not change throughout the episode.We define a binarized vector to denote the crashed state ofnagents as,whereci~Bernoulli(α).When theithagent crashes,ciis 1;otherwise,ciis 0.Note thatstays the same during an episode but may change throughout the task due to randomness.

        At each time step,each agentgireceives partial observationoi ∈Ωaccording to the observation probability functionO(oi|s).Each uncrashed agent chooses an actionai ∈Awith the normal strategy,while the crashed agents take no-move or random actions,forming a joint action.Given the current states,the joint actionaof the agents transits the environment to the next states′ ∈Saccording to the state transition functionP(s′|s,a).All of the agents share a team rewardR(s,a).The learning goal of MARL is to optimize every agent’s individual policyπi(ai|τi),whereis an agent’s actionobservation history,to maximize the team reward accumulation,whereγ ∈[0,1)is a discount factor.

        4 Methods

        In this section,we present our coach-assisted MARL framework for the Crashed Dec-POMDP problem and explain the rationality of our design.

        4.1 Overall framework

        To simulate crash scenarios during training,we introduce a virtual agent into the system to act as a coach.The coach is responsible for deciding the crash rate during training.At the beginning of each episodet,the coach sets up a crash rateαt.We assume that the probability of being crashed for each agent follows a Bernoulli(αt)distribution.Given the current crash rateαt,some of the agents crash and cannot take rational actions.Then,the multi-agent system with crashed agents is trained forTsteps to learn coordination.Then the coach can receive the performance of the agents under the current crash rate,denoted aset,and reset the crash rateαt+1for the next episode.To sum up,the overall framework is illustrated in Fig.1.

        Fig.1 An overview of the adaptive framework

        4.2 Coaching strategies

        The main challenge for the coach is how to choose an effective crash rate during training.Here,we introduce three coaching strategies:

        1.Fixed crash rate.The coach sets a fixed crash rate that is used throughout the training process.The agents,some of which are crashed,are required to learn coordination skills from scratch.

        2.Curriculum learning.The coach linearly increases the crash rate during training.At the beginning,the agents are trained in a crash-free environment.For thetthepisode,the coach sets the crash rate to be(t-1)Δα,where Δαis a hyperparameter.This approach gradually increases the cooperation difficulty.

        3.Adaptive crash rate.For the first two strategies,the coach does not take full advantage of the performance of the cooperative agents.An advanced strategy for the coach is to adaptively adjust the crash rate to correspond to the performance of the agents at the current crash rate.The basic idea is that if the agents can cooperate well and achieve acceptable performance under the current crash situation,the crash rate should be increased;otherwise,the crash rate should be decreased.The adaptive strategy can be formulated as follows:

        whereF(·) is a mapping function,andβrepresents the threshold of the performance of the specific evaluation metric.We can see that the fixed crash rate and curriculum learning are two special cases of the adaptive strategy.For the fixed crash rate strategy,

        For the curriculum learning strategy,

        whereα1=0.

        In this work,we use the following adaptive function:

        whereρis the learning rate of the crash rate,and functionI(·)is defined as follows:

        In this adaptive function,if the performance of the systemetdoes not reach threshold valueβ,the crash rate for the following training will be reduced,and vice versa.Therefore,the crash rate during training can fit the skills of the system,thus facilitating the learning process.Note that our method is not limited to the use of the above function.A more efficient adaptive function can be further investigated.

        4.3 Re-sampling strategy

        Randomly sampling from a Bernoulli(α) distribution may cause the proportion of the crashed agents to exceed or be smaller than the current crash ratioα.Therefore,we employ a re-sampling strategy to ensure that the number of crashed agents is not longer than the upper bound ofn×α.Here,we explain the rationality behind the re-sampling strategy.For the samples with more crashed agents,it may be too difficult for the current model to learn the coordination skills,and thus these samples are discarded and new samples will be generated.Samples with fewer crashed agents than expected can help the agents remember how to deal with the easier scenarios,and are therefore used during training.

        4.4 Overview of our algorithm

        To give a clear description of our method,we use a coaching strategy with an adaptive crash rate as an example to give a complete description of our algorithm,which is shown in Algorithm 1.

        5 Experiments

        In this section,we discuss the experiments we conducted to demonstrate the effectiveness of the methods that we propose.First,we conducted experiments in a grid-world environment as a toy example.Then we used the StarCraft Multi-Agent Challenge (SMAC) environment (Samvelyan et al.,2019) as the test-bed to evaluate our methods,which has become a commonly used benchmark for evaluating state-of-the-art MARL approaches. All experiments were conducted on a Ubuntu 18.04 server with four Intel?Xeon?Gold?6252 CPUs @ 2.10 GHz and a GeForce RTX 2080Ti GPU.Our codes are available at https://github.com/youpengzhao/Crashed_Agent.

        5.1 Grid-world example

        5.1.1 Settings

        We used the grid-world example to intuitively show the consequences without considering unexpected crashes in real-world scenarios.We set a 10×10 grid where two agents needed to touch two buttons within a limited number of steps.The game terminated after 20 steps or when both buttons had been touched.The default reward at each step was–1 and if one button was touched,the agents were assigned a reward of five at this step.In this way,the agents were encouraged to touch the button as quickly as possible.At each step,each agent had five possible actions including up,down,left,right,and staying still.If there was an unexpected crash during the test,the crashed agent remained still in the whole episode,so only one agent crashed during the test.For simplicity,the initial locations of agents and buttons were fixed,so the environment was deterministic.In addition,the observation of each agent was its own location,and the global state contained the locations of the two buttons and agents,so the agents did not know whether their partner crashed based on its own observation during execution.

        We used QMIX (Rashid et al.,2018),a stateof-the-art value-based MARL algorithm,as the base model in this toy example,and adopted the adaptive approach for comparison.Our implementation was based on the Pymarl Algorithm Library(Samvelyan et al.,2019),and the training schedules,optimizer,and training hyperparameters were kept the same as the default ones used in Pymarl.Our method includes two additional hyperparameters:one is the performance thresholdβto decide whether to increase or decrease the crash rate during training;the other is the learning rate of the crash rateρto control the step size of adjustment ofα.We setβas 0.75 andρa(bǔ)s 0.01 in this experiment.These tasks were separately trained for two million steps.

        5.1.2 Performance evaluation and discussion

        After training for the same number of steps,agents trained using these two methods managed to complete this task under normal scenarios.However,when an unexpected crash occurred,things were different.The results are illustrated in Fig.2.The agent trained using QMIX learned to touch the button near it in the shortest path,but after that,it wandered aimlessly.Due to partial observation,the normal agent failed to know that its partner was out of control and it did not try to touch another button.We assumed that agents trained by QMIX learned to efficiently cooperate to complete this task,and therefore they just needed to touch the closest button.However,their excessive reliance on cooperation made the system fragile and they failed to deal with the unexpected crash,which is common in realistic scenarios.In fact,the optimal strategy for the system when encountering an unexpected crash is for each agent to touch another button after touching the button nearest itself.In this way,even if one agent“breaks down,”the system can still complete the task.As shown in Fig.2,our method takes possible crashes into account during the training,so it can still fulfill the task even when one of the two agents encounters a crash.This toy example illustrates the drawback of overreliance on cooperation and the necessity of considering possible crashes when training the multi-agent system.

        Fig.2 The trajectory of agents during the test when one of them is crashed:(a) agent 1 is crashed;(b)agent 2 is crashed.The agents are represented with circles;the crashed one is marked using a dotted line and the normal one is marked using a solid line.The colored grids symbolize the two buttons.The green arrow line is the trajectory of agents trained using the original QMIX and the orange one is achieved by our method.References to color refer to the online version of this figure

        5.2 StarCraft Multi-Agent Challenge

        5.2.1 Settings

        In addition to the grid-world experiment,we conducted experiments on StarCraft II decentralized micromanagement tasks to show the effectiveness of our method.In this environment,we assumed that the crashed agents would take random actions.In this experiment,we also used QMIX (Rashid et al.,2018) as the base model.Then we compared the performance of QMIX and our coach-assisted framework with fixed crash rate,curriculum learning,and adaptive crash rate coaching strategies.Our implementation was also based on the Pymarl Algorithm Library (Samvelyan et al.,2019) without changing the default training schedules.For the variants of QMIX with the fixed crash rate,we randomly sampled the crashed agents with a Bernoulli distribution during each episode;thus,the actual number of crashed agents ranged from 0 ton.In the curriculum learning coaching strategy,the crash rate increased from 0 linearly and the upper limit was set to 0.1 as we test the models in scenarios whose crash rate was at most 0.1.We set the two hyperparametersβin{0.60,0.65,0.70,0.75}andρin{0.001,0.003,0.005,0.015},and selected their optimal values based on a grid search when adopting our adaptive method.We repeated the experiments in each setting over five runs with different seeds and reported the average results.For all the compared methods,each task was separately trained for two million steps.To obtain a relatively robust evaluation result,each model was tested 128 times.

        We chose two standard maps and designed two different maps in the experiment:3s_vs_5z,3s5z_vs_3s5z,8m_vs_5z,and 8s_vs_3s5z.The two standard maps were well-matched in strength,so a crash could result in some imbalance.To comprehensively show the performance of our method,we also designed two maps that guaranteed an appropriate gap in strength between the two sides,so that unexpected crashes would not lead to a significant change in difficulty.For more details about the maps,please refer to Samvelyan et al.(2019).

        5.2.2 Observation

        In this part,we discuss the observations from the scenario with crashed agents on StarCraft II micromanagement tasks and show what must be considered to deal with the crash scenario.

        The agents in Figs.3a and 3b play the role of Marines (ours) that are good at long-range attacks,while Zealots(opponents)can attack only in a short range and they have the same moving speed.Because the health point of Marines is only half that of Zealots,the optimal strategy is to alternate fire to attract the enemies.In Fig.3a,one agent (highlighted with a red rectangle) is out of control and starts to take random actions,while one of the remaining agents(highlighted with a yellow rectangle)is disrupted so that it cannot take a reasonable action.This case illustrates that the random crashes of some agents will undermine the coordination among the rest of the agents in the team,which is likely to cause a drop in the win rate.However,it can be observed in Fig.3b that agents trained with our method can avoid such effects of an unexpected crash because they may be familiar with abnormal observations.

        Figs.3c and 3d describe another situation where Stalkers(ours)play against Zealots and Stalkers(opponents)in the map 8s_vs_3s5z.Stalkers are good at long-range attacks,while Zealots are skilled in short-range attacks,and Stalkers move faster than Zealots.Stalkers can win the game by simply attacking when the number of normal agents is suffi-cient,and they will fail if they use the same strategy in crashed scenarios.Fig.3c illustrates that the Stalkers trained by QMIX only learn to attack continuously because this simple policy can achieve good performance in normal scenarios.However,if they can be split into two groups,i.e.,some of them attract Zealots and do kitting (i.e.,attack and step back)repeatedly while others focus fire to eliminate Stalkers and then attack the remaining enemies together,they are likely to achieve better performance(Fig.3d).This case indicates that once a simple winning strategy exists,the learning algorithm has little incentive to explore other optimal strategies,leading to poor capability in the event of crashed agents.The observation implies that increasing the challenge during training may drive the agents to learn better policies.

        Fig.3 Illustration of the actions taken by agents who are trained using the original QMIX (on the left) and our adaptive approach (on the right) when they are tested in scenarios with crashes.In (a) and (b),the agent highlighted with a red rectangle represents the crashed one and the agent highlighted with a yellow rectangle is the one that is affected.In (c) and (d),the system shows different behavior patterns after being trained with different approaches.References to color refer to the online version of this figure

        5.2.3 Performance evaluation and discussion

        We evaluated the performance of the compared methods by testing the win rate with different crash rates and the results are shown in Table 1.It can be observed that in standard maps,even using a simple fixed crash rate strategy can help improve performance.In contrast,in our designed maps,this approach works badly when the crash rate is low.We assume this occurs because the maps we designed are relatively simple so that even the original MARL algorithms can handle the scenarios with a low crash rate.In this case,fixing a low crash rate may instead introduce noise,which affects the learning process.However,in scenarios with a high crash rate,this method still has a positive effect.The curriculum learning strategy tends to perform well in scenarios with a low crash rate.In summary,these two straight methods can help the system be more robust in the face of an unexpected crash to some degree,but they all have some limits.In contrast,our adaptive approach can help improve the performance in different maps and crash rates,which demonstrates the effectiveness and generalization of our approach.

        When compared with the baseline algorithm,our adaptive method tends to gain a greater margin when the crash rate increases,indicating the superiority of our adaptive strategy in dealing with unexpected crashes.This finding further implies the rationality of our adaptive strategy,which allows agents to learn how to handle the crash scenarios step by step.In addition,the performance achieved by our method with re-sampling is consistently superior,compared to the performance achieved without adopting this strategy.We think this can be attributed to the fact that without a re-sampling strategy,there may be samples that contain more crashed agents,thus creating more difficulty during training.This finding also proves the importance of adopting a re-sampling strategy in our coach-assisted framework.

        5.2.4 Hyperparameter analysis

        In our adaptive framework,the performance thresholdβand the learning rate of crash rateρ,which jointly decide the updating of the adaptive crash rate,are of vital importance to the performance of our method.In this subsection,we further analyze the influence of these two hyperparameters on the overall performance,with other parameters unchanged.

        Here,we take the map 3s_vs_5z as an example.Table 2 reports the results of our method under different values ofβandρ.Given the sameρ,a largeβmeans that agents must learn quite well under the current crash rate before exploring a more difficult scenario.We can see that givenρ=0.003,the overall win rate first increases and then decreases asβincreases from 0.60 to 0.75,and the best performance is achieved whenβ=0.65.Given the sameβ,we can see that the performance first increases and then degrades asρincreases.The reason may be that,ifρis too small,the crash rateαwill be adjusted too slowly,so the agents cannot learn well within a limited number of steps.Ifρis too large,sharply increasing the crash rate may be too difficult for agents to learn coordination and the adjustment of the difficulty will be rough.In summary,the hyperparameters indeed have some effect on our framework,but our method can achieve a relatively stable performance if the hyperparameters are varied in a small range,which proves the robustness of our method.

        Table 1 The performance of the compared methods in terms of the win rate (including mean and standard deviation) under different crash rates

        Table 2 The impact of performance threshold β and learning rate ρ

        6 Conclusions

        Considering a common phenomenon that some agents may unexpectedly crash in real-world scenarios,this work is dedicated to a coach-assisted MARL framework that can close this sim-to-real gap.Our method simulates different random crash rates during the training process with the help of a coach,so that agents can master the skills necessary to deal with crashes.We conducted experiments on gridworld and StarCraft II micromanagement tasks to show the necessity of considering crashes during operation and tested the effectiveness of our framework using three coaching strategies in scenarios with unexpected crashes.The results demonstrated the effi-cacy and generalization of our method under different crash rates.In the future,we will further investigate the case in which crashed agents may take other abnormal actions in addition to random actions and other more efficient coaching strategies.

        Contributors

        Jian ZHAO designed the research and Weixun WANG gave advice.Youpeng ZHAO and Mingyu YANG conducted the experiments.Jian ZHAO and Youpeng ZHAO drafted the paper.Xunhan HU helped prepare figures.Wengang ZHOU,Jianye HAO,and Houqiang LI revised and finalized the paper.

        Compliance with ethics guidelines

        Jian ZHAO,Youpeng ZHAO,Weixun WANG,Mingyu YANG,Xunhan HU,Wengang ZHOU,Jianye HAO,and Houqiang LI declare that they have no conflict of interest.

        少妇高潮惨叫正在播放对白| 粉嫩的极品女神尤物在线| 性人久久久久| 国产精品无码午夜福利| 精品四虎免费观看国产高清| 国产精品麻豆成人av| 少妇又色又爽又高潮在线看| 少女韩国电视剧在线观看完整| 国产内射合集颜射| 中文字幕日韩人妻高清在线| 日本一区二区不卡在线| 国产人妻大战黑人20p| 熟妇人妻中文字幕无码老熟妇| 无码人妻专区一区二区三区| 久久精品熟女亚洲av香蕉| 少妇人妻在线无码天堂视频网| 中文乱码人妻系列一区二区| 69搡老女人老妇女老熟妇| 国产女主播一区二区久久| 一本色道久久综合无码人妻| 激情五月婷婷综合| 亚洲国产免费一区二区| 国产乱人伦偷精品视频免观看| 毛片大全真人在线| 中文字幕无码免费久久99| 久久国产精品国语对白 | 狠狠噜狠狠狠狠丁香五月| 1000部精品久久久久久久久| 夫妻一起自拍内射小视频| 精品国产亚洲av高清大片| 日本午夜精品理论片a级app发布| 无码人妻一区二区三区免费手机| 国产精品老女人亚洲av无| 国产精品区一区二区三在线播放| 欧美白人最猛性xxxxx| 国产精品一区二区日韩精品| 亚洲国产精品一区二区久久恐怖片 | 亚洲乱码中文字幕三四区| 国产一精品一av一免费| 香蕉视频一级片| 亚洲天堂一区二区三区视频|