Update RL chapter (#349)

* fix chap12 render

* add distributed rl chapter

* fix bug

* fix issue #212

* fix typo

* update imgs

* fix chinese

* fix svg img

* update contents in rl chapter

* update marl sys

* fix a fig

* fix ref

* fix error

Co-authored-by: Dalong <39682259+eedalong@users.noreply.github.com>
This commit is contained in:
Zihan Ding
2022-05-23 09:04:08 -04:00
committed by GitHub
parent 7da0f4a781
commit 719de7d582
14 changed files with 1016 additions and 20 deletions

View File

@@ -30,7 +30,7 @@ Ray :cite:`moritz2018ray`是由伯克利大学几名研究人员发起的一个
![RLlib分布式训练](../img/ch12/ch12-rllib-distributed.svg)
:width:`800px`
:width:`600px`
:label:`ch12/ch12-rllib_dist`

View File

@@ -12,7 +12,7 @@
由上述介绍和定义可以发现多智能体强化学习是一个比单智能体强化学习更加复杂的问题。而实际上多个智能体的存在对于每个智能体的决策而言绝对不是简单的把每个单智能体决策累加的难度实际情况要比单智能体决策问题复杂很多。多智能体系统的研究实际上是门古老的学科它与博弈论Game Theory密切相关在深度强化学习盛行以前早已有大量研究和许多理论上未解的难题。其中一个典型的问题是纳什均衡在双人非零和博弈下没有多项式时间内可解的方法实际上这是一个PPADPolynomial Parity Argument, Directed version类的问题。见论文Settling the Complexity of Computing Two-Player Nash Equilibria. Xi Chen, et al.)由于篇幅限制,我们这里无法对多智能体问题做深入探讨,我们可以用一个简单例子来介绍为什么多智能体强化学习问题无法简单地用单智能体强化学习算法来解。
:剪刀石头布奖励值
:剪刀-石头-布奖励值
| | 剪刀 | 石头 | 布 |
| --- | ------- | ------- | ------- |
@@ -21,6 +21,19 @@
| 布 | (-1,+1) | (+1,-1) | (0,0) |
:label:`tab_ch12_ch12_marl`
我们考虑一个大家都熟悉的游戏就“剪刀石头、布”,考虑两个玩家玩这个游戏的输赢情况,剪刀<石头<布<剪刀...”,这里的“<”即前一个纯策略被后一个纯策略完全压制,我们给予奖励值-1、+1到这两个玩家当他们选择相同的纯策略时奖励值均为0。于是我们得到一个奖励值表如 :numref:`tab_ch12_ch12_marl`所示横轴为玩家1纵轴为玩家2表内的数组为玩家1和玩家2各自在相应动作下得到的奖励值。由于这个矩阵的反对称性这个问题的纳什均衡策略对两个玩家相同均为$(\frac{1}{3}, \frac{1}{3}, \frac{1}{3})$的策略分布,即有各$\frac{1}{3}$的概率出剪刀、石头或布。如果我们把得到这个纳什均衡策略作为多智能体学习的目标那么我们可以简单分析得到这个均衡策略无法通过简单的单智能体算法得到。考虑我们随机初始化两个玩家为任意两个纯策略比如玩家1出剪刀玩家2出石头。这时假设玩家2策略固定可以把玩家2看做固定环境的一部分于是可以使用任意单智能体强化学习算法对玩家1进行训练使其最大化自己的奖励值。于是玩家1会收敛到布的纯策略。这时再把玩家1固定训练玩家2玩家2又收敛到剪刀的纯策略。于是循环往复整个训练过程始终无法收敛玩家1和2各自在3个策略中循环却无法得到正确的纳什均衡策略。
我们考虑一个大家都熟悉的游戏 剪刀-石头-布,考虑两个玩家玩这个游戏的输赢情况,我们知道有这样的输赢关系:剪刀<石头<布<剪刀...这里的“<”即前一个纯策略被后一个纯策略完全压制,我们给予奖励值-1、+1到这两个玩家当他们选择相同的纯策略时奖励值均为0。于是我们得到一个奖励值表如 :numref:`tab_ch12_ch12_marl`所示横轴为玩家1纵轴为玩家2表内的数组为玩家1和玩家2各自在相应动作下得到的奖励值。由于这个矩阵的反对称性这个问题的纳什均衡策略对两个玩家相同均为$(\frac{1}{3}, \frac{1}{3}, \frac{1}{3})$的策略分布,即有各$\frac{1}{3}$的概率出剪刀、石头或布。如果我们把得到这个纳什均衡策略作为多智能体学习的目标那么我们可以简单分析得到这个均衡策略无法通过简单的单智能体算法得到。考虑我们随机初始化两个玩家为任意两个纯策略比如玩家1出剪刀玩家2出石头。这时假设玩家2策略固定可以把玩家2看做固定环境的一部分于是可以使用任意单智能体强化学习算法对玩家1进行训练使其最大化自己的奖励值。于是玩家1会收敛到布的纯策略。这时再把玩家1固定训练玩家2玩家2又收敛到剪刀的纯策略。于是循环往复整个训练过程始终无法收敛玩家1和2各自在3个策略中循环却无法得到正确的纳什均衡策略。
我们在上面这个例子中采用的学习方法其实是多智能体强化学习中最基础的一种叫自学习Selfplay。我们可以看到自学习在特定的任务设置下可能无法收敛到我们想要的最终目标。正是由于多智能体学习过程中有类似循环结构的出现我们需要更复杂的训练方法和专门针对多智能体的学习方式来达到我们想要的目标。一般来讲多智能体强化学习是比单智能体强化学习更复杂的一类对于自学习的方法而言单智能体强化学习的过程可以看做一个多智能体强化学习的子任务。从前面这一小游戏的角度来理解当玩家1策略固定时玩家1加游戏环境构成玩家2的实际学习环境由于这个环境是固定的玩家2可以通过单智能体强化学习来达到自身奖励值最大化这时再固定玩家2的策略玩家1又可以进行单智能体强化学习......这样单智能体强化学习是多智能体任务的子任务。其他算法如虚构自学习Fictitious Self-play需要在每个单智能体强化学习的步骤中对对手历史策略的平均策略求得最优应对策略而对手的训练也是如此进行循环能够在上面剪刀-石头-布一类的游戏中保证收敛到纳什均衡策略。
![自学习算法示意图。](../img/ch12/ch12-marl-sp.png)
:width:`600px`
:label:`ch12/ch12-marl-sp`
我们在上面这个例子中采用的学习方法其实是多智能体强化学习中最基础的一种叫自学习Selfplay如 :numref:`ch12/ch12-marl-sp`所示。自学习的方法即固定当前对首次策略按照单智能体优化的方法最大化一侧智能体的表现这一过程称为最佳反应策略Best Response Strategy。之后再将这一最佳反应策略作为该智能体的固定策略再来优化另一边的智能体策略如此循环。我们可以看到自学习在特定的任务设置下可能无法收敛到我们想要的最终目标。正是由于多智能体学习过程中有类似循环结构的出现我们需要更复杂的训练方法和专门针对多智能体的学习方式来达到我们想要的目标。一般来讲多智能体强化学习是比单智能体强化学习更复杂的一类对于自学习的方法而言单智能体强化学习的过程可以看做一个多智能体强化学习的子任务。从前面这一小游戏的角度来理解当玩家1策略固定时玩家1加游戏环境构成玩家2的实际学习环境由于这个环境是固定的玩家2可以通过单智能体强化学习来达到自身奖励值最大化这时再固定玩家2的策略玩家1又可以进行单智能体强化学习......这样单智能体强化学习是多智能体任务的子任务。其他算法如虚构自学习Fictitious Self-play :numref:`ch12/ch12-marl-fsp`,需要在每个单智能体强化学习的步骤中,对对手历史策略的平均策略求得最优应对策略,而对手的训练也是如此,进行循环,能够在上面剪刀-石头-布一类的游戏中保证收敛到纳什均衡策略。
![虚构自学习算法示意图。](../img/ch12/ch12-marl-fsp.png)
:width:`600px`
:label:`ch12/ch12-marl-fsp`

View File

@@ -10,10 +10,27 @@
我们将对构建多智能体强化学习系统中的困难分为以下几点进行讨论:
* **智能体个数带来的复杂度**从单智能体系统到多智能体系统最直接的变化就是智能体个数从1变为大于1个。对于一个各个智能体独立的$N$智能体系统而言,这种变化带来的策略空间表示复杂度是指数增加的,即$\tilde{O}(e^N)$。举个简单的例子,对于一个离散空间的单智能体系统,假设其状态空间大小为$S$, 动作空间大小为$A$,游戏步长为$H$,那么这个离散策略空间的大小为$O(HSA)$;而直接将该游戏扩展为$N$玩家游戏后,所有玩家策略的联合分布空间大小为$O(HSA^N)$。这是因为每个独立玩家的策略空间构成联合策略空间是乘积关系$\mathcal{A}=\mathcal{A}_1\times\dots\mathcal{A}_N$。而这将直接导致算法搜索复杂度提升。
* **智能体个数带来的复杂度**从单智能体系统到多智能体系统最直接的变化就是智能体个数从1变为大于1个。对于一个各个智能体独立的$N$智能体系统而言,这种变化带来的策略空间表示复杂度是指数增加的,即$\tilde{O}(e^N)$。举个简单的例子,对于一个离散空间的单智能体系统,假设其状态空间大小为$S$, 动作空间大小为$A$,游戏步长为$H$,那么这个离散策略空间的大小为$O(HSA)$;而直接将该游戏扩展为$N$玩家游戏后,在最一般的情况下,即所有玩家有对称的动作空间动作空间大小为$A$且不共享任何结构信息,所有玩家策略的联合分布空间大小为$O(HSA^N)$。这是因为每个独立玩家的策略空间构成联合策略空间是乘积关系$\mathcal{A}=\mathcal{A}_1\times\dots\mathcal{A}_N$。而这将直接导致算法搜索复杂度提升。
* **游戏类型带来的复杂度**:从博弈论的角度,多智能系统所产生的游戏类型是复杂的。从最直接的分类角度,有竞争型、合作型、混合型。在竞争型游戏中,最典型的研究模型是二人零和博弈,如前一小结中提到的剪刀-石头-布的游戏。这类游戏中的纳什均衡策略一般为混合型策略即无法通过单一纯策略达到均衡条件。纯策略纳什均衡存在于少数零和游戏中。合作型游戏即多个智能体需要通过合作来提升整体奖励。在这类问题研究中一般采用基于值分解的思路将所有智能体得到的奖励值分配到单个智能体作为其奖励值。这一类的算法有VDN :cite:`sunehag2017value`, COMA :cite:`foerster2018counterfactual`, QMIX :cite:`rashid2018qmix`等。在混合型游戏中部分智能体之间为合作关系部分智能体或智能体的集合间为竞争关系。一般的非零和博弈且非纯合作型游戏为混合型游戏举个简单的例子如囚徒困境Prisoner's Dilemma 其奖励值表如 :numref:`tab_ch12_ch12_marl_prison`所示。囚徒困境的两个玩家各有两个动作沉默和背叛。可以用警察审查两名罪犯来理解奖励值的绝对值即他们将被判处的年数。纯所有玩家的奖励值之和非常数故其为非零和博弈型游戏。因此这一游戏不能被认为是纯竞争型或纯合作型游戏因为当他们中的一方选择沉默一方选择背叛时二者没有有效合作而一方拿到了0的奖励另一方为-3。而两者都选择沉默时是一种合作策略各自拿到-1的奖励值。尽管这一策略看起来优于其他策略但是这并不是这个游戏的纳什均衡策略因为纳什均衡策略假设玩家间策略需要单独制定无法形成联合策略分布。这实际上切断了玩家间的信息沟通和潜在合作的可能。因此囚徒困境的纳什均衡策略是两个玩家都选择背叛对方。诸如此类的博弈论游戏类型导致单智能体强化学习不能被直接用来优化多智能体系统中的各个智能体的策略。单智能体强化学习一般是找极值的过程而多智能体系统求解纳什均衡策略往往是找极大-极小值即鞍点的过程,从优化的角度看这也是不同的。复杂的关系需要更普适的系统进行表达,这也对多智能体系统的构建提出了挑战。多智能体游戏类型也有许多其他的分类角度,如单轮进行的游戏、多轮进行的游戏、多智能体同时决策的、多智能体序贯决策等等,每一类不同的游戏都有相应不同的算法。而现有的多智能体系统往往针对单一类型游戏或者单一算法,缺少普适性多智能体强化学习系统,尤其是分布式的系统。
:囚徒困境奖励值
| | 沉默 | 背叛 |
| --- | ------- | ------- |
| 沉默 | (-1,-1) | (-3,0) |
| 背叛 | (0,-3) | (-2,-2) |
:label:`tab_ch12_ch12_marl_prison`
* **游戏类型带来的复杂度**:从博弈论的角度,多智能系统所产生的游戏类型是复杂的。从最直接的分类角度,有竞争型、合作型、混合型。在混合型游戏中,部分智能体之间为合作关系,部分智能体或智能体的集合间为竞争关系。复杂的关系需要更普适的系统进行表达,这也对多智能体系统的构建提出了挑战。多智能体游戏类型也有许多其他的分类角度,如单轮进行的游戏、多轮进行的游戏、多智能体同时决策的、多智能体序贯决策等等,每一类不同的游戏都有相应不同的算法。而现有的多智能体系统往往针对单一类型游戏或者单一算法,缺少普适性多智能体强化学习系统,尤其是分布式的系统。
* **算法的异构**从前面介绍的几个简单的多智能体算法如自学习、虚构自学习等可以看出多智能体算法有时由许多轮单智能体强化学习过程组成。而对不同的游戏类型算法的类型也不相同。比如对合作型游戏许多算法是基于奖励分配Credit Assignment的思想如何将多个智能体获得的共同奖励合理分配给单个智能体是这类算法的核心。而这里面按照具体算法执行方式也可以分为集成训练统一执行的Centralized Training Centralized Execution、集成训练分别执行的Centralized Training Decentralized Execution、分别训练并分别执行Decentralized Training Decentralized Execution的几类来描述不同智能体训练过程和执行过程的统一性。对于竞争型游戏往往采用各种计算纳什均衡的近似方法如前面提到的虚构自学习、Double Oracle、Mirror Descent等等将获取单个最优策略的单智能体强化学习过程看做一个“动作”而对这些“动作”组成的元问题上进行纳什均衡近似。现有的算法在类似问题上有很大的差异性使得构建一个统一的多智能体强化学习系统比较困难。
* **学习方法组合**在前面提到的AlphaStar :cite:`vinyals2019grandmaster`等工作中,多智能体系统中优化得到一个好的策略往往不只需要强化学习算法,还需要其他学习方法如模仿学习等的辅助。比如从一些顶级人类玩家的游戏记录中形成有标签的训练样本,来预训练智能体。由于这些大规模游戏的复杂性,这往往是一个在训练前期快速提升智能体表现的有效方式。而对于整个学习系统而言,这就需要对不同学习范式进行结合,如合理地在模仿学习和强化学习之间进行切换等。这也使得大规模多智能体系统不单一是构建强化学习系统的问题,而需要许多其他学习机制和协调机制的配合实现。
* **学习方法组合**在前面提到的AlphaStar :cite:`vinyals2019grandmaster`等工作中,多智能体系统中优化得到一个好的策略往往不只需要强化学习算法,还需要其他学习方法如模仿学习等的辅助。比如从一些顶级人类玩家的游戏记录中形成有标签的训练样本,来预训练智能体。由于这些大规模游戏的复杂性,这往往是一个在训练前期快速提升智能体表现的有效方式。而对于整个学习系统而言,这就需要对不同学习范式进行结合,如合理地在模仿学习和强化学习之间进行切换等。这也使得大规模多智能体系统不单一是构建强化学习系统的问题,而需要许多其他学习机制和协调机制的配合实现。
如 :numref:`ch12/ch12_marl_sys`所示为一个分布式多智能体强化学习系统。图中的两个智能体可以类似扩展到多个智能体。每个智能体包含多个行动者Actor用于采样和学习者Learner用于更新模型这些行动者和学习者可以并行处理来加速训练过程具体方法可以参考单智能体分布式系统章节介绍的A3C和IMPALA架构。训练好的模型被统一存储和管理在模型存储器中是否对各个智能体的模型分别存储取决于各个智能体是否对称。存储器中的模型可以被模型评估器用来打分从而为下一步模型选择器做准备。模型选择器根据模型评估器或者元学习者如PSRO算法 :cite:`lanctot2017unified`以及均衡求解器等进行模型选择并将选出的模型分发到各个智能体的行动者上。这一处理过程我们称为联盟型管理League-based Management。对于与环境交互的部分分布式系统可以通过一个推理服务器Inference Server对各个并行进程中的模型进行集中推理将基于观察量Observation的动作Action发送给环境。环境部分也可以是并行的。推理服务器将采集到的交互轨迹发送给各个智能体进行模型训练。以上为一个分布式多智能体系统的例子实际中根据不同的游戏类型和算法结构可能会有不同的设计。
![分布式多智能体强化学习系统](../img/ch12/ch12-marl-sys.png)
:width:`800px`
:label:`ch12/ch12_marl_sys`

View File

@@ -1,6 +1,6 @@
## 强化学习介绍
近年来强化学习作为机器学习的一个分支受到越来越多的关注。从2013年起DeepMind公司的研究人员就提出深度Q学习Deep Q-learning用于学习7个不同的电子游戏中对象的操作。自此以后以DeepMind为首的科研机构推出了像AlphaGo下围棋这类的引人瞩目的强化学习成果并在2016年与世界顶级围棋高手李世石的对战中取得胜利。自那以后强化学习领域连续取得了一系列成就如星际争霸游戏智能体AlphaStar、Dota 2游戏智能体OpenAI Five、多人零和博弈德州扑克的Pluribus、机器狗运动控制算法等。在这一系列科研成就的背后是整个强化学习领域算法在这些年内快速迭代进步的结果基于模拟器产生的大量数据使得对数据“饥饿”Data Hungry的深度神经网络能够表现出很好的拟合效果从而将强化学习算法的能力充分发挥出来在以上领域中达到或者超过人类专家的学习表现。目前强化学习已经从电子游戏逐步走向更广阔的应用场景如机器人控制、机械手灵巧操作、能源系统调度、网络负载分配、股票期货交易等一系列更加现实和富有意义的领域对传统控制方法和启发式决策理论发起冲击。
近年来强化学习作为机器学习的一个分支受到越来越多的关注。从2013年起DeepMind公司的研究人员就提出深度Q学习 :cite:`mnih2013playing`Deep Q-learning用于学习7个不同的电子游戏中对象的操作。自此以后以DeepMind为首的科研机构推出了像AlphaGo下围棋这类的引人瞩目的强化学习成果并在2016年与世界顶级围棋高手李世石的对战中取得胜利。自那以后强化学习领域连续取得了一系列成就如星际争霸游戏智能体AlphaStar、Dota 2游戏智能体OpenAI Five、多人零和博弈德州扑克的Pluribus、机器狗运动控制算法等。在这一系列科研成就的背后是整个强化学习领域算法在这些年内快速迭代进步的结果基于模拟器产生的大量数据使得对数据“饥饿”Data Hungry的深度神经网络能够表现出很好的拟合效果从而将强化学习算法的能力充分发挥出来在以上领域中达到或者超过人类专家的学习表现。目前强化学习已经从电子游戏逐步走向更广阔的应用场景如机器人控制、机械手灵巧操作、能源系统调度、网络负载分配、股票期货交易等一系列更加现实和富有意义的领域对传统控制方法和启发式决策理论发起冲击。
![强化学习框架](../img/ch12/ch12-rl.png)

BIN
img/ch12/ch12-marl-fsp.pdf Normal file

Binary file not shown.

BIN
img/ch12/ch12-marl-fsp.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 44 KiB

BIN
img/ch12/ch12-marl-sp.pdf Normal file

Binary file not shown.

BIN
img/ch12/ch12-marl-sp.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 30 KiB

BIN
img/ch12/ch12-marl-sys.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 161 KiB

782
img/ch12/ch12-marl-sys.svg Normal file
View File

@@ -0,0 +1,782 @@
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!-- Created with Inkscape (http://www.inkscape.org/) -->
<svg
width="466.7132mm"
height="299.44547mm"
viewBox="0 0 466.71319 299.44546"
version="1.1"
id="svg5"
inkscape:version="1.1.1 (1:1.1+202109281949+c3084ef5ed)"
sodipodi:docname="marl_sys.svg"
xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape"
xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd"
xmlns="http://www.w3.org/2000/svg"
xmlns:svg="http://www.w3.org/2000/svg">
<sodipodi:namedview
id="namedview7"
pagecolor="#ffffff"
bordercolor="#666666"
borderopacity="1.0"
inkscape:pageshadow="2"
inkscape:pageopacity="0.0"
inkscape:pagecheckerboard="0"
inkscape:document-units="mm"
showgrid="false"
inkscape:zoom="0.32026164"
inkscape:cx="507.39763"
inkscape:cy="710.35668"
inkscape:window-width="1848"
inkscape:window-height="1136"
inkscape:window-x="72"
inkscape:window-y="27"
inkscape:window-maximized="1"
inkscape:current-layer="layer1"
inkscape:snap-global="false"
fit-margin-top="0"
fit-margin-left="0"
fit-margin-right="0"
fit-margin-bottom="0" />
<defs
id="defs2">
<marker
style="overflow:visible"
id="Arrow1Mend"
refX="0"
refY="0"
orient="auto"
inkscape:stockid="Arrow1Mend"
inkscape:isstock="true">
<path
transform="matrix(-0.4,0,0,-0.4,-4,0)"
style="fill:context-stroke;fill-rule:evenodd;stroke:context-stroke;stroke-width:1pt"
d="M 0,0 5,-5 -12.5,0 5,5 Z"
id="path129232" />
</marker>
<marker
style="overflow:visible"
id="Arrow1Mend-3"
refX="0"
refY="0"
orient="auto"
inkscape:stockid="Arrow1Mend"
inkscape:isstock="true">
<path
transform="matrix(-0.4,0,0,-0.4,-4,0)"
style="fill:context-stroke;fill-rule:evenodd;stroke:context-stroke;stroke-width:1pt"
d="M 0,0 5,-5 -12.5,0 5,5 Z"
id="path129232-3" />
</marker>
<marker
style="overflow:visible"
id="Arrow1Mend-3-8"
refX="0"
refY="0"
orient="auto"
inkscape:stockid="Arrow1Mend"
inkscape:isstock="true">
<path
transform="matrix(-0.4,0,0,-0.4,-4,0)"
style="fill:context-stroke;fill-rule:evenodd;stroke:context-stroke;stroke-width:1pt"
d="M 0,0 5,-5 -12.5,0 5,5 Z"
id="path129232-3-6" />
</marker>
<marker
style="overflow:visible"
id="Arrow1Mend-3-4"
refX="0"
refY="0"
orient="auto"
inkscape:stockid="Arrow1Mend"
inkscape:isstock="true">
<path
transform="matrix(-0.4,0,0,-0.4,-4,0)"
style="fill:context-stroke;fill-rule:evenodd;stroke:context-stroke;stroke-width:1pt"
d="M 0,0 5,-5 -12.5,0 5,5 Z"
id="path129232-3-8" />
</marker>
<marker
style="overflow:visible"
id="Arrow1Mend-3-89"
refX="0"
refY="0"
orient="auto"
inkscape:stockid="Arrow1Mend"
inkscape:isstock="true">
<path
transform="matrix(-0.4,0,0,-0.4,-4,0)"
style="fill:context-stroke;fill-rule:evenodd;stroke:context-stroke;stroke-width:1pt"
d="M 0,0 5,-5 -12.5,0 5,5 Z"
id="path129232-3-7" />
</marker>
<marker
style="overflow:visible"
id="Arrow1Mend-3-6"
refX="0"
refY="0"
orient="auto"
inkscape:stockid="Arrow1Mend"
inkscape:isstock="true">
<path
transform="matrix(-0.4,0,0,-0.4,-4,0)"
style="fill:context-stroke;fill-rule:evenodd;stroke:context-stroke;stroke-width:1pt"
d="M 0,0 5,-5 -12.5,0 5,5 Z"
id="path129232-3-4" />
</marker>
<marker
style="overflow:visible"
id="Arrow1Mend-0"
refX="0"
refY="0"
orient="auto"
inkscape:stockid="Arrow1Mend"
inkscape:isstock="true">
<path
transform="matrix(-0.4,0,0,-0.4,-4,0)"
style="fill:context-stroke;fill-rule:evenodd;stroke:context-stroke;stroke-width:1pt"
d="M 0,0 5,-5 -12.5,0 5,5 Z"
id="path129232-30" />
</marker>
<marker
style="overflow:visible"
id="Arrow1Mend-2"
refX="0"
refY="0"
orient="auto"
inkscape:stockid="Arrow1Mend"
inkscape:isstock="true">
<path
transform="matrix(-0.4,0,0,-0.4,-4,0)"
style="fill:context-stroke;fill-rule:evenodd;stroke:context-stroke;stroke-width:1pt"
d="M 0,0 5,-5 -12.5,0 5,5 Z"
id="path129232-5" />
</marker>
<marker
style="overflow:visible"
id="Arrow1Mend-3-89-0"
refX="0"
refY="0"
orient="auto"
inkscape:stockid="Arrow1Mend"
inkscape:isstock="true">
<path
transform="matrix(-0.4,0,0,-0.4,-4,0)"
style="fill:context-stroke;fill-rule:evenodd;stroke:context-stroke;stroke-width:1pt"
d="M 0,0 5,-5 -12.5,0 5,5 Z"
id="path129232-3-7-5" />
</marker>
<marker
style="overflow:visible"
id="Arrow1Mend-28"
refX="0"
refY="0"
orient="auto"
inkscape:stockid="Arrow1Mend"
inkscape:isstock="true">
<path
transform="matrix(-0.4,0,0,-0.4,-4,0)"
style="fill:context-stroke;fill-rule:evenodd;stroke:context-stroke;stroke-width:1pt"
d="M 0,0 5,-5 -12.5,0 5,5 Z"
id="path129232-9" />
</marker>
<marker
style="overflow:visible"
id="Arrow1Mend-2-6"
refX="0"
refY="0"
orient="auto"
inkscape:stockid="Arrow1Mend"
inkscape:isstock="true">
<path
transform="matrix(-0.4,0,0,-0.4,-4,0)"
style="fill:context-stroke;fill-rule:evenodd;stroke:context-stroke;stroke-width:1pt"
d="M 0,0 5,-5 -12.5,0 5,5 Z"
id="path129232-5-8" />
</marker>
<marker
style="overflow:visible"
id="Arrow1Mend-21"
refX="0"
refY="0"
orient="auto"
inkscape:stockid="Arrow1Mend"
inkscape:isstock="true">
<path
transform="matrix(-0.4,0,0,-0.4,-4,0)"
style="fill:context-stroke;fill-rule:evenodd;stroke:context-stroke;stroke-width:1pt"
d="M 0,0 5,-5 -12.5,0 5,5 Z"
id="path129232-0" />
</marker>
</defs>
<g
inkscape:label="Layer 1"
inkscape:groupmode="layer"
id="layer1"
transform="translate(153.48541,7.5859966)">
<rect
style="fill:#ffffff;fill-opacity:0;stroke:#000000;stroke-width:1.5;stroke-linejoin:round;stroke-miterlimit:4;stroke-dasharray:none"
id="rect2439"
width="319.29016"
height="73.025307"
x="-6.8123627"
y="-6.8359966"
ry="7.4795961" />
<text
xml:space="preserve"
style="font-style:normal;font-weight:normal;font-size:10.5833px;line-height:1.25;font-family:sans-serif;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.264583"
x="0.35719228"
y="7.3493328"
id="text6769"><tspan
sodipodi:role="line"
id="tspan6767"
style="font-style:normal;font-variant:normal;font-weight:bold;font-stretch:normal;font-family:sans-serif;-inkscape-font-specification:'sans-serif Bold';stroke-width:0.264583"
x="0.35719228"
y="7.3493328">智能体1</tspan></text>
<rect
style="fill:#ffffff;fill-opacity:0;stroke:#000000;stroke-width:1.5;stroke-linejoin:round;stroke-miterlimit:4;stroke-dasharray:none"
id="rect14994"
width="108.53062"
height="38.626457"
x="1.1066673"
y="13.8151"
ry="6.3606167" />
<rect
style="fill:#ffffff;fill-opacity:0.999939;stroke:#000000;stroke-width:1.5;stroke-linejoin:round;stroke-miterlimit:4;stroke-dasharray:none"
id="rect14994-9"
width="108.53062"
height="38.626457"
x="9.5172548"
y="20.080038"
ry="6.3606167" />
<rect
style="fill:#ffffff;fill-opacity:0.999939;stroke:#000000;stroke-width:0.886039;stroke-linejoin:round"
id="rect14994-9-6"
width="48.133919"
height="23.788153"
x="20.969793"
y="26.469824"
ry="3.9171939" />
<text
xml:space="preserve"
style="font-style:normal;font-weight:normal;font-size:10.5833px;line-height:1.25;font-family:sans-serif;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.264583"
x="33.587643"
y="42.48008"
id="text18868"><tspan
sodipodi:role="line"
id="tspan18866"
style="stroke-width:0.264583"
x="33.587643"
y="42.48008">模型</tspan></text>
<text
xml:space="preserve"
style="font-style:normal;font-weight:normal;font-size:10.5833px;line-height:1.25;font-family:sans-serif;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.264583"
x="84.019226"
y="32.250854"
id="text18868-3"><tspan
sodipodi:role="line"
id="tspan18866-8"
style="stroke-width:0.264583"
x="84.019226"
y="32.250854">行动者</tspan></text>
<text
xml:space="preserve"
style="font-style:normal;font-weight:normal;font-size:10.5833px;line-height:1.25;font-family:sans-serif;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.264583"
x="135.93124"
y="34.698502"
id="text18868-3-1"><tspan
sodipodi:role="line"
id="tspan18866-8-1"
style="stroke-width:0.264583"
x="135.93124"
y="34.698502">轨迹</tspan></text>
<text
xml:space="preserve"
style="font-style:normal;font-weight:normal;font-size:10.5833px;line-height:1.25;font-family:sans-serif;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.264583"
x="55.129414"
y="111.61241"
id="text18868-3-1-5"><tspan
sodipodi:role="line"
id="tspan18866-8-1-0"
style="font-style:italic;font-variant:normal;font-weight:normal;font-stretch:normal;font-family:sans-serif;-inkscape-font-specification:'sans-serif Italic';stroke-width:0.264583"
x="55.129414"
y="111.61241">模型分发</tspan></text>
<text
xml:space="preserve"
style="font-style:normal;font-weight:normal;font-size:10.5833px;line-height:1.25;font-family:sans-serif;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.264583"
x="243.21516"
y="90.411438"
id="text18868-3-1-5-8"><tspan
sodipodi:role="line"
id="tspan18866-8-1-0-62"
style="font-style:italic;font-variant:normal;font-weight:normal;font-stretch:normal;font-family:sans-serif;-inkscape-font-specification:'sans-serif Italic';stroke-width:0.264583"
x="243.21516"
y="90.411438">模型更新存储</tspan></text>
<text
xml:space="preserve"
style="font-style:normal;font-weight:normal;font-size:10.5833px;line-height:1.25;font-family:sans-serif;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.264583"
x="-26.510138"
y="181.73605"
id="text18868-3-1-5-2"><tspan
sodipodi:role="line"
id="tspan18866-8-1-0-5"
style="font-style:italic;font-variant:normal;font-weight:normal;font-stretch:normal;font-family:sans-serif;-inkscape-font-specification:'sans-serif Italic';stroke-width:0.264583"
x="-26.510138"
y="181.73605">集中推理</tspan></text>
<text
xml:space="preserve"
style="font-style:normal;font-weight:normal;font-size:10.5833px;line-height:1.25;font-family:sans-serif;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.264583"
x="-26.780876"
y="110.9239"
id="text18868-3-1-5-4"><tspan
sodipodi:role="line"
id="tspan18866-8-1-0-6"
style="font-style:italic;font-variant:normal;font-weight:normal;font-stretch:normal;font-family:sans-serif;-inkscape-font-specification:'sans-serif Italic';stroke-width:0.264583"
x="-26.780876"
y="110.9239">传回轨迹</tspan></text>
<text
xml:space="preserve"
style="font-style:normal;font-weight:normal;font-size:10.5833px;line-height:1.25;font-family:sans-serif;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.264583"
x="132.00789"
y="260.86563"
id="text18868-3-1-0"><tspan
sodipodi:role="line"
id="tspan18866-8-1-8"
style="stroke-width:0.264583"
x="132.00789"
y="260.86563">轨迹</tspan></text>
<rect
style="fill:#ffffff;fill-opacity:0;stroke:#000000;stroke-width:1.31629;stroke-linejoin:round"
id="rect14994-8-2"
width="72.150307"
height="35.024509"
x="-148.56631"
y="119.91573"
ry="5.7674842" />
<rect
style="fill:#ffffff;fill-opacity:1;stroke:#000000;stroke-width:1.31629;stroke-linejoin:round"
id="rect14994-8"
width="72.150307"
height="35.024509"
x="-141.41985"
y="127.66415"
ry="5.7674842" />
<text
xml:space="preserve"
style="font-style:normal;font-weight:normal;font-size:10.5833px;line-height:1.25;font-family:sans-serif;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.264583"
x="-117.13503"
y="148.94951"
id="text18868-8"><tspan
sodipodi:role="line"
id="tspan18866-5"
style="stroke-width:0.264583"
x="-117.13503"
y="148.94951">环境</tspan></text>
<text
xml:space="preserve"
style="font-style:normal;font-weight:normal;font-size:10.5833px;line-height:1.25;font-family:sans-serif;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.264583"
x="-66.873871"
y="158.91365"
id="text18868-8-4"><tspan
sodipodi:role="line"
id="tspan18866-5-6"
style="stroke-width:0.264583"
x="-66.873871"
y="158.91365">观察量</tspan></text>
<text
xml:space="preserve"
style="font-style:normal;font-weight:normal;font-size:10.5833px;line-height:1.25;font-family:sans-serif;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.264583"
x="-59.924511"
y="132.58028"
id="text18868-8-4-9"><tspan
sodipodi:role="line"
id="tspan18866-5-6-2"
style="stroke-width:0.264583"
x="-59.924511"
y="132.58028">动作</tspan></text>
<rect
style="fill:#ffffff;fill-opacity:0;stroke:#000000;stroke-width:1.31629;stroke-linejoin:round"
id="rect14994-8-8"
width="72.150307"
height="35.024509"
x="-33.468338"
y="125.49976"
ry="5.7674842" />
<text
xml:space="preserve"
style="font-style:normal;font-weight:normal;font-size:10.5833px;line-height:1.25;font-family:sans-serif;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.264583"
x="-22.948561"
y="148.04997"
id="text18868-8-1"><tspan
sodipodi:role="line"
id="tspan18866-5-0"
style="stroke-width:0.264583"
x="-22.948561"
y="148.04997">推理服务器</tspan></text>
<rect
style="fill:#ffffff;fill-opacity:0;stroke:#000000;stroke-width:1.31629;stroke-linejoin:round"
id="rect14994-8-8-3"
width="72.150307"
height="35.024509"
x="232.87144"
y="127.03558"
ry="5.7674842" />
<text
xml:space="preserve"
style="font-style:normal;font-weight:normal;font-size:10.5833px;line-height:1.25;font-family:sans-serif;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.264583"
x="243.4552"
y="147.87918"
id="text18868-8-1-0"><tspan
sodipodi:role="line"
id="tspan18866-5-0-4"
style="stroke-width:0.264583"
x="243.4552"
y="147.87918">模型存储器</tspan></text>
<rect
style="fill:#ffffff;fill-opacity:0;stroke:#000000;stroke-width:1.31629;stroke-linejoin:round"
id="rect14994-8-8-3-4"
width="72.150307"
height="35.024509"
x="144.12167"
y="125.81642"
ry="5.7674842" />
<text
xml:space="preserve"
style="font-style:normal;font-weight:normal;font-size:10.5833px;line-height:1.25;font-family:sans-serif;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.264583"
x="153.62067"
y="146.87305"
id="text18868-8-1-0-4"><tspan
sodipodi:role="line"
id="tspan18866-5-0-4-4"
style="stroke-width:0.264583"
x="153.62067"
y="146.87305">模型评估器</tspan></text>
<rect
style="fill:#ffffff;fill-opacity:0;stroke:#000000;stroke-width:1.31629;stroke-linejoin:round"
id="rect14994-8-8-3-4-7"
width="72.150307"
height="35.024509"
x="55.503185"
y="126.00691"
ry="5.7674842" />
<rect
style="fill:#ffffff;fill-opacity:0;stroke:#000000;stroke-width:1.5;stroke-linejoin:round;stroke-miterlimit:4;stroke-dasharray:4.5, 1.5;stroke-dashoffset:0"
id="rect14994-8-8-3-4-7-4"
width="260.65747"
height="135.01796"
x="50.785854"
y="74.764908"
ry="8.7207499" />
<rect
style="fill:#ffffff;fill-opacity:0;stroke:#000000;stroke-width:1.5;stroke-linejoin:round;stroke-miterlimit:4;stroke-dasharray:4.5, 1.5;stroke-dashoffset:0"
id="rect14994-8-8-3-4-7-4-4"
width="195.73186"
height="95.42511"
x="-152.73541"
y="94.206604"
ry="6.163465" />
<text
xml:space="preserve"
style="font-style:normal;font-weight:normal;font-size:10.5833px;line-height:1.25;font-family:sans-serif;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.264583"
x="65.438759"
y="147.22806"
id="text18868-8-1-0-4-6"><tspan
sodipodi:role="line"
id="tspan18866-5-0-4-4-3"
style="stroke-width:0.264583"
x="65.438759"
y="147.22806">模型选择器</tspan></text>
<rect
style="fill:#ffffff;fill-opacity:0;stroke:#000000;stroke-width:1.31629;stroke-linejoin:round"
id="rect14994-8-8-3-4-7-1"
width="72.150307"
height="35.024509"
x="146.03395"
y="77.397316"
ry="5.7674842" />
<text
xml:space="preserve"
style="font-style:normal;font-weight:normal;font-size:10.5833px;line-height:1.25;font-family:sans-serif;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.264583"
x="159.88176"
y="98.891731"
id="text18868-8-1-0-4-6-7"><tspan
sodipodi:role="line"
id="tspan18866-5-0-4-4-3-5"
style="stroke-width:0.264583"
x="159.88176"
y="98.891731">元学习者</tspan></text>
<text
xml:space="preserve"
style="font-style:normal;font-weight:normal;font-size:10.5833px;line-height:1.25;font-family:sans-serif;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.264583"
x="56.849323"
y="89.429657"
id="text18868-8-1-0-4-6-7-4"><tspan
sodipodi:role="line"
id="tspan18866-5-0-4-4-3-5-7"
style="font-style:normal;font-variant:normal;font-weight:bold;font-stretch:normal;font-family:sans-serif;-inkscape-font-specification:'sans-serif Bold';stroke-width:0.264583"
x="56.849323"
y="89.429657">联盟型管理</tspan></text>
<text
xml:space="preserve"
style="font-style:normal;font-weight:normal;font-size:10.5833px;line-height:1.25;font-family:sans-serif;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.264583"
x="-143.45882"
y="108.3296"
id="text18868-8-1-0-4-6-7-4-8"><tspan
sodipodi:role="line"
id="tspan18866-5-0-4-4-3-5-7-1"
style="font-style:normal;font-variant:normal;font-weight:bold;font-stretch:normal;font-family:sans-serif;-inkscape-font-specification:'sans-serif Bold';stroke-width:0.264583"
x="-143.45882"
y="108.3296">批量推理</tspan></text>
<rect
style="fill:#ffffff;fill-opacity:0;stroke:#000000;stroke-width:1.31629;stroke-linejoin:round"
id="rect14994-8-8-3-4-7-1-9"
width="72.150307"
height="35.024509"
x="144.70303"
y="171.50504"
ry="5.7674842" />
<text
xml:space="preserve"
style="font-style:normal;font-weight:normal;font-size:10.5833px;line-height:1.25;font-family:sans-serif;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.264583"
x="149.94156"
y="193.65414"
id="text18868-8-1-0-4-6-7-6"><tspan
sodipodi:role="line"
id="tspan18866-5-0-4-4-3-5-2"
style="stroke-width:0.264583"
x="149.94156"
y="193.65414">均衡求解器等</tspan></text>
<rect
style="fill:#ffffff;fill-opacity:0;stroke:#000000;stroke-width:1.5;stroke-linejoin:round;stroke-miterlimit:4;stroke-dasharray:none"
id="rect14994-5"
width="108.53062"
height="38.626457"
x="172.19795"
y="14.38378"
ry="6.3606167" />
<rect
style="fill:#ffffff;fill-opacity:0.999939;stroke:#000000;stroke-width:1.5;stroke-linejoin:round;stroke-miterlimit:4;stroke-dasharray:none"
id="rect14994-9-61"
width="108.53062"
height="38.626457"
x="180.60855"
y="20.648718"
ry="6.3606167" />
<rect
style="fill:#ffffff;fill-opacity:0.999939;stroke:#000000;stroke-width:0.886039;stroke-linejoin:round"
id="rect14994-9-6-1"
width="48.133919"
height="23.788153"
x="192.06108"
y="27.038504"
ry="3.9171939" />
<text
xml:space="preserve"
style="font-style:normal;font-weight:normal;font-size:10.5833px;line-height:1.25;font-family:sans-serif;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.264583"
x="206.19484"
y="43.035706"
id="text18868-5"><tspan
sodipodi:role="line"
id="tspan18866-9"
style="stroke-width:0.264583"
x="206.19484"
y="43.035706">模型</tspan></text>
<text
xml:space="preserve"
style="font-style:normal;font-weight:normal;font-size:10.5833px;line-height:1.25;font-family:sans-serif;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.264583"
x="255.11052"
y="32.819534"
id="text18868-3-8"><tspan
sodipodi:role="line"
id="tspan18866-8-4"
style="stroke-width:0.264583"
x="255.11052"
y="32.819534">学习者</tspan></text>
<rect
style="fill:#ffffff;fill-opacity:0;stroke:#000000;stroke-width:1.5;stroke-linejoin:round;stroke-miterlimit:4;stroke-dasharray:none"
id="rect2439-17"
width="319.29016"
height="73.025307"
x="-11.073907"
y="218.08417"
ry="7.4795961" />
<text
xml:space="preserve"
style="font-style:normal;font-weight:normal;font-size:10.5833px;line-height:1.25;font-family:sans-serif;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.264583"
x="-3.9043519"
y="232.2695"
id="text6769-8"><tspan
sodipodi:role="line"
id="tspan6767-5"
style="font-style:normal;font-variant:normal;font-weight:bold;font-stretch:normal;font-family:sans-serif;-inkscape-font-specification:'sans-serif Bold';stroke-width:0.264583"
x="-3.9043519"
y="232.2695">智能体2</tspan></text>
<rect
style="fill:#ffffff;fill-opacity:0;stroke:#000000;stroke-width:1.5;stroke-linejoin:round;stroke-miterlimit:4;stroke-dasharray:none"
id="rect14994-7"
width="108.53062"
height="38.626457"
x="-3.1548784"
y="238.73528"
ry="6.3606167" />
<rect
style="fill:#ffffff;fill-opacity:0.999939;stroke:#000000;stroke-width:1.5;stroke-linejoin:round;stroke-miterlimit:4;stroke-dasharray:none"
id="rect14994-9-4"
width="108.53062"
height="38.626457"
x="5.2557054"
y="245.00021"
ry="6.3606167" />
<rect
style="fill:#ffffff;fill-opacity:0.999939;stroke:#000000;stroke-width:0.886039;stroke-linejoin:round"
id="rect14994-9-6-18"
width="48.133919"
height="23.788153"
x="16.708244"
y="251.39"
ry="3.9171939" />
<text
xml:space="preserve"
style="font-style:normal;font-weight:normal;font-size:10.5833px;line-height:1.25;font-family:sans-serif;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.264583"
x="29.326096"
y="267.40024"
id="text18868-59"><tspan
sodipodi:role="line"
id="tspan18866-7"
style="stroke-width:0.264583"
x="29.326096"
y="267.40024">模型</tspan></text>
<text
xml:space="preserve"
style="font-style:normal;font-weight:normal;font-size:10.5833px;line-height:1.25;font-family:sans-serif;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.264583"
x="79.75769"
y="257.17102"
id="text18868-3-5"><tspan
sodipodi:role="line"
id="tspan18866-8-3"
style="stroke-width:0.264583"
x="79.75769"
y="257.17102">行动者</tspan></text>
<rect
style="fill:#ffffff;fill-opacity:0;stroke:#000000;stroke-width:1.5;stroke-linejoin:round;stroke-miterlimit:4;stroke-dasharray:none"
id="rect14994-5-8"
width="108.53062"
height="38.626457"
x="167.9364"
y="239.30396"
ry="6.3606167" />
<rect
style="fill:#ffffff;fill-opacity:0.999939;stroke:#000000;stroke-width:1.5;stroke-linejoin:round;stroke-miterlimit:4;stroke-dasharray:none"
id="rect14994-9-61-8"
width="108.53062"
height="38.626457"
x="176.34698"
y="245.56889"
ry="6.3606167" />
<rect
style="fill:#ffffff;fill-opacity:0.999939;stroke:#000000;stroke-width:0.886039;stroke-linejoin:round"
id="rect14994-9-6-1-3"
width="48.133919"
height="23.788153"
x="187.79953"
y="251.95868"
ry="3.9171939" />
<text
xml:space="preserve"
style="font-style:normal;font-weight:normal;font-size:10.5833px;line-height:1.25;font-family:sans-serif;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.264583"
x="201.93329"
y="267.95587"
id="text18868-5-1"><tspan
sodipodi:role="line"
id="tspan18866-9-8"
style="stroke-width:0.264583"
x="201.93329"
y="267.95587">模型</tspan></text>
<text
xml:space="preserve"
style="font-style:normal;font-weight:normal;font-size:10.5833px;line-height:1.25;font-family:sans-serif;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.264583"
x="250.84897"
y="257.73969"
id="text18868-3-8-9"><tspan
sodipodi:role="line"
id="tspan18866-8-4-6"
style="stroke-width:0.264583"
x="250.84897"
y="257.73969">学习者</tspan></text>
<path
style="fill:none;stroke:#000000;stroke-width:1;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1;marker-end:url(#Arrow1Mend)"
d="m 221.73029,51.680793 41.05117,73.696017"
id="path129221"
sodipodi:nodetypes="cc" />
<path
style="fill:none;stroke:#000000;stroke-width:1;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1;marker-end:url(#Arrow1Mend-3)"
d="M 221.60084,238.39922 263.16343,164.2437"
id="path129221-3"
sodipodi:nodetypes="cc" />
<path
style="fill:none;stroke:#000000;stroke-width:1;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1;marker-end:url(#Arrow1Mend-3)"
d="m 232.07068,144.81429 -14.1921,0.11901"
id="path129647" />
<path
style="fill:none;stroke:#000000;stroke-width:1;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1;marker-end:url(#Arrow1Mend-3-8)"
d="m 143.41514,143.74434 -14.1921,0.11901"
id="path129647-0" />
<path
style="fill:none;stroke:#000000;stroke-width:1;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1;marker-end:url(#Arrow1Mend-3-89)"
d="m -33.402256,137.51102 -32.818613,0.10115"
id="path129647-7"
sodipodi:nodetypes="cc" />
<path
style="fill:none;stroke:#000000;stroke-width:1;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1;marker-end:url(#Arrow1Mend-3-89-0)"
d="m -68.294265,145.97536 32.818613,0.10115"
id="path129647-7-9"
sodipodi:nodetypes="cc" />
<path
style="fill:none;stroke:#000000;stroke-width:1;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1;marker-end:url(#Arrow1Mend-3)"
d="M 145.25257,93.372474 128.91909,131.5397"
id="path129935" />
<path
style="fill:none;stroke:#000000;stroke-width:1;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1;marker-end:url(#Arrow1Mend-3-6)"
d="M 145.12679,192.6959 128.79331,154.52868"
id="path129935-3" />
<path
style="fill:none;stroke:#000000;stroke-width:1;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1;marker-end:url(#Arrow1Mend)"
d="M 55.519965,130.52022 49.772467,52.460192"
id="path130093"
sodipodi:nodetypes="cc" />
<path
style="fill:none;stroke:#000000;stroke-width:1;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1;marker-end:url(#Arrow1Mend-0)"
d="M 54.993175,148.814 49.288436,250.07616"
id="path130093-9"
sodipodi:nodetypes="cc" />
<path
style="fill:none;stroke:#000000;stroke-width:1;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1;marker-end:url(#Arrow1Mend)"
d="M 38.762972,59.572454 19.500412,124.86725"
id="path130388"
sodipodi:nodetypes="cc" />
<path
style="fill:none;stroke:#000000;stroke-width:1;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1;marker-end:url(#Arrow1Mend-28)"
d="M 13.755602,126.03049 32.884955,60.936682"
id="path130388-3"
sodipodi:nodetypes="cc" />
<path
style="fill:none;stroke:#000000;stroke-width:1;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1;marker-end:url(#Arrow1Mend-2)"
d="M 40.851068,237.41642 22.417294,163.06537"
id="path130388-4"
sodipodi:nodetypes="cc" />
<path
style="fill:none;stroke:#000000;stroke-width:1;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1;marker-end:url(#Arrow1Mend-2-6)"
d="m 27.302794,161.36631 18.154932,75.43181"
id="path130388-4-0"
sodipodi:nodetypes="cc" />
<path
style="fill:none;stroke:#000000;stroke-width:1;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1;marker-end:url(#Arrow1Mend)"
d="m 118.75326,38.552769 60.05676,0.0488"
id="path153359"
sodipodi:nodetypes="cc" />
<path
style="fill:none;stroke:#000000;stroke-width:1;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1;marker-end:url(#Arrow1Mend-21)"
d="m 114.4343,264.28683 60.05676,0.0488"
id="path153359-5"
sodipodi:nodetypes="cc" />
</g>
</svg>

After

Width:  |  Height:  |  Size: 31 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 157 KiB

After

Width:  |  Height:  |  Size: 45 KiB

View File

@@ -8,7 +8,7 @@
version="1.1"
id="svg5"
inkscape:version="1.1.1 (1:1.1+202109281949+c3084ef5ed)"
sodipodi:docname="rllib_dist1.svg"
sodipodi:docname="ch12-rllib-distributed.svg"
xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape"
xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd"
xmlns="http://www.w3.org/2000/svg"
@@ -28,8 +28,8 @@
fit-margin-right="0"
fit-margin-bottom="0"
inkscape:zoom="0.45291836"
inkscape:cx="442.68464"
inkscape:cy="325.66576"
inkscape:cx="301.37882"
inkscape:cy="473.59529"
inkscape:window-width="1848"
inkscape:window-height="1136"
inkscape:window-x="72"
@@ -174,9 +174,10 @@
id="path17845"
sodipodi:nodetypes="cccc" />
<path
style="fill:none;stroke:#000000;stroke-width:1;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1"
d="m 59.75092,108.92605 0.03189,8.81987"
id="path18293" />
style="fill:none;stroke:#000000;stroke-width:1;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1;marker-end:url(#Arrow2Mend)"
d="m 59.75092,108.92605 -0.01464,7.19111"
id="path18293"
sodipodi:nodetypes="cc" />
<path
style="fill:none;stroke:#000000;stroke-width:1;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1;marker-end:url(#Arrow2Mend)"
d="M 205.96492,71.048464 206.19379,39.397006 58.50728,38.32907 58.51278,81.318156"

Before

Width:  |  Height:  |  Size: 9.8 KiB

After

Width:  |  Height:  |  Size: 9.9 KiB

View File

@@ -20,6 +20,49 @@
publisher={MIT Press}
}
@article{lanctot2017unified,
title={A unified game-theoretic approach to multiagent reinforcement learning},
author={Lanctot, Marc and Zambaldi, Vinicius and Gruslys, Audrunas and Lazaridou, Angeliki and Tuyls, Karl and P{\'e}rolat, Julien and Silver, David and Graepel, Thore},
journal={Advances in neural information processing systems},
volume={30},
year={2017}
}
@article{mnih2013playing,
title={Playing atari with deep reinforcement learning},
author={Mnih, Volodymyr and Kavukcuoglu, Koray and Silver, David and Graves, Alex and Antonoglou, Ioannis and Wierstra, Daan and Riedmiller, Martin},
journal={arXiv preprint arXiv:1312.5602},
year={2013}
}
@article{sunehag2017value,
title={Value-decomposition networks for cooperative multi-agent learning},
author={Sunehag, Peter and Lever, Guy and Gruslys, Audrunas and Czarnecki, Wojciech Marian and Zambaldi, Vinicius and Jaderberg, Max and Lanctot, Marc and Sonnerat, Nicolas and Leibo, Joel Z and Tuyls, Karl and others},
journal={arXiv preprint arXiv:1706.05296},
year={2017}
}
@inproceedings{rashid2018qmix,
title={Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning},
author={Rashid, Tabish and Samvelyan, Mikayel and Schroeder, Christian and Farquhar, Gregory and Foerster, Jakob and Whiteson, Shimon},
booktitle={International Conference on Machine Learning},
pages={4295--4304},
year={2018},
organization={PMLR}
}
@inproceedings{foerster2018counterfactual,
title={Counterfactual multi-agent policy gradients},
author={Foerster, Jakob and Farquhar, Gregory and Afouras, Triantafyllos and Nardelli, Nantas and Whiteson, Shimon},
booktitle={Proceedings of the AAAI conference on artificial intelligence},
volume={32},
number={1},
year={2018}
}
@inproceedings{krizhevsky2012imagenet,
title={Imagenet classification with deep convolutional neural networks},
author={Krizhevsky, Alex and Sutskever, Ilya and Hinton, Geoffrey E},

View File

@@ -1,9 +1,4 @@
@article{han2020tstarbot,
title={Tstarbot-x: An open-sourced and comprehensive study for efficient league training in starcraft ii full game},
author={Han, Lei and Xiong, Jiechao and Sun, Peng and Sun, Xinghai and Fang, Meng and Guo, Qingwei and Chen, Qiaobo and Shi, Tengfei and Yu, Hongsheng and Wu, Xipeng and others},
journal={arXiv preprint arXiv:2011.13729},
year={2020}
}
@inproceedings{wang2021scc,
title={SCC: an efficient deep reinforcement learning agent mastering the game of StarCraft II},
@@ -171,4 +166,149 @@ series = {ADKDD'14}
year={2017},
howpublished = "Website",
note = {\url{http://www.nvidia.com/object/volta-architecture-whitepaper.html}}
}
@inproceedings{mnih2016asynchronous,
title={Asynchronous methods for deep reinforcement learning},
author={Mnih, Volodymyr and Badia, Adria Puigdomenech and Mirza, Mehdi and Graves, Alex and Lillicrap, Timothy and Harley, Tim and Silver, David and Kavukcuoglu, Koray},
booktitle={International Conference on Machine Learning (ICML)},
pages={1928--1937},
year={2016}
}
@article{espeholt2018impala,
title={Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures},
author={Espeholt, Lasse and Soyer, Hubert and Munos, Remi and Simonyan, Karen and Mnih, Volodymir and Ward, Tom and Doron, Yotam and Firoiu, Vlad and Harley, Tim and Dunning, Iain and others},
journal={arXiv preprint arXiv:1802.01561},
year={2018}
}
@article{espeholt2019seed,
title={Seed rl: Scalable and efficient deep-rl with accelerated central inference},
author={Espeholt, Lasse and Marinier, Rapha{\"e}l and Stanczyk, Piotr and Wang, Ke and Michalski, Marcin},
journal={arXiv preprint arXiv:1910.06591},
year={2019}
}
@misc{horgan2018distributed,
title={Distributed Prioritized Experience Replay},
author={Dan Horgan and John Quan and David Budden and Gabriel Barth-Maron and Matteo Hessel and Hado van Hasselt and David Silver},
year={2018},
eprint={1803.00933},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
@inproceedings{moritz2018ray,
title={Ray: A distributed framework for emerging $\{$AI$\}$ applications},
author={Moritz, Philipp and Nishihara, Robert and Wang, Stephanie and Tumanov, Alexey and Liaw, Richard and Liang, Eric and Elibol, Melih and Yang, Zongheng and Paul, William and Jordan, Michael I and others},
booktitle={13th $\{$USENIX$\}$ Symposium on Operating Systems Design and Implementation ($\{$OSDI$\}$ 18)},
pages={561--577},
year={2018}
}
@article{liang2017ray,
title={Ray rllib: A composable and scalable reinforcement learning library},
author={Liang, Eric and Liaw, Richard and Nishihara, Robert and Moritz, Philipp and Fox, Roy and Gonzalez, Joseph and Goldberg, Ken and Stoica, Ion},
journal={arXiv preprint arXiv:1712.09381},
pages={85},
year={2017}
}
@article{cassirer2021reverb,
title={Reverb: A Framework For Experience Replay},
author={Cassirer, Albin and Barth-Maron, Gabriel and Brevdo, Eugene and Ramos, Sabela and Boyd, Toby and Sottiaux, Thibault and Kroiss, Manuel},
journal={arXiv preprint arXiv:2102.04736},
year={2021}
}
@article{hoffman2020acme,
title={Acme: A research framework for distributed reinforcement learning},
author={Hoffman, Matt and Shahriari, Bobak and Aslanides, John and Barth-Maron, Gabriel and Behbahani, Feryal and Norman, Tamara and Abdolmaleki, Abbas and Cassirer, Albin and Yang, Fan and Baumli, Kate and others},
journal={arXiv preprint arXiv:2006.00979},
year={2020}
}
@article{vinyals2019grandmaster,
title={Grandmaster level in StarCraft II using multi-agent reinforcement learning},
author={Vinyals, Oriol and Babuschkin, Igor and Czarnecki, Wojciech M and Mathieu, Micha{\"e}l and Dudzik, Andrew and Chung, Junyoung and Choi, David H and Powell, Richard and Ewalds, Timo and Georgiev, Petko and others},
journal={Nature},
volume={575},
number={7782},
pages={350--354},
year={2019},
publisher={Nature Publishing Group}
}
@article{berner2019dota,
title={Dota 2 with large scale deep reinforcement learning},
author={Berner, Christopher and Brockman, Greg and Chan, Brooke and Cheung, Vicki and D{\k{e}}biak, Przemys{\l}aw and Dennison, Christy and Farhi, David and Fischer, Quirin and Hashme, Shariq and Hesse, Chris and others},
journal={arXiv preprint arXiv:1912.06680},
year={2019}
}
@article{han2020tstarbot,
title={Tstarbot-x: An open-sourced and comprehensive study for efficient league training in starcraft ii full game},
author={Han, Lei and Xiong, Jiechao and Sun, Peng and Sun, Xinghai and Fang, Meng and Guo, Qingwei and Chen, Qiaobo and Shi, Tengfei and Yu, Hongsheng and Wu, Xipeng and others},
journal={arXiv preprint arXiv:2011.13729},
year={2020}
}
@article{sunehag2017value,
title={Value-decomposition networks for cooperative multi-agent learning},
author={Sunehag, Peter and Lever, Guy and Gruslys, Audrunas and Czarnecki, Wojciech Marian and Zambaldi, Vinicius and Jaderberg, Max and Lanctot, Marc and Sonnerat, Nicolas and Leibo, Joel Z and Tuyls, Karl and others},
journal={arXiv preprint arXiv:1706.05296},
year={2017}
}
@inproceedings{rashid2018qmix,
title={Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning},
author={Rashid, Tabish and Samvelyan, Mikayel and Schroeder, Christian and Farquhar, Gregory and Foerster, Jakob and Whiteson, Shimon},
booktitle={International Conference on Machine Learning},
pages={4295--4304},
year={2018},
organization={PMLR}
}
@inproceedings{foerster2018counterfactual,
title={Counterfactual multi-agent policy gradients},
author={Foerster, Jakob and Farquhar, Gregory and Afouras, Triantafyllos and Nardelli, Nantas and Whiteson, Shimon},
booktitle={Proceedings of the AAAI conference on artificial intelligence},
volume={32},
number={1},
year={2018}
}
@article{lanctot2017unified,
title={A unified game-theoretic approach to multiagent reinforcement learning},
author={Lanctot, Marc and Zambaldi, Vinicius and Gruslys, Audrunas and Lazaridou, Angeliki and Tuyls, Karl and P{\'e}rolat, Julien and Silver, David and Graepel, Thore},
journal={Advances in neural information processing systems},
volume={30},
year={2017}
}
@article{mnih2013playing,
title={Playing atari with deep reinforcement learning},
author={Mnih, Volodymyr and Kavukcuoglu, Koray and Silver, David and Graves, Alex and Antonoglou, Ioannis and Wierstra, Daan and Riedmiller, Martin},
journal={arXiv preprint arXiv:1312.5602},
year={2013}
}
@article{ding2020efficient,
title={Efficient Reinforcement Learning Development with RLzoo},
author={Ding, Zihan and Yu, Tianyang and Huang, Yanhua and Zhang, Hongming and Li, Guo and Guo, Quancheng and Mai, Luo and Dong, Hao},
journal={arXiv preprint arXiv:2009.08644},
year={2020}
}
@article{makoviychuk2021isaac,
title={Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning},
author={Makoviychuk, Viktor and Wawrzyniak, Lukasz and Guo, Yunrong and Lu, Michelle and Storey, Kier and Macklin, Miles and Hoeller, David and Rudin, Nikita and Allshire, Arthur and Handa, Ankur and others},
journal={arXiv preprint arXiv:2108.10470},
year={2021}
}