详情页 - 聚合机构库演示平台

当前位置：首页 > 详情页

Approximate Policy Iteration With Deep Minimax Average Bellman Error Minimization

文献详情

资源类型：

WOS体系：

收录情况： ◇ SCIE

作者：

单位： [1]Duke NUS Med Sch, Cardiovasc & Metab Disorders Program, Singapore 169857, Singapore [2]Wuhan Univ, Sch Math & Stat, Wuhan 430072, Peoples R China [3]Wuhan Univ, Hubei Key Lab Computat Sci, Wuhan 430072, Peoples R China [4]Duke NUS Med Sch, Ctr Quantitat Med, Singapore 169857, Singapore [5]Huazhong Univ Sci & Technol, Tongji Hosp, Tongji Med Coll, Dept Anesthesiol, Wuhan 430030, Peoples R China

出处：

DOI：

ISSN：

关键词： alpha-mixing deep approximate policy iteration (DAPI) deep neural networks minimax loss nonasymptotic error bound reinforcement learning (RL)

摘要：

In this work, we investigate the utilization of deep approximate policy iteration (DAPI) in estimating the optimal action-value function Q* within the context of reinforcement learning, employing rectified linear unit (ReLU) ResNet as the underlying framework. The iterative process of DAPI incorporates the minimax average Bellman error minimization principle. It employs ReLU ResNet to estimate the fixed point of the Bellman equation, which is aligned with the estimated greedy policy. Through error propagation, we derive nonasymptotic error bounds between Q* and the estimated Q function induced by the output greedy policy in DAPI. To effectively control the Bellman residual error, we address both the statistical and approximation errors associated with the alpha -mixing dependent data derived from Markov decision processes, using the techniques of empirical process and deep approximation theory, respectively. Furthermore, we present a novel generalization bound for ReLU ResNet in the presence of dependent data, as well as an approximation bound for ReLU ResNet within the H & ouml;lder class. Notably, this approximation bound contributes to a significant improvement in the dependence on the ambient dimension, transitioning from an exponential relationship to a polynomial one. The derived nonasymptotic error bounds explicitly depend on factors such as the sample size, the ambient dimension (in polynomial terms), and the width and depth of the neural networks. Consequently, these bounds serve as valuable theoretical guidelines for appropriately setting the hyperparameters, thereby enabling the achievement of the desired convergence rate during the training process of DAPI.

基金：

语种：

WOS：

中科院(CAS)分区：

出版当年[2023]版：

大类 | 1 区

计算机科学

小类 | 1 区计算机：硬件 1 区计算机：理论方法 2 区计算机：人工智能 2 区工程：电子与电气

最新[2025]版：

大类 | 1 区

计算机科学

小类 | 1 区计算机：硬件 1 区计算机：理论方法 1 区工程：电子与电气 2 区计算机：人工智能

JCR分区：

出版当年[2022]版：

Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Q1 COMPUTER SCIENCE, THEORY & METHODS Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

最新[2024]版：

Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Q1 COMPUTER SCIENCE, THEORY & METHODS Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

影响因子： 8.9 最新[2024版] 11.1 最新五年平均 10.4 出版当年[2022版] 11.2 出版当年五年平均 14.255 出版前一年[2021版] 10.2 出版后一年[2023版]

第一作者：

第一作者单位： [1]Duke NUS Med Sch, Cardiovasc & Metab Disorders Program, Singapore 169857, Singapore

通讯作者：

推荐引用方式(GB/T 7714)：

APA：

MLA：

Approximate Policy Iteration With Deep Minimax Average Bellman Error Minimization

文献详情

相关文献