高级检索
当前位置: 首页 > 详情页

Approximate Policy Iteration With Deep Minimax Average Bellman Error Minimization

文献详情

资源类型:
WOS体系:

收录情况: ◇ SCIE

单位: [1]Duke NUS Med Sch, Cardiovasc & Metab Disorders Program, Singapore 169857, Singapore [2]Wuhan Univ, Sch Math & Stat, Wuhan 430072, Peoples R China [3]Wuhan Univ, Hubei Key Lab Computat Sci, Wuhan 430072, Peoples R China [4]Duke NUS Med Sch, Ctr Quantitat Med, Singapore 169857, Singapore [5]Huazhong Univ Sci & Technol, Tongji Hosp, Tongji Med Coll, Dept Anesthesiol, Wuhan 430030, Peoples R China
出处:
ISSN:

关键词: alpha-mixing deep approximate policy iteration (DAPI) deep neural networks minimax loss nonasymptotic error bound reinforcement learning (RL)

摘要:
In this work, we investigate the utilization of deep approximate policy iteration (DAPI) in estimating the optimal action-value function Q* within the context of reinforcement learning, employing rectified linear unit (ReLU) ResNet as the underlying framework. The iterative process of DAPI incorporates the minimax average Bellman error minimization principle. It employs ReLU ResNet to estimate the fixed point of the Bellman equation, which is aligned with the estimated greedy policy. Through error propagation, we derive nonasymptotic error bounds between Q* and the estimated Q function induced by the output greedy policy in DAPI. To effectively control the Bellman residual error, we address both the statistical and approximation errors associated with the alpha -mixing dependent data derived from Markov decision processes, using the techniques of empirical process and deep approximation theory, respectively. Furthermore, we present a novel generalization bound for ReLU ResNet in the presence of dependent data, as well as an approximation bound for ReLU ResNet within the H & ouml;lder class. Notably, this approximation bound contributes to a significant improvement in the dependence on the ambient dimension, transitioning from an exponential relationship to a polynomial one. The derived nonasymptotic error bounds explicitly depend on factors such as the sample size, the ambient dimension (in polynomial terms), and the width and depth of the neural networks. Consequently, these bounds serve as valuable theoretical guidelines for appropriately setting the hyperparameters, thereby enabling the achievement of the desired convergence rate during the training process of DAPI.

基金:
语种:
WOS:
中科院(CAS)分区:
出版当年[2023]版:
大类 | 1 区 计算机科学
小类 | 1 区 计算机:硬件 1 区 计算机:理论方法 2 区 计算机:人工智能 2 区 工程:电子与电气
最新[2025]版:
大类 | 1 区 计算机科学
小类 | 1 区 计算机:硬件 1 区 计算机:理论方法 1 区 工程:电子与电气 2 区 计算机:人工智能
JCR分区:
出版当年[2022]版:
Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Q1 COMPUTER SCIENCE, THEORY & METHODS Q1 ENGINEERING, ELECTRICAL & ELECTRONIC
最新[2024]版:
Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Q1 COMPUTER SCIENCE, THEORY & METHODS Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

影响因子: 最新[2024版] 最新五年平均 出版当年[2022版] 出版当年五年平均 出版前一年[2021版] 出版后一年[2023版]

第一作者:
第一作者单位: [1]Duke NUS Med Sch, Cardiovasc & Metab Disorders Program, Singapore 169857, Singapore
通讯作者:
推荐引用方式(GB/T 7714):
APA:
MLA:

资源点击量:622 今日访问量:0 总访问量:452 更新日期:2025-07-01 建议使用谷歌、火狐浏览器 常见问题

版权所有:重庆聚合科技有限公司 渝ICP备12007440号-3 地址:重庆市两江新区泰山大道西段8号坤恩国际商务中心16层(401121)