Near Minimax-Optimal Distributional Temporal Difference Algorithms and The Freedman Inequality in Hilbert Spaces
CoRR(2024)
摘要
Distributional reinforcement learning (DRL) has achieved empirical success in
various domains. One of the core tasks in the field of DRL is distributional
policy evaluation, which involves estimating the return distribution η^π
for a given policy π. The distributional temporal difference (TD) algorithm
has been accordingly proposed, which is an extension of the temporal difference
algorithm in the classic RL literature. In the tabular case,
and proved the
asymptotic convergence of two instances of distributional TD, namely
categorical temporal difference algorithm (CTD) and quantile temporal
difference algorithm (QTD), respectively. In this paper, we go a step further
and analyze the finite-sample performance of distributional TD. To facilitate
theoretical analysis, we propose a non-parametric distributional TD algorithm
(NTD). For a γ-discounted infinite-horizon tabular Markov decision
process, we show that for NTD we need
Õ(1/ε^2p(1-γ)^2p+1) iterations
to achieve an ε-optimal estimator with high probability, when the
estimation error is measured by the p-Wasserstein distance. This sample
complexity bound is minimax optimal (up to logarithmic factors) in the case of
the 1-Wasserstein distance. To achieve this, we establish a novel Freedman's
inequality in Hilbert spaces, which would be of independent interest. In
addition, we revisit CTD, showing that the same non-asymptotic convergence
bounds hold for CTD in the case of the p-Wasserstein distance.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要