Constrained Policy Improvement for Safe and Efficient Reinforcement Learning

May 20, 2018 ยท Entered Twilight ยท ๐Ÿ› arXiv.org

๐ŸŒ… TWILIGHT: Old Age
Predates the code-sharing era โ€” a pioneer of its time

"Last commit was 5.0 years ago (โ‰ฅ5 year threshold)"

Evidence collected by the PWNC Scanner

Repo contents: .gitignore, README.md, __init__.py, actors.sh, agent.py, ape_agent.py, config.py, constrained_policy_improvement__supplementary_material.pdf, environment.py, evaluate.ipynb, experiment.py, install, learner.sh, logger.py, main.py, memory.py, memory_rnn.py, model.py, plot_results.py, ppo_agent.py, preprocess.py, r2d2_agent.py, rbi_agent.py, rbi_rnn_agent.py, runs

Authors Elad Sarafian, Aviv Tamar, Sarit Kraus arXiv ID 1805.07805 Category cs.LG: Machine Learning Cross-listed cs.AI, stat.ML Citations 11 Venue arXiv.org Repository https://github.com/eladsar/rbi โญ 1 Last Checked 1 month ago
Abstract
We propose a policy improvement algorithm for Reinforcement Learning (RL) which is called Rerouted Behavior Improvement (RBI). RBI is designed to take into account the evaluation errors of the Q-function. Such errors are common in RL when learning the $Q$-value from finite past experience data. Greedy policies or even constrained policy optimization algorithms which ignore these errors may suffer from an improvement penalty (i.e. a negative policy improvement). To minimize the improvement penalty, the RBI idea is to attenuate rapid policy changes of low probability actions which were less frequently sampled. This approach is shown to avoid catastrophic performance degradation and reduce regret when learning from a batch of past experience. Through a two-armed bandit with Gaussian distributed rewards example, we show that it also increases data efficiency when the optimal action has a high variance. We evaluate RBI in two tasks in the Atari Learning Environment: (1) learning from observations of multiple behavior policies and (2) iterative RL. Our results demonstrate the advantage of RBI over greedy policies and other constrained policy optimization algorithms as a safe learning approach and as a general data efficient learning algorithm. An anonymous Github repository of our RBI implementation is found at https://github.com/eladsar/rbi.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

๐Ÿ“œ Similar Papers

In the same crypt โ€” Machine Learning