- Jul 2024
-
arxiv.org arxiv.org
-
2024 paper arguing that other methods beyond PPO could be better for "value alignment" of LLMs
Tags
Annotators
URL
-
- Jul 2023
-
-
This paper introduces the DDPG algorithm which builds on the existing DPG algorithm from classic RL theory. The main idea is to define a deterministic policy, or nearly deterministic, for situations where the environment is very sensitive to suboptimal actions, and one action setting usually dominates in each state. This showed good performance, but could not beat algorithms such as PPO until the additions of SAC were added. SAC adds an entropy penalty which essentially penalizes uncertainty in any states. Using this, the deterministic policy gradient approach performs well.
Tags
Annotators
URL
-
-
arxiv.org arxiv.org
-
Paper that introduced the PPO algorithm. PPO is, in a way, a response to the TRPO algorithm, trying to use the core idea but implement a more efficient and simpler algorithm.
TRPO defines the problem as a straight optimization problem, no learning is actually involved.
-
-
arxiv.org arxiv.org
-
Tom Schaul, John Quan, Ioannis Antonoglou and David Silver. "PRIORITIZED EXPERIENCE REPLAY", ICLR, 2016.
-