Hypothesis

4 Matching Annotations

Jul 2024
arxiv.org arxiv.org

2403.07691.pdf

1
1. mark.crowley 16 Jul 2024
  
  in Public
  
  2024 paper arguing that other methods beyond PPO could be better for "value alignment" of LLMs
  
  reinforcement-learning ppo ece457c
Visit annotations in context

Tags

reinforcement-learning

ece457c

ppo

Annotators

mark.crowley

URL

arxiv.org/pdf/2403.07691
Jul 2023
arxiv.org arxiv.org

Continuous control with deep reinforcement learning

1
1. mark.crowley 10 Jul 2023
  
  in Public
  
  This paper introduces the DDPG algorithm which builds on the existing DPG algorithm from classic RL theory. The main idea is to define a deterministic policy, or nearly deterministic, for situations where the environment is very sensitive to suboptimal actions, and one action setting usually dominates in each state. This showed good performance, but could not beat algorithms such as PPO until the additions of SAC were added. SAC adds an entropy penalty which essentially penalizes uncertainty in any states. Using this, the deterministic policy gradient approach performs well.
  
  ddpg reinforcement-learning SAC DPG PPO
Visit annotations in context

Tags

DPG

reinforcement-learning

PPO

ddpg

SAC

Annotators

mark.crowley

URL

arxiv.org/pdf/1509.02971
arxiv.org arxiv.org

1707.06347.pdf

1
1. mark.crowley 10 Jul 2023
  
  in Public
  
  Paper that introduced the PPO algorithm. PPO is, in a way, a response to the TRPO algorithm, trying to use the core idea but implement a more efficient and simpler algorithm.
  
  TRPO defines the problem as a straight optimization problem, no learning is actually involved.
  
  ppo reinforcement-learning policy-gradients trpo
Visit annotations in context

Tags

reinforcement-learning

trpo

policy-gradients

ppo

Annotators

mark.crowley

URL

arxiv.org/pdf/1707.06347
arxiv.org arxiv.org

1511.05952.pdf

1
1. mark.crowley 10 Jul 2023
  
  in Public
  
  Tom Schaul, John Quan, Ioannis Antonoglou and David Silver. "PRIORITIZED EXPERIENCE REPLAY", ICLR, 2016.
  
  reinforcement-learning ppo deep-learning deep-rl policy-gradient direct-policy-search trust-region
Visit annotations in context

Tags

deep-learning

direct-policy-search

ppo

reinforcement-learning

policy-gradient

deep-rl

trust-region

Annotators

mark.crowley

URL

arxiv.org/pdf/1511.05952.pdf

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL