59 Matching Annotations
  1. Feb 2024
    1. T. Herlau, "Moral Reinforcement Learning Using Actual Causation," 2022 2nd International Conference on Computer, Control and Robotics (ICCCR), Shanghai, China, 2022, pp. 179-185, doi: 10.1109/ICCCR54399.2022.9790262. keywords: {Digital control;Ethics;Costs;Philosophical considerations;Toy manufacturing industry;Reinforcement learning;Forestry;Causality;Reinforcement learning;Actual Causation;Ethical reinforcement learning}

  2. Nov 2023
  3. Oct 2023
    1. (Chen, NeurIPS, 2021) Che1, Lu, Rajeswaran, Lee, Grover, Laskin, Abbeel, Srinivas, and Mordatch. "Decision Transformer: Reinforcement Learning via Sequence Modeling". Arxiv preprint rXiv:2106.01345v2, June, 2021.

      Quickly a very influential paper with a new idea of how to learn generative models of action prediction using SARSA training from demonstration trajectories. No optimization of actions or rewards, but target reward is an input.

    1. Wu, Prabhumoye, Yeon Min, Bisk, Salakhutdinov, Azaria, Mitchell and Li. "SPRING: GPT-4 Out-performs RL Algorithms byStudying Papers and Reasoning". Arxiv preprint arXiv:2305.15486v2, May, 2023.

    2. Quantitatively, SPRING with GPT-4 outperforms all state-of-the-art RLbaselines, trained for 1M steps, without any training.

      Them's fighten' words!

      I haven't read it yet, but we're putting it on the list for this fall's reading group. Seriously, a strong result with a very strong implied claim. they are careful to say it's from their empirical results, very worth a look. I suspect that amount of implicit knowledge in the papers, text and DAG are helping to do this.

      The Big Question: is their comparison to RL baselines fair, are they being trained from scratch? What does a fair comparison of any from-scratch model (RL or supervised) mean when compared to an LLM approach (or any approach using a foundation model), when that model is not really from scratch.

    1. Training language models to follow instructionswith human feedback

      Original Paper for discussion of the Reinforcement Learning with Human Feedback algorithm.

  4. Sep 2023
  5. Jul 2023
    1. IMPALA: Scalable Distributed Deep-RL with Importance WeightedActor-Learner Architectures

      (Espeholt, ICML, 2018) "IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures"

    1. Yann LeCun released his vision for the future of Artificial Intelligence research in 2022, and it sounds a lot like Reinforcement Learning.

    1. Paper that evaluated the existing Double Q-Learning algorithm on the new DQN approach and validated that it is very effective in the Deep RL realm.

    1. This paper introduces the DDPG algorithm which builds on the existing DPG algorithm from classic RL theory. The main idea is to define a deterministic policy, or nearly deterministic, for situations where the environment is very sensitive to suboptimal actions, and one action setting usually dominates in each state. This showed good performance, but could not beat algorithms such as PPO until the additions of SAC were added. SAC adds an entropy penalty which essentially penalizes uncertainty in any states. Using this, the deterministic policy gradient approach performs well.

    1. This famous paper gives a great review of the DQN algorithm a couple years after it changed everything in Deep RL. It compares six different extensions to DQN for Deep Reinforcement Learning, many of which have now become standard additions to DQN and other Deep RL algorithms. It also combines all of them together to produce the "rainbow" algorithm, which outperformed many other models for a while.

    1. Arxiv paper from 2021 on reinforcement learning in a scenario where your aim is to learn a workable POMDP policy, but you start with a fully observable MDP and adjust it over time towards a POMDP.

    1. Paper that introduced the PPO algorithm. PPO is, in a way, a response to the TRPO algorithm, trying to use the core idea but implement a more efficient and simpler algorithm.

      TRPO defines the problem as a straight optimization problem, no learning is actually involved.

    1. Bowen Baker et. al. (Open AI) "Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos" Arkiv, June 2022.

      Introduction of VPT : New semi-supervied pre-trained model for sequential decision making on Minecraft. Data are from human video playthroughs but are unlabelled.

    1. Liang, Machado, Talvite, Bowling - AAMAS 2016 "State of the Art Control of Atari Games Using Shallow Reinforcement Learning"

      Response paper to DQN showing that well designed Value Function Approximations can also do well at these complex tasks without the use of Deep Learning

      A great paper showing how to think differently about the latest advances in Deep RL. All is not always what it seems!

  6. Jun 2023
    1. Liang, Machado, Talvite, Bowling - AAMAS 2016 "State of the Art Control of Atari Games Using Shallow Reinforcement Learning"

      A great paper showing how to think differently about the latest advances in Deep RL. All is not always what it seems!

    1. LeBlanc, D. G., & Lee, G. (2021). General Deep Reinforcement Learning in NES Games. Canadian AI 2021. Canadian Artificial Intelligence Association (CAIAC). https://doi.org/10.21428/594757db.8472938b

  7. Apr 2023
    1. Bowen Baker et. al. (Open AI) "Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos" Arkiv, June 2022.

      New supervised pre-trained model for sequential decision making on Minecraft. Data are from human video playthroughs but are unlabelled.

      reinforcement-learning foundation-models pretrained-models proj-minerl minecraft

  8. Mar 2023
    1. asks for the Minecraft domain.

      They demonstrate the model on a "minecraft-like" domain (introduced earlier by someone else) where there are resources in the world and the agent has tasks.

  9. Feb 2023
    1. Definition 3.2 (simple reward machine).

      The MDP does not change, it's dynamics are the same, with or without the RM, as they are with or without a standard reward model. Additionally, the rewards from the RM can be non-Markovian with respect to the MDP because they inherently have a kind of memory or where you've been, limited to the agents "movement" (almost "in it's mind") about where it is along the goals for this task.

    2. e thenshow that an RM can be interpreted as specifying a single reward function over a largerstate space, and consider types of reward functions that can be expressed using RMs

      So by specifying a reward machine you are augmenting the state space of the MDP with higher level goals/subgoals/concepts that provide structure about what is good and what isn't.

    3. However, an agent that hadaccess to the specification of the reward function might be able to use such information tolearn optimal policies faster.

      Fascinating idea, why not? Why are we hiding the reward from the agent really?

    4. Reward Machines: Exploiting Reward FunctionStructure in Reinforcement Learning

      [Icarte, JAIR, 2022] "Reward Machines: Exploiting Reward Function Structure in Reinforcement Learning"

    1. Using Reward Machines for High-Level Task Specificationand Decomposition in Reinforcement Learning

      [Icarte, PMLR, 2018] "Using Reward Machines for High-Level Task Specification and Decomposition in Reinforcement Learning"

  10. Dec 2022
  11. Sep 2022
    1. AAAI 2022 Paper : Decentralized Mean Field Games Happy to discuss online.

      S. Ganapathi Subramanian, M. Taylor, M. Crowley, and P. Poupart., “Decentralized mean field games,” in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI-2022), vol. 36, pp. 9439–9447, February 2022. 1.

  12. Jul 2022
  13. Jun 2022
  14. May 2022
    1. Question: What happened to Eligibility Traces in the Deep RL era? This paper highlights some of the reasons they are not used widely and proposes a way they could still be effective.

    1. Question: What happened to Eligibility Traces in the Deep RL era? This paper highlights some of the reasons they are not used widely and proposes a way they could still be effective.

  15. Mar 2022
  16. Jul 2021
  17. Jun 2021
  18. Mar 2021
    1. Using chemicals to improve our economy of attention and become emotionally "fitter" is an option that penetrated public consciousness some time ago.

      Same is true of reinforcement learning algorithms.

  19. Sep 2020
  20. May 2020
  21. Apr 2020
  22. Mar 2019
  23. Feb 2019
    1. We present MILABOT: a deep reinforcement learning chatbot developed by theMontreal Institute for Learning Algorithms (MILA) for the Amazon Alexa Prizecompetition. MILABOT is capable of conversing with humans on popular smalltalk topics through both speech and text. The system consists of an ensemble ofnatural language generation and retrieval models, including template-based models,bag-of-words models, sequence-to-sequence neural network and latent variableneural network models. By applying reinforcement learning to crowdsourced dataand real-world user interactions, the system has been trained to select an appropriateresponse from the models in its ensemble. The system has been evaluated throughA/B testing with real-world users, where it performed significantly better thanmany competing systems. Due to its machine learning architecture, the system islikely to improve with additional data