 Sep 2023

arxiv.org arxiv.org

Adaptive Stress Testing with Reward Augmentation for Autonomous Vehicle Validation


arxiv.org arxiv.org

"SPRING: GPT4 Outperforms RL Algorithms byStudying Papers and Reasoning"

Quantitatively, SPRING with GPT4 outperforms all stateoftheart RLbaselines, trained for 1M steps, without any training.
Them's fighten' words!
I haven't read it yet, but we're putting it on the list for this fall's reading group. Seriously, a strong result with a very strong implied claim. they are careful to say it's from their empirical results, very worth a look. I suspect that amount of implicit knowledge in the papers, text and DAG are helping to do this.
The Big Question: is their comparison to RL baselines fair, are they being trained from scratch? What does a fair comparison of any fromscratch model (RL or supervised) mean when compared to an LLM approach (or any approach using a foundation model), when that model is not really from scratch.


www.nature.com www.nature.com

and coupled with new algorithms
almost an afterthought here, I would cast it differently, the new algorithms are a major part of it as well.

Wang et. al. "Scientific discovery in the age of artificial intelligence", Nature, 2023.
A paper about the current state of using AI/ML for scientific discovery, connected with the AI4Science workshops at major conferences.
(NOTE: since Springer/Nature don't allow public pdfs to be linked without a paywall, we can't use hypothesis directly on the pdf of the paper, this link is to the website version of it.)


arxiv.org arxiv.org

Causal Parrots: Large Language Models May Talk Causality But Are Not Causal



Benyamin GhojoghAli Ghodsi. "Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey"


arxiv.org arxiv.org

"Attention is All You Need" Foundational paper introducing the Transformer Architecture.


arxiv.org arxiv.org

LaMDA: Language Models for Dialog Application
"LaMDA: Language Models for Dialog Application" Meta's introduction of LaMDA v1 Large Language Model.


arxiv.org arxiv.org

"Are Pretrained Convolutions Better than Pretrained Transformers?"


d4mucfpksywv.cloudfront.net d4mucfpksywv.cloudfront.net

GPT2 Introduction paper
Language Models are Unsupervised Multitask Learners A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, (2019).



GPT3 introduction paper

 Aug 2023

arxiv.org arxiv.org

Title: Delays, Detours, and Forks in the Road: Latent State Models of Training Dynamics Authors: Michael Y. Hu1 Angelica Chen1 Naomi Saphra1 Kyunghyun Cho Note: This paper seems cool, using older interpretable machine learning models, graphical models to understand what is going on inside a deep neural network

 Jul 2023

arxiv.org arxiv.org

“Rung 1.5” Pearl’s ladder of causation [1, 10] ranks structures in a similar way as we do, i.e., increasing amodel’s causal knowledge will yield a higher place upon his ladder. Like Pearl, we have three different levelsin our scale. However, they do not correspond onetoone.
They rescale Pearl's ladder levels downwards and define a new scale, arguing that the original definition of counterfactual as a different level on it's own actually combines together mutiple types of added reasoning complexity.

"Causal Deep Learning" Authors:Jeroen Berrevoets, Krzysztof Kacprzyk, Zhaozhi Qian, Mihaela van der Schaar
Very general and ambitious approach for representing the full continuous conceptual spectrum of Pearl's Causal Ladder, and ability to model and learning parts of this from Data.
Discussed by Prof. van der Shaar at ICML2023 workshop on Counterfactuals.
Tags
Annotators
URL


arxiv.org arxiv.org

Causal Deep Learning Authors:Jeroen Berrevoets, Krzysztof Kacprzyk, Zhaozhi Qian, Mihaela van der Schaar
Very general and ambitious approach for representing the full continuous conceptual spectrum of Pearl's Causal Ladder, and ability to model and learning parts of this from Data.
Discussed by Prof. van der Shaar at ICML2023 workshop on Counterfactuals.
Tags
Annotators
URL


proceedings.mlr.press proceedings.mlr.press

They think BON moves reward mass around from low reward samples to high reward samples

We find empirically that for bestofn (BoN) sampling
they foudn this relationship surpsing, but it does seem to fit better than other functions with mimic the general shape.
question: is tehre. agodo reason why?

d
they use sqrt since KL scales quadtraically, so it gets rid of the power 2.

RL
"for ... we don't see any overoptimization, we just see the .. monotonically improves"
For which, I don't see a linear growth here that might not bend down later.


arxiv.org arxiv.org

The MuZero paper for model based learning when the mdoel is not directly available.


arxiv.org arxiv.org

LLAMA 2 Release Paper


arxiv.org arxiv.org

Daniel Adiwardana MinhThang Luong David R. So Jamie Hall, Noah Fiedel Romal Thoppilan Zi Yang Apoorv Kulshreshtha, Gaurav Nemade Yifeng Lu Quoc V. Le "Towards a Humanlike OpenDomain Chatbot" Google Research, Brain Team
Defined the SSI metric for chatbots used in LAMDA paper by google.
Tags
Annotators
URL


arxiv.org arxiv.org

LaMDA pretraining as a language model.
Does this figure really mean anything? There is no 3 in the paper at all.

Safety does not seem to benefit much from model scaling without finetuning.
Safety does not seem to be improved by larger models.

How LaMDA handles groundedness through interactions with an external information retrieval system
Does LAmbda always ask these questions? How far down the chain does it go?

Daniel Adiwardana, MinhThang Luong, David R. So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang,Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, and Quoc V. Le. Towards a humanlike opendomain chatbot.arXiv preprint arXiv:2001.09977, 2020
SSI metric deifnitions

Using one model for both generation and discrimination enables an efficient combined generateanddiscriminateprocedure.
bidirectional model benefits

LaMDA Mount Everest provides factsthat could not be attributed to known sources in about 30% of response
Even with all this work, it will hallucinate about 30% of the time


arxiv.org arxiv.org

linear
W_\theta


openreview.net openreview.net

Shayan Shirahmad Gale Bagi, Zahra Gharaee, Oliver Schulte, and Mark Crowley Generative Causal Representation Learning for OutofDistribution Motion Forecasting In International Conference on Machine Learning (ICML). Honolulu, Hawaii, USA. Jul, 2023.


arxiv.org arxiv.org

Because DDPG is an offpolicy algorithm, the replay buffer can be large, allowingthe algorithm to benefit from learning across a set of uncorrelated transitions.
Offpolicy algorithms can have a larger replay buffer.

One challenge when using neural networks for reinforcement learning is that most optimization algorithms assume that the samples are independently and identically distributed. Obviously, whenthe samples are generated from exploring sequentially in an environment this assumption no longerholds. Additionally, to make efficient use of hardware optimizations, it is essential to learn in minibatches, rather than online.As in DQN, we used a replay buffer to address these issues
Motivation for minibatches of training experiences and for the use of replay buffers for Deep RL.

The DPG algorithm maintains a parameterized actor function μ(sθμ) which specifies the currentpolicy by deterministically mapping states to a specific action. The critic Q(s, a) is learned usingthe Bellman equation as in Qlearning. The actor is updated by following the applying the chain ruleto the expected return from the start distribution J with respect to the actor parameters:∇θμ J ≈ Est∼ρβ[∇θμ Q(s, aθQ)s=st,a=μ(stθμ)]= Est∼ρβ[∇aQ(s, aθQ)s=st,a=μ(st)∇θμ μ(sθμ)s=st] (6)Silver et al. (2014) proved that this is the policy gradient, the gradient of the policy’s performance
The original DPG algorithm (nondeep) takes the ActorCritic idea and makes the Actor deterministic.

Interestingly, all of our experiments used substantially fewer steps of experience than was used byDQN learning to find solutions in the Atari domain.
Training with DDPG seems to require less steps/examples than DQN.

The original DPG paper evaluated the algorithm with toy problems using tilecoding and linearfunction approximators. It demonstrated data efficiency advantages for offpolicy DPG over bothon and offpolicy stochastic actor critic.
(nondeep) DPG used tilecoding and linear VFAs.

It can be challenging to learn accurate value estimates. Qlearning, for example, is prone to overestimating values (Hasselt, 2010). We examined DDPG’s estimates empirically by comparing thevalues estimated by Q after training with the true returns seen on test episodes. Figure 3 shows thatin simple tasks DDPG estimates returns accurately without systematic biases. For harder tasks theQ estimates are worse, but DDPG is still able learn good policies.
DDPG avoids the overestimation problem that Qlearning has without using Double Qlearning.

It is not possible to straightforwardly apply Qlearning to continuous action spaces, because in continuous spaces finding the greedy policy requires an optimization of at at every timestep; this optimization is too slow to be practical with large, unconstrained function approximators and nontrivialaction spaces
Why it is not possible for pure Qlearning to handle continuous action spaces.

Our contribution here is to provide modifications to DPG, inspired bythe success of DQN, which allow it to use neural network function approximators to learn in largestate and action spaces online
contribution of this paper.

Directly implementing Q learning (equation 4) with neural networks proved to be unstable in manyenvironments.

As with Q learning, introducing nonlinear function approximators means that convergence is nolonger guaranteed. However, such approximators appear essential in order to learn and generalizeon large state spaces.
Why Qlearning can't have guaranteed convergence.

We refer to our algorithm as Deep DPG (DDPG, Algorithm 1).

This means that the target values are constrained to change slowly, greatlyimproving the stability of learning.

A major challenge of learning in continuous action spaces is exploration. An advantage of offpolicies algorithms such as DDPG is that we can treat the problem of exploration independentlyfrom the learning algorithm.
Learning and Exploration are handled seperately.

but modified for actorcritic and using “soft” target updates, rather thandirectly copying the weights.

his simple change moves the relatively unstable problem oflearning the actionvalue function closer to the case of supervised learning, a problem for whichrobust solutions exist.

One approach to this problem is to manually scale the features so they are in similar ranges acrossenvironments and units. We address this issue by adapting a recent technique from deep learningcalled batch normalization

minimize covariance shif

This paper introduces the DDPG algorithm which builds on the existing DPG algorithm from classic RL theory. The main idea is to define a deterministic policy, or nearly deterministic, for situations where the environment is very sensitive to suboptimal actions, and one action setting usually dominates in each state. This showed good performance, but could not beat algorithms such as PPO until the additions of SAC were added. SAC adds an entropy penalty which essentially penalizes uncertainty in any states. Using this, the deterministic policy gradient approach performs well.

normalizes each dimensionacross the samples in a minibatch to have unit mean and variance
Tags
Annotators
URL


proceedings.mlr.press proceedings.mlr.press

IMPALA: Scalable Distributed DeepRL with Importance WeightedActorLearner Architectures
(Espeholt, ICML, 2018) "IMPALA: Scalable Distributed DeepRL with Importance Weighted ActorLearner Architectures"

We achieve stable learning at high throughput by combining decoupled acting and learningwith a novel offpolicy correction method calledVtrace.

we aim to solve a large collection oftasks using a single reinforcement learning agentwith a single set of parameters

the progress has been primarily in singletask performance

multitask reinforcement learning
Task: Multitask Reinforcement Learning

IMPALA (Figure 1) uses an actorcritic setup to learn apolicy π and a baseline function V π . The process of generating experiences is decoupled from learning the parametersof π and V π . The architecture consists of a set of actors,repeatedly generating trajectories of experience, and one ormore learners that use the experiences sent from actors tolearn π offpolicy.

an agent is trained on each task

scalability

separately

We are interested in developing new methodscapable of mastering a diverse set of tasks simultaneously aswell as environments suitable for evaluating such methods.
Task: train agents that can do more than one thing.

IMPALA actors communicate trajectoriesof experience (sequences of states, actions, and rewards) to acentralised learner

full trajectories of experience

aggressivelyparallelising all time independent operations

learning becomes offpolicy

IMPALA achieves exceptionally high data throughput rates of250,000 frames per second, making it over 30 times fasterthan singlemachine A3C

With the introduction of very deep model architectures, thespeed of a single GPU is often the limiting factor duringtraining.

IMPALA is also moredata efficient than A3C based agents and more robust tohyperparameter values and network architectures

IMPALA use synchronised parameter update which is vitalto maintain data efficiency when scaling to many machines

A3C


proceedings.mlr.press proceedings.mlr.press

This paper introduced the DPG Algorithm


openreview.net openreview.net

Link to page with information about the paper: https://openreview.net/forum?id=rJeXCo0cYX


openreview.net openreview.net

Yann LeCun released his vision for the future of Artificial Intelligence research in 2022, and it sounds a lot like Reinforcement Learning.
Tags
Annotators
URL


www.cs.toronto.edu www.cs.toronto.edudqn.pdf1

The paper that introduced the DQN algorithm for using Deep Learning with Reinforcement Learning to play Atari game.


arxiv.org arxiv.org

Paper that evaluated the existing Double QLearning algorithm on the new DQN approach and validated that it is very effective in the Deep RL realm.

Qlearning(Watkins, 1989) is one of the most popular reinforcementlearning algorithms, but it is known to sometimes learn unrealistically high action values because it includes a maximization step over estimated action values, which tends toprefer overestimated to underestimated values
Qlearning tends to overestimate the value of an action.

noise

unify these views

we can learn a parameterized value function

insufficiently flexible function approximation

Both the target networkand the experience replay dramatically improve the performance of the algorithm

The target used by DQN is then

show overestimationscan occur when the action values are inaccurate, irrespectiveof the source of approximation error
They show overestimations occur when there is approximation error in the value function approximation for Q(s,a).

θt

upward bias

In the original Double Qlearning algorithm, two valuefunctions are learned by assigning each experience randomly to update one of the two value functions, such thatthere are two sets of weights, θ and θ′

θ′t

while Double Qlearning is unbiased.

The orange bars show the bias in a single Qlearning update when the action values are Q(s, a) =V∗(s) + a and the errors {a}ma=1 are independent standardnormal random variables. The second set of action valuesQ′, used for the blue bars, was generated identically and independently. All bars are the average of 100 repetitions.


arxiv.org arxiv.org

DDPG

multiplying the rewards generated from an environment by some scalar

ELU

his is akin to clipping therewards to [0, 1]

network structure of
differernt activiation functions tried

Hyperparameters
hyperparameters: alpha, dropbox prob, number of layers in your network, width of network layers, activation function (RELU, ELU, tanh, ...), CNN?, RNN?, ..., , epsilon (for egreedy policy)
parameters: specific to problem  paramters of Q(S,a) and policy pi (theta, w), gamma (? how important is the future)

PPO


arxiv.org arxiv.org

TRPO uses a hard constraint rather than a penalty because it is hardto choose a single value of β that performs well across different problems

gradient estimator

we only ignore the change in probability ratio when it would make the objective improve,and we include it when it makes the objective worse.

ot sufficient to simply choose a fixed penalty coefficient β and optimize the penalizedobjective Equation (5) with SGD

objective function (the “surrogate” objective) is maximized
PPO is a response to the TRPO algorithm, trying to use the core idea but implement a more efficient and simpler algorithm.
TRPO defines the problem as a straight optimization problem, no learning is actually involved.

Generalizingthis choice, we can use a truncated version of generalized advantage estimation

Without a constraint, maximization of LCP I would lead to an excessively large policyupdate; hence, we now consider how to modify the objective, to penalize changes to the policy thatmove rt(θ) away from 1
The policy iteration objective proposes steps which are too large. It uses a likelihood ratio of the current policy with and older version of the policy multiplied by the Advantage function. So, it uses the change in the policy probability for an action to weight the Advantage function.

our goalof a firstorder algorithm that emulates the monotonic improvement of TRPO,

A proximal policy optimization (PPO) algorithm that uses fixedlength trajectory segments isshown below. Each iteration, each of N (parallel) actors collect T timesteps of data. Then weconstruct the surrogate loss on these N T timesteps of data, and optimize it with minibatch SGD

Thefirst term inside the min is LCP I . The second term, clip(rt(θ), 1 − , 1 + ) ˆAt, modifies the surrogateobjective by clipping the probability ratio, which removes the incentive for moving rt outside of theinterval [1 − , 1 + ]. Finally, we take the minimum of the clipped and unclipped objective, so thefinal objective is a lower bound (i.e., a pessimistic bound) on the unclipped objective
The "clip" function cuts off the probability ratio output so that some changes in Advantage are ignored.

Clipped Surrogate Objective

Wecan see that LCLIP is a lower bound on LCP I , with a penalty for having too large of a policyupdate
The clipped loss is a lower bound on the actual loss defined in TRPO. So it is simpler to compute, and will provide some guidance at least, it will never overestimate the true loss.

hese methods havethe stability and reliability of trustregion methods but are much simpler to implement, requiringonly few lines of code change to a vanilla policy gradient implementation, applicable in more generalsetting

shows howseveral objectives vary as we interpolate along the policy update direction

Surrogate objectives, as we interpolate between the initial policy parameter θold, and the updatedpolicy parameter, which we compute after one iteration of PPO.
Another figure to show intuition for the approach by showing how each component changes with respect to following the policy update along the gradient direction.

lower bound (i.e., a pessimistic bound

Paper that introduced the PPO algorithm. PPO is, in a way, a response to the TRPO algorithm, trying to use the core idea but implement a more efficient and simpler algorithm.
TRPO defines the problem as a straight optimization problem, no learning is actually involved.

The theory justifying TRPO actually suggests using a penalty instead of a constraint, i.e.,solving the unconstrained optimization problemmaximizeθˆEt[ πθ(at  st)πθold (at  st) ˆAt − β KL[πθold (·  st), πθ(·  st)]](5)for some coefficient β
This parameter \Beta is a bit mysterious. PPO works very well generally, but setting \Beta is tricky, and incluences other parts of the algorithm.


arxiv.org arxiv.org

New transitions

bias towards recent transitions

samples transitions with probability ptrelative to the last encountered absolute TD error

RMSprop

his means thatin the loss above, the time index t will be a random time index from the last million transitions, rather than the currenttime.

Multistep learning.

Prioritized replay.

Prioritized

parameters θ of the online network (which is alsoused to select actions

ablation

θ represents the parameters of a target network

a periodic copy of the online network which is not directly optimized.

Noisy Nets. The limitations of exploring using greedypolicies are clear in games such as Montezuma’s Revenge,where many actions must be executed to collect the first reward

t is a time step randomly picked from the replaymemory

DDQN

This famous paper gives a great review of the DQN algorithm a couple years after it changed everything in Deep RL. It compares six different extensions to DQN for Deep Reinforcement Learning, many of which have now become standard additions to DQN and other Deep RL algorithms. It also combines all of them together to produce the "rainbow" algorithm, which outperformed many other models for a while.


arxiv.org arxiv.org

For many tasks our models exhibit humanlevel performance, and we are the first to report computer agents that can craftdiamond tools, which can take proficient humans upwards of 20 minutes (24,000environment actions) of gameplay to accomplish

Bowen Baker et. al. (Open AI) "Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos" Arkiv, June 2022.
Introduction of VPT : New semisupervied pretrained model for sequential decision making on Minecraft. Data are from human video playthroughs but are unlabelled.

e extend the internetscalepretraining paradigm to sequential decision domains through semisupervisedimitation learning wherein agents learn to act by watching online unlabeled videos.


arxiv.org arxiv.org

an agent isinstructed to obtain a desired goal item
Problem: Agent must complete the instructed task in MineCraft

urriculum learning (at least using current RL methods) is that the agentachieves a small success probability (within available/reasonable compute) on a new task aftermastering a previous task.
Curriculum Learning

We study curriculum learning on a set of goalconditioned Minecraft tasks, in which the agent istasked to collect one out of a set of 107 items from the Minecraft tech tree

Results
Experiments: They compared a variety of policies and training approachs

It has 5 minutes (1500 time steps) to complete the task and obtains areward of +1 upon success. After each success or failure a new task is selected without resettingthe world or respawning the agent
 Agent has 5 min to find item
 Next item chosen without resetting world

Simon Says”

Learning progress curriculum
Approach: Curriculum Learning


arxiv.org arxiv.org

Arxiv paper from 2021 on reinforcement learning in a scenario where your aim is to learn a workable POMDP policy, but you start with a fully observable MDP and adjust it over time towards a POMDP.
Tags
Annotators
URL


arxiv.org arxiv.org

Liang, Machado, Talvite, Bowling  AAMAS 2016 "State of the Art Control of Atari Games Using Shallow Reinforcement Learning"
Response paper to DQN showing that well designed Value Function Approximations can also do well at these complex tasks without the use of Deep Learning
A great paper showing how to think differently about the latest advances in Deep RL. All is not always what it seems!


arxiv.org arxiv.org

Tom Schaul, John Quan, Ioannis Antonoglou and David Silver. "PRIORITIZED EXPERIENCE REPLAY", ICLR, 2016.


openaccess.thecvf.com openaccess.thecvf.com

Xu, ICCV, 2019 "Temporal Recurrent Networks for Online Action Detection"
arxiv: https://arxiv.org/abs/1811.07391 hypothesis: https://hyp.is/go?url=https%3A%2F%2Fopenaccess.thecvf.com%2Fcontent_ICCV_2019%2Fpapers%2FXu_Temporal_Recurrent_Networks_for_Online_Action_Detection_ICCV_2019_paper.pdf&group=world



FewShot (FS)  the model is given a few demonstrations of the task at inference time asconditioning [ RWC+19 ], but no weights are updated
hints are given but the model is not updated

 Jun 2023


fuzzy
fuzzy!

[KMH+20] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess,Rewon Child, Scott Gray, Alec Ra
Justification for low learning rate in large language models.

As found in [ KMH+20 , MKAT18 ], larger models can typically use a larger batch size, but requirea smaller learning rate. We measure the gradient noise scale during training and use it to guideour choice of batch size [MKAT18 ]. Table A.1 shows the parameter settings we used. To train thelarger models without running out of memory, we use a mixture of model parallelism within eachmatrix multiply and model parallelism across the layers of the network. All models were trained onV100 GPU’s on part of a highbandwidth cluster. Details of the training process and hyperparametersettings are described in the appendix.
Why is this?

We use the same model and architecture as GPT2
What do they mean by "model" here? If they have retrained on more data, with a slightly different architecture, then the model weights after training must be different.


d4mucfpksywv.cloudfront.net d4mucfpksywv.cloudfront.net

While zeroshot performance establishes a baseline of thepotential performance of GPT2 on many tasks, it is notclear where the ceiling is with finetuning.
So finetuning could lead to better models.

13.19%
that's a lot!

The Bloom filterswere constructed such that the false positive rate is upperbounded by 1108 . We further verified the low false positiverate by generating 1M strings, of which zero were found bythe filter
Bloom filters used to determine how much overlap there is between train and test set, to be more sure of their results.

Bloom filters
Bloom Filter:
The high level idea is to map elements x∈X to values y=h(x)∈Y using a hash function h, and then test for membership of x' in X by checking whether y'=h(x')∈Y, and do that using multiple hash functions h.

Recent work in computer vision has shown that common image datasets contain a nontrivial amount of nearduplicateimages. For instance CIFAR10 has 3.3% overlap betweentrain and test images (Barz & Denzler, 2019). This results inan overreporting of the generalization performance of machine learning systems.
CIFAR10 performance results are overestimates since some of the training data is essentially in the test set.


www.fandm.edu www.fandm.edu

Liang, Machado, Talvite, Bowling  AAMAS 2016 "State of the Art Control of Atari Games Using Shallow Reinforcement Learning"
A great paper showing how to think differently about the latest advances in Deep RL. All is not always what it seems!


arxiv.org arxiv.org

Introduction of the RoBERTa improved analysis and training approach to BERT NLP models.


arxiv.org arxiv.org

e also add 8,000 textinstructions generated bythe OpenAI API gpt3.5turbo model [38],
how does this work? takes images as well as input?


www.cs.mcgill.ca www.cs.mcgill.ca

Paper from 2016 soon after DQN paper, about how to use eligbility traces to improve performance further.


assets.pubpub.org assets.pubpub.org

LeBlanc, D. G., & Lee, G. (2021). General Deep Reinforcement Learning in NES Games. Canadian AI 2021. Canadian Artificial Intelligence Association (CAIAC). https://doi.org/10.21428/594757db.8472938b


assets.pubpub.org assets.pubpub.org

hypothesis test for CANAI23 paper


jmlr.org jmlr.org

introducing a unified framework that converts all textbasedlanguage problems into a texttotext format
this is their goal, to have a single model, including hyperparameters and setup, that can be used for any NLP task.

Paper introducing the T5 TexttoText transformer mdoel from google. (Raffel, JMLR, 2020)

 May 2023

www.semanticscholar.org www.semanticscholar.org

Training language models to follow instructionswith human feedback
Original Paper for discussion of the Reinforcement Learning with Human Feedback algorithm.

 Apr 2023

srush.github.io srush.github.io

The Annotated S4 Efficiently Modeling Long Sequences with Structured State Spaces Albert Gu, Karan Goel, and Christopher Ré.
A new approach to transformers


arxiv.org arxiv.org

Efficiently Modeling Long Sequences with Structured State SpacesAlbert Gu, Karan Goel, and Christopher R ́eDepartment of Computer Science, Stanford University


arxiv.org arxiv.org

"Causal Triplet: An Open Challenge for Interventioncentric Causal Representation Learning" Yuejiang Liu1, 2,* YUEJIANG.LIU@EPFL.CH Alexandre Alahi2 ALEXANDRE.ALAHI@EPFL.CH Chris Russell1 CMRUSS@AMAZON.DE Max Horn1 HORNMAX@AMAZON.DE Dominik Zietlow1 ZIETLD@AMAZON.DE Bernhard Sch ̈olkopf1, 3 BS@TUEBINGEN.MPG.DE Francesco Locatello1 LOCATELF@AMAZON.DE


arxiv.org arxiv.org

Bowen Baker et. al. (Open AI) "Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos" Arkiv, June 2022.
New supervised pretrained model for sequential decision making on Minecraft. Data are from human video playthroughs but are unlabelled.
reinforcementlearning foundationmodels pretrainedmodels projminerl minecraft



Open AI page describing their video pretraining for minecraft.



It should not be used as a primary decisionmaking tool, but instead as a complement to other methods of determining the source of a piece of text.
This is true of any of these LLM models actually for any task.


en.wikipedia.org en.wikipedia.org

Wikipedia article for the Travelling Salesman Problem (TSP)


en.wikipedia.org en.wikipedia.org

In this way, an NTM can be thought of as simultaneously exploring all computational possibilities in parallel and selecting an accepting branch
Nondeterministic Turing Machines are able to get lucky and choose the single path to the answer in polynomial time, or be given a "hint" or "proof" or "certificate" for that path. This isn't realistic, but it separates the difficulty of the problem of verifying a solution and finding one into two different tasks.

Computational problems[edit] Intuitively, a computational problem is just a question that can be solved by an algorithm. For example, "is the natural number n {\displaystyle n} prime?" is a computational problem. A computational problem is mathematically represented as the set of answers to the problem. In the primality example, the problem (call it PRIME {\displaystyle {\texttt {PRIME}}} ) is represented by the set of all natural numbers that are prime: PRIME = { n ∈ N  n is prime } {\displaystyle {\texttt {PRIME}}=\{n\in \mathbb {N} n{\text{ is prime}}\}} . In the theory of computation, these answers are represented as strings; for example, in the primality example the natural numbers could be represented as strings of bits that represent binary numbers. For this reason, computational problems are often synonymously referred to as languages, since strings of bits represent formal languages (a concept borrowed from linguistics); for example, saying that the PRIME {\displaystyle {\texttt {PRIME}}} problem is in the complexity class NP is equivalent to saying that the language PRIME {\displaystyle {\texttt {PRIME}}} is in NP.
Explanation of why computational complexity class proofs with Turing Machines use "strings" instead of algorithms or programs.


en.wikipedia.org en.wikipedia.org

Presburger arithmetic is much weaker than Peano arithmetic, which includes both addition and multiplication operations. Unlike Peano arithmetic, Presburger arithmetic is a decidable theory. This means it is possible to algorithmically determine, for any sentence in the language of Presburger arithmetic, whether that sentence is provable from the axioms of Presburger arithmetic. The asymptotic runningtime computational complexity of this algorithm is at least doubly exponential, however, as shown by Fischer & Rabin (1974).
This is an example of a decision problem that is at least doubly exponential \(2^{2^n}\). It is a simpler form of arithmetic where the Halting problem/incompleteness theorem does not apply.


en.wikipedia.org en.wikipedia.org

NPhard if everything in NP can be transformed in polynomial time into it even though it may not be in NP
Definition of NPhard problem

At present, all known algorithms for NPcomplete problems require time that is superpolynomial in the input size, in fact exponential in O ( n k ) {\displaystyle O(n^{k})} [clarify] for some k > 0 {\displaystyle k>0} and it is unknown whether there are any faster algorithms.
So how hard are NPcomplete problems?

The Subgraph Isomorphism problem is NPcomplete. The graph isomorphism problem is suspected to be neither in P nor NPcomplete, though it is in NP. This is an example of a problem that is thought to be hard, but is not thought to be NPcomplete. This class is called NPIntermediate problems and exists if and only if P≠NP.
There might even be some problems in NP but not in P and that are not NPcomplete.

NPcomplete problems are often addressed by using heuristic methods and approximation algorithms.
usually solved with approximation algorithms


link.springer.com link.springer.com

Chapter 21 "Adversarial Autonencoders" from our book "Elements of Dimensionality Reduction and Manifold Learning", Springer 2023.


openai.com openai.comGPT44

my annotations for the OpenAI GPT4 info page.

GPT4 outperforms ChatGPT by scoring in higher approximate percentiles among testtakers.
oh, great.

40% more likely to produce factual responses than GPT3.5
great, 40% more than what though?

We used GPT4 to help create training data for model finetuning and iterate on classifiers across training, evaluations, and monitoring.
Interesting, you need to consider, is this like data augmentation, like bootstrapping, like adversarial training, or is it like overfitting to your data?
Tags
Annotators
URL

 Mar 2023

arxiv.org arxiv.org

[ Bengio, The Consciousness Prior, Arxiv, 2018]
Tags
Annotators
URL


www.gatesnotes.com www.gatesnotes.com

"The Age of AI has begun : Artificial intelligence is as revolutionary as mobile phones and the Internet." Bill Gates, March 21, 2023. GatesNotes


www.inc.com www.inc.com

"There is a robust debate going on in the computing industry about how to create it, and whether it can even be created at all."
Is there? By whom? Why industry only and not government, academia and civil society?

Minda Zetlin. "Bill Gates Says We're Witnessing a 'Stunning' New Technology Age. 5 Ways You Must Prepare Now". Inc.com, March 2023.


arxiv.org arxiv.org

asks for the Minecraft domain.
They demonstrate the model on a "minecraftlike" domain (introduced earlier by someone else) where there are resources in the world and the agent has tasks.

 Feb 2023

arxiv.org arxiv.org

Definition 3.2 (simple reward machine).
The MDP does not change, it's dynamics are the same, with or without the RM, as they are with or without a standard reward model. Additionally, the rewards from the RM can be nonMarkovian with respect to the MDP because they inherently have a kind of memory or where you've been, limited to the agents "movement" (almost "in it's mind") about where it is along the goals for this task.

e thenshow that an RM can be interpreted as specifying a single reward function over a largerstate space, and consider types of reward functions that can be expressed using RMs
So by specifying a reward machine you are augmenting the state space of the MDP with higher level goals/subgoals/concepts that provide structure about what is good and what isn't.

However, an agent that hadaccess to the specification of the reward function might be able to use such information tolearn optimal policies faster.
Fascinating idea, why not? Why are we hiding the reward from the agent really?

U is a finite set of states,
Apply a set of logical rules to the state space to obtain a finite set of states.

statereward function,
reward is a constant number assigned to each set of states

Reward Machines: Exploiting Reward FunctionStructure in Reinforcement Learning
[Icarte, JAIR, 2022] "Reward Machines: Exploiting Reward Function Structure in Reinforcement Learning"


proceedings.mlr.press proceedings.mlr.press

Using Reward Machines for HighLevel Task Specificationand Decomposition in Reinforcement Learning
[Icarte, PMLR, 2018] "Using Reward Machines for HighLevel Task Specification and Decomposition in Reinforcement Learning"


royalsocietypublishing.org royalsocietypublishing.org

Bell’s theorem is aboutcorrelations (joint probabilities) of stochastic real variables and therefore doesnot apply to quantum theory, which neither describes stochastic motion nor usesrealvalued observables
strong statement, what do people think about this? is it accepted by anyone or dismissed?


arxiv.org arxiv.org

[Kapturowski, DeepMind, Sep 2022] "Humanlevel Atari 200x Faster"
Improving the 2020 Agent57 performance to be more efficeint.

 Jan 2023

www.cs.princeton.edu www.cs.princeton.edu

"Finding Optimal Solutions to Rubik's Cub e Using Pattern Databases" by Richard E. Korf, AAAI 1997.
The famous "Korf Algorithm" for finding the optimal solution to any Rubik's Cube state.



make up facts less often
but not "never"

On prompts submitted by our customers to the API,[1
really? so that's how they make money.
Question: what kind of bias does this introduce into the model?
 which topics and questions grt trained on?
 what is the goal of training? truth? clickability?

Blog post from OpenAI in Jan 2022 explaining some of the approaches they use to train, reduce and tube their LLM for particular tasks. This was all precursor to the ChatGPT system we now see.


arxiv.org arxiv.org

Feng, 2022. "TrainingFree Structured Diffusion Guidance for Compositional TexttoImage Synthesis"
Shared and found via: Gowthami Somepalli @gowthami@sigmoid.social Mastodon > Gowthami Somepalli @gowthami StructureDiffusion: Improve the compositional generation capabilities of texttoimage #diffusion models by modifying the text guidance by using a constituency tree or a scene graph.
