 Feb 2021

arxiv.org arxiv.org

The mouselook action distribution is in turn also defined autoregressively: the first sampled actionsplits the window bounded by(−1,1)×(−1,1)in width and height into 9 squares. Thesecond action splits the selected square into 9 further squares, and so on. Repeating thisprocess several times allows the agent to express any continuous mouse movement up to athreshold resolution.
Interesting representation of a continuous action space!


arxiv.org arxiv.org

effective dimensionality of a Bayesian neural network is inverselyproportional to the variance of the posterior distribution.
posterior contraction in parameter space I think you are talking about no?


www.shortscience.org www.shortscience.org

Yet this also implies non i.i.d. samples! Indeed, even if one could directly sample from the stateaction distribution (like having its analytical form or an infinite experience replay buffer) and thus draw i.i.d. samples, the dependency will occur across optimization steps: if I draw a sample and use it to update my policy, I also update the distribution from which I will draw my next sample and then my next sample depends on my previous sample (since it conditioned my policy update).
But this isn't a problem if the examples come from a fixed expert no?

 Jan 2021

arxiv.org arxiv.org

Prefixtuning prepends a sequence ofcontinuous taskspecificvectors to the input, whichwe call aprefix, depicted by red blocks in Figure 1(bottom). For subsequent tokens, the Transformercan attend to the prefix as if it were a sequence of“virtual tokens”, but unlike prompting, the prefixconsists entirely of free parameters which do notcorrespond to real tokens.
and are thus differentiable! yay


openlabflowers.inria.fr openlabflowers.inria.fr

I guess a steppingstone towards this would be to optimize morphological growth processes to generate a body with a particular form in 3D (that would be quite similar to the differentiable CA, except that here the “cells” move in 3D space and have physical interaction that depend on their internal parameters and states)

(and that would be also novel to use a populationbased IMGEPs using gradient descent for local optimization towards selfgenerated goals)
similar to SIREN+CLIP (Deep Sleep)


www.semanticscholar.org www.semanticscholar.org

For this reason, we wereunable to collect baselines such as an equivalent amount of highquality human demonstrations forsupervised baselines. See D for more discussion. We leave this ablation to future work.
so one possibility is that the feedback you got was of better quality than the data used for SL. Perhaps if you did SL on higher quality data you would match the performance of the human feedback model?

it’s unclear how much one can optimizeagainst the reward model until it starts giving useless evaluations.
adversarial examples

Previous work on finetuning language models from human feedback [73] reported “a mismatchbetween the notion of quality we wanted our model to learn, and what the humans labelers actuallyevaluated”, leading to modelgenerated summaries that were highquality according to the labelers,but fairly lowquality according to the researchers.
That is quite interesting

We rely on detailed procedures toensure high agreement between labelers and us on the task, which we describe in the next section
is this necessarily a good thing? Could you not miss other notions of "qualtiy" this way? I guess you want to ensure a consistent notion of quality, rather than asking the question of "what about other notions of quality?"


openai.com openai.com

We conjecture that this gap occurs because the models “cheat” by only optimizing for performance on the benchmark, much like a student who passed an exam by studying only the questions on past years’ exams.
poor generalization


proceedings.mlr.press proceedings.mlr.press

n fact, without visiting any states at all, sincethe queries are synthetic.
grr, what about during the phase of training the generative model?

he xaxis represents the number of queries to the user, where each queryelicits a label for a single state transition(s, a, s0).
but isnt sampling from model less expensive than sampling by optimizing AFs? shouldnt that be taken into account?

having to visit unsafe states during the training process
it may have visited some during the training of the generative model no?
But I guess not that many, if the generative model has been pretrained, and it can generalize well

As discussed inSection4.3and illustrated in the rightmost plot of Figure5, the baselines learn a reward model that incorrectly extrapolates that continuing up and to the right past the goalregion is good behavior.
but if the baselines arent visiting those high reward states, then they havent actually fallen into reward hacking? I guess the idea is that they could in a new environment.
Take away is to do more exploration if you expect to be tested to new environments

⌧query= maxz0,a0,z1,...,zTJ(⌧)+logp(⌧)
its like a modelbased version of DDPG + curiosity/exploration rewards?

Here, the states2R64⇥64⇥3is anRGB image with a topdown view of the car (Figure3), andthe actiona2R3controls steering, gas, and brake
In my experience, high dimensional action spaces are even harder, specially when combined with high dim state spaces

he idea is to elicit labels for examples that themodel is least certain how to label, and thus reduce modeluncertainty.
what if the user(s) the model is querying are also uncertain? Then the model shouldnt spend too much time on these. This is one thing that learning progress aims to avoid!

To simplify our experiments,we sample trajectories⌧by following random policies thatexplore a wide variety of states. We use the observed trajectories to train a likelihood model
Seems like this may be an issue in more complex environments, as the random policies may not explore enough!
We probably want either human demonstrations and/or iterate/reinfe the generative model with the later policies

(4) maximize novelty of trajectories regardless of predicted rewards, to improve the diversityof the training data.
could also do something based on learning progress

In complex domains,the user may not be able to anticipate all possible agentbehaviors and specify a reward function that accuratelydescribes user preferences over those behaviors
so is the assumption that the automated way ot exploring agent behaviours is better than what a human would consider?

 Dec 2020

www.wikiwand.com www.wikiwand.com

it is far easier to obtain reliability beyond a certain margin by mechanisms in the end hosts of a network rather than in the intermediary nodes,[nb 4] especially when the latter are beyond the control of, and not accountable to, the former
this seems to me to be mostly saying that: it's hard to change the standards at the low level, so it's easier to program at the higher level.
This is true of not just networks, but of computers, etc too. But it may not always be the best approach!
Should have called it "rule of thumb" more than principle I think

 Nov 2020

arxiv.org arxiv.org

all causal explanationsare necessarily robust in this extreme case
are they? Can you not have a thing that has a conditional causal effect?
Seems to me that causality should be a more quantiative thing (how robust is this predictor), rather than an eitheror thing


proceedings.neurips.cc proceedings.neurips.cc

Goldblum et al.[119]which empirically observes that the large width behavior of ResidualNetworks does not conform to the infinitewidth limit.
Oh interesting!

WhileCNNVECpossess translation equivariance but not invariance (§3.11), we believe it can effectivelyleverage equivariance to learn invariance from data
How? if it doesn't imply anything about the output?

This is caused by poor conditioning of pooling networks. Xiao et al.[33](Table 1) show that theconditioning at initialization of aCNNGAPnetwork is worse than that ofFCNorCNNVECnetworksby a factor of the number of pixels (1024 for CIFAR10). This poor conditioning of the kerneleigenspectrum can be seen in Figure 8. For linearized networks, in addition to slowing training by afactor of 1024, this leads to numerical instability when usingfloat32
Interesting. Do models with a stronger bias lead, which may be associated with better generalization (see https://arxiv.org/abs/2002.02561 / https://arxiv.org/abs/1905.10843), lead also to poorer conditioning?
Hmm, but this did not affect the nonlinearized model. Interesting. How does nonlinear GD avoid the issue?

egularization parameter
what regularization parameter?


arxiv.org arxiv.org

We add the superscript “all" to emphasize that gradientbased training of the networks is alwaysperformed on the entire dataset, while NNGP inference is performed on subsampled datasets.
ah hm, so the gradient method is given an advantage by being able to "look" at more data than the NNGP method?


arxiv.org arxiv.org

With few exceptions (Carlson et al., 2010), machine learning models havebeen confined to IID datasets that lack the structurein time from which humans draw correlations aboutlongrange causal dependencies
All of RL studies nonIID data

how pretraining obfuscates ourability to measure generalization (Linzen, 2020)
How??

but even complex simulation action spaces can be discretizedand enumerated.
What's the problem of enumerating and discretizing action spaces?
what about agents that can act via free text? like those in AI dungeon? those are in principle not enumerable

models the listener’s desires and experiences explicitly
what does it mean to model them explicitly versus implicitly?

Collecting data about rich natural situations is often impossible.
NOPE. VR.

Meanwhile, it is precisely human’sability to draw on past experience and make zeroshot decisions that AI aims to emulate
which is what GPT3 is doing

Second, current cross entropy training losses actively discourage learning the tail of the distribution properly, as statistically infrequent events aredrowned out (Pennington et al., 2014; Holtzmanet al., 2020).
That's what scaling is doing, shaving off those tails (as the scaling papers discuss)

it is unlikely that universal function approximatorssuch as neural networks would ever reliably positthat people, events, and causality exist without being biased towards such solutions (Mitchell, 1980)
Why?

(which are usually thrown out beforethe dataset is released)
They shouldn't be! We should learn to probabilistically model the data

persistent enough to learn the effects of actions.
so we should aim for longer contexts? Yeah memory is important. There is research in extending transformers to have longer contexts

and active experimentation is keyto learning that effec
why?

participatein linguistic activity, such as negotiation (Yang et al.,2019a; He et al., 2018; Lewis et al., 2017), collaboration (Chai et al., 2017), visual disambiguation(Anderson et al., 2018; Lazaridou et al., 2017; Liuand Chai, 2015), or providing emotional support(Rashkin et al., 2019).
do we need the agent itself to participate, or is not sufficient to feed it data from such types of interactions?

Framing, such as suggesting that achatbot speaks English as a second language
Tbh I think that framing can be both missleading and illuminating (about the degree or lack thereof of capability of the agent)

Robotics and embodiment are not available inthe same offtheshelf manner as computer visionmodels.
I think VR can solve that

(Liet al., 2019b; Krishna et al., 2017; Yatskar et al.,2016; Perlis, 2016)
why don't you explain how these papers support the statement at least?

Models must be ableto watch and recognize objects, people, and activities to understand the language describing them
why?

Learned, physical heuristics, such as thefact that a falling cat will land quietly, are generalized and abstracted into language metaphors likeas nimble as a cat(Lakoff, 1980).
So you just conceded that a prime example of things that need physical interaction to be learnt, can be expressed in words?
You should make your points clearer. The point I think is that there are a lot of subconscious knowledge like the example you give, but which we can't quite put into words!

Language learning needs perception, because perception forms the basis for many of our semanticaxioms
could we not argue that language is all that we are conscious of. Even though it may be formed by external sensations, what we currently (consciously) know may be almost fully expressible by language, and therefore WS2 may be enought to learn all of conscious knowledge

As text pretraining schemes seem to be reaching the point of diminishing returns,
Not yet, in long scale IIRC

parked my car in the compact parking space because it looked (big/small) enough
Hmm, I think the answer is "big"? This seems learnable from text statistics?

Continuing to expandhardware, data sizes, and financial compute costby orders of magnitude will yield further gains, butthe slope of the increase is quickly decreasing.
Right, but it's nice that we have a reliable way to improve performance.

cale in data andmodeling has demonstrated that a single representation can discover both rich syntax and semanticswithout our help (Tenney et al., 2019).
It's not without our help. The data is our help?^^

You can’t learn language from the radio.
I think the question shouldn't be phrased as a dichotomy, but quantitatively: How much language (semantics) can you and can you not learn from the radio?

The futility of learning language from linguistic signal alone is intuitive, and mirrors thebelief that humans lean deeply on nonlinguisticknowledge (Chomsky, 1965, 1980).
Something being intuitive isn't a strong argument for it being true.

from their use by people to communicate
Let's gather massive datasets on that through VR ^^

Natural language processing is a diverse field,and progress throughout its development hascome from new representational theories, modeling techniques, data collection paradigms,and tasks.
and figuring out how to scale up https://arxiv.org/abs/2001.08361

successful linguisticcommunicationrelies on a sharedexperience of the world. It is this shared experience that makes utterances meaningful
I think this is true, except for the language which communicates about language. I think there is meaning purely within the world of language too.
Though certainly a lot of meaning lies in the grounding of language too


www.ece.uvic.ca www.ece.uvic.cauntitled2

share attention
common context

Any smaller subset of these competencies is not sufficient to develop proper language/communication skills, and further, the development of language clearlybootstraps better motor and affordance learning and/or sociallearning.
This seems to be full of statements like this where they claim something is "obviously true" but really more justification is needed for these claims.


arxiv.org arxiv.org

Intuition
The way I think about their framework is as follows:
They shift perspective from bounding the error to "bounding" the learning curves
Learning curves are functions (of n), so there is no clear ordering between them as there is for the error at a particular n, which is just a number.
So instead of learning curves we look at {learning curves up to the equivalence relation of having the same asymptotic behaviour (up to a constant)}, which we call "rates".
For these there is a natural ordering, and one can provide a rate upper bound, that is uniform over P, for a particular hypothesis class, assuming realizability. This is what they do here, so it is basically uniform convergence, but of a different quantity, which is more representative of how ML works in practice, so that this framework is probably more useful.
However, their description of "PAC learning" is too restrictive I think; they don't seem to consider datadependent generalizatoin bounds which exist, and some of them are based on extensions to the uniform PAC bounds. For example how does their framework compared to the PACBayes framework?

Hisnot learnable at rate faster thanR
So that the concept of universal learnability is characterizing the worst case learning curve rate. The constant is allowed to depend on P but not the function R. So it is nonuniform in that way. But really that's not the best way to think of it I think. The way I think of it is written in my page note titled "Intuition"

For simplicity of exposition, we have stated a definition corresponding todeterministicalgorithms, to avoidthe notational inconvenience required to formally define randomized algorithms in this contex
IKR

erP
nice

That is,everynontrivial classHis eitheruniversally learnable at an exponential rate (but not faster), or isuniversally learnable at a linearrate (but not faster), or is universally learnable but necessarily with arbitrarily slow rates
what do they mean by "nontrivial" here?

for any learning algorithm, there is a realizable distributionPwhoselearning curve decays no faster than a linear rate (Schuurmans, 1997)
aren't we interested in the statement that for any realizable distribution P there is a learning algorithm whose learning curve decays no faster than a linear rate?


arxiv.org arxiv.org

S({Oμ(x)})
what do they mean by this quantity?
The number of states with the same energy as O_\mu(x)?

2−Nq(h∗)eN(h∗m−log coshh∗)
Isn't this missing the Hessian factor in Laplace's approximation? where has it gone?

argument [10] converts Eq. (1) withα= 1 into the statement that, for a large system,N→ ∞, the energy andentropy are exactly equal (up to a constant) to leadingorder inN.
I think this is the idea that Zipf law is related to P(Energy) being a constant w.r.t. Energy hmm
tho really if both E and S are extensive in N ( meaning linear in N), then they will scale equally with N, obviousy? Tho is zipf law followed for extensive systems? aren't those were parts are independent, and we expect to aproach a uniform distribution?
Right I think E and S scaling the same does not imply Zipf, but the other way, it does, apparently. Need to check argument in [10]


arxiv.org arxiv.org

Because the exponentαN1for language models, we can approximateN−αN≈1−αNlog(N)to obtainequation 4.1.
If \(\alpha_N\log{(N)} \ll 1\) i don't see how E.4 will scale as equation 4.1?
wouldnt the constant \(L_U 1\) dominate?

could be misleading if the models have not all been trained fully to convergence
you mean because perhaps the assumption that {in the limit of large N, they will perfectly model the data} may not hold if we dont train until convergence, and so the power law + constant assumption may not be justified. Yeah that makes sense

which makes the interpretation ofL(N)difficult.
why?

mattn
what is \(m_{attn}\)?

There we also show trends forthe training loss, which do not adhere as well to a powerlaw form, perhaps because of the implicit curriculumin the frequency distribution of easy and hard problems
why would that affect the training loss scaling??

the poor loss onthese modules would dominate the trends
could they show accuracy trends?..

easier problems will naturally appear more often than more difficult problems
interesting. I have some ideas on how this could be related to learning curve exponents

We sample the default mixture of easy, medium, and hard problems, withouta progressive curriculum.
Did they look if curriculum learning had any effect on the learning curves?

context length of3200tokens per image/caption pair
isn't that the total length of an example? I thought the context was the part given before the token to be predicted?

We revisit the question “Is a picture worth a thousand words?” by comparing the informationcontentof textual captions to the image/text mutual information
I think an Issue with their analysis is that a picture's caption in a standard dataset does not capture all the info derivable from a picture


arxiv.org arxiv.org

but we will onlyapply it along the time dimensiont.
what do you mean? I thought you were applying the normalizing flow at each time step individually, not convolving over time


arxiv.org arxiv.org

The key point of this work is that based on observing a single sample from a subpopulation, it isimpossible to distinguish samples from “borderline” populations from those in the “outlier” ones. Thereforean algorithm can only avoid the risk of missing “borderline” subpopulations by also memorizing examplesfrom the “outlier” subpopulations.
I just find it weird that we have to offer so much justification for fitting to 0 error, when I don't see much reason to believe it isn't a good idea?


arxiv.org arxiv.org

e over parameters and the functionspace posterior covariance. Red indicates the underparameterized setting, yellowthe critical regime withp≈n, and green the overparameterizedregime.
isn't it the other way? Red is overparametrized and green is underparametrized?

We see wide but shallow models overfit, providing low train loss, but high testloss and high effective dimensionality.
it seems like it's mostly the number of parameters not the aspect ratio which determines the generalization performance? So that depth is not intrinsically helping generalization?

subspace and ensembling methods could beimproved through the avoidance of expensive computations within degenerate parameter regimes
but how do you make sure you are sampling with the right probabilities?


arxiv.org arxiv.org

w
this should be transposed

Our theoryagain perfectly fits the experiments.
well you can see some deviations in this NN, probably because of the smaller width

K
i think here it should be \(\kappa_{\text{NTK}}\)

marginal training data point causes greater reduction in relative error for low frequency modes than for highfrequency modes.
isn't this the opposite of what you said earlier??
"the marginal training data point causes agreater percent reduction in generalization error for modeswith larger RKHS eigenvalues."

 Oct 2020

arxiv.org arxiv.org

Each expert in the MoE layer receives a combinedbatch consisting of the relevant examples from all of the dataparallel input batches.
so the activations for the set of samples which use expert k should be sent to the right device which has expert k, right?
how much communication overhead is this?


localhost:8000 localhost:8000

A prior over parametersp(w)combines with the functionalform of a modelf(x;w)to induce a distribution over functionsp(f(x;w)). It is this distribution over functions thatcontrols the generalization properties of the model; the priorover parameters, in isolation, has no meaning.
Yep this is what we say in our paper too^^ https://arxiv.org/abs/1805.08522

Distance between the truepredictive distribution and the approximation
you mean something like minus the distance? because you want this distance to be smaller for better approximations?


academic.oup.com academic.oup.com

coherent
coherent hear just means that it will approach the true distribution eventually?


arxiv.org arxiv.org

As the effective dimensionality increases, so doesthe dimensionality of parameter space in which theposterior variance has contracted.
can you not have very confident models which are making wrong predictions?


arxiv.org arxiv.org

In the notation of Section 3, pointsω∈Ωrepresent possible samples. In our setting, each sample represents a complete record of a machine learning experiment. An environmentespecifies adistributionPeon the spaceΩof complete records.In the setting of supervised deep learning, a complete record of an experiment would specify hyperparameters, random seeds, optimizers, training (and held out) data, etc.
so each e represents an "experimetn" which is a range/distribution of hyperparameters (or what they call a complete record of a machine learning experiment)


arxiv.org arxiv.org

We measure a simple empirical statistic, thegradient noise scale3(essentially a measure of the signaltonoise ratio of gradient across training examples),and show that it can approximately predict the largest efficient batch size for a wide range of tasks
how is this related to the difficulty of the task?


www.semanticscholar.org www.semanticscholar.org

nonzero entropy
what about entropy rate?

overfitting
OK, I THINK THEY ARE DEFINING OVERFITTING in the agnostic learning sense of L(f)min_{f'\in F}L(f'). How badly am I doing relative to the best in the class!

we stop training early when the test loss ceases to improve and optimize all models in the same way
didn't they say earlier that they train for a fixed number of steps?

Nincreases and the model begins to overfit
well the increased overfitting is only visible in the smallest data size

S
should be N?

We find that generalization depends almost exclusively on theindistribution validation loss, and does not depend on the duration of training or proximity to convergence
no overfitting^^ even for transfer learning

Although these models have been trained on the WebText2 dataset, their test loss on a variety of other datasetsis also a powerlaw inNwith nearly identical power, as shown in Figure 8.
probably significantly different datasets will show different power laws. The different datasets looked at here seem quite similar

(approximately twice the compute as the forwards pass)
why?

To utilize both training time and compute as effectively as possible, it is best to train with a batchsizeB≈Bcrit
because above B_crit you can reduce time, but with increasing compute cost (diminishing returns)

 Jul 2020

arxiv.org arxiv.org

standard deviations
but are these standard deviations of the means?

 Jun 2020

arxiv.org arxiv.org

(x;W1,...,Wl,b1,...,bl)
it should depend on \(W^{l+1}\) and \(b^{l+1}\) too


arxiv.org arxiv.org

Naturally, such an increase in the learning rate also increases the mean stepsE[∆w]. However,we found that this effect is negligible sinceE[∆w]is typically orders of magnitude lower than thestandard deviation.
Interesting. This is why the intuition that increasing the learning rate would decrease the number of updates is probably not true, because what seems to determine the number of steps is the amount of noise!

 May 2020

arxiv.org arxiv.org

−
+


openreview.net openreview.net

〈O( ̄θ)〉=〈[[O[ ̄θ−η ̄∇LB(θ)]]]m.b.〉.
this is missing some time indices?


arxiv.org arxiv.org

We omit thedβexp (−cγ) +bβlog(1δ)nterm since it does not change with changein random labels.
how can we be sure it is nonvacuous then? hmm

while ̃Hθ†l,φ[j,j] can change based onαscaling Dinh et al. [2017], the effective curvature is scale invariant
do you mean because you change \(\sigma\) too? Was that what Dinh et al. were talking about? Or just the fact that there are other \theta (not reparametrizing, just finding new \theta) which have high curvature, but produce same function?

(f) stays valid for the test error rate in (a)
if you take into account the spread in (f) and (a) it would seem that for some runs the upper bound isn't valid?

Then, based on the ‘fast rate’ PACBayes bound as before, we have the following result
the posterior Q is a strange posterior over hypotheses. How do they take the KL divergence with the prior Because the posterior is defined by two parameters (\(\theta_\rho\) and \(\theta\))

Further, all the diagonal elementsdecrease as more samples are used for training.
Really? That sounds surprising!
I would have expected that as more training samples are added the parameters get more constrained (if the number of parameters is kept fixed).

Theorem 1
Derandomization of the margin loss

The bound provides a concrete realization of the notionof ‘flatness’ in deep nets [Smith and Le, 2018, Hochreiter and Schmidhuber, 1997, Keskar et al., 2017] andillustrates a tradeoff between curvature and distance from initialization.
is there evidence that distance from initialization anticorrelates with generalization? Even evidence for sharpness <> generalization isn't very strong.

In spite of the dependency on the Hessian diagonal elements, which canbe changed based on reparameterization without changing the function [Smith and Le, 2018, Dinh et al.,2017], the bound itself is scale invariant since KLdivergence is invariant to such reparameterizations Kleeman [2011], Li et al. [2019].
i thought Dinh's criticism wasn't so much about reparametrization, but about the fact that there are other minima which are sharper but give the same function. KL wouldn't be invariant to that, as you aren't changing the prior in that case?

 Apr 2020

arxiv.org arxiv.org

∈Ck
this sum was over all points in the training set in the previous step, and now it's over all points ?
Just think of the case where the partition C_i is made up of singletons, one for each possible point. Then, the robustness would be zero, but the generalizatoin error bound doesn't seem right then.
This made me suspect there may be something wrong, and I think it could be at this step. If we kept the sum to be over training sets, now we can;t upper bound the result by the max in the next two lines, I think!

 Mar 2020

www.nature.com www.nature.com

because of the softmax operation.
more like because of the Heaviside operation

the signs of f and 𝑓̃ f~\tilde{f} are the same.
and therefore the classification functions are the same

f~\tilde{f} as 𝑓𝑉=𝜌𝑓̃ fV=ρf~f_V=\rho \tilde{f},
this is confusing, is f_V or \tilde{f} the normalized network?

Our main results should also hold for SGD.
Will this be commented on in more detail?

normalized weights Vk as the variables of interest
Can we even reparametrize to the normalized weights? For homogeneous networks, it's obvious that we can. But for ReLU networks with biases it's less obvious. If one multiplies the biases via constants that grow exponentially with weight, the function is left invariant. We can always do this until the paramter vector is left normalized. Therefore we can reparametrize to the normalized vectors even with biases, but dunno if they consider this case here.

This mechanism underlies regularization in deep networks for exponential losses
we cannot say this, until we know more. Is this the reason why the generalize? Is this even sufficient to explain their generalization?



Bahdanau et al.(2019) learn a reward function jointly with the action policybut does so using an external expert dataset whereas ouragent uses trajectories collected through its own exploration
Yeah what they do here is similar to IRL, in that we are trying to learn a human NLconditioned reward function, but we do it via supervision, rather than demonstration. More similar to the work on "learning from human preferences"


arxiv.org arxiv.org

other agents
which share the same policy right? otherwise it woud be offpolicy experience?

Zero Sum
don't understand this one

specific choice ofλ
here, a specific choice of \(\lambda\) can determine which solutions among the many which satisfy the constraint we choose. Similarly to the choice of convex regularizer in the GAIL paper


Local file Local file

z(xi)z(xj)h
RHS depends on h, but LHS doesn't?


www.aaai.org www.aaai.org

The problem with the max entropy approach in Ziebart et al. 2008 is that it maximizes the entropy of trajectory distributions, without the constraint that these distributions must be realizable by causallyacting policies/agents. They then construct a causal policy from this distribution, but following the policy may result in a different trajectory distribution!
The question is what would be the maximum entropy path distribution that is compatible with a causal policy? Does maximizing causal entropy give that? Not clear. Instead they prove a different property of maximum causal entropy: Theorem 3 in Ziebert 2012

Z(θ)
Remember the partition function sums over trajectories which are compatible with the MDP dynamics only.
Trajectories incompatible with the dynamics have probability 0 of course


www.cs.cmu.edu www.cs.cmu.edu

Ziebart et al. (2008)
The problem with the max entropy approach in Ziebart et al. 2008 is that it maximizes the entropy of trajectory distributions, without the constraint that these distributions must be realizable by causallyacting policies/agents. They then construct a causal policy from this distribution, but following the policy may result in a different trajectory distribution!
The question is what would be the maximum entropy path distribution that is compatible with a causal policy? Does maximizing causal entropy give that? Not clear. Instead they prove a different property of maximum causal entropy: Theorem 3

eθ>F(X,Y)
this is P(YX), right? but it should be P(YX,Y_{1:t1})?


arxiv.org arxiv.org

without interactionwith the expert
how do things change when you can interact with the expert?

 Feb 2020

www.semanticscholar.org www.semanticscholar.org

Attention: Mask
by this, do they mean the attention weighted aggregation step?

nlayerdmodel3dattn
are they ignoring the \(W^O\) matrix? from the original Transformer paper?

Large models are more sampleefficient than small models, reaching the same level ofperformance with fewer optimization steps (Figure 2) and using fewer data points (Figure 4)
hmm in teresting. why are larger models more sample efficient?

Theperformance penalty depends predictably on the ratioN0.74/D
That is weird, what's the origin of this?

hmm do they look at generalization gap?
is trend on test loss due to parameter count, mostly due to effect on expressivyt / tranining loss (similarly with compute)?


Local file Local file

Some preliminary numerical simulations show that thisapproach does predict high robustness and log scaling.However, it only makes any sense if transitions from onephenotype to another phenotype are memoryless.
I thought the whole transition matrix approach itself assumed memorylessness

LetPbe a row vectorspecifying the probability distribution over phenotypes. Wewant to find a stochastic transition matrixM(rows sum toone) such that
why do we want P to be stationary?

Mhas 1s on the diagonals,and 0s elsewhere, for example
that is high robustness right?

Fano’s inequality)
doesn't Fano's inequality give H(XY) on the numerator which is a lower bound on H(X), and so doesnt imply this?


arxiv.org arxiv.org

Intrinsic motivations f
Basically the idea is that the RL/HER part is intrisnsically motivated with LP, to solve more and more tasks while the goal sampling part is intrinsically motivated to get trajectories that give new information to learn the reward function. I suppose they could add a bit of LP to the goal sampling as well to have some tendency to sample trajectories that may help to solve new tasks.

Highquality trajectories are trajectories where the agent collectsdescriptions from the social partner for goals that are rarely reached.
why do you want more than one description for a goal? A: Ah, because the goal will be the same but the final state may not be for each of these trajectories, thus giving more data to train the reward function.

 Jan 2020

arxiv.org arxiv.org

f memorybased sample efficient methods
bandits methods, which are suitable for sequences of indepenedent experiments


arxiv.org arxiv.org

We find that the object geometry makes a significantdifferences in how hard the problem is
apply some goal exploration process like POET?



When it comes to NNs, the regularization mechanism is also well appreciated in the literature,since they traditionally suffer from overparameterization,resulting in overfitting.
No. Overparametrized networks have been shown to generalize even without explicit regularization (Zhang et al. 2017)


arxiv.org arxiv.org

Therefore, we can get the following generalization bound:
as long as the value of L is bounded by at most 1/delta or something right?


arxiv.org arxiv.org

They use onaverage stability that does not imply generalization bounds with highprobability
Their bounds on expectations can be converted to bounds with high probability, as they claim in page 3, citing "ShalevShwartz, S., Shamir, O., Srebro, N., and Sridharan, K. Learnability, stability and uniform convergence. Journal of Machine Learning Research, 11(Oct):2635–2670, 2010."


arxiv.org arxiv.org

forTďmstep
one pass SGD

validation error which is used asan empirical estimate forRpw1q
so their bound has the disadvantage that it needs an estimate given by the validation error to compute the bound! So it can't be computed from the training data alone!!

our bound corroborates the intuition that whenever we start at a good location of the objectivefunction, the algorithm is more stable and thus generalizes better.
This is a nice intuition for why good initializations can lead to good generalization

Rpw1q ́R‹
remember that \(R\) is the population risk, so this isn't a priori something that we can know?

 Dec 2019


Whileit is known having a finite VCdimension (Vapnik and Chervonenkis, 1991) or equivalentlybeing CVEEEloostable (Mukherjee et al., 2006) is necessary and sufficient for the EmpiricalRisk Minimization (ERM) to generalize,
it is only necessary to generalize in the worst case over data distributions right?


www.ronjatutorials.com www.ronjatutorials.com

position attribute.
what is an attribute?


arxiv.org arxiv.org

The bounds based on`2path normand spectral norm can be derived directly from the those based on`1path norm and`2norm respectively
Hmm. how?
This implies that even though the l2 path norms seem nonvacuous on Figure 1, they aren't. They appear so, because we have dropped the "terms that only depend on depth or number of hidden units", which are large for l2path norm


arxiv.org arxiv.org

ExperimentsIn
experiments only in 2 dimensional input space. Could results depend on the input dimensionality?

 Nov 2019

arxiv.org arxiv.org

min(Td;2S)
the min is because depending on which is larger one or the other of the two limits of the integral, dominates

29
Compare this to the analysis of Sollich ( https://pdfs.semanticscholar.org/7294/862e59c8c3a65167260c0156427f4757c67e.pdf ) which is in the wellspecified setting. There there's no dependence on the labels of the training data. Here neither, but at least there's dependence on the distribution of the target labels, so that it allows for more general types of assumptions.

K(x)is an even
which can be seen from its definition as a covariance.

of a Teacher Gaussian process with covarianceKTand assume that they lie in theRKHS of the Student kernelKS, namely
ah yes, being in RKHS means having a finite norm in the RKHS, which makes sense. But not sure how restrictive this is, just like I'm not sure if simply being ntimes differentiable is a good measure of complexity of the function. Are there ntimes differentiable functions that approximate any less smooth function? Maybe Lipschitz constant of derivatives (smoothness constants) could be more quantitatively useful?

If both kernels are Laplace kernels thenT=S=d+ 1andEMSEn1=d, whichscales very slowly with the dataset size in large dimensions. If the Teacher is a Gaussian kernel(T=1) and the Student is a Laplace kernel then= 2(1 + 1=d), leading to!2asd!1
hm, wait what? But wouldn't the Bayes optimal answer be obtained if the student has the same kernel as the teacher?

as \(n\to\infty\)

We perform kernel classification via the algorithmsoftmargin SVM.
which approximates a point estimator of the Gaussian process classifier, but I don't know the exact relation.

man
mean

Importantly (i) Eq. (1) leads to a prediction for(d)that accurately matches our numerical study forrandom training data points, leading to the conjecture that Eq. (1) holds in that case as well.
Compare with: https://arxiv.org/pdf/1909.11500.pdf where they find that random inputs give rise to plateaus, hmm at least with epochs, but they cite papers where these are apparently found for training set size (perhaps only for thin networks?)

s a result, various works on kernel regressionmake the much stronger assumption that the training points are sampled from a target function thatbelongs to thereproducing kernel Hilbert space(RKHS) of the kernel (see for example [Smola et al.,1998]). With this assumptiondoes not depend ond(for instance in [Rudi and Rosasco, 2017]= 1=2is guaranteed). Yet, RKHS is a very strong assumption which requires the smoothness ofthe target function to increase withd[Bach, 2017] (see more on this point below), which may not berealistic in large dimensions.
I think when they say "it belongs to an RKHS", they mean that it does so with a fixed/bounded norm (otherwise almost any function would satisfy this, for universal RKHSs). This is consistent with the next comment saying, that this assumption implies smoothness (smoothness<>small RKHS norm generally)


openreview.net openreview.netpdf1

Seems like PPO works better than their approach in several of the experiments. Hmm


arxiv.org arxiv.org

irreducible error (e.g.,Bayes error)
more commonly model capacity limitations I guess?


arxiv.org arxiv.org

GMM on a dataset of previously sampled parametersconcatenated to their respective ALP measure.
the GMM is only fitted to the parameter part or the (parameter, ALP) vector?


www.ki.tuberlin.de www.ki.tuberlin.de

nevertheless, the few remaining ones must still differ in a finite fraction of bits fromeach other and from the teacher so that perfect generalization is still impossible. For aslightly above aconly the couplings of the teacher survive.
Lenka Zdeborová, Florent Krzakala have found that at the capacity threshold, algorithms tend to have the longest running times, i.e. the worstcase examples seem to live at that transition

For a committeeof two students it can be shown that when the number ofexamples is large, the information gain does not decreasebut reaches a positive constant. This results in a much fasterdecrease of the generalization error. Instead of being inversely proportional to the number of examples, the decrease is now exponentially fast
For the case of the perceptron you can see how the uncertainty region (whose volume approximates the generalization error) approximately halves (or is reduced by about a constant) after every optimal query.


incompleteideas.net incompleteideas.net

n general, the baseline leaves the expected value of the update unchanged,but it can have a large
because baseline depends on S, it can reduce the variance from state to state (not the one from action to action).
WRONG: IT can reduce the action to action variance of the gradient (not the variance of the advantage!)

 Oct 2019

arxiv.org arxiv.org

computevar1bbÂj
this is the covariance matrix

This suggests that the effect ofj(x)is to rotate the gradient field and move thecritical points, also seen in Fig. 4b.
how does this equation suggest this?

sampling with replacement has better regularization
but you are saying that the temperature (\(\beta^{1}\) is lower when you sample with replacement, so that the regularization should be less?

conservative
how does this mean that it is conservatice?

This implies that SGD implicitlyperforms variational inference with a uniform prior, albeit of a different loss than the one used tocompute backpropagation gradients
The interpreation of doing variational inference with a uniform prior is because if we interpret the minimization objective as an ELBO, the second term is like the KL divergence between the approximate posterior and a uniform prior (whicih just gives the entropy). Nice
If \(\rho\) doesn't have any constraints then this should give the exact posterior with uniform prior, and likelihood given by \(\Phi(x)\)


arxiv.org arxiv.org

The second particularity is that since the computation of the rewardRpp;c;;oqis internal to themachine, it can be computed any time after the experimentpc;;oqand for any problempPP,not only the particular problem that the agent was trying to solve. Consequently, if the machineexperiments a policyin contextcand observeso(e.g. trying to solve problemp1), and storesthe resultspc;;oqof this experiment in its memory, then when later on it selfgenerates problemsp2;p3;:::;piit can compute on the fly (and without new actual actions in the environment) theassociated rewardsRp2pc;;oq;Rp3pc;;oq;:::;Rpipc;;oqand use this information to improveover these goalsp2;p3;:::;pi.
like hindsight experience replay


arxiv.org arxiv.org

Although methods to learndisentangled representation of the world exist [25,26,27], they do not allow to distinguish featuresthat are controllable by the learner from features describing external phenomena that are outsidethe control of the agent.
learning controllabe features is similar to learning a causal model of the world I think


arxiv.org arxiv.org

We find that the full NTK has better approximation propertiescompared to other function classes typically defined for ReLU activations [5, 13, 15], which arise for instancewhen only training the weights in the last layer, or when considering Gaussian process limits of ReLUnetworks (e.g., [20, 24, 32]).
NTK has "better approximation properties". What do they mean more precisely?


arxiv.org arxiv.org

and we have left the activation kernel unchanged,K`=1M`A0`A0T`
what is the reason to do this?

(A`jJ`)
J_l is the covariance for a single column of A_l right?

Second, we modified theinputs by zeroingout all but the first input unit (Fig. 1 right).
how does this work more precisely? The targets are generated by feeding the modified inputs to the "teacher network", but the student network gets the unmodified inputs?

for MAP inference, the learned representationstransition from the input to the output kernel, irrespective of the network width.
how is MAP inference implemented?

he representations in learned neural networks slowly transitionfrom being similar to the input kernel (i.e. the inner product of the inputs) to being similar to theoutput kernel (i.e. the inner product of onehot vectors representing targets).
this transition, as what? as the layer width is increased?

the covariance in the toplayer kernel induced by randomnessin the lowerlayer weights.
what does he mean by this?

e.g.compare performance in GarrigaAlonso et al. (2019) and Novak et al. (2019) against He et al.(2016) and Chen et al. (2018)).
but in here the GP networks lack many important features like batchnorm, pooling etc! Not sure if this example is a fair comparison. Also, not clear whether this difference is due to finite width or SGD (a question that Novak also asks)

enabling efficient and exact reasoning aboutuncertainty
Only in regression... AAaaAaaAh ÒwÓ



significant new benchmark for performance of a pure kernelbased method on CIFAR10, being 10% higher than the methods reported in [Novak et al., 2019]
Interesting, so apparently the NTK works better than the NNGP for this architecture at least


www.jmlr.org www.jmlr.org

Optimally, these parameters are chosen such that the true predictiveprocessP(t§jx§;S) is closest toQ(t§jx§;S) in relative entropy.
in which sense is this optimal?

Bayes classiØer
I thought the Bayes classifier would predict sign ( E_w [P(ty)y(xw)]  0.5) ?

our task is then to separate the structure from thenoise.
Well, and to find the correct regularity; generalization is not just about separating structure from noise. Unless by "noise" here, you mean also the stochasticity in the training sample (of inputs)..

We know of no interesting realworld learningproblem which comes without any sort of prior knowledg
Yep, no free lunch

(theluckycase)
again I wouldn't call it "unlucky", because the whole proof is that the generalization is good, because it's very unlikely to have obtained this training set by luck, so that it's most likely that we obtained it by having chosen a good prior. So I would call it "good prior" case.


arxiv.org arxiv.org

, such as cross entropy loss, encourage a larger outputmargin
The fact that they also encourage a large SVMmargin is not so trivial tho

the gap between predictions on the true label and andnext most confident label.
In SVMs, for instance, "margin" refers to the distance between classification boundary and a point. This can be related to the definition of margin here, but they are not the same?
E.g. if we have a small SVMmargin, but a really large weight norm, then we would still have a small output margin.
Ah, that's why they normalize by weight norm I suppose yeah.


arxiv.org arxiv.org

This is further consistent with recent experimental work showing that neuralnetworks are often robust to reinitialization but not rerandomization of layers (Zhang et al. [42]).
what does this mean?

Kernels from single hidden layer randomly initializedReLUnetwork convergence to analytic kernel using Monte Carlo sampling (Msamples). See §I foradditional discussion
I think the monte carlo estimate of the NTK is a montecarlo estimate of the average NTK (as in average over initializations), not of the initializationdependent NTK which Jacot studied. Jacot showed that in infinite width limit both are the same.
But it seems from their results that even for finite width the average NTK is closer to the limit NTK than the singlesample NTK. This makes sense, because the single sample one has extra fluctuation around average.

We observe that the empirical kernel^gives more accurate dynamics for finite width networks.
That is a very interesting observation!

=0n
yeah! so in standard parametrization, the learning rate is indeed O(1/n) !


github.com github.com

Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent
You didn't except hypothes.is in here did you?
Bamboozled again!
