 Last 7 days

arxiv.org arxiv.org

uTWv
uWu

pθ(z)
this should read \(\ln{p_\theta ( \mathbf{z})}\)

whilereverse KLdivergence is not a viable objective function,
I guess in the above derivation, we dont really have access to p_d(x) even if we had the trained discriminator. We only have the softmaxes which give us the two ratios in expression (23) which are conditional probabilities (p(gx) and p(dx)) rather than the pd and pg separately.

dditionally, it has been suggested thatreverse KLdivergence,DKL(pgpd), is a better measurefor training generative models than normal KLdivergence,DKL(pdpg), since it minimisesEx∼pg[lnpd(x)][92]
I think the intuition here is that minimizing D(pgpd) has the property of ensuring that pg is contained inside the support of pd. So that, we are more sure that samples will be realistic, though may suffer from mode collapse, like GANs, but that may be a less serious problem.

2Stage VAEsBy interpreting VAEs as regularised autoencoders, it is natural to apply less strict regularisation to the latent spaceduring training then subsequently train a density estimator on this space, thus obtaining a more complex prior[53]. Vector QuantizedVariational Autoencoders (VQVAE)[170], [215] achieve this by training an autoencoder witha discrete latent space, then training a powerful autoregressive model (see Section 5) on the latent space.
So I think what VQVAEs are trying to achieve is a VAE with a very flexible learned prior. If we look at the VAE objective (eq 19), we see that if p(z) equals q(z), then the KL divergence (averaged over q(x), which I'm interpreting to be the data distribution), becomes I(z;x), i.e. minus the mutual information between z and x. Interesting!
However, we can think of VQVAE basically ignoring this explicit regularization term, during its first stage (or maybe its implicitly approximated via its codebook training objectives! hmm). In the second stage, we just fit the prior to approximate q(z) which is what we were assuming to be true in the analysis.
It seems that some of the other works cited in the previous paragraph, try to use the expression in eq 19 (and thus the I(z;x) regularization) directly. Though because it is intractable to compute, they approximate it in different ways.

2 stage VAEs that firstmap data to latents of dimensionrthen use a second VAE tocorrect the learned density can better capture the data [35]
because then the second VAE does recover the data distribution on the latent, according to that paper. Interesting!

(14)
i think here q(x) is supposed to be q(xz) and the e^E(x) should be e^E(z)? (well e^E(z) is equivalent to e^E(x) if x is a function fo z..) But yeah, I think the q part, should be q(xz). For example, if using normalizing flows, it would be jacobian of the inverse flow

implicit energy models
what are implicit energy models?? I thought they said previously that EBM are not implicit generative models?

This is made worse by thefinite nature of the sampling process, meaning that samplescan be arbitrarily far away from the model’s distribution[57].
does he mean because of the finite step size? Is that a big problem? hmm not sure I guess this sentence.

the gradient of the negative loglikelihoodlossL(θ) =Ex∼pd[−lnpθ(x)]has been shown to approximately demonstrate the following property [21], [193]
this contrastive divergence result is very cool!

TABLE 1
what do the stars in training speed, sample speed and param efficiency correspond to, quantiatively?
Also, it would be nice to know robsutness to hyperparameters, as that is often a big part of "training time"

training speed is assessed based on reportedtotal training times
hmm ideally we would know if they had trained until convergence, or if they had gone over convergence.

Choosing what to optimise for has implications for sample quality, with direct likelihood optimisationoften leading to worse sample quality than alternatives.
is this in part because of noise in the data, which the likelihood based models also fit?



constant memory
They say constant memory but below the memory is said to be \(O(N)\) which one is true? As far as I can tell, the latter is a typo?

 Apr 2021

arxiv.org arxiv.org

especially in the most difficult long horizon setting.
actually the graph shows a larger difference for 1 task than for more tasks?

This process is scalable because pairing happens afterthefact, making it straightforward to parallelize via crowdsourcing.
could you not crowdsource insruction following too?
Maybe this adds extra diversity though
Probably combining both would be best

 Mar 2021

Local file Local file

Training an agent for145such social interactions most likely requires drasti146cally different methods – e.g. different architectural147biases – than classical objectmanipulation training
Or a lot of pretraining data, which given current empirical findings, tends to work better.

To117enable the design and study of complex social sce118narios in reasonable computational time
alternatively you could consider more complex environments but with more offline algorithms like bootstraping from supervised learning

rather than lan059guage based social interactions
some important recent counter example is IIL.


openreview.net openreview.net

(8)
This whole variational calculation is basically like combining Monte Carlo integration (with importance sampling), and Jensen inequality (to bring the expectation outside the log). The cool thing is that optimizing over q, makes the approximation exact if we our model for q is sufficiently expressive

Consequently, maximizing the loglikelihood of the continuous model on uniformly dequantizeddata cannot lead to the continuous model degenerately collapsing onto the discrete data, because itsobjective is bounded above by the loglikelihood of a discrete model.
I don't see how this argument works


arxiv.org arxiv.org

In our datasets Fig. 2c, we find empirically thatfor the same amount of collection time, play indeed covers 4.2 times more regions of the availableinteraction space than 18 tasks worth of expert demonstration data, and 14.4 times more regions thanrandom exploration.
That is indeed a cool finding

lay data ischeap: Unlike expert demonstrations (Fig. 5), playrequires no task segmenting, labeling, or resetting to an initial state, meaning it can be collectedquickly in large quantities.
Well, you still need to have enough people playing for enough time, which may not be cheap. For example, in Imitating Interactive Intelligence they had to spent about 200K pounds to pay people to play with their environment

We additionally find that playsupervised models,unlike their experttrained counterparts, are more robust to perturbations and exhibit retryingtillsuccess behaviors.
I guess because in the play data there was examples of reaching the goal even from suboptimal trajectories


arxiv.org arxiv.org

Intuitively, this is equivalentto taking the average of demonstrated actions at each specificstate.
unless you model the ditribution, e.g. using Normalizing flows, which should be better?


arxiv.org arxiv.org

TextWorldaddresses this discrepancy by providing programmatic and aligned linguistic signals during agentexploration.
but isnt this just substituting the human languageinstructor, with a rulebased one that is bound to be of lower quality?


arxiv.org arxiv.org

Informally, all else being equal, discontinuousrepresentations should in many cases be “harder” to approximate by neural networks than continuous ones. Theoretical results suggest that functions that are smoother [34] orhave stronger continuity properties such as in the modulusof continuity [33, 10] have lower approximation error for agiven number of neurons.
and they probably generalize better, as there are several works showing that DNNs are implicitly biased towards "smooth" functions


flowersteam.github.io flowersteam.github.io

n the skill learning phase, LGB relies on an innate semantic representation that characterizes spatial relations between objects in the scene using predicates known to be used by preverbal infants [Mandler, 2012].
so this is featureengineered no?

Although the policy overgeneralizes, the reward function can still identify whether plants have grown or haven’t.
how has the reward function learned the association between "feed" and "the object grows"? I guess that was taught from the language descriptions? It should be able to learn the reward function correctly then

more aligned data autonomously.
i think this is similar to the idea of selftraining


www.frontiersin.org www.frontiersin.org

This would require a means for representing meaning from experience—a situation model—and a mechanism that allows information to be extracted from sentences and mapped onto the situation model that has been derived from experience, thus enriching that representation
This basically means adding extra information that is inferred, to what is just directly observed no?


www.cc.gatech.edu www.cc.gatech.edu

What can and should the user be doing while the AI agent is taking its turn to increase engagement?
Maybe the agent's actions themselves should be engaging enough? We should aim for that I think

We employ a turnbased frameworkbecause it is a common way of organizing cocreative interactions [3,12,13] and because it suitsevolutionary and reinforcementlearning approaches that require discrete steps [2, 7, 8, 14].
I think thats a significant limitation. More fluid interactions can only take place in continuoustime settings


arxiv.org arxiv.org

As can be seen in Figure 3 (left), the trainingperformance was sensitive to the weight scaleσ, despitethe fact that a weight normalisation scheme was beingused.
It would be interesting to explore whether this pitfall can actually have an effect in some scenario where one isnt using an abnormally high initialization

 Feb 2021

research.fb.com research.fb.com

natural motions
more natural than the baseline*

For quantitative evaluation, we computed the meansquared error between the generated motion and motioncapture on a leftout test set, for fingertip positions and jointangles
this is problematic because there could be many motions which are good but quite different, and thus having big MSE

In total, we used approximately120 minutes of data
what? why didnt you use more data? ... We need to do scaling experiments with this


arxiv.org arxiv.org

2)autoregression reduces the amount of fast moments, making thevelocity histogram more similar to the ground truth
huh? I see autoregression increasing the amount of fast movements no?

“In which video...”: (Q1) “...are the character’s movements mosthumanlike?” (Q2) “...do the character’s movements most reflectwhat the character says?” (Q3) “...do the character’s movementsmost help to understand what the character says?” (Q4) “...are thecharacter’s voice and movement more in sync?”
It would also be good to do observational studies where users are simply asked to interact with different characters. And we measure how engaged they are.

Hence, after five epochs of training with autoregression,our model has full teacher forcing: it always receives the groundtruth poses for autoregression. This procedure greatly helps withlearning a model that properly integrates nonautoregressive input.
Interesting, I would have guessed that doing it the other way (starting with teacher forcing and decrease this to fully autoregressive training) would have been the natural curriculum.
What was the idea for doing this? Is the idea basically to extend to gradually make the information in the autoregressive part of the input more and more predictive, so that the network can anneal from using features in the speech part, to using features in both speech and autoregressive motion?

This pretraining helps thenetwork learn to extract useful features from the speech input, anability which is not lost during further training.
I wonder if selfattention like in transformers would be better at learning which features to pick on

we pass a sliding windowspanning 0.5 s (10 frames) of past speech and 1 s (20 frames) offuture speech features over the encoded feature vectors.
so cant generate gestures from audio/text in real time with this

feature vector𝑉𝑠was made distinct from all other encodings, bysetting all elements equal to−15
it may be a good idea to learn these embeddings no?


arxiv.org arxiv.org

three different domains: U.S. presidents,dog breeds, and U.S. national parks. We use multiple domains to include diversity in our tasks,choosing domains that have a multitude of entitiesto which a single question could be applied
three domains wow much diversity

We assume each task has an associated metricμj(Dj,fθ)∈R, which is used tocompute the model performance for taskτjonDjfor the model represented byfθ.
So this assumes that the reward can be defineable. In some tasks, it may not be so easy right? We may need to learn rewards


arxiv.org arxiv.org

ecological pretraining
whats ecological pretraining


arxiv.org arxiv.org

BART waspretrained using a denoising objective and a variety of different noising functions. It has obtainedstateoftheart results on a diverse set of generation tasks and outperforms comparablysized T5models [32].
wait so it was just trained on reconstruction? hmm interesting.
i guess the finetuning then really changes the output in this case, even tho it still reuses knowledge in the model?


arxiv.org arxiv.org

We believe these properties provide good motivationfor continuing to scale larger endtoend imitation architectures over larger play datasets as a practical strategy fortaskagnostic control.
MORE DATA


arxiv.org arxiv.org

Multipleavenues, including understanding more deeply the mechanisms of creative, knowledgerich thought, or transferring knowledge from large, real world datasets, may offer a wayforward.
ALSO INTERESTING FUTURE DIRECTIONS

To go beyond competence within somewhat stereotyped scenarios toward interactive agents that can actively acquire and creatively recombineknowledge to cope with new challenges may require as yet unknown methods for knowledge representation and credit assignment, or, failing this, larger scales of data.
Probably most reliable approach: Larger scales of data

To record sufficientlydiverse behaviour, we have “gamified” humanhuman interaction via the instrument of language games.
GAMIFICATION DATA GATHERING THROUGH GAMES

Winograd envisioned computers that are not “tyrants,” but rather machines that understand and assist us interactively, and it is this view that ultimately led himto advocate convergence between artificial intelligence and humancomputer interaction(Winograd, 2006)
And VR is a big part in the next step in humancomputer interaction

Generally, these results give us confidence that we could continue to improve the performance of the agents straightforwardlyby increasing the dataset size.
yeah if you have lots of money to pay people..
but that is not that scalable

Although the agents do not yet attainhumanlevel performance, we will soon describe scaling experiments which suggest thatthis gap could be closed substantially simply by collecting more data.
We need more data

The regularisation schemes presented in the last section can improve the generalisationproperties of BC policies to novel inputs, but they cannot train the policy to exert active control in the environment to attain states that are probable in the demonstrator’s distribution.
Unless that active control can be learned by generalizing from learned actions in the demonstrations?

The mouselook action distribution is in turn also defined autoregressively: the first sampled actionsplits the window bounded by(−1,1)×(−1,1)in width and height into 9 squares. Thesecond action splits the selected square into 9 further squares, and so on. Repeating thisprocess several times allows the agent to express any continuous mouse movement up to athreshold resolution.
Interesting representation of a continuous action space!


arxiv.org arxiv.org

effective dimensionality of a Bayesian neural network is inverselyproportional to the variance of the posterior distribution.
posterior contraction in parameter space I think you are talking about no?


www.shortscience.org www.shortscience.org

Yet this also implies non i.i.d. samples! Indeed, even if one could directly sample from the stateaction distribution (like having its analytical form or an infinite experience replay buffer) and thus draw i.i.d. samples, the dependency will occur across optimization steps: if I draw a sample and use it to update my policy, I also update the distribution from which I will draw my next sample and then my next sample depends on my previous sample (since it conditioned my policy update).
But this isn't a problem if the examples come from a fixed expert no?

 Jan 2021

arxiv.org arxiv.org

Prefixtuning prepends a sequence ofcontinuous taskspecificvectors to the input, whichwe call aprefix, depicted by red blocks in Figure 1(bottom). For subsequent tokens, the Transformercan attend to the prefix as if it were a sequence of“virtual tokens”, but unlike prompting, the prefixconsists entirely of free parameters which do notcorrespond to real tokens.
and are thus differentiable! yay


openlabflowers.inria.fr openlabflowers.inria.fr

I guess a steppingstone towards this would be to optimize morphological growth processes to generate a body with a particular form in 3D (that would be quite similar to the differentiable CA, except that here the “cells” move in 3D space and have physical interaction that depend on their internal parameters and states)

(and that would be also novel to use a populationbased IMGEPs using gradient descent for local optimization towards selfgenerated goals)
similar to SIREN+CLIP (Deep Sleep)


arxiv.org arxiv.org

For this reason, we wereunable to collect baselines such as an equivalent amount of highquality human demonstrations forsupervised baselines. See D for more discussion. We leave this ablation to future work.
so one possibility is that the feedback you got was of better quality than the data used for SL. Perhaps if you did SL on higher quality data you would match the performance of the human feedback model?

it’s unclear how much one can optimizeagainst the reward model until it starts giving useless evaluations.
adversarial examples

Previous work on finetuning language models from human feedback [73] reported “a mismatchbetween the notion of quality we wanted our model to learn, and what the humans labelers actuallyevaluated”, leading to modelgenerated summaries that were highquality according to the labelers,but fairly lowquality according to the researchers.
That is quite interesting

We rely on detailed procedures toensure high agreement between labelers and us on the task, which we describe in the next section
is this necessarily a good thing? Could you not miss other notions of "qualtiy" this way? I guess you want to ensure a consistent notion of quality, rather than asking the question of "what about other notions of quality?"


openai.com openai.com

We conjecture that this gap occurs because the models “cheat” by only optimizing for performance on the benchmark, much like a student who passed an exam by studying only the questions on past years’ exams.
poor generalization


proceedings.mlr.press proceedings.mlr.press

n fact, without visiting any states at all, sincethe queries are synthetic.
grr, what about during the phase of training the generative model?

he xaxis represents the number of queries to the user, where each queryelicits a label for a single state transition(s, a, s0).
but isnt sampling from model less expensive than sampling by optimizing AFs? shouldnt that be taken into account?

having to visit unsafe states during the training process
it may have visited some during the training of the generative model no?
But I guess not that many, if the generative model has been pretrained, and it can generalize well

As discussed inSection4.3and illustrated in the rightmost plot of Figure5, the baselines learn a reward model that incorrectly extrapolates that continuing up and to the right past the goalregion is good behavior.
but if the baselines arent visiting those high reward states, then they havent actually fallen into reward hacking? I guess the idea is that they could in a new environment.
Take away is to do more exploration if you expect to be tested to new environments

⌧query= maxz0,a0,z1,...,zTJ(⌧)+logp(⌧)
its like a modelbased version of DDPG + curiosity/exploration rewards?

Here, the states2R64⇥64⇥3is anRGB image with a topdown view of the car (Figure3), andthe actiona2R3controls steering, gas, and brake
In my experience, high dimensional action spaces are even harder, specially when combined with high dim state spaces

he idea is to elicit labels for examples that themodel is least certain how to label, and thus reduce modeluncertainty.
what if the user(s) the model is querying are also uncertain? Then the model shouldnt spend too much time on these. This is one thing that learning progress aims to avoid!

To simplify our experiments,we sample trajectories⌧by following random policies thatexplore a wide variety of states. We use the observed trajectories to train a likelihood model
Seems like this may be an issue in more complex environments, as the random policies may not explore enough!
We probably want either human demonstrations and/or iterate/reinfe the generative model with the later policies

(4) maximize novelty of trajectories regardless of predicted rewards, to improve the diversityof the training data.
could also do something based on learning progress

In complex domains,the user may not be able to anticipate all possible agentbehaviors and specify a reward function that accuratelydescribes user preferences over those behaviors
so is the assumption that the automated way ot exploring agent behaviours is better than what a human would consider?

 Dec 2020

www.wikiwand.com www.wikiwand.com

it is far easier to obtain reliability beyond a certain margin by mechanisms in the end hosts of a network rather than in the intermediary nodes,[nb 4] especially when the latter are beyond the control of, and not accountable to, the former
this seems to me to be mostly saying that: it's hard to change the standards at the low level, so it's easier to program at the higher level.
This is true of not just networks, but of computers, etc too. But it may not always be the best approach!
Should have called it "rule of thumb" more than principle I think

 Nov 2020

arxiv.org arxiv.org

all causal explanationsare necessarily robust in this extreme case
are they? Can you not have a thing that has a conditional causal effect?
Seems to me that causality should be a more quantiative thing (how robust is this predictor), rather than an eitheror thing


proceedings.neurips.cc proceedings.neurips.cc

Goldblum et al.[119]which empirically observes that the large width behavior of ResidualNetworks does not conform to the infinitewidth limit.
Oh interesting!

WhileCNNVECpossess translation equivariance but not invariance (§3.11), we believe it can effectivelyleverage equivariance to learn invariance from data
How? if it doesn't imply anything about the output?

This is caused by poor conditioning of pooling networks. Xiao et al.[33](Table 1) show that theconditioning at initialization of aCNNGAPnetwork is worse than that ofFCNorCNNVECnetworksby a factor of the number of pixels (1024 for CIFAR10). This poor conditioning of the kerneleigenspectrum can be seen in Figure 8. For linearized networks, in addition to slowing training by afactor of 1024, this leads to numerical instability when usingfloat32
Interesting. Do models with a stronger bias lead, which may be associated with better generalization (see https://arxiv.org/abs/2002.02561 / https://arxiv.org/abs/1905.10843), lead also to poorer conditioning?
Hmm, but this did not affect the nonlinearized model. Interesting. How does nonlinear GD avoid the issue?

egularization parameter
what regularization parameter?


arxiv.org arxiv.org

We add the superscript “all" to emphasize that gradientbased training of the networks is alwaysperformed on the entire dataset, while NNGP inference is performed on subsampled datasets.
ah hm, so the gradient method is given an advantage by being able to "look" at more data than the NNGP method?


arxiv.org arxiv.org

With few exceptions (Carlson et al., 2010), machine learning models havebeen confined to IID datasets that lack the structurein time from which humans draw correlations aboutlongrange causal dependencies
All of RL studies nonIID data

how pretraining obfuscates ourability to measure generalization (Linzen, 2020)
How??

but even complex simulation action spaces can be discretizedand enumerated.
What's the problem of enumerating and discretizing action spaces?
what about agents that can act via free text? like those in AI dungeon? those are in principle not enumerable

models the listener’s desires and experiences explicitly
what does it mean to model them explicitly versus implicitly?

Collecting data about rich natural situations is often impossible.
NOPE. VR.

Meanwhile, it is precisely human’sability to draw on past experience and make zeroshot decisions that AI aims to emulate
which is what GPT3 is doing

Second, current cross entropy training losses actively discourage learning the tail of the distribution properly, as statistically infrequent events aredrowned out (Pennington et al., 2014; Holtzmanet al., 2020).
That's what scaling is doing, shaving off those tails (as the scaling papers discuss)

it is unlikely that universal function approximatorssuch as neural networks would ever reliably positthat people, events, and causality exist without being biased towards such solutions (Mitchell, 1980)
Why?

(which are usually thrown out beforethe dataset is released)
They shouldn't be! We should learn to probabilistically model the data

persistent enough to learn the effects of actions.
so we should aim for longer contexts? Yeah memory is important. There is research in extending transformers to have longer contexts

and active experimentation is keyto learning that effec
why?

participatein linguistic activity, such as negotiation (Yang et al.,2019a; He et al., 2018; Lewis et al., 2017), collaboration (Chai et al., 2017), visual disambiguation(Anderson et al., 2018; Lazaridou et al., 2017; Liuand Chai, 2015), or providing emotional support(Rashkin et al., 2019).
do we need the agent itself to participate, or is not sufficient to feed it data from such types of interactions?

Framing, such as suggesting that achatbot speaks English as a second language
Tbh I think that framing can be both missleading and illuminating (about the degree or lack thereof of capability of the agent)

Robotics and embodiment are not available inthe same offtheshelf manner as computer visionmodels.
I think VR can solve that

(Liet al., 2019b; Krishna et al., 2017; Yatskar et al.,2016; Perlis, 2016)
why don't you explain how these papers support the statement at least?

Models must be ableto watch and recognize objects, people, and activities to understand the language describing them
why?

Learned, physical heuristics, such as thefact that a falling cat will land quietly, are generalized and abstracted into language metaphors likeas nimble as a cat(Lakoff, 1980).
So you just conceded that a prime example of things that need physical interaction to be learnt, can be expressed in words?
You should make your points clearer. The point I think is that there are a lot of subconscious knowledge like the example you give, but which we can't quite put into words!

Language learning needs perception, because perception forms the basis for many of our semanticaxioms
could we not argue that language is all that we are conscious of. Even though it may be formed by external sensations, what we currently (consciously) know may be almost fully expressible by language, and therefore WS2 may be enought to learn all of conscious knowledge

As text pretraining schemes seem to be reaching the point of diminishing returns,
Not yet, in long scale IIRC

parked my car in the compact parking space because it looked (big/small) enough
Hmm, I think the answer is "big"? This seems learnable from text statistics?

Continuing to expandhardware, data sizes, and financial compute costby orders of magnitude will yield further gains, butthe slope of the increase is quickly decreasing.
Right, but it's nice that we have a reliable way to improve performance.

cale in data andmodeling has demonstrated that a single representation can discover both rich syntax and semanticswithout our help (Tenney et al., 2019).
It's not without our help. The data is our help?^^

You can’t learn language from the radio.
I think the question shouldn't be phrased as a dichotomy, but quantitatively: How much language (semantics) can you and can you not learn from the radio?

The futility of learning language from linguistic signal alone is intuitive, and mirrors thebelief that humans lean deeply on nonlinguisticknowledge (Chomsky, 1965, 1980).
Something being intuitive isn't a strong argument for it being true.

from their use by people to communicate
Let's gather massive datasets on that through VR ^^

Natural language processing is a diverse field,and progress throughout its development hascome from new representational theories, modeling techniques, data collection paradigms,and tasks.
and figuring out how to scale up https://arxiv.org/abs/2001.08361

successful linguisticcommunicationrelies on a sharedexperience of the world. It is this shared experience that makes utterances meaningful
I think this is true, except for the language which communicates about language. I think there is meaning purely within the world of language too.
Though certainly a lot of meaning lies in the grounding of language too


www.ece.uvic.ca www.ece.uvic.cauntitled2

share attention
common context

Any smaller subset of these competencies is not sufficient to develop proper language/communication skills, and further, the development of language clearlybootstraps better motor and affordance learning and/or sociallearning.
This seems to be full of statements like this where they claim something is "obviously true" but really more justification is needed for these claims.


arxiv.org arxiv.org

Intuition
The way I think about their framework is as follows:
They shift perspective from bounding the error to "bounding" the learning curves
Learning curves are functions (of n), so there is no clear ordering between them as there is for the error at a particular n, which is just a number.
So instead of learning curves we look at {learning curves up to the equivalence relation of having the same asymptotic behaviour (up to a constant)}, which we call "rates".
For these there is a natural ordering, and one can provide a rate upper bound, that is uniform over P, for a particular hypothesis class, assuming realizability. This is what they do here, so it is basically uniform convergence, but of a different quantity, which is more representative of how ML works in practice, so that this framework is probably more useful.
However, their description of "PAC learning" is too restrictive I think; they don't seem to consider datadependent generalizatoin bounds which exist, and some of them are based on extensions to the uniform PAC bounds. For example how does their framework compared to the PACBayes framework?

Hisnot learnable at rate faster thanR
So that the concept of universal learnability is characterizing the worst case learning curve rate. The constant is allowed to depend on P but not the function R. So it is nonuniform in that way. But really that's not the best way to think of it I think. The way I think of it is written in my page note titled "Intuition"

For simplicity of exposition, we have stated a definition corresponding todeterministicalgorithms, to avoidthe notational inconvenience required to formally define randomized algorithms in this contex
IKR

erP
nice

That is,everynontrivial classHis eitheruniversally learnable at an exponential rate (but not faster), or isuniversally learnable at a linearrate (but not faster), or is universally learnable but necessarily with arbitrarily slow rates
what do they mean by "nontrivial" here?

for any learning algorithm, there is a realizable distributionPwhoselearning curve decays no faster than a linear rate (Schuurmans, 1997)
aren't we interested in the statement that for any realizable distribution P there is a learning algorithm whose learning curve decays no faster than a linear rate?


arxiv.org arxiv.org

S({Oμ(x)})
what do they mean by this quantity?
The number of states with the same energy as O_\mu(x)?

2−Nq(h∗)eN(h∗m−log coshh∗)
Isn't this missing the Hessian factor in Laplace's approximation? where has it gone?

argument [10] converts Eq. (1) withα= 1 into the statement that, for a large system,N→ ∞, the energy andentropy are exactly equal (up to a constant) to leadingorder inN.
I think this is the idea that Zipf law is related to P(Energy) being a constant w.r.t. Energy hmm
tho really if both E and S are extensive in N ( meaning linear in N), then they will scale equally with N, obviousy? Tho is zipf law followed for extensive systems? aren't those were parts are independent, and we expect to aproach a uniform distribution?
Right I think E and S scaling the same does not imply Zipf, but the other way, it does, apparently. Need to check argument in [10]


arxiv.org arxiv.org

Because the exponentαN1for language models, we can approximateN−αN≈1−αNlog(N)to obtainequation 4.1.
If \(\alpha_N\log{(N)} \ll 1\) i don't see how E.4 will scale as equation 4.1?
wouldnt the constant \(L_U 1\) dominate?

could be misleading if the models have not all been trained fully to convergence
you mean because perhaps the assumption that {in the limit of large N, they will perfectly model the data} may not hold if we dont train until convergence, and so the power law + constant assumption may not be justified. Yeah that makes sense

which makes the interpretation ofL(N)difficult.
why?

mattn
what is \(m_{attn}\)?

There we also show trends forthe training loss, which do not adhere as well to a powerlaw form, perhaps because of the implicit curriculumin the frequency distribution of easy and hard problems
why would that affect the training loss scaling??

the poor loss onthese modules would dominate the trends
could they show accuracy trends?..

easier problems will naturally appear more often than more difficult problems
interesting. I have some ideas on how this could be related to learning curve exponents

We sample the default mixture of easy, medium, and hard problems, withouta progressive curriculum.
Did they look if curriculum learning had any effect on the learning curves?

context length of3200tokens per image/caption pair
isn't that the total length of an example? I thought the context was the part given before the token to be predicted?

We revisit the question “Is a picture worth a thousand words?” by comparing the informationcontentof textual captions to the image/text mutual information
I think an Issue with their analysis is that a picture's caption in a standard dataset does not capture all the info derivable from a picture


arxiv.org arxiv.org

but we will onlyapply it along the time dimensiont.
what do you mean? I thought you were applying the normalizing flow at each time step individually, not convolving over time


arxiv.org arxiv.org

The key point of this work is that based on observing a single sample from a subpopulation, it isimpossible to distinguish samples from “borderline” populations from those in the “outlier” ones. Thereforean algorithm can only avoid the risk of missing “borderline” subpopulations by also memorizing examplesfrom the “outlier” subpopulations.
I just find it weird that we have to offer so much justification for fitting to 0 error, when I don't see much reason to believe it isn't a good idea?


arxiv.org arxiv.org

e over parameters and the functionspace posterior covariance. Red indicates the underparameterized setting, yellowthe critical regime withp≈n, and green the overparameterizedregime.
isn't it the other way? Red is overparametrized and green is underparametrized?

We see wide but shallow models overfit, providing low train loss, but high testloss and high effective dimensionality.
it seems like it's mostly the number of parameters not the aspect ratio which determines the generalization performance? So that depth is not intrinsically helping generalization?

subspace and ensembling methods could beimproved through the avoidance of expensive computations within degenerate parameter regimes
but how do you make sure you are sampling with the right probabilities?


arxiv.org arxiv.org

w
this should be transposed

Our theoryagain perfectly fits the experiments.
well you can see some deviations in this NN, probably because of the smaller width

K
i think here it should be \(\kappa_{\text{NTK}}\)

marginal training data point causes greater reduction in relative error for low frequency modes than for highfrequency modes.
isn't this the opposite of what you said earlier??
"the marginal training data point causes agreater percent reduction in generalization error for modeswith larger RKHS eigenvalues."

 Oct 2020

arxiv.org arxiv.org

Each expert in the MoE layer receives a combinedbatch consisting of the relevant examples from all of the dataparallel input batches.
so the activations for the set of samples which use expert k should be sent to the right device which has expert k, right?
how much communication overhead is this?



A prior over parametersp(w)combines with the functionalform of a modelf(x;w)to induce a distribution over functionsp(f(x;w)). It is this distribution over functions thatcontrols the generalization properties of the model; the priorover parameters, in isolation, has no meaning.
Yep this is what we say in our paper too^^ https://arxiv.org/abs/1805.08522

Distance between the truepredictive distribution and the approximation
you mean something like minus the distance? because you want this distance to be smaller for better approximations?


academic.oup.com academic.oup.com

coherent
coherent hear just means that it will approach the true distribution eventually?


arxiv.org arxiv.org

As the effective dimensionality increases, so doesthe dimensionality of parameter space in which theposterior variance has contracted.
can you not have very confident models which are making wrong predictions?


arxiv.org arxiv.org

In the notation of Section 3, pointsω∈Ωrepresent possible samples. In our setting, each sample represents a complete record of a machine learning experiment. An environmentespecifies adistributionPeon the spaceΩof complete records.In the setting of supervised deep learning, a complete record of an experiment would specify hyperparameters, random seeds, optimizers, training (and held out) data, etc.
so each e represents an "experimetn" which is a range/distribution of hyperparameters (or what they call a complete record of a machine learning experiment)


arxiv.org arxiv.org

We measure a simple empirical statistic, thegradient noise scale3(essentially a measure of the signaltonoise ratio of gradient across training examples),and show that it can approximately predict the largest efficient batch size for a wide range of tasks
how is this related to the difficulty of the task?


arxiv.org arxiv.org

nonzero entropy
what about entropy rate?

overfitting
OK, I THINK THEY ARE DEFINING OVERFITTING in the agnostic learning sense of L(f)min_{f'\in F}L(f'). How badly am I doing relative to the best in the class!

we stop training early when the test loss ceases to improve and optimize all models in the same way
didn't they say earlier that they train for a fixed number of steps?

Nincreases and the model begins to overfit
well the increased overfitting is only visible in the smallest data size

S
should be N?

We find that generalization depends almost exclusively on theindistribution validation loss, and does not depend on the duration of training or proximity to convergence
no overfitting^^ even for transfer learning

Although these models have been trained on the WebText2 dataset, their test loss on a variety of other datasetsis also a powerlaw inNwith nearly identical power, as shown in Figure 8.
probably significantly different datasets will show different power laws. The different datasets looked at here seem quite similar

(approximately twice the compute as the forwards pass)
why?

To utilize both training time and compute as effectively as possible, it is best to train with a batchsizeB≈Bcrit
because above B_crit you can reduce time, but with increasing compute cost (diminishing returns)

 Jul 2020

arxiv.org arxiv.org

standard deviations
but are these standard deviations of the means?

 Jun 2020

arxiv.org arxiv.org

(x;W1,...,Wl,b1,...,bl)
it should depend on \(W^{l+1}\) and \(b^{l+1}\) too


arxiv.org arxiv.org

Naturally, such an increase in the learning rate also increases the mean stepsE[∆w]. However,we found that this effect is negligible sinceE[∆w]is typically orders of magnitude lower than thestandard deviation.
Interesting. This is why the intuition that increasing the learning rate would decrease the number of updates is probably not true, because what seems to determine the number of steps is the amount of noise!

 May 2020

arxiv.org arxiv.org

−
+


openreview.net openreview.net

〈O( ̄θ)〉=〈[[O[ ̄θ−η ̄∇LB(θ)]]]m.b.〉.
this is missing some time indices?


arxiv.org arxiv.org

We omit thedβexp (−cγ) +bβlog(1δ)nterm since it does not change with changein random labels.
how can we be sure it is nonvacuous then? hmm

while ̃Hθ†l,φ[j,j] can change based onαscaling Dinh et al. [2017], the effective curvature is scale invariant
do you mean because you change \(\sigma\) too? Was that what Dinh et al. were talking about? Or just the fact that there are other \theta (not reparametrizing, just finding new \theta) which have high curvature, but produce same function?

(f) stays valid for the test error rate in (a)
if you take into account the spread in (f) and (a) it would seem that for some runs the upper bound isn't valid?

Then, based on the ‘fast rate’ PACBayes bound as before, we have the following result
the posterior Q is a strange posterior over hypotheses. How do they take the KL divergence with the prior Because the posterior is defined by two parameters (\(\theta_\rho\) and \(\theta\))

Further, all the diagonal elementsdecrease as more samples are used for training.
Really? That sounds surprising!
I would have expected that as more training samples are added the parameters get more constrained (if the number of parameters is kept fixed).

Theorem 1
Derandomization of the margin loss

The bound provides a concrete realization of the notionof ‘flatness’ in deep nets [Smith and Le, 2018, Hochreiter and Schmidhuber, 1997, Keskar et al., 2017] andillustrates a tradeoff between curvature and distance from initialization.
is there evidence that distance from initialization anticorrelates with generalization? Even evidence for sharpness <> generalization isn't very strong.

In spite of the dependency on the Hessian diagonal elements, which canbe changed based on reparameterization without changing the function [Smith and Le, 2018, Dinh et al.,2017], the bound itself is scale invariant since KLdivergence is invariant to such reparameterizations Kleeman [2011], Li et al. [2019].
i thought Dinh's criticism wasn't so much about reparametrization, but about the fact that there are other minima which are sharper but give the same function. KL wouldn't be invariant to that, as you aren't changing the prior in that case?

 Apr 2020

arxiv.org arxiv.org

∈Ck
this sum was over all points in the training set in the previous step, and now it's over all points ?
Just think of the case where the partition C_i is made up of singletons, one for each possible point. Then, the robustness would be zero, but the generalizatoin error bound doesn't seem right then.
This made me suspect there may be something wrong, and I think it could be at this step. If we kept the sum to be over training sets, now we can;t upper bound the result by the max in the next two lines, I think!

 Mar 2020

www.nature.com www.nature.com

because of the softmax operation.
more like because of the Heaviside operation

the signs of f and 𝑓̃ f~\tilde{f} are the same.
and therefore the classification functions are the same

f~\tilde{f} as 𝑓𝑉=𝜌𝑓̃ fV=ρf~f_V=\rho \tilde{f},
this is confusing, is f_V or \tilde{f} the normalized network?

Our main results should also hold for SGD.
Will this be commented on in more detail?

normalized weights Vk as the variables of interest
Can we even reparametrize to the normalized weights? For homogeneous networks, it's obvious that we can. But for ReLU networks with biases it's less obvious. If one multiplies the biases via constants that grow exponentially with weight, the function is left invariant. We can always do this until the paramter vector is left normalized. Therefore we can reparametrize to the normalized vectors even with biases, but dunno if they consider this case here.

This mechanism underlies regularization in deep networks for exponential losses
we cannot say this, until we know more. Is this the reason why the generalize? Is this even sufficient to explain their generalization?



Bahdanau et al.(2019) learn a reward function jointly with the action policybut does so using an external expert dataset whereas ouragent uses trajectories collected through its own exploration
Yeah what they do here is similar to IRL, in that we are trying to learn a human NLconditioned reward function, but we do it via supervision, rather than demonstration. More similar to the work on "learning from human preferences"


arxiv.org arxiv.org

other agents
which share the same policy right? otherwise it woud be offpolicy experience?

Zero Sum
don't understand this one

specific choice ofλ
here, a specific choice of \(\lambda\) can determine which solutions among the many which satisfy the constraint we choose. Similarly to the choice of convex regularizer in the GAIL paper


Local file Local file

z(xi)z(xj)h
RHS depends on h, but LHS doesn't?


www.aaai.org www.aaai.org

The problem with the max entropy approach in Ziebart et al. 2008 is that it maximizes the entropy of trajectory distributions, without the constraint that these distributions must be realizable by causallyacting policies/agents. They then construct a causal policy from this distribution, but following the policy may result in a different trajectory distribution!
The question is what would be the maximum entropy path distribution that is compatible with a causal policy? Does maximizing causal entropy give that? Not clear. Instead they prove a different property of maximum causal entropy: Theorem 3 in Ziebert 2012

Z(θ)
Remember the partition function sums over trajectories which are compatible with the MDP dynamics only.
Trajectories incompatible with the dynamics have probability 0 of course


www.cs.cmu.edu www.cs.cmu.edu

Ziebart et al. (2008)
The problem with the max entropy approach in Ziebart et al. 2008 is that it maximizes the entropy of trajectory distributions, without the constraint that these distributions must be realizable by causallyacting policies/agents. They then construct a causal policy from this distribution, but following the policy may result in a different trajectory distribution!
The question is what would be the maximum entropy path distribution that is compatible with a causal policy? Does maximizing causal entropy give that? Not clear. Instead they prove a different property of maximum causal entropy: Theorem 3

eθ>F(X,Y)
this is P(YX), right? but it should be P(YX,Y_{1:t1})?


arxiv.org arxiv.org

without interactionwith the expert
how do things change when you can interact with the expert?

 Feb 2020

arxiv.org arxiv.org

Attention: Mask
by this, do they mean the attention weighted aggregation step?

nlayerdmodel3dattn
are they ignoring the \(W^O\) matrix? from the original Transformer paper?

Large models are more sampleefficient than small models, reaching the same level ofperformance with fewer optimization steps (Figure 2) and using fewer data points (Figure 4)
hmm in teresting. why are larger models more sample efficient?

Theperformance penalty depends predictably on the ratioN0.74/D
That is weird, what's the origin of this?

hmm do they look at generalization gap?
is trend on test loss due to parameter count, mostly due to effect on expressivyt / tranining loss (similarly with compute)?


Local file Local file

Some preliminary numerical simulations show that thisapproach does predict high robustness and log scaling.However, it only makes any sense if transitions from onephenotype to another phenotype are memoryless.
I thought the whole transition matrix approach itself assumed memorylessness

LetPbe a row vectorspecifying the probability distribution over phenotypes. Wewant to find a stochastic transition matrixM(rows sum toone) such that
why do we want P to be stationary?

Mhas 1s on the diagonals,and 0s elsewhere, for example
that is high robustness right?

Fano’s inequality)
doesn't Fano's inequality give H(XY) on the numerator which is a lower bound on H(X), and so doesnt imply this?


arxiv.org arxiv.org

Intrinsic motivations f
Basically the idea is that the RL/HER part is intrisnsically motivated with LP, to solve more and more tasks while the goal sampling part is intrinsically motivated to get trajectories that give new information to learn the reward function. I suppose they could add a bit of LP to the goal sampling as well to have some tendency to sample trajectories that may help to solve new tasks.

Highquality trajectories are trajectories where the agent collectsdescriptions from the social partner for goals that are rarely reached.
why do you want more than one description for a goal? A: Ah, because the goal will be the same but the final state may not be for each of these trajectories, thus giving more data to train the reward function.

 Jan 2020

arxiv.org arxiv.org

f memorybased sample efficient methods
bandits methods, which are suitable for sequences of indepenedent experiments


arxiv.org arxiv.org

We find that the object geometry makes a significantdifferences in how hard the problem is
apply some goal exploration process like POET?



When it comes to NNs, the regularization mechanism is also well appreciated in the literature,since they traditionally suffer from overparameterization,resulting in overfitting.
No. Overparametrized networks have been shown to generalize even without explicit regularization (Zhang et al. 2017)
