681 Matching Annotations
  1. Feb 2021
    1. The mouselook action distribution is in turn also defined autoregressively: the first sampled actionsplits the window bounded by(−1,1)×(−1,1)in width and height into 9 squares. Thesecond action splits the selected square into 9 further squares, and so on. Repeating thisprocess several times allows the agent to express any continuous mouse movement up to athreshold resolution.

      Interesting representation of a continuous action space!

    1. effective di-mensionality of a Bayesian neural network is inverselyproportional to the variance of the posterior distribu-tion.

      posterior contraction in parameter space I think you are talking about no?

    1. Yet this also implies non i.i.d. samples! Indeed, even if one could directly sample from the state-action distribution (like having its analytical form or an infinite experience replay buffer) and thus draw i.i.d. samples, the dependency will occur across optimization steps: if I draw a sample and use it to update my policy, I also update the distribution from which I will draw my next sample and then my next sample depends on my previous sample (since it conditioned my policy update).

      But this isn't a problem if the examples come from a fixed expert no?

  2. Jan 2021
    1. Prefix-tuning prepends a sequence ofcontinuous task-specificvectors to the input, whichwe call aprefix, depicted by red blocks in Figure 1(bottom). For subsequent tokens, the Transformercan attend to the prefix as if it were a sequence of“virtual tokens”, but unlike prompting, the prefixconsists entirely of free parameters which do notcorrespond to real tokens.

      and are thus differentiable! yay

    1. I guess a stepping-stone towards this would be to optimize morphological growth processes to generate a body with a particular form in 3D (that would be quite similar to the differentiable CA, except that here the “cells” move in 3D space and have physical interaction that depend on their internal parameters and states)
    2. (and that would be also novel to use a population-based IMGEPs using gradient descent for local optimization towards self-generated goals)

      similar to SIREN+CLIP (Deep Sleep)

    1. For this reason, we wereunable to collect baselines such as an equivalent amount of high-quality human demonstrations forsupervised baselines. See D for more discussion. We leave this ablation to future work.

      so one possibility is that the feedback you got was of better quality than the data used for SL. Perhaps if you did SL on higher quality data you would match the performance of the human feedback model?

    2. it’s unclear how much one can optimizeagainst the reward model until it starts giving useless evaluations.

      adversarial examples

    3. Previous work on fine-tuning language models from human feedback [73] reported “a mismatchbetween the notion of quality we wanted our model to learn, and what the humans labelers actuallyevaluated”, leading to model-generated summaries that were high-quality according to the labelers,but fairly low-quality according to the researchers.

      That is quite interesting

    4. We rely on detailed procedures toensure high agreement between labelers and us on the task, which we describe in the next section

      is this necessarily a good thing? Could you not miss other notions of "qualtiy" this way? I guess you want to ensure a consistent notion of quality, rather than asking the question of "what about other notions of quality?"

    1. We conjecture that this gap occurs because the models “cheat” by only optimizing for performance on the benchmark, much like a student who passed an exam by studying only the questions on past years’ exams.

      poor generalization

    1. n fact, without visiting any states at all, sincethe queries are synthetic.

      grr, what about during the phase of training the generative model?

    2. he x-axis represents the number of queries to the user, where each queryelicits a label for a single state transition(s, a, s0).

      but isnt sampling from model less expensive than sampling by optimizing AFs? shouldnt that be taken into account?

    3. having to visit unsafe states during the training process

      it may have visited some during the training of the generative model no?

      But I guess not that many, if the generative model has been pretrained, and it can generalize well

    4. As discussed inSection4.3and illustrated in the right-most plot of Figure5, the baselines learn a reward model that incorrectly ex-trapolates that continuing up and to the right past the goalregion is good behavior.

      but if the baselines arent visiting those high reward states, then they havent actually fallen into reward hacking? I guess the idea is that they could in a new environment.

      Take away is to do more exploration if you expect to be tested to new environments

    5. ⌧query= maxz0,a0,z1,...,zTJ(⌧)+logp(⌧)

      its like a model-based version of DDPG + curiosity/exploration rewards?

    6. Here, the states2R64⇥64⇥3is anRGB image with a top-down view of the car (Figure3), andthe actiona2R3controls steering, gas, and brake

      In my experience, high dimensional action spaces are even harder, specially when combined with high dim state spaces

    7. he idea is to elicit labels for examples that themodel is least certain how to label, and thus reduce modeluncertainty.

      what if the user(s) the model is querying are also uncertain? Then the model shouldnt spend too much time on these. This is one thing that learning progress aims to avoid!

    8. To simplify our experiments,we sample trajectories⌧by following random policies thatexplore a wide variety of states. We use the observed trajec-tories to train a likelihood model

      Seems like this may be an issue in more complex environments, as the random policies may not explore enough!

      We probably want either human demonstrations and/or iterate/reinfe the generative model with the later policies

    9. (4) maximize novelty of trajecto-ries regardless of predicted rewards, to improve the diversityof the training data.

      could also do something based on learning progress

    10. In complex domains,the user may not be able to anticipate all possible agentbehaviors and specify a reward function that accuratelydescribes user preferences over those behaviors

      so is the assumption that the automated way ot exploring agent behaviours is better than what a human would consider?

  3. Dec 2020
    1. it is far easier to obtain reliability beyond a certain margin by mechanisms in the end hosts of a network rather than in the intermediary nodes,[nb 4] especially when the latter are beyond the control of, and not accountable to, the former

      this seems to me to be mostly saying that: it's hard to change the standards at the low level, so it's easier to program at the higher level.

      This is true of not just networks, but of computers, etc too. But it may not always be the best approach!

      Should have called it "rule of thumb" more than principle I think

  4. Nov 2020
    1. all causal explanationsare necessarily robust in this extreme case

      are they? Can you not have a thing that has a conditional causal effect?

      Seems to me that causality should be a more quantiative thing (how robust is this predictor), rather than an either-or thing

    1. Goldblum et al.[119]which empirically observes that the large width behavior of ResidualNetworks does not conform to the infinite-width limit.

      Oh interesting!

    2. WhileCNN-VECpossess translation equivariance but not invariance (§3.11), we believe it can effectivelyleverage equivariance to learn invariance from data

      How? if it doesn't imply anything about the output?

    3. This is caused by poor conditioning of pooling networks. Xiao et al.[33](Table 1) show that theconditioning at initialization of aCNN-GAPnetwork is worse than that ofFCNorCNN-VECnetworksby a factor of the number of pixels (1024 for CIFAR-10). This poor conditioning of the kerneleigenspectrum can be seen in Figure 8. For linearized networks, in addition to slowing training by afactor of 1024, this leads to numerical instability when usingfloat32

      Interesting. Do models with a stronger bias lead, which may be associated with better generalization (see https://arxiv.org/abs/2002.02561 / https://arxiv.org/abs/1905.10843), lead also to poorer conditioning?

      Hmm, but this did not affect the non-linearized model. Interesting. How does non-linear GD avoid the issue?

    4. egularization parameter

      what regularization parameter?

    1. We add the superscript “all" to emphasize that gradient-based training of the networks is alwaysperformed on the entire dataset, while NNGP inference is performed on sub-sampled datasets.

      ah hm, so the gradient method is given an advantage by being able to "look" at more data than the NNGP method?

    1. With few exceptions (Carl-son et al., 2010), machine learning models havebeen confined to IID datasets that lack the structurein time from which humans draw correlations aboutlong-range causal dependencies

      All of RL studies non-IID data

    2. how pretraining obfuscates ourability to measure generalization (Linzen, 2020)


    3. but even com-plex simulation action spaces can be discretizedand enumerated.

      What's the problem of enumerating and discretizing action spaces?

      what about agents that can act via free text? like those in AI dungeon? those are in principle not enumerable

    4. models the listener’s desires and experiences explic-itly

      what does it mean to model them explicitly versus implicitly?

    5. Collecting data about rich natural sit-uations is often impossible.

      NOPE. VR.

    6. Meanwhile, it is precisely human’sability to draw on past experience and make zero-shot decisions that AI aims to emulate

      which is what GPT3 is doing

    7. Second, current cross entropy training losses ac-tively discourage learning the tail of the distribu-tion properly, as statistically infrequent events aredrowned out (Pennington et al., 2014; Holtzmanet al., 2020).

      That's what scaling is doing, shaving off those tails (as the scaling papers discuss)

    8. it is unlikely that universal function approximatorssuch as neural networks would ever reliably positthat people, events, and causality exist without be-ing biased towards such solutions (Mitchell, 1980)


    9. (which are usually thrown out beforethe dataset is released)

      They shouldn't be! We should learn to probabilistically model the data

    10. persistent enough to learn the effects of actions.

      so we should aim for longer contexts? Yeah memory is important. There is research in extending transformers to have longer contexts

    11. and active experimentation is keyto learning that effec


    12. participatein lin-guistic activity, such as negotiation (Yang et al.,2019a; He et al., 2018; Lewis et al., 2017), collab-oration (Chai et al., 2017), visual disambiguation(Anderson et al., 2018; Lazaridou et al., 2017; Liuand Chai, 2015), or providing emotional support(Rashkin et al., 2019).

      do we need the agent itself to participate, or is not sufficient to feed it data from such types of interactions?

    13. Framing, such as suggesting that achatbot speaks English as a second language

      Tbh I think that framing can be both missleading and illuminating (about the degree or lack thereof of capability of the agent)

    14. Robotics and embodiment are not available inthe same off-the-shelf manner as computer visionmodels.

      I think VR can solve that

    15. (Liet al., 2019b; Krishna et al., 2017; Yatskar et al.,2016; Perlis, 2016)

      why don't you explain how these papers support the statement at least?

    16. Models must be ableto watch and recognize objects, people, and activi-ties to understand the language describing them


    17. Learned, physical heuristics, such as thefact that a falling cat will land quietly, are general-ized and abstracted into language metaphors likeas nimble as a cat(Lakoff, 1980).

      So you just conceded that a prime example of things that need physical interaction to be learnt, can be expressed in words?

      You should make your points clearer. The point I think is that there are a lot of subconscious knowledge like the example you give, but which we can't quite put into words!

    18. Language learning needs perception, because per-ception forms the basis for many of our semanticaxioms

      could we not argue that language is all that we are conscious of. Even though it may be formed by external sensations, what we currently (consciously) know may be almost fully expressible by language, and therefore WS2 may be enought to learn all of conscious knowledge

    19. As text pretraining schemes seem to be reach-ing the point of diminishing returns,

      Not yet, in long scale IIRC

    20. parked my car in the compact park-ing space because it looked (big/small) enough

      Hmm, I think the answer is "big"? This seems learnable from text statistics?

    21. Continuing to expandhardware, data sizes, and financial compute costby orders of magnitude will yield further gains, butthe slope of the increase is quickly decreasing.

      Right, but it's nice that we have a reliable way to improve performance.

    22. cale in data andmodeling has demonstrated that a single represen-tation can discover both rich syntax and semanticswithout our help (Tenney et al., 2019).

      It's not without our help. The data is our help?^^

    23. You can’t learn language from the radio.

      I think the question shouldn't be phrased as a dichotomy, but quantitatively: How much language (semantics) can you and can you not learn from the radio?

    24. The futility of learning language from lin-guistic signal alone is intuitive, and mirrors thebelief that humans lean deeply on non-linguisticknowledge (Chomsky, 1965, 1980).

      Something being intuitive isn't a strong argument for it being true.

    25. from their use by people to communi-cate

      Let's gather massive datasets on that through VR ^^

    26. Natural language processing is a diverse field,and progress throughout its development hascome from new representational theories, mod-eling techniques, data collection paradigms,and tasks.

      and figuring out how to scale up https://arxiv.org/abs/2001.08361

    27. success-ful linguisticcommunicationrelies on a sharedexperience of the world. It is this shared expe-rience that makes utterances meaningful

      I think this is true, except for the language which communicates about language. I think there is meaning purely within the world of language too.

      Though certainly a lot of meaning lies in the grounding of language too

    1. share attention

      common context

    2. Any smaller subset of these compe-tencies is not sufficient to develop proper language/communi-cation skills, and further, the development of language clearlybootstraps better motor and affordance learning and/or sociallearning.

      This seems to be full of statements like this where they claim something is "obviously true" but really more justification is needed for these claims.

    1. Intuition

      The way I think about their framework is as follows:

      They shift perspective from bounding the error to "bounding" the learning curves

      Learning curves are functions (of n), so there is no clear ordering between them as there is for the error at a particular n, which is just a number.

      So instead of learning curves we look at {learning curves up to the equivalence relation of having the same asymptotic behaviour (up to a constant)}, which we call "rates".

      For these there is a natural ordering, and one can provide a rate upper bound, that is uniform over P, for a particular hypothesis class, assuming realizability. This is what they do here, so it is basically uniform convergence, but of a different quantity, which is more representative of how ML works in practice, so that this framework is probably more useful.

      However, their description of "PAC learning" is too restrictive I think; they don't seem to consider data-dependent generalizatoin bounds which exist, and some of them are based on extensions to the uniform PAC bounds. For example how does their framework compared to the PAC-Bayes framework?

    2. Hisnot learnable at rate faster thanR

      So that the concept of universal learnability is characterizing the worst case learning curve rate. The constant is allowed to depend on P but not the function R. So it is non-uniform in that way. But really that's not the best way to think of it I think. The way I think of it is written in my page note titled "Intuition"

    3. For simplicity of exposition, we have stated a definition corresponding todeterministicalgorithms, to avoidthe notational inconvenience required to formally define randomized algorithms in this contex


    4. erP


    5. That is,everynontrivial classHis eitheruniversally learnable at an exponential rate (but not faster), or isuniversally learnable at a linearrate (but not faster), or is universally learnable but necessarily with arbitrarily slow rates

      what do they mean by "nontrivial" here?

    6. for any learning algorithm, there is a realizable distributionPwhoselearning curve decays no faster than a linear rate (Schuurmans, 1997)

      aren't we interested in the statement that for any realizable distribution P there is a learning algorithm whose learning curve decays no faster than a linear rate?

    1. S({Oμ(x)})

      what do they mean by this quantity?

      The number of states with the same energy as O_\mu(x)?

    2. 2−Nq(h∗)eN(h∗m−log coshh∗)

      Isn't this missing the Hessian factor in Laplace's approximation? where has it gone?

    3. argument [10] converts Eq. (1) withα= 1 into the state-ment that, for a large system,N→ ∞, the energy andentropy are exactly equal (up to a constant) to leadingorder inN.

      I think this is the idea that Zipf law is related to P(Energy) being a constant w.r.t. Energy hmm

      tho really if both E and S are extensive in N ( meaning linear in N), then they will scale equally with N, obviousy? Tho is zipf law followed for extensive systems? aren't those were parts are independent, and we expect to aproach a uniform distribution?

      Right I think E and S scaling the same does not imply Zipf, but the other way, it does, apparently. Need to check argument in [10]

    1. Because the exponentαN1for language models, we can approximateN−αN≈1−αNlog(N)to obtainequation 4.1.

      If \(\alpha_N\log{(N)} \ll 1\) i don't see how E.4 will scale as equation 4.1?

      wouldnt the constant \(L_U -1\) dominate?

    2. could be misleading if the models have not all been trained fully to convergence

      you mean because perhaps the assumption that {in the limit of large N, they will perfectly model the data} may not hold if we dont train until convergence, and so the power law + constant assumption may not be justified. Yeah that makes sense

    3. which makes the interpretation ofL(N)difficult.


    4. mattn

      what is \(m_{attn}\)?

    5. There we also show trends forthe training loss, which do not adhere as well to a power-law form, perhaps because of the implicit curriculumin the frequency distribution of easy and hard problems

      why would that affect the training loss scaling??

    6. the poor loss onthese modules would dominate the trends

      could they show accuracy trends?..

    7. easier problems will naturally appear more often than more difficult problems

      interesting. I have some ideas on how this could be related to learning curve exponents

    8. We sample the default mixture of easy, medium, and hard problems, withouta progressive curriculum.

      Did they look if curriculum learning had any effect on the learning curves?

    9. context length of3200tokens per image/caption pair

      isn't that the total length of an example? I thought the context was the part given before the token to be predicted?

    10. We revisit the question “Is a picture worth a thousand words?” by comparing the information-contentof textual captions to the image/text mutual information

      I think an Issue with their analysis is that a picture's caption in a standard dataset does not capture all the info derivable from a picture

    1. but we will onlyapply it along the time dimensiont.

      what do you mean? I thought you were applying the normalizing flow at each time step individually, not convolving over time

    1. The key point of this work is that based on observing a single sample from a subpopulation, it isimpossible to distinguish samples from “borderline” populations from those in the “outlier” ones. Thereforean algorithm can only avoid the risk of missing “borderline” subpopulations by also memorizing examplesfrom the “outlier” subpopulations.

      I just find it weird that we have to offer so much justification for fitting to 0 error, when I don't see much reason to believe it isn't a good idea?

    1. e over parameters and the function-space posterior co-variance. Red indicates the under-parameterized setting, yellowthe critical regime withp≈n, and green the over-parameterizedregime.

      isn't it the other way? Red is over-parametrized and green is under-parametrized?

    2. We see wide but shallow models overfit, providing low train loss, but high testloss and high effective dimensionality.

      it seems like it's mostly the number of parameters not the aspect ratio which determines the generalization performance? So that depth is not intrinsically helping generalization?

    3. subspace and ensembling methods could beimproved through the avoidance of expensive com-putations within degenerate parameter regimes

      but how do you make sure you are sampling with the right probabilities?

    1. w

      this should be transposed

    2. Our theoryagain perfectly fits the experiments.

      well you can see some deviations in this NN, probably because of the smaller width

    3. K

      i think here it should be \(\kappa_{\text{NTK}}\)

    4. marginal training data point causes greater reduc-tion in relative error for low frequency modes than for highfrequency modes.

      isn't this the opposite of what you said earlier??

      "the marginal training data point causes agreater percent reduction in generalization error for modeswith larger RKHS eigenvalues."

  5. Oct 2020
    1. Each expert in the MoE layer receives a combinedbatch consisting of the relevant examples from all of the data-parallel input batches.

      so the activations for the set of samples which use expert k should be sent to the right device which has expert k, right?

      how much communication overhead is this?

    1. A prior over parametersp(w)combines with the functionalform of a modelf(x;w)to induce a distribution over func-tionsp(f(x;w)). It is this distribution over functions thatcontrols the generalization properties of the model; the priorover parameters, in isolation, has no meaning.

      Yep this is what we say in our paper too^^ https://arxiv.org/abs/1805.08522

    2. Distance between the truepredictive distribution and the approximation

      you mean something like minus the distance? because you want this distance to be smaller for better approximations?

    1. coherent

      coherent hear just means that it will approach the true distribution eventually?

    1. As the effective dimensionality increases, so doesthe dimensionality of parameter space in which theposterior variance has contracted.

      can you not have very confident models which are making wrong predictions?

    1. In the notation of Section 3, pointsω∈Ωrepresent possible samples. In our setting, each sam-ple represents a complete record of a machine learning experiment. An environmentespecifies adistributionPeon the spaceΩof complete records.In the setting of supervised deep learning, a complete record of an experiment would specify hy-perparameters, random seeds, optimizers, training (and held out) data, etc.

      so each e represents an "experimetn" which is a range/distribution of hyperparameters (or what they call a complete record of a machine learning experiment)

    1. We measure a simple empirical statistic, thegradient noise scale3(essentially a measure of the signal-to-noise ratio of gradient across training examples),and show that it can approximately predict the largest efficient batch size for a wide range of tasks

      how is this related to the difficulty of the task?

    1. non-zero entropy

      what about entropy rate?

    2. overfitting

      OK, I THINK THEY ARE DEFINING OVERFITTING in the agnostic learning sense of L(f)-min_{f'\in F}L(f'). How badly am I doing relative to the best in the class!

    3. we stop training early when the test loss ceases to improve and optimize all models in the same way

      didn't they say earlier that they train for a fixed number of steps?

    4. Nincreases and the model begins to overfit

      well the increased overfitting is only visible in the smallest data size

    5. S

      should be N?

    6. We find that generalization depends almost exclusively on thein-distribution validation loss, and does not depend on the duration of training or proximity to convergence

      no overfitting^^ even for transfer learning

    7. Although these models have been trained on the WebText2 dataset, their test loss on a variety of other datasetsis also a power-law inNwith nearly identical power, as shown in Figure 8.

      probably significantly different datasets will show different power laws. The different datasets looked at here seem quite similar

    8. (approximately twice the compute as the forwards pass)


    9. To utilize both training time and compute as effectively as possible, it is best to train with a batchsizeB≈Bcrit

      because above B_crit you can reduce time, but with increasing compute cost (diminishing returns)

  6. Jul 2020
  7. Jun 2020
    1. (x;W1,...,Wl,b1,...,bl)

      it should depend on \(W^{l+1}\) and \(b^{l+1}\) too

    1. Naturally, such an increase in the learning rate also increases the mean stepsE[∆w]. However,we found that this effect is negligible sinceE[∆w]is typically orders of magnitude lower than thestandard deviation.

      Interesting. This is why the intuition that increasing the learning rate would decrease the number of updates is probably not true, because what seems to determine the number of steps is the amount of noise!

  8. May 2020
    1. 〈O( ̄θ)〉=〈[[O[ ̄θ−η ̄∇LB(θ)]]]m.b.〉.

      this is missing some time indices?

    1. We omit thedβexp (−cγ) +bβlog(1δ)nterm since it does not change with changein random labels.

      how can we be sure it is non-vacuous then? hmm

    2. while ̃Hθ†l,φ[j,j] can change based onα-scaling Dinh et al. [2017], the effective curvature is scale invariant

      do you mean because you change \(\sigma\) too? Was that what Dinh et al. were talking about? Or just the fact that there are other \theta (not reparametrizing, just finding new \theta) which have high curvature, but produce same function?

    3. (f) stays valid for the test error rate in (a)

      if you take into account the spread in (f) and (a) it would seem that for some runs the upper bound isn't valid?

    4. Then, based on the ‘fast rate’ PAC-Bayes bound as before, we have the following result

      the posterior Q is a strange posterior over hypotheses. How do they take the KL divergence with the prior Because the posterior is defined by two parameters (\(\theta_\rho\) and \(\theta\))

    5. Further, all the diagonal elementsdecrease as more samples are used for training.

      Really? That sounds surprising!

      I would have expected that as more training samples are added the parameters get more constrained (if the number of parameters is kept fixed).

    6. Theorem 1

      Derandomization of the margin loss

    7. The bound provides a concrete realization of the notionof ‘flatness’ in deep nets [Smith and Le, 2018, Hochreiter and Schmidhuber, 1997, Keskar et al., 2017] andillustrates a trade-off between curvature and distance from initialization.

      is there evidence that distance from initialization anti-correlates with generalization? Even evidence for sharpness <> generalization isn't very strong.

    8. In spite of the dependency on the Hessian diagonal elements, which canbe changed based on re-parameterization without changing the function [Smith and Le, 2018, Dinh et al.,2017], the bound itself is scale invariant since KL-divergence is invariant to such re-parameterizations Klee-man [2011], Li et al. [2019].

      i thought Dinh's criticism wasn't so much about reparametrization, but about the fact that there are other minima which are sharper but give the same function. KL wouldn't be invariant to that, as you aren't changing the prior in that case?

  9. Apr 2020
    1. ∈Ck

      this sum was over all points in the training set in the previous step, and now it's over all points ?

      Just think of the case where the partition C_i is made up of singletons, one for each possible point. Then, the robustness would be zero, but the generalizatoin error bound doesn't seem right then.

      This made me suspect there may be something wrong, and I think it could be at this step. If we kept the sum to be over training sets, now we can;t upper bound the result by the max in the next two lines, I think!

  10. Mar 2020
    1. because of the softmax operation.

      more like because of the Heaviside operation

    2. the signs of f and 𝑓̃ f~\tilde{f} are the same.

      and therefore the classification functions are the same

    3. f~\tilde{f} as 𝑓𝑉=𝜌𝑓̃ fV=ρf~f_V=\rho \tilde{f},

      this is confusing, is f_V or \tilde{f} the normalized network?

    4. Our main results should also hold for SGD.

      Will this be commented on in more detail?

    5. normalized weights Vk as the variables of interest

      Can we even reparametrize to the normalized weights? For homogeneous networks, it's obvious that we can. But for ReLU networks with biases it's less obvious. If one multiplies the biases via constants that grow exponentially with weight, the function is left invariant. We can always do this until the paramter vector is left normalized. Therefore we can reparametrize to the normalized vectors even with biases, but dunno if they consider this case here.

    6. This mechanism underlies regularization in deep networks for exponential losses

      we cannot say this, until we know more. Is this the reason why the generalize? Is this even sufficient to explain their generalization?

    1. Bahdanau et al.(2019) learn a reward function jointly with the action policybut does so using an external expert dataset whereas ouragent uses trajectories collected through its own exploration

      Yeah what they do here is similar to IRL, in that we are trying to learn a human NL-conditioned reward function, but we do it via supervision, rather than demonstration. More similar to the work on "learning from human preferences"

    1. other agents

      which share the same policy right? otherwise it woud be off-policy experience?

    2. Zero Sum

      don't understand this one

    3. specific choice ofλ

      here, a specific choice of \(\lambda\) can determine which solutions among the many which satisfy the constraint we choose. Similarly to the choice of convex regularizer in the GAIL paper

    1. The problem with the max entropy approach in Ziebart et al. 2008 is that it maximizes the entropy of trajectory distributions, without the constraint that these distributions must be realizable by causally-acting policies/agents. They then construct a causal policy from this distribution, but following the policy may result in a different trajectory distribution!

      The question is what would be the maximum entropy path distribution that is compatible with a causal policy? Does maximizing causal entropy give that? Not clear. Instead they prove a different property of maximum causal entropy: Theorem 3 in Ziebert 2012

    2. Z(θ)

      Remember the partition function sums over trajectories which are compatible with the MDP dynamics only.

      Trajectories incompatible with the dynamics have probability 0 of course

    1. Ziebart et al. (2008)

      The problem with the max entropy approach in Ziebart et al. 2008 is that it maximizes the entropy of trajectory distributions, without the constraint that these distributions must be realizable by causally-acting policies/agents. They then construct a causal policy from this distribution, but following the policy may result in a different trajectory distribution!

      The question is what would be the maximum entropy path distribution that is compatible with a causal policy? Does maximizing causal entropy give that? Not clear. Instead they prove a different property of maximum causal entropy: Theorem 3

    2. eθ>F(X,Y)

      this is P(Y|X), right? but it should be P(Y|X,Y_{1:t-1})?

    1. without interactionwith the expert

      how do things change when you can interact with the expert?

  11. Feb 2020
    1. Attention: Mask

      by this, do they mean the attention weighted aggregation step?

    2. nlayerdmodel3dattn

      are they ignoring the \(W^O\) matrix? from the original Transformer paper?

    3. Large models are more sample-efficient than small models, reaching the same level ofperformance with fewer optimization steps (Figure 2) and using fewer data points (Figure 4)

      hmm in teresting. why are larger models more sample efficient?

    4. Theperformance penalty depends predictably on the ratioN0.74/D

      That is weird, what's the origin of this?

    5. hmm do they look at generalization gap?

      is trend on test loss due to parameter count, mostly due to effect on expressivyt / tranining loss (similarly with compute)?

    1. Some preliminary numerical simulations show that thisapproach does predict high robustness and log scaling.However, it only makes any sense if transitions from onephenotype to another phenotype are memoryless.

      I thought the whole transition matrix approach itself assumed memorylessness

    2. LetPbe a row vectorspecifying the probability distribution over phenotypes. Wewant to find a stochastic transition matrixM(rows sum toone) such that

      why do we want P to be stationary?

    3. Mhas 1s on the diagonals,and 0s elsewhere, for example

      that is high robustness right?

    4. Fano’s inequality)

      doesn't Fano's inequality give H(X|Y) on the numerator which is a lower bound on H(X), and so doesnt imply this?


    1. Intrinsic motivations f

      Basically the idea is that the RL/HER part is intrisnsically motivated with LP, to solve more and more tasks while the goal sampling part is intrinsically motivated to get trajectories that give new information to learn the reward function. I suppose they could add a bit of LP to the goal sampling as well to have some tendency to sample trajectories that may help to solve new tasks.

    2. High-quality trajectories are trajectories where the agent collectsdescriptions from the social partner for goals that are rarely reached.

      why do you want more than one description for a goal? A: Ah, because the goal will be the same but the final state may not be for each of these trajectories, thus giving more data to train the reward function.

  12. Jan 2020
    1. f memory-based sample efficient methods

      bandits methods, which are suitable for sequences of indepenedent experiments

    1. We find that the object geometry makes a significantdifferences in how hard the problem is

      apply some goal exploration process like POET?

    1. When it comes to NNs, the regulariza-tion mechanism is also well appreciated in the literature,since they traditionally suffer from overparameterization,resulting in overfitting.

      No. Overparametrized networks have been shown to generalize even without explicit regularization (Zhang et al. 2017)

    1. Therefore, we can get the following generalization bound:

      as long as the value of L is bounded by at most 1/delta or something right?

    1. They use on-average stability that does not imply generalization bounds with highprobability

      Their bounds on expectations can be converted to bounds with high probability, as they claim in page 3, citing "Shalev-Shwartz, S., Shamir, O., Srebro, N., and Sridharan, K. Learnability, stability and uniform convergence. Journal of Machine Learning Research, 11(Oct):2635–2670, 2010."

    1. forTďmstep

      one pass SGD

    2. validation error which is used asan empirical estimate forRpw1q

      so their bound has the disadvantage that it needs an estimate given by the validation error to compute the bound! So it can't be computed from the training data alone!!

    3. our bound corroborates the intuition that whenever we start at a good location of the objectivefunction, the algorithm is more stable and thus generalizes better.

      This is a nice intuition for why good initializations can lead to good generalization

    4. Rpw1q ́R‹

      remember that \(R\) is the population risk, so this isn't a priori something that we can know?

  13. Dec 2019
  14. arxiv.org arxiv.org
    1. Whileit is known having a finite VC-dimension (Vapnik and Chervonenkis, 1991) or equivalentlybeing CVEEEloostable (Mukherjee et al., 2006) is necessary and sufficient for the EmpiricalRisk Minimization (ERM) to generalize,

      it is only necessary to generalize in the worst case over data distributions right?

    1. The bounds based on`2-path normand spectral norm can be derived directly from the those based on`1-path norm and`2norm respectively

      Hmm. how?

      This implies that even though the l2 path norms seem non-vacuous on Figure 1, they aren't. They appear so, because we have dropped the "terms that only depend on depth or number of hidden units", which are large for l2-path norm

    1. ExperimentsIn

      experiments only in 2 dimensional input space. Could results depend on the input dimensionality?

  15. Nov 2019
    1. min(Td;2S)

      the min is because depending on which is larger one or the other of the two limits of the integral, dominates

    2. 29

      Compare this to the analysis of Sollich ( https://pdfs.semanticscholar.org/7294/862e59c8c3a65167260c0156427f4757c67e.pdf ) which is in the well-specified setting. There there's no dependence on the labels of the training data. Here neither, but at least there's dependence on the distribution of the target labels, so that it allows for more general types of assumptions.

    3. K(x)is an even

      which can be seen from its definition as a covariance.

    4. of a Teacher Gaussian process with covarianceKTand assume that they lie in theRKHS of the Student kernelKS, namely

      ah yes, being in RKHS means having a finite norm in the RKHS, which makes sense. But not sure how restrictive this is, just like I'm not sure if simply being n-times differentiable is a good measure of complexity of the function. Are there n-times differentiable functions that approximate any less smooth function? Maybe Lipschitz constant of derivatives (smoothness constants) could be more quantitatively useful?

    5. If both kernels are Laplace kernels thenT=S=d+ 1andEMSEn1=d, whichscales very slowly with the dataset size in large dimensions. If the Teacher is a Gaussian kernel(T=1) and the Student is a Laplace kernel then= 2(1 + 1=d), leading to!2asd!1

      hm, wait what? But wouldn't the Bayes optimal answer be obtained if the student has the same kernel as the teacher?


      as \(n\to\infty\)

    7. We perform kernel classification via the algorithmsoft-margin SVM.

      which approximates a point estimator of the Gaussian process classifier, but I don't know the exact relation.

    8. man


    9. Importantly (i) Eq. (1) leads to a prediction for(d)that accurately matches our numerical study forrandom training data points, leading to the conjecture that Eq. (1) holds in that case as well.

      Compare with: https://arxiv.org/pdf/1909.11500.pdf where they find that random inputs give rise to plateaus, hmm at least with epochs, but they cite papers where these are apparently found for training set size (perhaps only for thin networks?)

    10. s a result, various works on kernel regressionmake the much stronger assumption that the training points are sampled from a target function thatbelongs to thereproducing kernel Hilbert space(RKHS) of the kernel (see for example [Smola et al.,1998]). With this assumptiondoes not depend ond(for instance in [Rudi and Rosasco, 2017]= 1=2is guaranteed). Yet, RKHS is a very strong assumption which requires the smoothness ofthe target function to increase withd[Bach, 2017] (see more on this point below), which may not berealistic in large dimensions.

      I think when they say "it belongs to an RKHS", they mean that it does so with a fixed/bounded norm (otherwise almost any function would satisfy this, for universal RKHSs). This is consistent with the next comment saying, that this assumption implies smoothness (smoothness<>small RKHS norm generally)

  16. openreview.net openreview.net
    1. Seems like PPO works better than their approach in several of the experiments. Hmm

    1. irreducible error (e.g.,Bayes error)

      more commonly model capacity limitations I guess?

    1. GMM on a dataset of previously sampled parametersconcatenated to their respective ALP measure.

      the GMM is only fitted to the parameter part or the (parameter, ALP) vector?

    1. nevertheless, the few re-maining ones must still differ in a finite fraction of bits fromeach other and from the teacher so that perfect generaliza-tion is still impossible. For aslightly above aconly the cou-plings of the teacher survive.

      Lenka Zdeborová, Florent Krzakala have found that at the capacity threshold, algorithms tend to have the longest running times, i.e. the worst-case examples seem to live at that transition

    2. For a committeeof two students it can be shown that when the number ofexamples is large, the information gain does not decreasebut reaches a positive constant. This results in a much fasterdecrease of the generalization error. Instead of being in-versely proportional to the number of examples, the de-crease is now exponentially fast

      For the case of the perceptron you can see how the uncertainty region (whose volume approximates the generalization error) approximately halves (or is reduced by about a constant) after every optimal query.

    1. n general, the baseline leaves the expected value of the update unchanged,but it can have a large

      because baseline depends on S, it can reduce the variance from state to state (not the one from action to action).

      WRONG: IT can reduce the action to action variance of the gradient (not the variance of the advantage!)

  17. Oct 2019
    1. computevar1bbÂj

      this is the covariance matrix

    2. This suggests that the effect ofj(x)is to rotate the gradient field and move thecritical points, also seen in Fig. 4b.

      how does this equation suggest this?

    3. sampling with replacement has better regularization

      but you are saying that the temperature (\(\beta^{-1}\) is lower when you sample with replacement, so that the regularization should be less?

    4. conservative

      how does this mean that it is conservatice?

    5. This implies that SGD implicitlyperforms variational inference with a uniform prior, albeit of a different loss than the one used tocompute back-propagation gradients

      The interpreation of doing variational inference with a uniform prior is because if we interpret the minimization objective as an ELBO, the second term is like the KL divergence between the approximate posterior and a uniform prior (whicih just gives the entropy). Nice

      If \(\rho\) doesn't have any constraints then this should give the exact posterior with uniform prior, and likelihood given by \(\Phi(x)\)

    1. The second particularity is that since the computation of the rewardRpp;c;;oqis internal to themachine, it can be computed any time after the experimentpc;;oqand for any problempPP,not only the particular problem that the agent was trying to solve. Consequently, if the machineexperiments a policyin contextcand observeso(e.g. trying to solve problemp1), and storesthe resultspc;;oqof this experiment in its memory, then when later on it self-generates problemsp2;p3;:::;piit can compute on the fly (and without new actual actions in the environment) theassociated rewardsRp2pc;;oq;Rp3pc;;oq;:::;Rpipc;;oqand use this information to improveover these goalsp2;p3;:::;pi.

      like hindsight experience replay

    1. Although methods to learndisentangled representation of the world exist [25,26,27], they do not allow to distinguish featuresthat are controllable by the learner from features describing external phenomena that are outsidethe control of the agent.

      learning controllabe features is similar to learning a causal model of the world I think

    1. We find that the full NTK has better approximation propertiescompared to other function classes typically defined for ReLU activations [5, 13, 15], which arise for instancewhen only training the weights in the last layer, or when considering Gaussian process limits of ReLUnetworks (e.g., [20, 24, 32]).

      NTK has "better approximation properties". What do they mean more precisely?

    1. and we have left the activation kernel unchanged,K`=1M`A0`A0T`

      what is the reason to do this?

    2. (A`jJ`)

      J_l is the covariance for a single column of A_l right?

    3. Second, we modified theinputs by zeroing-out all but the first input unit (Fig. 1 right).

      how does this work more precisely? The targets are generated by feeding the modified inputs to the "teacher network", but the student network gets the unmodified inputs?

    4. for MAP inference, the learned representationstransition from the input to the output kernel, irrespective of the network width.

      how is MAP inference implemented?

    5. he representations in learned neural networks slowly transitionfrom being similar to the input kernel (i.e. the inner product of the inputs) to being similar to theoutput kernel (i.e. the inner product of one-hot vectors representing targets).

      this transition, as what? as the layer width is increased?

    6. the covariance in the top-layer kernel induced by randomnessin the lower-layer weights.

      what does he mean by this?

    7. e.g.compare performance in Garriga-Alonso et al. (2019) and Novak et al. (2019) against He et al.(2016) and Chen et al. (2018)).

      but in here the GP networks lack many important features like batch-norm, pooling etc! Not sure if this example is a fair comparison. Also, not clear whether this difference is due to finite width or SGD (a question that Novak also asks)

    8. enabling efficient and exact reasoning aboutuncertainty

      Only in regression... AAaaAaaAh ÒwÓ

    1. significant new benchmark for performance of a pure kernel-based method on CIFAR-10, being 10% higher than the methods reported in [Novak et al., 2019]

      Interesting, so apparently the NTK works better than the NNGP for this architecture at least

    1. Optimally, these parameters are chosen such that the true predictiveprocessP(t§jx§;S) is closest toQ(t§jx§;S) in relative entropy.

      in which sense is this optimal?

    2. Bayes classiØer

      I thought the Bayes classifier would predict sign ( E_w [P(t|y)y(x|w)] - 0.5) ?

    3. our task is then to separate the structure from thenoise.

      Well, and to find the correct regularity; generalization is not just about separating structure from noise. Unless by "noise" here, you mean also the stochasticity in the training sample (of inputs)..

    4. We know of no interesting real-world learningproblem which comes without any sort of prior knowledg

      Yep, no free lunch

    5. (theluckycase)

      again I wouldn't call it "unlucky", because the whole proof is that the generalization is good, because it's very unlikely to have obtained this training set by luck, so that it's most likely that we obtained it by having chosen a good prior. So I would call it "good prior" case.

    1. , such as cross entropy loss, encourage a larger outputmargin

      The fact that they also encourage a large SVM-margin is not so trivial tho

    2. the gap between predictions on the true label and andnext most confident label.

      In SVMs, for instance, "margin" refers to the distance between classification boundary and a point. This can be related to the definition of margin here, but they are not the same?

      E.g. if we have a small SVM-margin, but a really large weight norm, then we would still have a small output margin.

      Ah, that's why they normalize by weight norm I suppose yeah.

    1. This is further consistent with recent experimental work showing that neuralnetworks are often robust to re-initialization but not re-randomization of layers (Zhang et al. [42]).

      what does this mean?

    2. Kernels from single hidden layer randomly initializedReLUnetwork convergence to analytic kernel using Monte Carlo sampling (Msamples). See §I foradditional discussion

      I think the monte carlo estimate of the NTK is a montecarlo estimate of the average NTK (as in average over initializations), not of the initialization-dependent NTK which Jacot studied. Jacot showed that in infinite width limit both are the same.

      But it seems from their results that even for finite width the average NTK is closer to the limit NTK than the single-sample NTK. This makes sense, because the single sample one has extra fluctuation around average.

    3. We observe that the empirical kernel^gives more accurate dynamics for finite width networks.

      That is a very interesting observation!

    4. =0n

      yeah! so in standard parametrization, the learning rate is indeed O(1/n) !

    1. Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent

      You didn't except hypothes.is in here did you?

      Bamboozled again!