536 Matching Annotations
  1. Last 7 days
    1. uTWv


    2. pθ(z)

      this should read \(-\ln{p_\theta ( \mathbf{z})}\)

    3. whilereverse KL-divergence is not a viable objective function,

      I guess in the above derivation, we dont really have access to p_d(x) even if we had the trained discriminator. We only have the softmaxes which give us the two ratios in expression (23) which are conditional probabilities (p(g|x) and p(d|x)) rather than the pd and pg separately.

    4. dditionally, it has been suggested thatreverse KL-divergence,DKL(pg||pd), is a better measurefor training generative models than normal KL-divergence,DKL(pd||pg), since it minimisesEx∼pg[lnpd(x)][92]

      I think the intuition here is that minimizing D(pg|pd) has the property of ensuring that pg is contained inside the support of pd. So that, we are more sure that samples will be realistic, though may suffer from mode collapse, like GANs, but that may be a less serious problem.

    5. 2-Stage VAEsBy interpreting VAEs as regularised autoencoders, it is nat-ural to apply less strict regularisation to the latent spaceduring training then subsequently train a density estima-tor on this space, thus obtaining a more complex prior[53]. Vector Quantized-Variational Autoencoders (VQ-VAE)[170], [215] achieve this by training an autoencoder witha discrete latent space, then training a powerful autore-gressive model (see Section 5) on the latent space.

      So I think what VQ-VAEs are trying to achieve is a VAE with a very flexible learned prior. If we look at the VAE objective (eq 19), we see that if p(z) equals q(z), then the KL divergence (averaged over q(x), which I'm interpreting to be the data distribution), becomes -I(z;x), i.e. minus the mutual information between z and x. Interesting!

      However, we can think of VQ-VAE basically ignoring this explicit regularization term, during its first stage (or maybe its implicitly approximated via its codebook training objectives! hmm). In the second stage, we just fit the prior to approximate q(z) which is what we were assuming to be true in the analysis.

      It seems that some of the other works cited in the previous paragraph, try to use the expression in eq 19 (and thus the -I(z;x) regularization) directly. Though because it is intractable to compute, they approximate it in different ways.

    6. 2 stage VAEs that firstmap data to latents of dimensionrthen use a second VAE tocorrect the learned density can better capture the data [35]

      because then the second VAE does recover the data distribution on the latent, according to that paper. Interesting!

    7. (14)

      i think here q(x) is supposed to be q(x|z) and the e^E(x) should be e^E(z)? (well e^E(z) is equivalent to e^E(x) if x is a function fo z..) But yeah, I think the q part, should be q(x|z). For example, if using normalizing flows, it would be jacobian of the inverse flow

    8. implicit energy models

      what are implicit energy models?? I thought they said previously that EBM are not implicit generative models?

    9. This is made worse by thefinite nature of the sampling process, meaning that samplescan be arbitrarily far away from the model’s distribution[57].

      does he mean because of the finite step size? Is that a big problem? hmm not sure I guess this sentence.

    10. the gradient of the negative log-likelihoodlossL(θ) =Ex∼pd[−lnpθ(x)]has been shown to approxi-mately demonstrate the following property [21], [193]

      this contrastive divergence result is very cool!

    11. TABLE 1

      what do the stars in training speed, sample speed and param efficiency correspond to, quantiatively?

      Also, it would be nice to know robsutness to hyperparameters, as that is often a big part of "training time"

    12. training speed is assessed based on reportedtotal training times

      hmm ideally we would know if they had trained until convergence, or if they had gone over convergence.

    13. Choosing what to optimise for has implica-tions for sample quality, with direct likelihood optimisationoften leading to worse sample quality than alternatives.

      is this in part because of noise in the data, which the likelihood based models also fit?

    1. constant mem-ory

      They say constant memory but below the memory is said to be \(O(N)\) which one is true? As far as I can tell, the latter is a typo?

  2. Apr 2021
    1. especially in the most difficult long horizon setting.

      actually the graph shows a larger difference for 1 task than for more tasks?

    2. This process is scalable because pairing happens after-the-fact, making it straightforward to parallelize via crowdsourc-ing.

      could you not crowdsource insruction following too?

      Maybe this adds extra diversity though

      Probably combining both would be best

  3. Mar 2021
    1. Training an agent for145such social interactions most likely requires drasti-146cally different methods – e.g. different architectural147biases – than classical object-manipulation training

      Or a lot of pre-training data, which given current empirical findings, tends to work better.

    2. To117enable the design and study of complex social sce-118narios in reasonable computational time

      alternatively you could consider more complex environments but with more offline algorithms like bootstraping from supervised learning

    3. rather than lan-059guage based social interactions

      some important recent counter example is IIL.


    1. (8)

      This whole variational calculation is basically like combining Monte Carlo integration (with importance sampling), and Jensen inequality (to bring the expectation outside the log). The cool thing is that optimizing over q, makes the approximation exact if we our model for q is sufficiently expressive

    2. Consequently, maximizing the log-likelihood of the continuous model on uniformly dequantizeddata cannot lead to the continuous model degenerately collapsing onto the discrete data, because itsobjective is bounded above by the log-likelihood of a discrete model.

      I don't see how this argument works

    1. In our datasets Fig. 2c, we find empirically thatfor the same amount of collection time, play indeed covers 4.2 times more regions of the availableinteraction space than 18 tasks worth of expert demonstration data, and 14.4 times more regions thanrandom exploration.

      That is indeed a cool finding

    2. lay data ischeap: Unlike expert demonstrations (Fig. 5), playrequires no task segmenting, labeling, or resetting to an initial state, meaning it can be collectedquickly in large quantities.

      Well, you still need to have enough people playing for enough time, which may not be cheap. For example, in Imitating Interactive Intelligence they had to spent about 200K pounds to pay people to play with their environment

    3. We additionally find that play-supervised models,unlike their expert-trained counterparts, are more robust to perturbations and ex-hibit retrying-till-success behaviors.

      I guess because in the play data there was examples of reaching the goal even from suboptimal trajectories

    1. Intuitively, this is equivalentto taking the average of demonstrated actions at each specificstate.

      unless you model the ditribution, e.g. using Normalizing flows, which should be better?

    1. TextWorldaddresses this discrepancy by providing programmatic and aligned linguistic signals during agentexploration.

      but isnt this just substituting the human language-instructor, with a rule-based one that is bound to be of lower quality?

    1. Informally, all else being equal, discontinuousrepresentations should in many cases be “harder” to approx-imate by neural networks than continuous ones. Theoreti-cal results suggest that functions that are smoother [34] orhave stronger continuity properties such as in the modulusof continuity [33, 10] have lower approximation error for agiven number of neurons.

      and they probably generalize better, as there are several works showing that DNNs are implicitly biased towards "smooth" functions

    1. n the skill learning phase, LGB relies on an innate semantic representation that characterizes spatial relations between objects in the scene using predicates known to be used by pre-verbal infants [Mandler, 2012].

      so this is feature-engineered no?

    2. Although the policy over-generalizes, the reward function can still identify whether plants have grown or haven’t.

      how has the reward function learned the association between "feed" and "the object grows"? I guess that was taught from the language descriptions? It should be able to learn the reward function correctly then

    3. more aligned data autonomously.

      i think this is similar to the idea of self-training

    1. This would require a means for representing meaning from experience—a situation model—and a mechanism that allows information to be extracted from sentences and mapped onto the situation model that has been derived from experience, thus enriching that representation

      This basically means adding extra information that is inferred, to what is just directly observed no?

    1. What can and should the user be doing while the AI agent is taking its turn to increase engagement?

      Maybe the agent's actions themselves should be engaging enough? We should aim for that I think

    2. We employ a turn-based frameworkbecause it is a common way of organizing co-creative interactions [3,12,13] and because it suitsevolutionary and reinforcement-learning approaches that require discrete steps [2, 7, 8, 14].

      I think thats a significant limitation. More fluid interactions can only take place in continuous-time settings

    1. As can be seen in Figure 3 (left), the trainingperformance was sensitive to the weight scaleσ, despitethe fact that a weight normalisation scheme was beingused.

      It would be interesting to explore whether this pitfall can actually have an effect in some scenario where one isnt using an abnormally high initialization

  4. Feb 2021
    1. natural motions

      more natural than the baseline*

    2. For quantitative evaluation, we computed the meansquared error between the generated motion and motioncapture on a left-out test set, for fingertip positions and jointangles

      this is problematic because there could be many motions which are good but quite different, and thus having big MSE

    3. In total, we used approximately120 minutes of data

      what? why didnt you use more data? ... We need to do scaling experiments with this

    1. 2)autoregression reduces the amount of fast moments, making thevelocity histogram more similar to the ground truth

      huh? I see autoregression increasing the amount of fast movements no?

    2. “In which video...”: (Q1) “...are the character’s movements mosthuman-like?” (Q2) “...do the character’s movements most reflectwhat the character says?” (Q3) “...do the character’s movementsmost help to understand what the character says?” (Q4) “...are thecharacter’s voice and movement more in sync?”

      It would also be good to do observational studies where users are simply asked to interact with different characters. And we measure how engaged they are.

    3. Hence, after five epochs of training with autoregression,our model has full teacher forcing: it always receives the ground-truth poses for autoregression. This procedure greatly helps withlearning a model that properly integrates non-autoregressive input.

      Interesting, I would have guessed that doing it the other way (starting with teacher forcing and decrease this to fully autoregressive training) would have been the natural curriculum.

      What was the idea for doing this? Is the idea basically to extend to gradually make the information in the autoregressive part of the input more and more predictive, so that the network can anneal from using features in the speech part, to using features in both speech and autoregressive motion?

    4. This pretraining helps thenetwork learn to extract useful features from the speech input, anability which is not lost during further training.

      I wonder if self-attention like in transformers would be better at learning which features to pick on

    5. we pass a sliding windowspanning 0.5 s (10 frames) of past speech and 1 s (20 frames) offuture speech features over the encoded feature vectors.

      so cant generate gestures from audio/text in real time with this

    6. feature vector𝑉𝑠was made distinct from all other encodings, bysetting all elements equal to−15

      it may be a good idea to learn these embeddings no?

    1. three different domains: U.S. presidents,dog breeds, and U.S. national parks. We use mul-tiple domains to include diversity in our tasks,choosing domains that have a multitude of entitiesto which a single question could be applied

      three domains wow much diversity

    2. We assume each task has an associ-ated metricμj(Dj,fθ)∈R, which is used tocompute the model performance for taskτjonDjfor the model represented byfθ.

      So this assumes that the reward can be defineable. In some tasks, it may not be so easy right? We may need to learn rewards

    1. ecological pre-training

      whats ecological pretraining

    1. BART waspre-trained using a denoising objective and a variety of different noising functions. It has obtainedstate-of-the-art results on a diverse set of generation tasks and outperforms comparably-sized T5models [32].

      wait so it was just trained on reconstruction? hmm interesting.

      i guess the fine-tuning then really changes the output in this case, even tho it still reuses knowledge in the model?

    1. We believe these properties provide good motivationfor continuing to scale larger end-to-end imitation archi-tectures over larger play datasets as a practical strategy fortask-agnostic control.


    1. Multipleavenues, including understanding more deeply the mechanisms of creative, knowledge-rich thought, or transferring knowledge from large, real world datasets, may offer a wayforward.


    2. To go beyond competence within somewhat stereo-typed scenarios toward interactive agents that can actively acquire and creatively recombineknowledge to cope with new challenges may require as yet unknown methods for knowl-edge representation and credit assignment, or, failing this, larger scales of data.

      Probably most reliable approach: Larger scales of data

    3. To record sufficientlydiverse behaviour, we have “gamified” human-human interaction via the instrument of lan-guage games.


    4. Winograd envisioned computers that are not “tyrants,” but rather ma-chines that understand and assist us interactively, and it is this view that ultimately led himto advocate convergence between artificial intelligence and human-computer interaction(Winograd, 2006)

      And VR is a big part in the next step in human-computer interaction

    5. Generally, these results give us con-fidence that we could continue to improve the performance of the agents straightforwardlyby increasing the dataset size.

      yeah if you have lots of money to pay people..

      but that is not that scalable

    6. Although the agents do not yet attainhuman-level performance, we will soon describe scaling experiments which suggest thatthis gap could be closed substantially simply by collecting more data.

      We need more data

    7. The regularisation schemes presented in the last section can improve the generalisationproperties of BC policies to novel inputs, but they cannot train the policy to exert active con-trol in the environment to attain states that are probable in the demonstrator’s distribution.

      Unless that active control can be learned by generalizing from learned actions in the demonstrations?

    8. The mouselook action distribution is in turn also defined autoregressively: the first sampled actionsplits the window bounded by(−1,1)×(−1,1)in width and height into 9 squares. Thesecond action splits the selected square into 9 further squares, and so on. Repeating thisprocess several times allows the agent to express any continuous mouse movement up to athreshold resolution.

      Interesting representation of a continuous action space!

    1. effective di-mensionality of a Bayesian neural network is inverselyproportional to the variance of the posterior distribu-tion.

      posterior contraction in parameter space I think you are talking about no?

    1. Yet this also implies non i.i.d. samples! Indeed, even if one could directly sample from the state-action distribution (like having its analytical form or an infinite experience replay buffer) and thus draw i.i.d. samples, the dependency will occur across optimization steps: if I draw a sample and use it to update my policy, I also update the distribution from which I will draw my next sample and then my next sample depends on my previous sample (since it conditioned my policy update).

      But this isn't a problem if the examples come from a fixed expert no?

  5. Jan 2021
    1. Prefix-tuning prepends a sequence ofcontinuous task-specificvectors to the input, whichwe call aprefix, depicted by red blocks in Figure 1(bottom). For subsequent tokens, the Transformercan attend to the prefix as if it were a sequence of“virtual tokens”, but unlike prompting, the prefixconsists entirely of free parameters which do notcorrespond to real tokens.

      and are thus differentiable! yay

    1. I guess a stepping-stone towards this would be to optimize morphological growth processes to generate a body with a particular form in 3D (that would be quite similar to the differentiable CA, except that here the “cells” move in 3D space and have physical interaction that depend on their internal parameters and states)
    2. (and that would be also novel to use a population-based IMGEPs using gradient descent for local optimization towards self-generated goals)

      similar to SIREN+CLIP (Deep Sleep)

    1. For this reason, we wereunable to collect baselines such as an equivalent amount of high-quality human demonstrations forsupervised baselines. See D for more discussion. We leave this ablation to future work.

      so one possibility is that the feedback you got was of better quality than the data used for SL. Perhaps if you did SL on higher quality data you would match the performance of the human feedback model?

    2. it’s unclear how much one can optimizeagainst the reward model until it starts giving useless evaluations.

      adversarial examples

    3. Previous work on fine-tuning language models from human feedback [73] reported “a mismatchbetween the notion of quality we wanted our model to learn, and what the humans labelers actuallyevaluated”, leading to model-generated summaries that were high-quality according to the labelers,but fairly low-quality according to the researchers.

      That is quite interesting

    4. We rely on detailed procedures toensure high agreement between labelers and us on the task, which we describe in the next section

      is this necessarily a good thing? Could you not miss other notions of "qualtiy" this way? I guess you want to ensure a consistent notion of quality, rather than asking the question of "what about other notions of quality?"

    1. We conjecture that this gap occurs because the models “cheat” by only optimizing for performance on the benchmark, much like a student who passed an exam by studying only the questions on past years’ exams.

      poor generalization

    1. n fact, without visiting any states at all, sincethe queries are synthetic.

      grr, what about during the phase of training the generative model?

    2. he x-axis represents the number of queries to the user, where each queryelicits a label for a single state transition(s, a, s0).

      but isnt sampling from model less expensive than sampling by optimizing AFs? shouldnt that be taken into account?

    3. having to visit unsafe states during the training process

      it may have visited some during the training of the generative model no?

      But I guess not that many, if the generative model has been pretrained, and it can generalize well

    4. As discussed inSection4.3and illustrated in the right-most plot of Figure5, the baselines learn a reward model that incorrectly ex-trapolates that continuing up and to the right past the goalregion is good behavior.

      but if the baselines arent visiting those high reward states, then they havent actually fallen into reward hacking? I guess the idea is that they could in a new environment.

      Take away is to do more exploration if you expect to be tested to new environments

    5. ⌧query= maxz0,a0,z1,...,zTJ(⌧)+logp(⌧)

      its like a model-based version of DDPG + curiosity/exploration rewards?

    6. Here, the states2R64⇥64⇥3is anRGB image with a top-down view of the car (Figure3), andthe actiona2R3controls steering, gas, and brake

      In my experience, high dimensional action spaces are even harder, specially when combined with high dim state spaces

    7. he idea is to elicit labels for examples that themodel is least certain how to label, and thus reduce modeluncertainty.

      what if the user(s) the model is querying are also uncertain? Then the model shouldnt spend too much time on these. This is one thing that learning progress aims to avoid!

    8. To simplify our experiments,we sample trajectories⌧by following random policies thatexplore a wide variety of states. We use the observed trajec-tories to train a likelihood model

      Seems like this may be an issue in more complex environments, as the random policies may not explore enough!

      We probably want either human demonstrations and/or iterate/reinfe the generative model with the later policies

    9. (4) maximize novelty of trajecto-ries regardless of predicted rewards, to improve the diversityof the training data.

      could also do something based on learning progress

    10. In complex domains,the user may not be able to anticipate all possible agentbehaviors and specify a reward function that accuratelydescribes user preferences over those behaviors

      so is the assumption that the automated way ot exploring agent behaviours is better than what a human would consider?

  6. Dec 2020
    1. it is far easier to obtain reliability beyond a certain margin by mechanisms in the end hosts of a network rather than in the intermediary nodes,[nb 4] especially when the latter are beyond the control of, and not accountable to, the former

      this seems to me to be mostly saying that: it's hard to change the standards at the low level, so it's easier to program at the higher level.

      This is true of not just networks, but of computers, etc too. But it may not always be the best approach!

      Should have called it "rule of thumb" more than principle I think

  7. Nov 2020
    1. all causal explanationsare necessarily robust in this extreme case

      are they? Can you not have a thing that has a conditional causal effect?

      Seems to me that causality should be a more quantiative thing (how robust is this predictor), rather than an either-or thing

    1. Goldblum et al.[119]which empirically observes that the large width behavior of ResidualNetworks does not conform to the infinite-width limit.

      Oh interesting!

    2. WhileCNN-VECpossess translation equivariance but not invariance (§3.11), we believe it can effectivelyleverage equivariance to learn invariance from data

      How? if it doesn't imply anything about the output?

    3. This is caused by poor conditioning of pooling networks. Xiao et al.[33](Table 1) show that theconditioning at initialization of aCNN-GAPnetwork is worse than that ofFCNorCNN-VECnetworksby a factor of the number of pixels (1024 for CIFAR-10). This poor conditioning of the kerneleigenspectrum can be seen in Figure 8. For linearized networks, in addition to slowing training by afactor of 1024, this leads to numerical instability when usingfloat32

      Interesting. Do models with a stronger bias lead, which may be associated with better generalization (see https://arxiv.org/abs/2002.02561 / https://arxiv.org/abs/1905.10843), lead also to poorer conditioning?

      Hmm, but this did not affect the non-linearized model. Interesting. How does non-linear GD avoid the issue?

    4. egularization parameter

      what regularization parameter?

    1. We add the superscript “all" to emphasize that gradient-based training of the networks is alwaysperformed on the entire dataset, while NNGP inference is performed on sub-sampled datasets.

      ah hm, so the gradient method is given an advantage by being able to "look" at more data than the NNGP method?

    1. With few exceptions (Carl-son et al., 2010), machine learning models havebeen confined to IID datasets that lack the structurein time from which humans draw correlations aboutlong-range causal dependencies

      All of RL studies non-IID data

    2. how pretraining obfuscates ourability to measure generalization (Linzen, 2020)


    3. but even com-plex simulation action spaces can be discretizedand enumerated.

      What's the problem of enumerating and discretizing action spaces?

      what about agents that can act via free text? like those in AI dungeon? those are in principle not enumerable

    4. models the listener’s desires and experiences explic-itly

      what does it mean to model them explicitly versus implicitly?

    5. Collecting data about rich natural sit-uations is often impossible.

      NOPE. VR.

    6. Meanwhile, it is precisely human’sability to draw on past experience and make zero-shot decisions that AI aims to emulate

      which is what GPT3 is doing

    7. Second, current cross entropy training losses ac-tively discourage learning the tail of the distribu-tion properly, as statistically infrequent events aredrowned out (Pennington et al., 2014; Holtzmanet al., 2020).

      That's what scaling is doing, shaving off those tails (as the scaling papers discuss)

    8. it is unlikely that universal function approximatorssuch as neural networks would ever reliably positthat people, events, and causality exist without be-ing biased towards such solutions (Mitchell, 1980)


    9. (which are usually thrown out beforethe dataset is released)

      They shouldn't be! We should learn to probabilistically model the data

    10. persistent enough to learn the effects of actions.

      so we should aim for longer contexts? Yeah memory is important. There is research in extending transformers to have longer contexts

    11. and active experimentation is keyto learning that effec


    12. participatein lin-guistic activity, such as negotiation (Yang et al.,2019a; He et al., 2018; Lewis et al., 2017), collab-oration (Chai et al., 2017), visual disambiguation(Anderson et al., 2018; Lazaridou et al., 2017; Liuand Chai, 2015), or providing emotional support(Rashkin et al., 2019).

      do we need the agent itself to participate, or is not sufficient to feed it data from such types of interactions?

    13. Framing, such as suggesting that achatbot speaks English as a second language

      Tbh I think that framing can be both missleading and illuminating (about the degree or lack thereof of capability of the agent)

    14. Robotics and embodiment are not available inthe same off-the-shelf manner as computer visionmodels.

      I think VR can solve that

    15. (Liet al., 2019b; Krishna et al., 2017; Yatskar et al.,2016; Perlis, 2016)

      why don't you explain how these papers support the statement at least?

    16. Models must be ableto watch and recognize objects, people, and activi-ties to understand the language describing them


    17. Learned, physical heuristics, such as thefact that a falling cat will land quietly, are general-ized and abstracted into language metaphors likeas nimble as a cat(Lakoff, 1980).

      So you just conceded that a prime example of things that need physical interaction to be learnt, can be expressed in words?

      You should make your points clearer. The point I think is that there are a lot of subconscious knowledge like the example you give, but which we can't quite put into words!

    18. Language learning needs perception, because per-ception forms the basis for many of our semanticaxioms

      could we not argue that language is all that we are conscious of. Even though it may be formed by external sensations, what we currently (consciously) know may be almost fully expressible by language, and therefore WS2 may be enought to learn all of conscious knowledge

    19. As text pretraining schemes seem to be reach-ing the point of diminishing returns,

      Not yet, in long scale IIRC

    20. parked my car in the compact park-ing space because it looked (big/small) enough

      Hmm, I think the answer is "big"? This seems learnable from text statistics?

    21. Continuing to expandhardware, data sizes, and financial compute costby orders of magnitude will yield further gains, butthe slope of the increase is quickly decreasing.

      Right, but it's nice that we have a reliable way to improve performance.

    22. cale in data andmodeling has demonstrated that a single represen-tation can discover both rich syntax and semanticswithout our help (Tenney et al., 2019).

      It's not without our help. The data is our help?^^

    23. You can’t learn language from the radio.

      I think the question shouldn't be phrased as a dichotomy, but quantitatively: How much language (semantics) can you and can you not learn from the radio?

    24. The futility of learning language from lin-guistic signal alone is intuitive, and mirrors thebelief that humans lean deeply on non-linguisticknowledge (Chomsky, 1965, 1980).

      Something being intuitive isn't a strong argument for it being true.

    25. from their use by people to communi-cate

      Let's gather massive datasets on that through VR ^^

    26. Natural language processing is a diverse field,and progress throughout its development hascome from new representational theories, mod-eling techniques, data collection paradigms,and tasks.

      and figuring out how to scale up https://arxiv.org/abs/2001.08361

    27. success-ful linguisticcommunicationrelies on a sharedexperience of the world. It is this shared expe-rience that makes utterances meaningful

      I think this is true, except for the language which communicates about language. I think there is meaning purely within the world of language too.

      Though certainly a lot of meaning lies in the grounding of language too

    1. share attention

      common context

    2. Any smaller subset of these compe-tencies is not sufficient to develop proper language/communi-cation skills, and further, the development of language clearlybootstraps better motor and affordance learning and/or sociallearning.

      This seems to be full of statements like this where they claim something is "obviously true" but really more justification is needed for these claims.

    1. Intuition

      The way I think about their framework is as follows:

      They shift perspective from bounding the error to "bounding" the learning curves

      Learning curves are functions (of n), so there is no clear ordering between them as there is for the error at a particular n, which is just a number.

      So instead of learning curves we look at {learning curves up to the equivalence relation of having the same asymptotic behaviour (up to a constant)}, which we call "rates".

      For these there is a natural ordering, and one can provide a rate upper bound, that is uniform over P, for a particular hypothesis class, assuming realizability. This is what they do here, so it is basically uniform convergence, but of a different quantity, which is more representative of how ML works in practice, so that this framework is probably more useful.

      However, their description of "PAC learning" is too restrictive I think; they don't seem to consider data-dependent generalizatoin bounds which exist, and some of them are based on extensions to the uniform PAC bounds. For example how does their framework compared to the PAC-Bayes framework?

    2. Hisnot learnable at rate faster thanR

      So that the concept of universal learnability is characterizing the worst case learning curve rate. The constant is allowed to depend on P but not the function R. So it is non-uniform in that way. But really that's not the best way to think of it I think. The way I think of it is written in my page note titled "Intuition"

    3. For simplicity of exposition, we have stated a definition corresponding todeterministicalgorithms, to avoidthe notational inconvenience required to formally define randomized algorithms in this contex


    4. erP


    5. That is,everynontrivial classHis eitheruniversally learnable at an exponential rate (but not faster), or isuniversally learnable at a linearrate (but not faster), or is universally learnable but necessarily with arbitrarily slow rates

      what do they mean by "nontrivial" here?

    6. for any learning algorithm, there is a realizable distributionPwhoselearning curve decays no faster than a linear rate (Schuurmans, 1997)

      aren't we interested in the statement that for any realizable distribution P there is a learning algorithm whose learning curve decays no faster than a linear rate?

    1. S({Oμ(x)})

      what do they mean by this quantity?

      The number of states with the same energy as O_\mu(x)?

    2. 2−Nq(h∗)eN(h∗m−log coshh∗)

      Isn't this missing the Hessian factor in Laplace's approximation? where has it gone?

    3. argument [10] converts Eq. (1) withα= 1 into the state-ment that, for a large system,N→ ∞, the energy andentropy are exactly equal (up to a constant) to leadingorder inN.

      I think this is the idea that Zipf law is related to P(Energy) being a constant w.r.t. Energy hmm

      tho really if both E and S are extensive in N ( meaning linear in N), then they will scale equally with N, obviousy? Tho is zipf law followed for extensive systems? aren't those were parts are independent, and we expect to aproach a uniform distribution?

      Right I think E and S scaling the same does not imply Zipf, but the other way, it does, apparently. Need to check argument in [10]

    1. Because the exponentαN1for language models, we can approximateN−αN≈1−αNlog(N)to obtainequation 4.1.

      If \(\alpha_N\log{(N)} \ll 1\) i don't see how E.4 will scale as equation 4.1?

      wouldnt the constant \(L_U -1\) dominate?

    2. could be misleading if the models have not all been trained fully to convergence

      you mean because perhaps the assumption that {in the limit of large N, they will perfectly model the data} may not hold if we dont train until convergence, and so the power law + constant assumption may not be justified. Yeah that makes sense

    3. which makes the interpretation ofL(N)difficult.


    4. mattn

      what is \(m_{attn}\)?

    5. There we also show trends forthe training loss, which do not adhere as well to a power-law form, perhaps because of the implicit curriculumin the frequency distribution of easy and hard problems

      why would that affect the training loss scaling??

    6. the poor loss onthese modules would dominate the trends

      could they show accuracy trends?..

    7. easier problems will naturally appear more often than more difficult problems

      interesting. I have some ideas on how this could be related to learning curve exponents

    8. We sample the default mixture of easy, medium, and hard problems, withouta progressive curriculum.

      Did they look if curriculum learning had any effect on the learning curves?

    9. context length of3200tokens per image/caption pair

      isn't that the total length of an example? I thought the context was the part given before the token to be predicted?

    10. We revisit the question “Is a picture worth a thousand words?” by comparing the information-contentof textual captions to the image/text mutual information

      I think an Issue with their analysis is that a picture's caption in a standard dataset does not capture all the info derivable from a picture

    1. but we will onlyapply it along the time dimensiont.

      what do you mean? I thought you were applying the normalizing flow at each time step individually, not convolving over time

    1. The key point of this work is that based on observing a single sample from a subpopulation, it isimpossible to distinguish samples from “borderline” populations from those in the “outlier” ones. Thereforean algorithm can only avoid the risk of missing “borderline” subpopulations by also memorizing examplesfrom the “outlier” subpopulations.

      I just find it weird that we have to offer so much justification for fitting to 0 error, when I don't see much reason to believe it isn't a good idea?

    1. e over parameters and the function-space posterior co-variance. Red indicates the under-parameterized setting, yellowthe critical regime withp≈n, and green the over-parameterizedregime.

      isn't it the other way? Red is over-parametrized and green is under-parametrized?

    2. We see wide but shallow models overfit, providing low train loss, but high testloss and high effective dimensionality.

      it seems like it's mostly the number of parameters not the aspect ratio which determines the generalization performance? So that depth is not intrinsically helping generalization?

    3. subspace and ensembling methods could beimproved through the avoidance of expensive com-putations within degenerate parameter regimes

      but how do you make sure you are sampling with the right probabilities?

    1. w

      this should be transposed

    2. Our theoryagain perfectly fits the experiments.

      well you can see some deviations in this NN, probably because of the smaller width

    3. K

      i think here it should be \(\kappa_{\text{NTK}}\)

    4. marginal training data point causes greater reduc-tion in relative error for low frequency modes than for highfrequency modes.

      isn't this the opposite of what you said earlier??

      "the marginal training data point causes agreater percent reduction in generalization error for modeswith larger RKHS eigenvalues."

  8. Oct 2020
    1. Each expert in the MoE layer receives a combinedbatch consisting of the relevant examples from all of the data-parallel input batches.

      so the activations for the set of samples which use expert k should be sent to the right device which has expert k, right?

      how much communication overhead is this?

    1. A prior over parametersp(w)combines with the functionalform of a modelf(x;w)to induce a distribution over func-tionsp(f(x;w)). It is this distribution over functions thatcontrols the generalization properties of the model; the priorover parameters, in isolation, has no meaning.

      Yep this is what we say in our paper too^^ https://arxiv.org/abs/1805.08522

    2. Distance between the truepredictive distribution and the approximation

      you mean something like minus the distance? because you want this distance to be smaller for better approximations?

    1. coherent

      coherent hear just means that it will approach the true distribution eventually?

    1. As the effective dimensionality increases, so doesthe dimensionality of parameter space in which theposterior variance has contracted.

      can you not have very confident models which are making wrong predictions?

    1. In the notation of Section 3, pointsω∈Ωrepresent possible samples. In our setting, each sam-ple represents a complete record of a machine learning experiment. An environmentespecifies adistributionPeon the spaceΩof complete records.In the setting of supervised deep learning, a complete record of an experiment would specify hy-perparameters, random seeds, optimizers, training (and held out) data, etc.

      so each e represents an "experimetn" which is a range/distribution of hyperparameters (or what they call a complete record of a machine learning experiment)

    1. We measure a simple empirical statistic, thegradient noise scale3(essentially a measure of the signal-to-noise ratio of gradient across training examples),and show that it can approximately predict the largest efficient batch size for a wide range of tasks

      how is this related to the difficulty of the task?

    1. non-zero entropy

      what about entropy rate?

    2. overfitting

      OK, I THINK THEY ARE DEFINING OVERFITTING in the agnostic learning sense of L(f)-min_{f'\in F}L(f'). How badly am I doing relative to the best in the class!

    3. we stop training early when the test loss ceases to improve and optimize all models in the same way

      didn't they say earlier that they train for a fixed number of steps?

    4. Nincreases and the model begins to overfit

      well the increased overfitting is only visible in the smallest data size

    5. S

      should be N?

    6. We find that generalization depends almost exclusively on thein-distribution validation loss, and does not depend on the duration of training or proximity to convergence

      no overfitting^^ even for transfer learning

    7. Although these models have been trained on the WebText2 dataset, their test loss on a variety of other datasetsis also a power-law inNwith nearly identical power, as shown in Figure 8.

      probably significantly different datasets will show different power laws. The different datasets looked at here seem quite similar

    8. (approximately twice the compute as the forwards pass)


    9. To utilize both training time and compute as effectively as possible, it is best to train with a batchsizeB≈Bcrit

      because above B_crit you can reduce time, but with increasing compute cost (diminishing returns)

  9. Jul 2020
  10. Jun 2020
    1. (x;W1,...,Wl,b1,...,bl)

      it should depend on \(W^{l+1}\) and \(b^{l+1}\) too

    1. Naturally, such an increase in the learning rate also increases the mean stepsE[∆w]. However,we found that this effect is negligible sinceE[∆w]is typically orders of magnitude lower than thestandard deviation.

      Interesting. This is why the intuition that increasing the learning rate would decrease the number of updates is probably not true, because what seems to determine the number of steps is the amount of noise!

  11. May 2020
    1. 〈O( ̄θ)〉=〈[[O[ ̄θ−η ̄∇LB(θ)]]]m.b.〉.

      this is missing some time indices?

    1. We omit thedβexp (−cγ) +bβlog(1δ)nterm since it does not change with changein random labels.

      how can we be sure it is non-vacuous then? hmm

    2. while ̃Hθ†l,φ[j,j] can change based onα-scaling Dinh et al. [2017], the effective curvature is scale invariant

      do you mean because you change \(\sigma\) too? Was that what Dinh et al. were talking about? Or just the fact that there are other \theta (not reparametrizing, just finding new \theta) which have high curvature, but produce same function?

    3. (f) stays valid for the test error rate in (a)

      if you take into account the spread in (f) and (a) it would seem that for some runs the upper bound isn't valid?

    4. Then, based on the ‘fast rate’ PAC-Bayes bound as before, we have the following result

      the posterior Q is a strange posterior over hypotheses. How do they take the KL divergence with the prior Because the posterior is defined by two parameters (\(\theta_\rho\) and \(\theta\))

    5. Further, all the diagonal elementsdecrease as more samples are used for training.

      Really? That sounds surprising!

      I would have expected that as more training samples are added the parameters get more constrained (if the number of parameters is kept fixed).

    6. Theorem 1

      Derandomization of the margin loss

    7. The bound provides a concrete realization of the notionof ‘flatness’ in deep nets [Smith and Le, 2018, Hochreiter and Schmidhuber, 1997, Keskar et al., 2017] andillustrates a trade-off between curvature and distance from initialization.

      is there evidence that distance from initialization anti-correlates with generalization? Even evidence for sharpness <> generalization isn't very strong.

    8. In spite of the dependency on the Hessian diagonal elements, which canbe changed based on re-parameterization without changing the function [Smith and Le, 2018, Dinh et al.,2017], the bound itself is scale invariant since KL-divergence is invariant to such re-parameterizations Klee-man [2011], Li et al. [2019].

      i thought Dinh's criticism wasn't so much about reparametrization, but about the fact that there are other minima which are sharper but give the same function. KL wouldn't be invariant to that, as you aren't changing the prior in that case?

  12. Apr 2020
    1. ∈Ck

      this sum was over all points in the training set in the previous step, and now it's over all points ?

      Just think of the case where the partition C_i is made up of singletons, one for each possible point. Then, the robustness would be zero, but the generalizatoin error bound doesn't seem right then.

      This made me suspect there may be something wrong, and I think it could be at this step. If we kept the sum to be over training sets, now we can;t upper bound the result by the max in the next two lines, I think!

  13. Mar 2020
    1. because of the softmax operation.

      more like because of the Heaviside operation

    2. the signs of f and 𝑓̃ f~\tilde{f} are the same.

      and therefore the classification functions are the same

    3. f~\tilde{f} as 𝑓𝑉=𝜌𝑓̃ fV=ρf~f_V=\rho \tilde{f},

      this is confusing, is f_V or \tilde{f} the normalized network?

    4. Our main results should also hold for SGD.

      Will this be commented on in more detail?

    5. normalized weights Vk as the variables of interest

      Can we even reparametrize to the normalized weights? For homogeneous networks, it's obvious that we can. But for ReLU networks with biases it's less obvious. If one multiplies the biases via constants that grow exponentially with weight, the function is left invariant. We can always do this until the paramter vector is left normalized. Therefore we can reparametrize to the normalized vectors even with biases, but dunno if they consider this case here.

    6. This mechanism underlies regularization in deep networks for exponential losses

      we cannot say this, until we know more. Is this the reason why the generalize? Is this even sufficient to explain their generalization?

    1. Bahdanau et al.(2019) learn a reward function jointly with the action policybut does so using an external expert dataset whereas ouragent uses trajectories collected through its own exploration

      Yeah what they do here is similar to IRL, in that we are trying to learn a human NL-conditioned reward function, but we do it via supervision, rather than demonstration. More similar to the work on "learning from human preferences"

    1. other agents

      which share the same policy right? otherwise it woud be off-policy experience?

    2. Zero Sum

      don't understand this one

    3. specific choice ofλ

      here, a specific choice of \(\lambda\) can determine which solutions among the many which satisfy the constraint we choose. Similarly to the choice of convex regularizer in the GAIL paper

    1. The problem with the max entropy approach in Ziebart et al. 2008 is that it maximizes the entropy of trajectory distributions, without the constraint that these distributions must be realizable by causally-acting policies/agents. They then construct a causal policy from this distribution, but following the policy may result in a different trajectory distribution!

      The question is what would be the maximum entropy path distribution that is compatible with a causal policy? Does maximizing causal entropy give that? Not clear. Instead they prove a different property of maximum causal entropy: Theorem 3 in Ziebert 2012

    2. Z(θ)

      Remember the partition function sums over trajectories which are compatible with the MDP dynamics only.

      Trajectories incompatible with the dynamics have probability 0 of course

    1. Ziebart et al. (2008)

      The problem with the max entropy approach in Ziebart et al. 2008 is that it maximizes the entropy of trajectory distributions, without the constraint that these distributions must be realizable by causally-acting policies/agents. They then construct a causal policy from this distribution, but following the policy may result in a different trajectory distribution!

      The question is what would be the maximum entropy path distribution that is compatible with a causal policy? Does maximizing causal entropy give that? Not clear. Instead they prove a different property of maximum causal entropy: Theorem 3

    2. eθ>F(X,Y)

      this is P(Y|X), right? but it should be P(Y|X,Y_{1:t-1})?

    1. without interactionwith the expert

      how do things change when you can interact with the expert?

  14. Feb 2020
    1. Attention: Mask

      by this, do they mean the attention weighted aggregation step?

    2. nlayerdmodel3dattn

      are they ignoring the \(W^O\) matrix? from the original Transformer paper?

    3. Large models are more sample-efficient than small models, reaching the same level ofperformance with fewer optimization steps (Figure 2) and using fewer data points (Figure 4)

      hmm in teresting. why are larger models more sample efficient?

    4. Theperformance penalty depends predictably on the ratioN0.74/D

      That is weird, what's the origin of this?

    5. hmm do they look at generalization gap?

      is trend on test loss due to parameter count, mostly due to effect on expressivyt / tranining loss (similarly with compute)?

    1. Some preliminary numerical simulations show that thisapproach does predict high robustness and log scaling.However, it only makes any sense if transitions from onephenotype to another phenotype are memoryless.

      I thought the whole transition matrix approach itself assumed memorylessness

    2. LetPbe a row vectorspecifying the probability distribution over phenotypes. Wewant to find a stochastic transition matrixM(rows sum toone) such that

      why do we want P to be stationary?

    3. Mhas 1s on the diagonals,and 0s elsewhere, for example

      that is high robustness right?

    4. Fano’s inequality)

      doesn't Fano's inequality give H(X|Y) on the numerator which is a lower bound on H(X), and so doesnt imply this?


    1. Intrinsic motivations f

      Basically the idea is that the RL/HER part is intrisnsically motivated with LP, to solve more and more tasks while the goal sampling part is intrinsically motivated to get trajectories that give new information to learn the reward function. I suppose they could add a bit of LP to the goal sampling as well to have some tendency to sample trajectories that may help to solve new tasks.

    2. High-quality trajectories are trajectories where the agent collectsdescriptions from the social partner for goals that are rarely reached.

      why do you want more than one description for a goal? A: Ah, because the goal will be the same but the final state may not be for each of these trajectories, thus giving more data to train the reward function.

  15. Jan 2020
    1. f memory-based sample efficient methods

      bandits methods, which are suitable for sequences of indepenedent experiments

    1. We find that the object geometry makes a significantdifferences in how hard the problem is

      apply some goal exploration process like POET?

    1. When it comes to NNs, the regulariza-tion mechanism is also well appreciated in the literature,since they traditionally suffer from overparameterization,resulting in overfitting.

      No. Overparametrized networks have been shown to generalize even without explicit regularization (Zhang et al. 2017)