683 Matching Annotations

Feb 2021
arxiv.org arxiv.org

2012.05672.pdf

3
1. guillefix 07 Feb 2021
  
  in Public
  
  Although the agents do not yet attainhuman-level performance, we will soon describe scaling experiments which suggest thatthis gap could be closed substantially simply by collecting more data.
  
  We need more data
2. guillefix 07 Feb 2021
  
  in Public
  
  The regularisation schemes presented in the last section can improve the generalisationproperties of BC policies to novel inputs, but they cannot train the policy to exert active con-trol in the environment to attain states that are probable in the demonstrator’s distribution.
  
  Unless that active control can be learned by generalizing from learned actions in the demonstrations?
3. guillefix 06 Feb 2021
  
  in Public
  
  The mouselook action distribution is in turn also defined autoregressively: the first sampled actionsplits the window bounded by(−1,1)×(−1,1)in width and height into 9 squares. Thesecond action splits the selected square into 9 further squares, and so on. Repeating thisprocess several times allows the agent to express any continuous mouse movement up to athreshold resolution.
  
  Interesting representation of a continuous action space!
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/2012.05672.pdf
arxiv.org arxiv.org

Rethinking Parameter Counting in Deep Models: Effective Dimensionality Revisited

1
1. guillefix 04 Feb 2021
  
  in Public
  
  effective di-mensionality of a Bayesian neural network is inverselyproportional to the variance of the posterior distribu-tion.
  
  posterior contraction in parameter space I think you are talking about no?
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/2003.02139.pdf
www.shortscience.org www.shortscience.org

A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning - ShortScience.org

1
1. guillefix 02 Feb 2021
  
  in Public
  
  Yet this also implies non i.i.d. samples! Indeed, even if one could directly sample from the state-action distribution (like having its analytical form or an infinite experience replay buffer) and thus draw i.i.d. samples, the dependency will occur across optimization steps: if I draw a sample and use it to update my policy, I also update the distribution from which I will draw my next sample and then my next sample depends on my previous sample (since it conditioned my policy update).
  
  But this isn't a problem if the examples come from a fixed expert no?
Visit annotations in context

Annotators

guillefix

URL

shortscience.org/paper
Jan 2021
arxiv.org arxiv.org

2101.00190.pdf

1
1. guillefix 22 Jan 2021
  
  in Public
  
  Prefix-tuning prepends a sequence ofcontinuous task-specificvectors to the input, whichwe call aprefix, depicted by red blocks in Figure 1(bottom). For subsequent tokens, the Transformercan attend to the prefix as if it were a sequence of“virtual tokens”, but unlike prompting, the prefixconsists entirely of free parameters which do notcorrespond to real tokens.
  
  and are thus differentiable! yay
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/2101.00190
openlab-flowers.inria.fr openlab-flowers.inria.fr

Morphological Search and Control

2
1. guillefix 20 Jan 2021
  
  in Public
  
  I guess a stepping-stone towards this would be to optimize morphological growth processes to generate a body with a particular form in 3D (that would be quite similar to the differentiable CA, except that here the “cells” move in 3D space and have physical interaction that depend on their internal parameters and states)
  
  See https://twitter.com/ak92501/status/1312288333326942208?s=21
2. guillefix 20 Jan 2021
  
  in Public
  
  (and that would be also novel to use a population-based IMGEPs using gradient descent for local optimization towards self-generated goals)
  
  similar to SIREN+CLIP (Deep Sleep)
Visit annotations in context

Annotators

guillefix

URL

openlab-flowers.inria.fr/t/morphological-search-and-control/797
www.semanticscholar.org www.semanticscholar.org

2009.01325.pdf

4
1. guillefix 20 Jan 2021
  
  in Public
  
  For this reason, we wereunable to collect baselines such as an equivalent amount of high-quality human demonstrations forsupervised baselines. See D for more discussion. We leave this ablation to future work.
  
  so one possibility is that the feedback you got was of better quality than the data used for SL. Perhaps if you did SL on higher quality data you would match the performance of the human feedback model?
2. guillefix 20 Jan 2021
  
  in Public
  
  it’s unclear how much one can optimizeagainst the reward model until it starts giving useless evaluations.
  
  adversarial examples
3. guillefix 20 Jan 2021
  
  in Public
  
  Previous work on fine-tuning language models from human feedback [73] reported “a mismatchbetween the notion of quality we wanted our model to learn, and what the humans labelers actuallyevaluated”, leading to model-generated summaries that were high-quality according to the labelers,but fairly low-quality according to the researchers.
  
  That is quite interesting
4. guillefix 20 Jan 2021
  
  in Public
  
  We rely on detailed procedures toensure high agreement between labelers and us on the task, which we describe in the next section
  
  is this necessarily a good thing? Could you not miss other notions of "qualtiy" this way? I guess you want to ensure a consistent notion of quality, rather than asking the question of "what about other notions of quality?"
Visit annotations in context

Annotators

guillefix

URL

semanticscholar.org/reader/053b1d7b97eb2c91fc3921d589c160b0923c70b1
openai.com openai.com

CLIP: Connecting Text and Images

1
1. guillefix 11 Jan 2021
  
  in Public
  
  We conjecture that this gap occurs because the models “cheat” by only optimizing for performance on the benchmark, much like a student who passed an exam by studying only the questions on past years’ exams.
  
  poor generalization
Visit annotations in context

Annotators

guillefix

URL

openai.com/blog/clip/
proceedings.mlr.press proceedings.mlr.press

Learning Human Objectives by Evaluating Hypothetical Behavior

10
1. guillefix 11 Jan 2021
  
  in Public
  
  n fact, without visiting any states at all, sincethe queries are synthetic.
  
  grr, what about during the phase of training the generative model?
2. guillefix 11 Jan 2021
  
  in Public
  
  he x-axis represents the number of queries to the user, where each queryelicits a label for a single state transition(s, a, s0).
  
  but isnt sampling from model less expensive than sampling by optimizing AFs? shouldnt that be taken into account?
3. guillefix 11 Jan 2021
  
  in Public
  
  having to visit unsafe states during the training process
  
  it may have visited some during the training of the generative model no?
  
  But I guess not that many, if the generative model has been pretrained, and it can generalize well
4. guillefix 11 Jan 2021
  
  in Public
  
  As discussed inSection4.3and illustrated in the right-most plot of Figure5, the baselines learn a reward model that incorrectly ex-trapolates that continuing up and to the right past the goalregion is good behavior.
  
  but if the baselines arent visiting those high reward states, then they havent actually fallen into reward hacking? I guess the idea is that they could in a new environment.
  
  Take away is to do more exploration if you expect to be tested to new environments
5. guillefix 11 Jan 2021
  
  in Public
  
  ⌧query= maxz0,a0,z1,...,zTJ(⌧)+logp(⌧)
  
  its like a model-based version of DDPG + curiosity/exploration rewards?
6. guillefix 11 Jan 2021
  
  in Public
  
  Here, the states2R64⇥64⇥3is anRGB image with a top-down view of the car (Figure3), andthe actiona2R3controls steering, gas, and brake
  
  In my experience, high dimensional action spaces are even harder, specially when combined with high dim state spaces
7. guillefix 11 Jan 2021
  
  in Public
  
  he idea is to elicit labels for examples that themodel is least certain how to label, and thus reduce modeluncertainty.
  
  what if the user(s) the model is querying are also uncertain? Then the model shouldnt spend too much time on these. This is one thing that learning progress aims to avoid!
8. guillefix 11 Jan 2021
  
  in Public
  
  To simplify our experiments,we sample trajectories⌧by following random policies thatexplore a wide variety of states. We use the observed trajec-tories to train a likelihood model
  
  Seems like this may be an issue in more complex environments, as the random policies may not explore enough!
  
  We probably want either human demonstrations and/or iterate/reinfe the generative model with the later policies
9. guillefix 11 Jan 2021
  
  in Public
  
  (4) maximize novelty of trajecto-ries regardless of predicted rewards, to improve the diversityof the training data.
  
  could also do something based on learning progress
10. guillefix 11 Jan 2021
  
  in Public
  
  In complex domains,the user may not be able to anticipate all possible agentbehaviors and specify a reward function that accuratelydescribes user preferences over those behaviors
  
  so is the assumption that the automated way ot exploring agent behaviours is better than what a human would consider?
Visit annotations in context

Annotators

guillefix

URL

proceedings.mlr.press/v119/reddy20a/reddy20a.pdf
Dec 2020
www.wikiwand.com www.wikiwand.com

End-to-end principle | Wikiwand

1
1. guillefix 01 Dec 2020
  
  in Public
  
  it is far easier to obtain reliability beyond a certain margin by mechanisms in the end hosts of a network rather than in the intermediary nodes,[nb 4] especially when the latter are beyond the control of, and not accountable to, the former
  
  this seems to me to be mostly saying that: it's hard to change the standards at the low level, so it's easier to program at the higher level.
  
  This is true of not just networks, but of computers, etc too. But it may not always be the best approach!
  
  Should have called it "rule of thumb" more than principle I think
Visit annotations in context

Annotators

guillefix

URL

wikiwand.com/en/End-to-end_principle
Nov 2020
arxiv.org arxiv.org

2010.11924.pdf

1
1. guillefix 30 Nov 2020
  
  in Public
  
  all causal explanationsare necessarily robust in this extreme case
  
  are they? Can you not have a thing that has a conditional causal effect?
  
  Seems to me that causality should be a more quantiative thing (how robust is this predictor), rather than an either-or thing
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/2010.11924.pdf
proceedings.neurips.cc proceedings.neurips.cc

NeurIPS-2020-finite-versus-infinite-neural-networks-an-empirical-study-Paper.pdf

4
1. guillefix 30 Nov 2020
  
  in Public
  
  Goldblum et al.[119]which empirically observes that the large width behavior of ResidualNetworks does not conform to the infinite-width limit.
  
  Oh interesting!
2. guillefix 30 Nov 2020
  
  in Public
  
  WhileCNN-VECpossess translation equivariance but not invariance (§3.11), we believe it can effectivelyleverage equivariance to learn invariance from data
  
  How? if it doesn't imply anything about the output?
3. guillefix 30 Nov 2020
  
  in Public
  
  This is caused by poor conditioning of pooling networks. Xiao et al.[33](Table 1) show that theconditioning at initialization of aCNN-GAPnetwork is worse than that ofFCNorCNN-VECnetworksby a factor of the number of pixels (1024 for CIFAR-10). This poor conditioning of the kerneleigenspectrum can be seen in Figure 8. For linearized networks, in addition to slowing training by afactor of 1024, this leads to numerical instability when usingfloat32
  
  Interesting. Do models with a stronger bias lead, which may be associated with better generalization (see https://arxiv.org/abs/2002.02561 / https://arxiv.org/abs/1905.10843), lead also to poorer conditioning?
  
  Hmm, but this did not affect the non-linearized model. Interesting. How does non-linear GD avoid the issue?
4. guillefix 30 Nov 2020
  
  in Public
  
  egularization parameter
  
  what regularization parameter?
Visit annotations in context

Annotators

guillefix

URL

proceedings.neurips.cc/paper/2020/file/ad086f59924fffe0773f8d0ca22ea712-Paper.pdf
arxiv.org arxiv.org

2011.06006.pdf

1
1. guillefix 20 Nov 2020
  
  in Public
  
  We add the superscript “all" to emphasize that gradient-based training of the networks is alwaysperformed on the entire dataset, while NNGP inference is performed on sub-sampled datasets.
  
  ah hm, so the gradient method is given an advantage by being able to "look" at more data than the NNGP method?
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/2011.06006.pdf
arxiv.org arxiv.org

2004.10151.pdf

27
1. guillefix 19 Nov 2020
  
  in Public
  
  With few exceptions (Carl-son et al., 2010), machine learning models havebeen confined to IID datasets that lack the structurein time from which humans draw correlations aboutlong-range causal dependencies
  
  All of RL studies non-IID data
2. guillefix 19 Nov 2020
  
  in Public
  
  how pretraining obfuscates ourability to measure generalization (Linzen, 2020)
  
  How??
3. guillefix 19 Nov 2020
  
  in Public
  
  but even com-plex simulation action spaces can be discretizedand enumerated.
  
  What's the problem of enumerating and discretizing action spaces?
  
  what about agents that can act via free text? like those in AI dungeon? those are in principle not enumerable
4. guillefix 19 Nov 2020
  
  in Public
  
  models the listener’s desires and experiences explic-itly
  
  what does it mean to model them explicitly versus implicitly?
5. guillefix 19 Nov 2020
  
  in Public
  
  Collecting data about rich natural sit-uations is often impossible.
  
  NOPE. VR.
6. guillefix 19 Nov 2020
  
  in Public
  
  Meanwhile, it is precisely human’sability to draw on past experience and make zero-shot decisions that AI aims to emulate
  
  which is what GPT3 is doing
7. guillefix 19 Nov 2020
  
  in Public
  
  Second, current cross entropy training losses ac-tively discourage learning the tail of the distribu-tion properly, as statistically infrequent events aredrowned out (Pennington et al., 2014; Holtzmanet al., 2020).
  
  That's what scaling is doing, shaving off those tails (as the scaling papers discuss)
8. guillefix 19 Nov 2020
  
  in Public
  
  it is unlikely that universal function approximatorssuch as neural networks would ever reliably positthat people, events, and causality exist without be-ing biased towards such solutions (Mitchell, 1980)
  
  Why?
9. guillefix 19 Nov 2020
  
  in Public
  
  (which are usually thrown out beforethe dataset is released)
  
  They shouldn't be! We should learn to probabilistically model the data
10. guillefix 19 Nov 2020
  
  in Public
  
  persistent enough to learn the effects of actions.
  
  so we should aim for longer contexts? Yeah memory is important. There is research in extending transformers to have longer contexts
11. guillefix 19 Nov 2020
  
  in Public
  
  and active experimentation is keyto learning that effec
  
  why?
12. guillefix 19 Nov 2020
  
  in Public
  
  participatein lin-guistic activity, such as negotiation (Yang et al.,2019a; He et al., 2018; Lewis et al., 2017), collab-oration (Chai et al., 2017), visual disambiguation(Anderson et al., 2018; Lazaridou et al., 2017; Liuand Chai, 2015), or providing emotional support(Rashkin et al., 2019).
  
  do we need the agent itself to participate, or is not sufficient to feed it data from such types of interactions?
13. guillefix 19 Nov 2020
  
  in Public
  
  Framing, such as suggesting that achatbot speaks English as a second language
  
  Tbh I think that framing can be both missleading and illuminating (about the degree or lack thereof of capability of the agent)
14. guillefix 19 Nov 2020
  
  in Public
  
  Robotics and embodiment are not available inthe same off-the-shelf manner as computer visionmodels.
  
  I think VR can solve that
15. guillefix 19 Nov 2020
  
  in Public
  
  (Liet al., 2019b; Krishna et al., 2017; Yatskar et al.,2016; Perlis, 2016)
  
  why don't you explain how these papers support the statement at least?
16. guillefix 19 Nov 2020
  
  in Public
  
  Models must be ableto watch and recognize objects, people, and activi-ties to understand the language describing them
  
  why?
17. guillefix 19 Nov 2020
  
  in Public
  
  Learned, physical heuristics, such as thefact that a falling cat will land quietly, are general-ized and abstracted into language metaphors likeas nimble as a cat(Lakoff, 1980).
  
  So you just conceded that a prime example of things that need physical interaction to be learnt, can be expressed in words?
  
  You should make your points clearer. The point I think is that there are a lot of subconscious knowledge like the example you give, but which we can't quite put into words!
18. guillefix 19 Nov 2020
  
  in Public
  
  Language learning needs perception, because per-ception forms the basis for many of our semanticaxioms
  
  could we not argue that language is all that we are conscious of. Even though it may be formed by external sensations, what we currently (consciously) know may be almost fully expressible by language, and therefore WS2 may be enought to learn all of conscious knowledge
19. guillefix 19 Nov 2020
  
  in Public
  
  As text pretraining schemes seem to be reach-ing the point of diminishing returns,
  
  Not yet, in long scale IIRC
20. guillefix 19 Nov 2020
  
  in Public
  
  parked my car in the compact park-ing space because it looked (big/small) enough
  
  Hmm, I think the answer is "big"? This seems learnable from text statistics?
21. guillefix 19 Nov 2020
  
  in Public
  
  Continuing to expandhardware, data sizes, and financial compute costby orders of magnitude will yield further gains, butthe slope of the increase is quickly decreasing.
  
  Right, but it's nice that we have a reliable way to improve performance.
22. guillefix 19 Nov 2020
  
  in Public
  
  cale in data andmodeling has demonstrated that a single represen-tation can discover both rich syntax and semanticswithout our help (Tenney et al., 2019).
  
  It's not without our help. The data is our help?^^
23. guillefix 19 Nov 2020
  
  in Public
  
  You can’t learn language from the radio.
  
  I think the question shouldn't be phrased as a dichotomy, but quantitatively: How much language (semantics) can you and can you not learn from the radio?
24. guillefix 19 Nov 2020
  
  in Public
  
  The futility of learning language from lin-guistic signal alone is intuitive, and mirrors thebelief that humans lean deeply on non-linguisticknowledge (Chomsky, 1965, 1980).
  
  Something being intuitive isn't a strong argument for it being true.
25. guillefix 19 Nov 2020
  
  in Public
  
  from their use by people to communi-cate
  
  Let's gather massive datasets on that through VR ^^
26. guillefix 19 Nov 2020
  
  in Public
  
  Natural language processing is a diverse field,and progress throughout its development hascome from new representational theories, mod-eling techniques, data collection paradigms,and tasks.
  
  and figuring out how to scale up https://arxiv.org/abs/2001.08361
27. guillefix 19 Nov 2020
  
  in Public
  
  success-ful linguisticcommunicationrelies on a sharedexperience of the world. It is this shared expe-rience that makes utterances meaningful
  
  I think this is true, except for the language which communicates about language. I think there is meaning purely within the world of language too.
  
  Though certainly a lot of meaning lies in the grounding of language too
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/2004.10151.pdf
www.ece.uvic.ca www.ece.uvic.ca

untitled

2
1. guillefix 18 Nov 2020
  
  in Public
  
  share attention
  
  common context
2. guillefix 18 Nov 2020
  
  in Public
  
  Any smaller subset of these compe-tencies is not sufficient to develop proper language/communi-cation skills, and further, the development of language clearlybootstraps better motor and affordance learning and/or sociallearning.
  
  This seems to be full of statements like this where they claim something is "obviously true" but really more justification is needed for these claims.
Visit annotations in context

Annotators

guillefix

URL

ece.uvic.ca/~bctill/papers/ememcog/Cangelosi_etal_2010.pdf
arxiv.org arxiv.org

Untitled document

6
1. guillefix 17 Nov 2020
  
  in Public
  
  Intuition
  
  The way I think about their framework is as follows:
  
  They shift perspective from bounding the error to "bounding" the learning curves
  
  Learning curves are functions (of n), so there is no clear ordering between them as there is for the error at a particular n, which is just a number.
  
  So instead of learning curves we look at {learning curves up to the equivalence relation of having the same asymptotic behaviour (up to a constant)}, which we call "rates".
  
  For these there is a natural ordering, and one can provide a rate upper bound, that is uniform over P, for a particular hypothesis class, assuming realizability. This is what they do here, so it is basically uniform convergence, but of a different quantity, which is more representative of how ML works in practice, so that this framework is probably more useful.
  
  However, their description of "PAC learning" is too restrictive I think; they don't seem to consider data-dependent generalizatoin bounds which exist, and some of them are based on extensions to the uniform PAC bounds. For example how does their framework compared to the PAC-Bayes framework?
2. guillefix 16 Nov 2020
  
  in Public
  
  Hisnot learnable at rate faster thanR
  
  So that the concept of universal learnability is characterizing the worst case learning curve rate. The constant is allowed to depend on P but not the function R. So it is non-uniform in that way. But really that's not the best way to think of it I think. The way I think of it is written in my page note titled "Intuition"
3. guillefix 16 Nov 2020
  
  in Public
  
  For simplicity of exposition, we have stated a definition corresponding todeterministicalgorithms, to avoidthe notational inconvenience required to formally define randomized algorithms in this contex
  
  IKR
4. guillefix 16 Nov 2020
  
  in Public
  
  erP
  
  nice
5. guillefix 16 Nov 2020
  
  in Public
  
  That is,everynontrivial classHis eitheruniversally learnable at an exponential rate (but not faster), or isuniversally learnable at a linearrate (but not faster), or is universally learnable but necessarily with arbitrarily slow rates
  
  what do they mean by "nontrivial" here?
6. guillefix 16 Nov 2020
  
  in Public
  
  for any learning algorithm, there is a realizable distributionPwhoselearning curve decays no faster than a linear rate (Schuurmans, 1997)
  
  aren't we interested in the statement that for any realizable distribution P there is a learning algorithm whose learning curve decays no faster than a linear rate?
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/2011.04483.pdf
arxiv.org arxiv.org

1310.0448.pdf

3
1. guillefix 12 Nov 2020
  
  in Public
  
  S({Oμ(x)})
  
  what do they mean by this quantity?
  
  The number of states with the same energy as O_\mu(x)?
2. guillefix 12 Nov 2020
  
  in Public
  
  2−Nq(h∗)eN(h∗m−log coshh∗)
  
  Isn't this missing the Hessian factor in Laplace's approximation? where has it gone?
3. guillefix 12 Nov 2020
  
  in Public
  
  argument [10] converts Eq. (1) withα= 1 into the state-ment that, for a large system,N→ ∞, the energy andentropy are exactly equal (up to a constant) to leadingorder inN.
  
  I think this is the idea that Zipf law is related to P(Energy) being a constant w.r.t. Energy hmm
  
  tho really if both E and S are extensive in N ( meaning linear in N), then they will scale equally with N, obviousy? Tho is zipf law followed for extensive systems? aren't those were parts are independent, and we expect to aproach a uniform distribution?
  
  Right I think E and S scaling the same does not imply Zipf, but the other way, it does, apparently. Need to check argument in [10]
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1310.0448.pdf
arxiv.org arxiv.org

2010.14701.pdf

10
1. guillefix 10 Nov 2020
  
  in Public
  
  Because the exponentαN1for language models, we can approximateN−αN≈1−αNlog(N)to obtainequation 4.1.
  
  If \(\alpha_N\log{(N)} \ll 1\) i don't see how E.4 will scale as equation 4.1?
  
  wouldnt the constant \(L_U -1\) dominate?
2. guillefix 10 Nov 2020
  
  in Public
  
  could be misleading if the models have not all been trained fully to convergence
  
  you mean because perhaps the assumption that {in the limit of large N, they will perfectly model the data} may not hold if we dont train until convergence, and so the power law + constant assumption may not be justified. Yeah that makes sense
3. guillefix 10 Nov 2020
  
  in Public
  
  which makes the interpretation ofL(N)difficult.
  
  why?
4. guillefix 10 Nov 2020
  
  in Public
  
  mattn
  
  what is \(m_{attn}\)?
5. guillefix 09 Nov 2020
  
  in Public
  
  There we also show trends forthe training loss, which do not adhere as well to a power-law form, perhaps because of the implicit curriculumin the frequency distribution of easy and hard problems
  
  why would that affect the training loss scaling??
6. guillefix 09 Nov 2020
  
  in Public
  
  the poor loss onthese modules would dominate the trends
  
  could they show accuracy trends?..
7. guillefix 09 Nov 2020
  
  in Public
  
  easier problems will naturally appear more often than more difficult problems
  
  interesting. I have some ideas on how this could be related to learning curve exponents
8. guillefix 09 Nov 2020
  
  in Public
  
  We sample the default mixture of easy, medium, and hard problems, withouta progressive curriculum.
  
  Did they look if curriculum learning had any effect on the learning curves?
9. guillefix 09 Nov 2020
  
  in Public
  
  context length of3200tokens per image/caption pair
  
  isn't that the total length of an example? I thought the context was the part given before the token to be predicted?
10. guillefix 09 Nov 2020
  
  in Public
  
  We revisit the question “Is a picture worth a thousand words?” by comparing the information-contentof textual captions to the image/text mutual information
  
  I think an Issue with their analysis is that a picture's caption in a standard dataset does not capture all the info derivable from a picture
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/2010.14701.pdf
arxiv.org arxiv.org

MoGlow: Probabilistic and controllable motion synthesis using normalising flows

1
1. guillefix 08 Nov 2020
  
  in Public
  
  but we will onlyapply it along the time dimensiont.
  
  what do you mean? I thought you were applying the normalizing flow at each time step individually, not convolving over time
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1905.06598.pdf
arxiv.org arxiv.org

1906.05271.pdf

1
1. guillefix 07 Nov 2020
  
  in Public
  
  The key point of this work is that based on observing a single sample from a subpopulation, it isimpossible to distinguish samples from “borderline” populations from those in the “outlier” ones. Thereforean algorithm can only avoid the risk of missing “borderline” subpopulations by also memorizing examplesfrom the “outlier” subpopulations.
  
  I just find it weird that we have to offer so much justification for fitting to 0 error, when I don't see much reason to believe it isn't a good idea?
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1906.05271
arxiv.org arxiv.org

Rethinking Parameter Counting in Deep Models: Effective Dimensionality Revisited

3
1. guillefix 06 Nov 2020
  
  in Public
  
  e over parameters and the function-space posterior co-variance. Red indicates the under-parameterized setting, yellowthe critical regime withp≈n, and green the over-parameterizedregime.
  
  isn't it the other way? Red is over-parametrized and green is under-parametrized?
2. guillefix 06 Nov 2020
  
  in Public
  
  We see wide but shallow models overfit, providing low train loss, but high testloss and high effective dimensionality.
  
  it seems like it's mostly the number of parameters not the aspect ratio which determines the generalization performance? So that depth is not intrinsically helping generalization?
3. guillefix 06 Nov 2020
  
  in Public
  
  subspace and ensembling methods could beimproved through the avoidance of expensive com-putations within degenerate parameter regimes
  
  but how do you make sure you are sampling with the right probabilities?
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/2003.02139.pdf
arxiv.org arxiv.org

Spectrum Dependent Learning Curves in Kernel Regression and Wide Neural Networks

4
1. guillefix 04 Nov 2020
  
  in Public
  
  w
  
  this should be transposed
2. guillefix 03 Nov 2020
  
  in Public
  
  Our theoryagain perfectly fits the experiments.
  
  well you can see some deviations in this NN, probably because of the smaller width
3. guillefix 03 Nov 2020
  
  in Public
  
  K
  
  i think here it should be \(\kappa_{\text{NTK}}\)
4. guillefix 03 Nov 2020
  
  in Public
  
  marginal training data point causes greater reduc-tion in relative error for low frequency modes than for highfrequency modes.
  
  isn't this the opposite of what you said earlier??
  
  "the marginal training data point causes agreater percent reduction in generalization error for modeswith larger RKHS eigenvalues."
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/2002.02561.pdf
Oct 2020
arxiv.org arxiv.org

1701.06538.pdf

1
1. guillefix 31 Oct 2020
  
  in Public
  
  Each expert in the MoE layer receives a combinedbatch consisting of the relevant examples from all of the data-parallel input batches.
  
  so the activations for the set of samples which use expert k should be sent to the right device which has expert k, right?
  
  how much communication overhead is this?
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1701.06538
localhost:8000 localhost:8000

Bayesian Deep Learning and a Probabilistic Perspective of Generalization

2
1. guillefix 27 Oct 2020
  
  in Public
  
  A prior over parametersp(w)combines with the functionalform of a modelf(x;w)to induce a distribution over func-tionsp(f(x;w)). It is this distribution over functions thatcontrols the generalization properties of the model; the priorover parameters, in isolation, has no meaning.
  
  Yep this is what we say in our paper too^^ https://arxiv.org/abs/1805.08522
2. guillefix 27 Oct 2020
  
  in Public
  
  Distance between the truepredictive distribution and the approximation
  
  you mean something like minus the distance? because you want this distance to be smaller for better approximations?
Visit annotations in context

Annotators

guillefix

URL

localhost:8000/docs/2002.08791.pdf
academic.oup.com academic.oup.com

On the marginal likelihood and cross-validation

1
1. guillefix 26 Oct 2020
  
  in Public
  
  coherent
  
  coherent hear just means that it will approach the true distribution eventually?
Visit annotations in context

Annotators

guillefix

URL

academic.oup.com/biomet/article/107/2/489/5715611
arxiv.org arxiv.org

Rethinking Parameter Counting in Deep Models: Effective Dimensionality Revisited

1
1. guillefix 26 Oct 2020
  
  in Public
  
  As the effective dimensionality increases, so doesthe dimensionality of parameter space in which theposterior variance has contracted.
  
  can you not have very confident models which are making wrong predictions?
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/2003.02139.pdf
arxiv.org arxiv.org

2010.11924.pdf

1
1. guillefix 24 Oct 2020
  
  in Public
  
  In the notation of Section 3, pointsω∈Ωrepresent possible samples. In our setting, each sam-ple represents a complete record of a machine learning experiment. An environmentespecifies adistributionPeon the spaceΩof complete records.In the setting of supervised deep learning, a complete record of an experiment would specify hy-perparameters, random seeds, optimizers, training (and held out) data, etc.
  
  so each e represents an "experimetn" which is a range/distribution of hyperparameters (or what they call a complete record of a machine learning experiment)
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/2010.11924.pdf
arxiv.org arxiv.org

1812.06162.pdf

1
1. guillefix 21 Oct 2020
  
  in Public
  
  We measure a simple empirical statistic, thegradient noise scale3(essentially a measure of the signal-to-noise ratio of gradient across training examples),and show that it can approximately predict the largest efficient batch size for a wide range of tasks
  
  how is this related to the difficulty of the task?
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1812.06162.pdf
arxiv.org arxiv.org

2001.08361.pdf

9
1. guillefix 21 Oct 2020
  
  in Public
  
  non-zero entropy
  
  what about entropy rate?
2. guillefix 20 Oct 2020
  
  in Public
  
  overfitting
  
  OK, I THINK THEY ARE DEFINING OVERFITTING in the agnostic learning sense of L(f)-min_{f'\in F}L(f'). How badly am I doing relative to the best in the class!
3. guillefix 20 Oct 2020
  
  in Public
  
  we stop training early when the test loss ceases to improve and optimize all models in the same way
  
  didn't they say earlier that they train for a fixed number of steps?
4. guillefix 20 Oct 2020
  
  in Public
  
  Nincreases and the model begins to overfit
  
  well the increased overfitting is only visible in the smallest data size
5. guillefix 20 Oct 2020
  
  in Public
  
  S
  
  should be N?
6. guillefix 20 Oct 2020
  
  in Public
  
  We find that generalization depends almost exclusively on thein-distribution validation loss, and does not depend on the duration of training or proximity to convergence
  
  no overfitting^^ even for transfer learning
7. guillefix 20 Oct 2020
  
  in Public
  
  Although these models have been trained on the WebText2 dataset, their test loss on a variety of other datasetsis also a power-law inNwith nearly identical power, as shown in Figure 8.
  
  probably significantly different datasets will show different power laws. The different datasets looked at here seem quite similar
8. guillefix 20 Oct 2020
  
  in Public
  
  (approximately twice the compute as the forwards pass)
  
  why?
9. guillefix 20 Oct 2020
  
  in Public
  
  To utilize both training time and compute as effectively as possible, it is best to train with a batchsizeB≈Bcrit
  
  because above B_crit you can reduce time, but with increasing compute cost (diminishing returns)
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/2001.08361
Jul 2020
arxiv.org arxiv.org

Spectrum Dependent Learning Curves in Kernel Regression and Wide Neural Networks

1
1. guillefix 22 Jul 2020
  
  in Public
  
  standard deviations
  
  but are these standard deviations of the means?
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/2002.02561.pdf
Jun 2020
arxiv.org arxiv.org

1906.05301.pdf

1
1. guillefix 09 Jun 2020
  
  in Public
  
  (x;W1,...,Wl,b1,...,bl)
  
  it should depend on \(W^{l+1}\) and \(b^{l+1}\) too
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1906.05301.pdf
arxiv.org arxiv.org

1705.08741.pdf

1
1. guillefix 03 Jun 2020
  
  in Public
  
  Naturally, such an increase in the learning rate also increases the mean stepsE[∆w]. However,we found that this effect is negligible sinceE[∆w]is typically orders of magnitude lower than thestandard deviation.
  
  Interesting. This is why the intuition that increasing the learning rate would decrease the number of updates is probably not true, because what seems to determine the number of steps is the amount of noise!
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1705.08741.pdf
May 2020
arxiv.org arxiv.org

1705.08741.pdf

1
1. guillefix 29 May 2020
  
  in Public
  
  −
  
  +
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1705.08741.pdf
openreview.net openreview.net

how_noise_affects_the_hessian_spectrum_in_overparameterized_neural_networks.pdf

1
1. guillefix 29 May 2020
  
  in Public
  
  〈O( ̄θ)〉=〈[[O[ ̄θ−η ̄∇LB(θ)]]]m.b.〉.
  
  this is missing some time indices?
Visit annotations in context

Annotators

guillefix

URL

openreview.net/pdf
arxiv.org arxiv.org

2002.09956.pdf

8
1. guillefix 27 May 2020
  
  in Public
  
  We omit thedβexp (−cγ) +bβlog(1δ)nterm since it does not change with changein random labels.
  
  how can we be sure it is non-vacuous then? hmm
2. guillefix 27 May 2020
  
  in Public
  
  while ̃Hθ†l,φ[j,j] can change based onα-scaling Dinh et al. [2017], the effective curvature is scale invariant
  
  do you mean because you change \(\sigma\) too? Was that what Dinh et al. were talking about? Or just the fact that there are other \theta (not reparametrizing, just finding new \theta) which have high curvature, but produce same function?
3. guillefix 27 May 2020
  
  in Public
  
  (f) stays valid for the test error rate in (a)
  
  if you take into account the spread in (f) and (a) it would seem that for some runs the upper bound isn't valid?
4. guillefix 27 May 2020
  
  in Public
  
  Then, based on the ‘fast rate’ PAC-Bayes bound as before, we have the following result
  
  the posterior Q is a strange posterior over hypotheses. How do they take the KL divergence with the prior Because the posterior is defined by two parameters (\(\theta_\rho\) and \(\theta\))
5. guillefix 20 May 2020
  
  in Public
  
  Further, all the diagonal elementsdecrease as more samples are used for training.
  
  Really? That sounds surprising!
  
  I would have expected that as more training samples are added the parameters get more constrained (if the number of parameters is kept fixed).
6. guillefix 20 May 2020
  
  in Public
  
  Theorem 1
  
  Derandomization of the margin loss
7. guillefix 20 May 2020
  
  in Public
  
  The bound provides a concrete realization of the notionof ‘flatness’ in deep nets [Smith and Le, 2018, Hochreiter and Schmidhuber, 1997, Keskar et al., 2017] andillustrates a trade-off between curvature and distance from initialization.
  
  is there evidence that distance from initialization anti-correlates with generalization? Even evidence for sharpness <> generalization isn't very strong.
8. guillefix 20 May 2020
  
  in Public
  
  In spite of the dependency on the Hessian diagonal elements, which canbe changed based on re-parameterization without changing the function [Smith and Le, 2018, Dinh et al.,2017], the bound itself is scale invariant since KL-divergence is invariant to such re-parameterizations Klee-man [2011], Li et al. [2019].
  
  i thought Dinh's criticism wasn't so much about reparametrization, but about the fact that there are other minima which are sharper but give the same function. KL wouldn't be invariant to that, as you aren't changing the prior in that case?
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/2002.09956.pdf
Apr 2020
arxiv.org arxiv.org

1602.02389.pdf

1
1. guillefix 24 Apr 2020
  
  in Public
  
  ∈Ck
  
  this sum was over all points in the training set in the previous step, and now it's over all points ?
  
  Just think of the case where the partition C_i is made up of singletons, one for each possible point. Then, the robustness would be zero, but the generalizatoin error bound doesn't seem right then.
  
  This made me suspect there may be something wrong, and I think it could be at this step. If we kept the sum to be over training sets, now we can;t upper bound the result by the max in the next two lines, I think!
  
  mistake
Visit annotations in context

Tags

mistake

Annotators

guillefix

URL

arxiv.org/pdf/1602.02389.pdf
Mar 2020
www.nature.com www.nature.com

Complexity control by gradient descent in deep networks

6
1. guillefix 30 Mar 2020
  
  in Public
  
  because of the softmax operation.
  
  more like because of the Heaviside operation
2. guillefix 30 Mar 2020
  
  in Public
  
  the signs of f and 𝑓̃ f~\tilde{f} are the same.
  
  and therefore the classification functions are the same
3. guillefix 30 Mar 2020
  
  in Public
  
  f~\tilde{f} as 𝑓𝑉=𝜌𝑓̃ fV=ρf~f_V=\rho \tilde{f},
  
  this is confusing, is f_V or \tilde{f} the normalized network?
4. guillefix 30 Mar 2020
  
  in Public
  
  Our main results should also hold for SGD.
  
  Will this be commented on in more detail?
5. guillefix 30 Mar 2020
  
  in Public
  
  normalized weights Vk as the variables of interest
  
  Can we even reparametrize to the normalized weights? For homogeneous networks, it's obvious that we can. But for ReLU networks with biases it's less obvious. If one multiplies the biases via constants that grow exponentially with weight, the function is left invariant. We can always do this until the paramter vector is left normalized. Therefore we can reparametrize to the normalized vectors even with biases, but dunno if they consider this case here.
6. guillefix 30 Mar 2020
  
  in Public
  
  This mechanism underlies regularization in deep networks for exponential losses
  
  we cannot say this, until we know more. Is this the reason why the generalize? Is this even sufficient to explain their generalization?
Visit annotations in context

Annotators

guillefix

URL

nature.com/articles/s41467-020-14663-9
arxiv.org arxiv.org

Language as a Cognitive Tool to Imagine Goals in Curiosity Driven Exploration

1
1. guillefix 26 Mar 2020
  
  in Public
  
  Bahdanau et al.(2019) learn a reward function jointly with the action policybut does so using an external expert dataset whereas ouragent uses trajectories collected through its own exploration
  
  Yeah what they do here is similar to IRL, in that we are trying to learn a human NL-conditioned reward function, but we do it via supervision, rather than demonstration. More similar to the work on "learning from human preferences"
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/2002.09253.pdf
arxiv.org arxiv.org

1807.09936.pdf

3
1. guillefix 25 Mar 2020
  
  in Public
  
  other agents
  
  which share the same policy right? otherwise it woud be off-policy experience?
2. guillefix 25 Mar 2020
  
  in Public
  
  Zero Sum
  
  don't understand this one
3. guillefix 25 Mar 2020
  
  in Public
  
  specific choice ofλ
  
  here, a specific choice of \(\lambda\) can determine which solutions among the many which satisfy the constraint we choose. Similarly to the choice of convex regularizer in the GAIL paper
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1807.09936.pdf
Local file Local file

A review of the double descent phenomenon in neural networks and an attempt of interpretation

1
1. guillefix 23 Mar 2020
  
  in Public
  
  z(xi)z(xj)|h
  
  RHS depends on h, but LHS doesn't?
Annotators

guillefix
www.aaai.org www.aaai.org

Maximum Entropy Inverse Reinforcement Learning

2
1. guillefix 09 Mar 2020
  
  in Public
  
  The problem with the max entropy approach in Ziebart et al. 2008 is that it maximizes the entropy of trajectory distributions, without the constraint that these distributions must be realizable by causally-acting policies/agents. They then construct a causal policy from this distribution, but following the policy may result in a different trajectory distribution!
  
  The question is what would be the maximum entropy path distribution that is compatible with a causal policy? Does maximizing causal entropy give that? Not clear. Instead they prove a different property of maximum causal entropy: Theorem 3 in Ziebert 2012
2. guillefix 09 Mar 2020
  
  in Public
  
  Z(θ)
  
  Remember the partition function sums over trajectories which are compatible with the MDP dynamics only.
  
  Trajectories incompatible with the dynamics have probability 0 of course
Visit annotations in context

Annotators

guillefix

URL

aaai.org/Papers/AAAI/2008/AAAI08-227.pdf
www.cs.cmu.edu www.cs.cmu.edu

maximum-causal-entropy.pdf

2
1. guillefix 09 Mar 2020
  
  in Public
  
  Ziebart et al. (2008)
  
  The problem with the max entropy approach in Ziebart et al. 2008 is that it maximizes the entropy of trajectory distributions, without the constraint that these distributions must be realizable by causally-acting policies/agents. They then construct a causal policy from this distribution, but following the policy may result in a different trajectory distribution!
  
  The question is what would be the maximum entropy path distribution that is compatible with a causal policy? Does maximizing causal entropy give that? Not clear. Instead they prove a different property of maximum causal entropy: Theorem 3
2. guillefix 09 Mar 2020
  
  in Public
  
  eθ>F(X,Y)
  
  this is P(Y|X), right? but it should be P(Y|X,Y_{1:t-1})?
Visit annotations in context

Annotators

guillefix

URL

cs.cmu.edu/~bziebart/publications/maximum-causal-entropy.pdf
www.semanticscholar.org www.semanticscholar.org

1606.03476.pdf

1
1. guillefix 02 Mar 2020
  
  in Public
  
  without interactionwith the expert
  
  how do things change when you can interact with the expert?
Visit annotations in context

Annotators

guillefix

URL

semanticscholar.org/reader/4ab53de69372ec2cd2d90c126b6a100165dc8ed1
Feb 2020
arxiv.org arxiv.org

2001.08361.pdf

5
1. guillefix 14 Feb 2020
  
  in Public
  
  Attention: Mask
  
  by this, do they mean the attention weighted aggregation step?
2. guillefix 14 Feb 2020
  
  in Public
  
  nlayerdmodel3dattn
  
  are they ignoring the \(W^O\) matrix? from the original Transformer paper?
3. guillefix 11 Feb 2020
  
  in Public
  
  Large models are more sample-efficient than small models, reaching the same level ofperformance with fewer optimization steps (Figure 2) and using fewer data points (Figure 4)
  
  hmm in teresting. why are larger models more sample efficient?
4. guillefix 11 Feb 2020
  
  in Public
  
  Theperformance penalty depends predictably on the ratioN0.74/D
  
  That is weird, what's the origin of this?
5. guillefix 11 Feb 2020
  
  in Public
  
  hmm do they look at generalization gap?
  
  is trend on test loss due to parameter count, mostly due to effect on expressivyt / tranining loss (similarly with compute)?
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/2001.08361
Local file Local file

Robustness.pdf

4
1. guillefix 10 Feb 2020
  
  in Public
  
  Some preliminary numerical simulations show that thisapproach does predict high robustness and log scaling.However, it only makes any sense if transitions from onephenotype to another phenotype are memoryless.
  
  I thought the whole transition matrix approach itself assumed memorylessness
2. guillefix 10 Feb 2020
  
  in Public
  
  LetPbe a row vectorspecifying the probability distribution over phenotypes. Wewant to find a stochastic transition matrixM(rows sum toone) such that
  
  why do we want P to be stationary?
3. guillefix 10 Feb 2020
  
  in Public
  
  Mhas 1s on the diagonals,and 0s elsewhere, for example
  
  that is high robustness right?
4. guillefix 10 Feb 2020
  
  in Public
  
  Fano’s inequality)
  
  doesn't Fano's inequality give H(X|Y) on the numerator which is a lower bound on H(X), and so doesnt imply this?
Annotators

guillefix
arxiv.org arxiv.org

1911.03219.pdf

2
1. guillefix 01 Feb 2020
  
  in Public
  
  Intrinsic motivations f
  
  Basically the idea is that the RL/HER part is intrisnsically motivated with LP, to solve more and more tasks while the goal sampling part is intrinsically motivated to get trajectories that give new information to learn the reward function. I suppose they could add a bit of LP to the goal sampling as well to have some tendency to sample trajectories that may help to solve new tasks.
2. guillefix 01 Feb 2020
  
  in Public
  
  High-quality trajectories are trajectories where the agent collectsdescriptions from the social partner for goals that are rarely reached.
  
  why do you want more than one description for a goal? A: Ah, because the goal will be the same but the final state may not be for each of these trajectories, thus giving more data to train the reward function.
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1911.03219.pdf
Jan 2020
arxiv.org arxiv.org

1908.06663.pdf

1
1. guillefix 24 Jan 2020
  
  in Public
  
  f memory-based sample efficient methods
  
  bandits methods, which are suitable for sequences of indepenedent experiments
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1908.06663.pdf
arxiv.org arxiv.org

1802.09464.pdf

1
1. guillefix 18 Jan 2020
  
  in Public
  
  We find that the object geometry makes a significantdifferences in how hard the problem is
  
  apply some goal exploration process like POET?
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1802.09464.pdf
arxiv.org arxiv.org

Understanding Priors in Bayesian Neural Networks at the Unit Level

1
1. guillefix 16 Jan 2020
  
  in Public
  
  When it comes to NNs, the regulariza-tion mechanism is also well appreciated in the literature,since they traditionally suffer from overparameterization,resulting in overfitting.
  
  No. Overparametrized networks have been shown to generalize even without explicit regularization (Zhang et al. 2017)
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1810.05193.pdf
arxiv.org arxiv.org

1912.02178.pdf

1
1. guillefix 16 Jan 2020
  
  in Public
  
  Therefore, we can get the following generalization bound:
  
  as long as the value of L is bounded by at most 1/delta or something right?
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1912.02178.pdf
arxiv.org arxiv.org

Untitled document

1
1. guillefix 15 Jan 2020
  
  in Public
  
  They use on-average stability that does not imply generalization bounds with highprobability
  
  Their bounds on expectations can be converted to bounds with high probability, as they claim in page 3, citing "Shalev-Shwartz, S., Shamir, O., Srebro, N., and Sridharan, K. Learnability, stability and uniform convergence. Journal of Machine Learning Research, 11(Oct):2635–2670, 2010."
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1902.10710.pdf
arxiv.org arxiv.org

1703.01678.pdf

4
1. guillefix 14 Jan 2020
  
  in Public
  
  forTďmstep
  
  one pass SGD
2. guillefix 13 Jan 2020
  
  in Public
  
  validation error which is used asan empirical estimate forRpw1q
  
  so their bound has the disadvantage that it needs an estimate given by the validation error to compute the bound! So it can't be computed from the training data alone!!
3. guillefix 13 Jan 2020
  
  in Public
  
  our bound corroborates the intuition that whenever we start at a good location of the objectivefunction, the algorithm is more stable and thus generalizes better.
  
  This is a nice intuition for why good initializations can lead to good generalization
4. guillefix 13 Jan 2020
  
  in Public
  
  Rpw1q ́R‹
  
  remember that \(R\) is the population risk, so this isn't a priori something that we can know?
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1703.01678.pdf
Dec 2019
arxiv.org arxiv.org

()

1
1. guillefix 28 Dec 2019
  
  in Public
  
  Whileit is known having a finite VC-dimension (Vapnik and Chervonenkis, 1991) or equivalentlybeing CVEEEloostable (Mukherjee et al., 2006) is necessary and sufficient for the EmpiricalRisk Minimization (ERM) to generalize,
  
  it is only necessary to generalize in the worst case over data distributions right?
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1005.2243.pdf
www.ronja-tutorials.com www.ronja-tutorials.com

Simple color

1
1. guillefix 27 Dec 2019
  
  in Public
  
  position attribute.
  
  what is an attribute?
Visit annotations in context

Annotators

guillefix

URL

ronja-tutorials.com/2018/03/21/simple-color.html
arxiv.org arxiv.org

1706.08947.pdf

1
1. guillefix 21 Dec 2019
  
  in Public
  
  The bounds based on`2-path normand spectral norm can be derived directly from the those based on`1-path norm and`2norm respectively
  
  Hmm. how?
  
  This implies that even though the l2 path norms seem non-vacuous on Figure 1, they aren't. They appear so, because we have dropped the "terms that only depend on depth or number of hidden units", which are large for l2-path norm
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1706.08947.pdf
arxiv.org arxiv.org

1910.08720.pdf

1
1. guillefix 16 Dec 2019
  
  in Public
  
  ExperimentsIn
  
  experiments only in 2 dimensional input space. Could results depend on the input dimensionality?
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1910.08720.pdf
Nov 2019
arxiv.org arxiv.org

1905.10843.pdf

10
1. guillefix 28 Nov 2019
  
  in Public
  
  min(Td;2S)
  
  the min is because depending on which is larger one or the other of the two limits of the integral, dominates
2. guillefix 28 Nov 2019
  
  in Public
  
  29
  
  Compare this to the analysis of Sollich ( https://pdfs.semanticscholar.org/7294/862e59c8c3a65167260c0156427f4757c67e.pdf ) which is in the well-specified setting. There there's no dependence on the labels of the training data. Here neither, but at least there's dependence on the distribution of the target labels, so that it allows for more general types of assumptions.
3. guillefix 25 Nov 2019
  
  in Public
  
  K(x)is an even
  
  which can be seen from its definition as a covariance.
4. guillefix 25 Nov 2019
  
  in Public
  
  of a Teacher Gaussian process with covarianceKTand assume that they lie in theRKHS of the Student kernelKS, namely
  
  ah yes, being in RKHS means having a finite norm in the RKHS, which makes sense. But not sure how restrictive this is, just like I'm not sure if simply being n-times differentiable is a good measure of complexity of the function. Are there n-times differentiable functions that approximate any less smooth function? Maybe Lipschitz constant of derivatives (smoothness constants) could be more quantitatively useful?
5. guillefix 25 Nov 2019
  
  in Public
  
  If both kernels are Laplace kernels thenT=S=d+ 1andEMSEn1=d, whichscales very slowly with the dataset size in large dimensions. If the Teacher is a Gaussian kernel(T=1) and the Student is a Laplace kernel then= 2(1 + 1=d), leading to!2asd!1
  
  hm, wait what? But wouldn't the Bayes optimal answer be obtained if the student has the same kernel as the teacher?
6. guillefix 25 Nov 2019
  
  in Public
  
  as \(n\to\infty\)
7. guillefix 25 Nov 2019
  
  in Public
  
  We perform kernel classification via the algorithmsoft-margin SVM.
  
  which approximates a point estimator of the Gaussian process classifier, but I don't know the exact relation.
8. guillefix 25 Nov 2019
  
  in Public
  
  man
  
  mean
  
  typo
9. guillefix 25 Nov 2019
  
  in Public
  
  Importantly (i) Eq. (1) leads to a prediction for(d)that accurately matches our numerical study forrandom training data points, leading to the conjecture that Eq. (1) holds in that case as well.
  
  Compare with: https://arxiv.org/pdf/1909.11500.pdf where they find that random inputs give rise to plateaus, hmm at least with epochs, but they cite papers where these are apparently found for training set size (perhaps only for thin networks?)
10. guillefix 22 Nov 2019
  
  in Public
  
  s a result, various works on kernel regressionmake the much stronger assumption that the training points are sampled from a target function thatbelongs to thereproducing kernel Hilbert space(RKHS) of the kernel (see for example [Smola et al.,1998]). With this assumptiondoes not depend ond(for instance in [Rudi and Rosasco, 2017]= 1=2is guaranteed). Yet, RKHS is a very strong assumption which requires the smoothness ofthe target function to increase withd[Bach, 2017] (see more on this point below), which may not berealistic in large dimensions.
  
  I think when they say "it belongs to an RKHS", they mean that it does so with a fixed/bounded norm (otherwise almost any function would satisfy this, for universal RKHSs). This is consistent with the next comment saying, that this assumption implies smoothness (smoothness<>small RKHS norm generally)
Visit annotations in context

Tags

typo

Annotators

guillefix

URL

arxiv.org/pdf/1905.10843.pdf
openreview.net openreview.net

pdf

1
1. guillefix 25 Nov 2019
  
  in Public
  
  Seems like PPO works better than their approach in several of the experiments. Hmm
Visit annotations in context

Annotators

guillefix

URL

openreview.net/pdf
arxiv.org arxiv.org

1712.00409.pdf

1
1. guillefix 22 Nov 2019
  
  in Public
  
  irreducible error (e.g.,Bayes error)
  
  more commonly model capacity limitations I guess?
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1712.00409.pdf
arxiv.org arxiv.org

1910.07224.pdf

1
1. guillefix 20 Nov 2019
  
  in Public
  
  GMM on a dataset of previously sampled parametersconcatenated to their respective ALP measure.
  
  the GMM is only fitted to the parameter part or the (parameter, ALP) vector?
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1910.07224.pdf
www.ki.tu-berlin.de www.ki.tu-berlin.de

262-A1677

2
1. guillefix 19 Nov 2019
  
  in Public
  
  nevertheless, the few re-maining ones must still differ in a finite fraction of bits fromeach other and from the teacher so that perfect generaliza-tion is still impossible. For aslightly above aconly the cou-plings of the teacher survive.
  
  Lenka Zdeborová, Florent Krzakala have found that at the capacity threshold, algorithms tend to have the longest running times, i.e. the worst-case examples seem to live at that transition
2. guillefix 19 Nov 2019
  
  in Public
  
  For a committeeof two students it can be shown that when the number ofexamples is large, the information gain does not decreasebut reaches a positive constant. This results in a much fasterdecrease of the generalization error. Instead of being in-versely proportional to the number of examples, the de-crease is now exponentially fast
  
  For the case of the perceptron you can see how the uncertainty region (whose volume approximates the generalization error) approximately halves (or is reduced by about a constant) after every optimal query.
Visit annotations in context

Annotators

guillefix

URL

ki.tu-berlin.de/fileadmin/fg135/publikationen/opper/Op01.pdf
incompleteideas.net incompleteideas.net

RLbook2018.pdf

1
1. guillefix 09 Nov 2019
  
  in Public
  
  n general, the baseline leaves the expected value of the update unchanged,but it can have a large
  
  because baseline depends on S, it can reduce the variance from state to state (not the one from action to action).
  
  WRONG: IT can reduce the action to action variance of the gradient (not the variance of the advantage!)
Visit annotations in context

Annotators

guillefix

URL

incompleteideas.net/sutton/book/RLbook2018.pdf
Oct 2019
arxiv.org arxiv.org

1710.11029.pdf

5
1. guillefix 30 Oct 2019
  
  in Public
  
  computevar1bbÂj
  
  this is the covariance matrix
2. guillefix 30 Oct 2019
  
  in Public
  
  This suggests that the effect ofj(x)is to rotate the gradient field and move thecritical points, also seen in Fig. 4b.
  
  how does this equation suggest this?
3. guillefix 30 Oct 2019
  
  in Public
  
  sampling with replacement has better regularization
  
  but you are saying that the temperature (\(\beta^{-1}\) is lower when you sample with replacement, so that the regularization should be less?
4. guillefix 25 Oct 2019
  
  in Public
  
  conservative
  
  how does this mean that it is conservatice?
5. guillefix 25 Oct 2019
  
  in Public
  
  This implies that SGD implicitlyperforms variational inference with a uniform prior, albeit of a different loss than the one used tocompute back-propagation gradients
  
  The interpreation of doing variational inference with a uniform prior is because if we interpret the minimization objective as an ELBO, the second term is like the KL divergence between the approximate posterior and a uniform prior (whicih just gives the entropy). Nice
  
  If \(\rho\) doesn't have any constraints then this should give the exact posterior with uniform prior, and likelihood given by \(\Phi(x)\)
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1710.11029.pdf
arxiv.org arxiv.org

1708.02190.pdf

1
1. guillefix 26 Oct 2019
  
  in Public
  
  The second particularity is that since the computation of the rewardRpp;c;;oqis internal to themachine, it can be computed any time after the experimentpc;;oqand for any problempPP,not only the particular problem that the agent was trying to solve. Consequently, if the machineexperiments a policyin contextcand observeso(e.g. trying to solve problemp1), and storesthe resultspc;;oqof this experiment in its memory, then when later on it self-generates problemsp2;p3;:::;piit can compute on the fly (and without new actual actions in the environment) theassociated rewardsRp2pc;;oq;Rp3pc;;oq;:::;Rpipc;;oqand use this information to improveover these goalsp2;p3;:::;pi.
  
  like hindsight experience replay
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1708.02190.pdf
arxiv.org arxiv.org

1807.01521.pdf

1
1. guillefix 26 Oct 2019
  
  in Public
  
  Although methods to learndisentangled representation of the world exist [25,26,27], they do not allow to distinguish featuresthat are controllable by the learner from features describing external phenomena that are outsidethe control of the agent.
  
  learning controllabe features is similar to learning a causal model of the world I think
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1807.01521.pdf
arxiv.org arxiv.org

Untitled document

1
1. guillefix 24 Oct 2019
  
  in Public
  
  We find that the full NTK has better approximation propertiescompared to other function classes typically defined for ReLU activations [5, 13, 15], which arise for instancewhen only training the weights in the last layer, or when considering Gaussian process limits of ReLUnetworks (e.g., [20, 24, 32]).
  
  NTK has "better approximation properties". What do they mean more precisely?
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1905.12173.pdf
arxiv.org arxiv.org

1910.08013.pdf

8
1. guillefix 23 Oct 2019
  
  in Public
  
  and we have left the activation kernel unchanged,K`=1M`A0`A0T`
  
  what is the reason to do this?
2. guillefix 23 Oct 2019
  
  in Public
  
  (A`jJ`)
  
  J_l is the covariance for a single column of A_l right?
3. guillefix 23 Oct 2019
  
  in Public
  
  Second, we modified theinputs by zeroing-out all but the first input unit (Fig. 1 right).
  
  how does this work more precisely? The targets are generated by feeding the modified inputs to the "teacher network", but the student network gets the unmodified inputs?
4. guillefix 23 Oct 2019
  
  in Public
  
  for MAP inference, the learned representationstransition from the input to the output kernel, irrespective of the network width.
  
  how is MAP inference implemented?
5. guillefix 23 Oct 2019
  
  in Public
  
  he representations in learned neural networks slowly transitionfrom being similar to the input kernel (i.e. the inner product of the inputs) to being similar to theoutput kernel (i.e. the inner product of one-hot vectors representing targets).
  
  this transition, as what? as the layer width is increased?
6. guillefix 23 Oct 2019
  
  in Public
  
  the covariance in the top-layer kernel induced by randomnessin the lower-layer weights.
  
  what does he mean by this?
7. guillefix 23 Oct 2019
  
  in Public
  
  e.g.compare performance in Garriga-Alonso et al. (2019) and Novak et al. (2019) against He et al.(2016) and Chen et al. (2018)).
  
  but in here the GP networks lack many important features like batch-norm, pooling etc! Not sure if this example is a fair comparison. Also, not clear whether this difference is due to finite width or SGD (a question that Novak also asks)
8. guillefix 23 Oct 2019
  
  in Public
  
  enabling efficient and exact reasoning aboutuncertainty
  
  Only in regression... AAaaAaaAh ÒwÓ
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1910.08013.pdf
arxiv.org arxiv.org

On Exact Computation with an Infinitely Wide Neural Net

1
1. guillefix 23 Oct 2019
  
  in Public
  
  significant new benchmark for performance of a pure kernel-based method on CIFAR-10, being 10% higher than the methods reported in [Novak et al., 2019]
  
  Interesting, so apparently the NTK works better than the NNGP for this architecture at least
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/abs/1904.11955
www.jmlr.org www.jmlr.org

seeger02a.dvi

5
1. guillefix 21 Oct 2019
  
  in Public
  
  Optimally, these parameters are chosen such that the true predictiveprocessP(t§jx§;S) is closest toQ(t§jx§;S) in relative entropy.
  
  in which sense is this optimal?
2. guillefix 18 Oct 2019
  
  in Public
  
  Bayes classiØer
  
  I thought the Bayes classifier would predict sign ( E_w [P(t|y)y(x|w)] - 0.5) ?
3. guillefix 18 Oct 2019
  
  in Public
  
  our task is then to separate the structure from thenoise.
  
  Well, and to find the correct regularity; generalization is not just about separating structure from noise. Unless by "noise" here, you mean also the stochasticity in the training sample (of inputs)..
4. guillefix 18 Oct 2019
  
  in Public
  
  We know of no interesting real-world learningproblem which comes without any sort of prior knowledg
  
  Yep, no free lunch
5. guillefix 18 Oct 2019
  
  in Public
  
  (theluckycase)
  
  again I wouldn't call it "unlucky", because the whole proof is that the generalization is good, because it's very unlikely to have obtained this training set by luck, so that it's most likely that we obtained it by having chosen a good prior. So I would call it "good prior" case.
Visit annotations in context

Annotators

guillefix

URL

jmlr.org/papers/volume3/seeger02a/seeger02a.pdf
arxiv.org arxiv.org

Untitled document

2
1. guillefix 16 Oct 2019
  
  in Public
  
  , such as cross entropy loss, encourage a larger outputmargin
  
  The fact that they also encourage a large SVM-margin is not so trivial tho
2. guillefix 16 Oct 2019
  
  in Public
  
  the gap between predictions on the true label and andnext most confident label.
  
  In SVMs, for instance, "margin" refers to the distance between classification boundary and a point. This can be related to the definition of margin here, but they are not the same?
  
  E.g. if we have a small SVM-margin, but a really large weight norm, then we would still have a small output margin.
  
  Ah, that's why they normalize by weight norm I suppose yeah.
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1910.04284.pdf
arxiv.org arxiv.org

1902.06720.pdf

3
1. guillefix 03 Oct 2019
  
  in Public
  
  This is further consistent with recent experimental work showing that neuralnetworks are often robust to re-initialization but not re-randomization of layers (Zhang et al. [42]).
  
  what does this mean?
2. guillefix 02 Oct 2019
  
  in Public
  
  Kernels from single hidden layer randomly initializedReLUnetwork convergence to analytic kernel using Monte Carlo sampling (Msamples). See §I foradditional discussion
  
  I think the monte carlo estimate of the NTK is a montecarlo estimate of the average NTK (as in average over initializations), not of the initialization-dependent NTK which Jacot studied. Jacot showed that in infinite width limit both are the same.
  
  But it seems from their results that even for finite width the average NTK is closer to the limit NTK than the single-sample NTK. This makes sense, because the single sample one has extra fluctuation around average.
3. guillefix 02 Oct 2019
  
  in Public
  
  We observe that the empirical kernel^gives more accurate dynamics for finite width networks.
  
  That is a very interesting observation!
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1902.06720.pdf

guillefix

Annotations: 683

Joined: February 23, 2015

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Tags

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

Annotators

URL