683 Matching Annotations

Oct 2019
github.com github.com

damaru2/ntk

1
1. guillefix 02 Oct 2019
  
  in Public
  
  Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent
  
  You didn't except hypothes.is in here did you?
  
  Bamboozled again!
Visit annotations in context

Annotators

guillefix

URL

github.com/damaru2/ntk/blob/master/readme.md
arxiv.org arxiv.org

1902.06720.pdf

2
1. guillefix 02 Oct 2019
  
  in Public
  
  =0n
  
  yeah! so in standard parametrization, the learning rate is indeed O(1/n) !
2. guillefix 02 Oct 2019
  
  in Public
  
  < critical
  
  is the condition \(\eta <\eta_{\text{critical}}\) on the learning rate just so that gradient descent and gradient flow give similar results?
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1902.06720.pdf
arxiv.org arxiv.org

1904.01399.pdf

1
1. guillefix 02 Oct 2019
  
  in Public
  
  One may argue that a natural question consid-ering convex polygons would be whether they are separatefrom each other. Unfortunately, there is an infeasible compu-tational lower bound for this question.
  
  Yeah the no-miss-inclusion property doesn't imply the hulls don't intersect. Think of two perpependicular rectangles which intersect but where the corners are not inside the other rectangle
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1904.01399.pdf
projects.raspberrypi.org projects.raspberrypi.org

Setting up your Raspberry Pi - Set up your SD card | Raspberry Pi Projects

1
1. guillefix 01 Oct 2019
  
  in Public
  
  most laptop and desktop computers have one.
  
  lol yeah sure
Visit annotations in context

Annotators

guillefix

URL

projects.raspberrypi.org/en/projects/raspberry-pi-setting-up
Sep 2019
papers.nips.cc papers.nips.cc

The Spectrum of the Fisher Information Matrix of a Single-Hidden-Layer Neural Network

1
1. guillefix 30 Sep 2019
  
  in Public
  
  For example, linear networks suffer worse conditioningthan any nonlinear network, and although nonlinear networks may have many small eigenvalues theyare generically non-degenerate.
  
  but doesn't it necessarily have to be degenerate when the number of trainnig points is smaller than the number of parameters
Visit annotations in context

Annotators

guillefix

URL

papers.nips.cc/paper/7786-the-spectrum-of-the-fisher-information-matrix-of-a-single-hidden-layer-neural-network.pdf
Local file Local file

Doucet - Training dynamics of DNNs using SGD via NTK.pdf

4
1. guillefix 29 Sep 2019
  
  in Public
  
  For Softplus and Sigmoid, the training algorithm is stuck at a low test accuracy10%
  
  wow. what's their train accuracy?
2. guillefix 29 Sep 2019
  
  in Public
  
  activation functions that do not have an EOC, such as Softplus and Sigmoid
  
  how does their phase diagram look like?
3. guillefix 29 Sep 2019
  
  in Public
  
  might be explained by thislog(L)factor.
  
  Is that likely given that it's so small that it can't be seen experimentally?
4. guillefix 29 Sep 2019
  
  in Public
  
  2EOC
  
  what exactly is the EOC for ReLU?
Annotators

guillefix
arxiv.org arxiv.org

1902.04942.pdf

5
1. guillefix 23 Sep 2019
  
  in Public
  
  increasesits mean
  
  Right, but it may decrease its mean squared, which is what you are interested in.
2. guillefix 23 Sep 2019
  
  in Public
  
  We again make use ofthe wide network assumption
  
  Isn't this now assuming large dimensionality of inputs?
3. guillefix 23 Sep 2019
  
  in Public
  
  It was shown in He et al. [2015] that for ReLU networks initialized using Equation 2,the total mean is zero and total variance is the same forallpre-activations, regardless of the sampledistribution.
  
  Well it seems to me that they just looked at the average over weights. But their basic result is true if you average over any input distribution.You just get the average squared norm of the input multiplying the variances at each layer, but the variance at each layer are still all the same
4. guillefix 23 Sep 2019
  
  in Public
  
  s
  
  insert comma here
5. guillefix 23 Sep 2019
  
  in Public
  
  For almost all samples, these neurons are either operating as if they werelinear, or do not exist.
  
  Unclear what you mean here
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1902.04942.pdf
arxiv.org arxiv.org

1502.01852v1.pdf

2
1. guillefix 23 Sep 2019
  
  in Public
  
  thenyl1has zero mean and has a symmetricdistribution around zero. This leads toE[x2l] =12Var[yl1]
  
  This is very nice. We don't need the infinite width assumption to calculate how variances propagate through the network. This is unlike for covariances or higher moments
2. guillefix 23 Sep 2019
  
  in Public
  
  wl1have a symmetric distribution around zero
  
  so the fundamental assumptions is that the weights have a distribution which is symmetric around 0, not just of mean 0
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1502.01852.pdf
arxiv.org arxiv.org

()

1
1. guillefix 17 Sep 2019
  
  in Public
  
  However, sincexis affected byW,band the parameters of all the layers below, changesto those parameters during training will likely move manydimensions ofxinto the saturated regime of the nonlin-earity and slow down the convergence. This effect isamplified as the network depth increases.
  
  Why is this?
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1502.03167.pdf
Local file Local file

1906.04358 (1).pdf

2
1. guillefix 03 Sep 2019
  
  in Public
  
  We hope this work will encourage further research that facilitatesthe discovery of new architectures that not only possess inductive biases for practical domains, butcan also be trained with algorithms that may not require gradient computation
  
  The WANN is an example of meta learning architectures that can be trained with new algorithms
2. guillefix 03 Sep 2019
  
  in Public
  
  In the discussion they point out a couple of ways the work could be useful or interesting (transfer learning, training techniques beyond gradient-methods), but they don't make many clear points It seems the main point of the paper is to just point out that this is possible, and give food for thought.
Annotators

guillefix
arxiv.org arxiv.org

1907.10599.pdf

1
1. guillefix 03 Sep 2019
  
  in Public
  
  for erf andb= 0, the even degreeks all vanish
  
  Does this mean that erf networks are not fully expressive?
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1907.10599.pdf
Local file Local file

Second_draft.pdf

31
1. guillefix 02 Sep 2019
  
  in Public
  
  The correlation vanishes if we train to a xed, nite loss instead of for a specied number ofepochs, and weakens as we increase the number of layers in the network
  
  Do you mean the other way round? I thought the experiments in this section were running SGD until it reaches a small fixed loss value, like you say in Figure 5.
  
  EDIT: Ah I see you repeat this in the end. Make sure you mention that you state that the experiments in Section 3.2 are for a fixed number of epochs.
  
  The fact that you need to train for a fixed number of epochs is interesting. Perhaps, the self-similarity among different minima occurs within a "layer" of weight space corresponding to weights of a given norm, and different layers are related by rescaling also (see section about flatness vs epochs, assuming norm increases with epochs). But if you train to a fixed loss, i guess the norm/layer needed to reach that loss is different for different data/functions, so that's why the correlation vanishes?
2. guillefix 02 Sep 2019
  
  in Public
  
  We restrict our attention to the rectied linear activation(ReLU) function, described by
  
  as in the rest of the paper?
3. guillefix 02 Sep 2019
  
  in Public
  
  atness
  
  I think in the plots where you talk about flatness, the flatness axis should be -log(prod lambda), so that larger values correspond to higher flatness
4. guillefix 02 Sep 2019
  
  in Public
  
  This is consistent with the aforementioned self-similarity properties of the simple Boolean network.
  
  what do you mean? flatness could increase by the same amounts, whether the lambda_max correlated with flatness or not, no? The two phenomena could be related though
5. guillefix 02 Sep 2019
  
  in Public
  
  as gradient 1.687, indicating that atness has increasedupon further training
  
  The gradient here just indicates that larger flatness increases in flatness more right? It's the y-intercept that is showing here that the flatness increases after 100 epochs
6. guillefix 02 Sep 2019
  
  in Public
  
  , the volume does not change (up to some noise limit).
  
  I suppose because the function stops changing? Is this training on part of the inputs, rather than on all the inputs? If the latter is the case, it's trivial that, if the algorithm converges, it will not change the function it finds after enough epochs.
7. guillefix 02 Sep 2019
  
  in Public
  
  7-40-40-2
  
  I thought the network was 7-40-40-1
8. guillefix 02 Sep 2019
  
  in Public
  
  For varying proportions of corruption(up to 50%)
  
  This makes it seem to me like you corrupted a fraction of S_train, rather than appending an S_attack? We don't want that as then we are not fitting the same-sized (uncorrupted) training set.
9. guillefix 02 Sep 2019
  
  in Public
  
  (using the original, uncorrupted train dataset)
  
  You'll have to explain to me (and the reader) what precisely you did. What are each of the points in Figure 15? A different training set, the same training set?
10. guillefix 02 Sep 2019
  
  in Public
  
  We restrict the training set to 500 examples
  
  Previously you said 512 examples. Which one is it?
11. guillefix 02 Sep 2019
  
  in Public
  
  image pixels
  
  pixels in the images of size 28x28
12. guillefix 02 Sep 2019
  
  in Public
  
  We deliberately corrupt a small, variable proportion of the training dataset to producesolutions with diverse generalisation performances.
  
  To be clear, S_train is fixed in size, but S_attack varies in size right? It's not S_train + S_attack that's fixed in size. You should make this clear, because both approaches have been used, but in this case, we want S_train to be fixed.
13. guillefix 02 Sep 2019
  
  in Public
  
  Figure 11
  
  \(\alpha>1\) increases the norm and \(\alpha<1\) decreases it, it seems. I guess this is because in the former case, we are increasing more parameters than we are decreasing it, and viceversa in the latter case.
  
  On the other hand, both increase the sharpness. This shows that sharpness and norm don't necessarily follow each other. However, it may be that for solutions that SGD finds, sharpness and norm do correlate. [ this is in a similar spirit to the alpha scaling rebuttal; while there are sharp regions with large and small norm, perhaps constrained to SGD-typical regions, the two quantities correlate. We could check this ]
14. guillefix 02 Sep 2019
  
  in Public
  
  deteriorating atness as we-scale
  
  what are you referring to here?
15. guillefix 02 Sep 2019
  
  in Public
  
  after a xed number of iterations with random weightand bias initialisation)
  
  and standard optimization algorithms
16. guillefix 02 Sep 2019
  
  in Public
  
  attestpoint
  
  well, we don't know if the flattest, but definitely flatter than alpha-scaled ones
17. guillefix 02 Sep 2019
  
  in Public
  
  how can minima corresponding toidentical functions have arbitrarily dierent atnesses?
  
  the question here is: "if minima corresponding to the same function (which thus generalize identically) can have arbitrarily different flatnesses, how can flatness be used as a proxy for generalization?
18. guillefix 02 Sep 2019
  
  in Public
  
  (Wi;bi;Wi+1)!(Wi;bi;1Wi+1)
  
  where \(\alpha>0\)
19. guillefix 02 Sep 2019
  
  in Public
  
  any form (rectiedor otherwise)
  
  as long as it is twice differentiable, so that the Hessian is defined
20. guillefix 02 Sep 2019
  
  in Public
  
  redundant
  
  rather than "redundant", "have limitations", or "are inappropriate"?
21. guillefix 02 Sep 2019
  
  in Public
  
  most simple output function.
  
  in the sample
22. guillefix 01 Sep 2019
  
  in Public
  
  symmetry
  
  self-similarity is a better term, as it's more commonly used, specially in cases, where the similarity at different isn't exact
23. guillefix 01 Sep 2019
  
  in Public
  
  irrespectiveof complications due to rounding,
  
  what complications?
24. guillefix 01 Sep 2019
  
  in Public
  
  the upper band corresponds to functions dominated by 0 and the lower band corresponds tofunctions dominated by 1.
  
  How is this? I would have expected behavior to be symmetric under changing outputs 0 to outputs 1 (think of changing sign of last layer)
25. guillefix 01 Sep 2019
  
  in Public
  
  a xed numberof epochs
  
  enough to reach global minimum / target function?
26. guillefix 01 Sep 2019
  
  in Public
  
  atness is strongly anticorrelated with the volume
  
  I would define the x axis as "sharpness", or if you want to call it flatness, make it negative. A minimum with larger -log(lambda) is more flat right?
27. guillefix 01 Sep 2019
  
  in Public
  
  is is because
  
  this is expected because
  
  This is the MDL argument right?
28. guillefix 01 Sep 2019
  
  in Public
  
  nction behaviour (and hence loss)
  
  here we are assuming that the loss measures discrepancy on all input points, so that zero loss is only achieved by functions equal to the target function.
29. guillefix 01 Sep 2019
  
  in Public
  
  arguments
  
  and can provide rigorous bounds in some cases via PAC-Bayes theory
30. guillefix 01 Sep 2019
  
  in Public
  
  f there is a bias, it will obey this bound.
  
  the bound is always obeyed, but it is only tight if there is enough bias.
  
  In other words, one only obtains simplicity bias (simple functions being more likely) if there is bias to begin with.
31. guillefix 01 Sep 2019
  
  in Public
  
  But precisely why dierent solutions dier in atness and why optimisation algorithmsused during training converge so surely to 'good', at solutions remains unclear.
  
  also, the MDL principle based on flatness doesn't provide nonvacuous bounds [lecun et al], except on few recent exceptions [dan roy et al]
Annotators

guillefix
Aug 2019
Local file Local file

Doucet - Training dynamics of DNNs using SGD via NTK.pdf

4
1. guillefix 31 Aug 2019
  
  in Public
  
  the limiting kernelsK1carry (almost) no information onx;x0and have therefore little expressive power
  
  Why?
2. guillefix 30 Aug 2019
  
  in Public
  
  1t(X).
  
  how does this gamma term evolve?
3. guillefix 30 Aug 2019
  
  in Public
  
  when using SGD, the gradient update can be seen as a GD with aGaussian noise (Hu et al., 2018; Li et al., 2017).
  
  Think of each step of Brownian motion as integrating many mini steps of SGD.
  
  This is reminiscent of CLT
4. guillefix 30 Aug 2019
  
  in Public
  
  Recent work by Jacot et al. (2018) has showed that training a neural networkof any kind with a full batch gradient descent in parameter space is equivalentto kernel gradient descent in function space with respect to the Neural TangentKernel (NTK).
  
  Only when the learning rate is very small
Annotators

guillefix
arxiv.org arxiv.org

1907.10599.pdf

4
1. guillefix 30 Aug 2019
  
  in Public
  
  Theorem 4.1(Weak Spectral Simplicity Bias).
  
  No bias towards complexity
2. guillefix 30 Aug 2019
  
  in Public
  
  Kernels as Integral Operators
  
  How does this definition of the kernel as an integral operator fit in the rest of the story as a kernel of a Gaussian process? I thought a Gaussian process doesn't define an input over functions?
3. guillefix 30 Aug 2019
  
  in Public
  
  with the larger stepsize2
  
  why does the larger step size cause more stability?
4. guillefix 30 Aug 2019
  
  in Public
  
  whereCdk
  
  you haven't defined \(\Delta\) in the statement of the theorem
  
  typo
Visit annotations in context

Tags

typo

Annotators

guillefix

URL

arxiv.org/pdf/1907.10599.pdf
www.semanticscholar.org www.semanticscholar.org

1806.07572.pdf

5
1. guillefix 18 Aug 2019
  
  in Public
  
  non-constant
  
  because it is not a polynomial
2. guillefix 18 Aug 2019
  
  in Public
  
  w(`)!0
  
  Why?? The operator norm of a random Gaussian matrix goes like \(O(\sqrt{n})\) ? (see https://terrytao.wordpress.com/2010/01/09/254a-notes-3-the-operator-norm-of-a-random-matrix/ e.g.)
3. guillefix 18 Aug 2019
  
  in Public
  
  a(k)(t)a(k)(0) +c~a(k)(t)andw(k)(t)w(k)(0) + ~w(k)(t)to getthe polynomial bound
  
  Do they substitute \(a^{(k)}\) and \(w^{(k)}\) with \(A(t)\)? That is a valid bound, but seems quite loose. But it doesn't seem to be what they are saying. Substituting the LHS by RHS on Q, doesn't guarantee we obtain a polynomial on A(t) ?
4. guillefix 18 Aug 2019
  
  in Public
  
  Idn`
  
  should be \(Id_{n_0}\)
5. guillefix 18 Aug 2019
  
  in Public
  
  (`);d(`+1
  
  This is an extension of the definition of <>_pin to allow for vector-valued functions living in spaces of different dimensionality, in which case we take the outer product I think
Visit annotations in context

Annotators

guillefix

URL

semanticscholar.org/reader/7a84a692327534fd227fa1e07fcb3816b633c591
arxiv.org arxiv.org

Untitled document

14
1. guillefix 09 Aug 2019
  
  in Public
  
  stochastic gradient descen
  
  An in fact SGD seems to work better in terms of both convergence, and generalization, in practice. Can we explain that theoretically?
2. guillefix 09 Aug 2019
  
  in Public
  
  Rqdh−1×p
  
  As this has \(p\) columns, I'm assuming they only consider a stride of \(1\), for the convolutional networks (probably easy to generalize regardless)
3. guillefix 09 Aug 2019
  
  in Public
  
  λ4
  
  \(\lambda_0\) is \(\lambda_\text{min}(H)\) I suppose
4. guillefix 09 Aug 2019
  
  in Public
  
  show
  
  -extra word-
  
  typo
5. guillefix 09 Aug 2019
  
  in Public
  
  depends
  
  depends on
6. guillefix 09 Aug 2019
  
  in Public
  
  at linearrate.
  
  The rate of convergence is going to be very slow though for big \(n\), because the learning rate is very small
7. guillefix 09 Aug 2019
  
  in Public
  
  ηλminK(H)2
  
  Is this guaranteed to be \(<1\)? Otherwise, the inequality couldn't be right.
  
  Seems like this being less than \(1\) depends on what \(\lambda_{\text{min}}(\mathbf{K^{(H)}})\) is
8. guillefix 09 Aug 2019
  
  in Public
  
  A(H−1)
  
  Hmm, should this be \(\mathbf{A}_{ij}^{(H-1)}\)?
  
  I think so. See page 40 for example.
9. guillefix 09 Aug 2019
  
  in Public
  
  K(h)ij
  
  These are the standard Neural Network Gaussian Process kernels I think
10. guillefix 09 Aug 2019
  
  in Public
  
  gradient descentwith a constant positive step size
  
  But, like for NTK paper, step size is effectively very small, because of using NTK parametrization. Except that because \(m\), is at least a finite step size :P
11. guillefix 09 Aug 2019
  
  in Public
  
  ǫ1
  
  Should be \(\mathcal{E}_1\)?
12. guillefix 08 Aug 2019
  
  in Public
  
  BothZou et al.(2018)andAllen-Zhu et al.(2018c) train a subset of the layers
  
  Really? That sounds like a big difference with practice..
13. guillefix 08 Aug 2019
  
  in Public
  
  in a unified fashionin AppendixE.
  
  How does this compare to the unified treatment of Greg Yang?
14. guillefix 08 Aug 2019
  
  in Public
  
  Jacot et al.(2018) do not establish theconvergence of gradient flow to a global minimizer.
  
  They do, for a positive definite kernel right?
Visit annotations in context

Tags

typo

Annotators

guillefix

URL

arxiv.org/pdf/1811.03804.pdf
Jul 2019
www.semanticscholar.org www.semanticscholar.org

1806.07572.pdf

14
1. guillefix 31 Jul 2019
  
  in Public
  
  this concludes theproo
  
  Technically we've shown that the variation in \(\alpha\) is \(O(1/\sqrt{n_L})\), but not the derivative? I think however, one can express the variation in the NTK as a sum over variations of alpha (by using the arguments here, and then integrating in time), giving us the desired result that the variation of the NTK goes to zero as \(O(1/\sqrt{n_L})\)
2. guillefix 31 Jul 2019
  
  in Public
  
  Rn`n`+1
  
  should be \(\mathbb{R}^{n_{l+1} \times n_l}\) I think?
3. guillefix 30 Jul 2019
  
  in Public
  
  recursive bounds
  
  Uses Cauchy-Schwartz for Operator norm of matrices, which can be obtained easily from definition
4. guillefix 30 Jul 2019
  
  in Public
  
  A(t)stays uni-formly bounded on[0;]
  
  Why does it stay uniformly bounded on \([0,\tau]\) and not on \([0,T]\) ?
  
  Which theorem are they applying from reference [6], Thm 4?
5. guillefix 30 Jul 2019
  
  in Public
  
  The summands@W(L)ijf;j0(x)@W(L)ijf;j00(x0)of the NTK hence vary at rate ofn3=
  
  Each of the derivatives has a factor of \(\frac{1}{\sqrt{n_{L}}}\), plus we get an extra factor of \(\frac{1}{\sqrt{n_{L}}}\) from the derivative of \(\alpha_{i}^{(L)}\)
6. guillefix 30 Jul 2019
  
  in Public
  
  @tjjjjjj@tjj
  
  For the 2-norm, at least?
7. guillefix 30 Jul 2019
  
  in Public
  
  theactivations
  
  pre-activations?
8. guillefix 30 Jul 2019
  
  in Public
  
  hence thatk1pnLW(L)(0)kopis bounded
  
  as Frobenius norm bounds spectral norm (which is the same as operator norm for matrices)
9. guillefix 12 Jul 2019
  
  in Public
  
  Theorem 2.Assume thatis a Lipschitz, twice differentiable nonlinearity function, with boundedsecond derivative. For anyTsuch that the integralRT0kdtkpindtstays stochastically bounded, asn1;:::;nL1!1, we have, uniformly fort2[0;T]
  
  Under the NTK parametrization, which they use, this limit implies that the learning rate (for GD on the standard-parametrization) is \(O(1/\sqrt{n}\) (where \(n\) is layer size). So the parameters move less and less for a fixed \(T\), in this limit, which is, intuitively, why the NTK stays constant for this period of time until \(T\)
  
  The interesting thing is that the function \(f\) can change, as all the parameters "conspire" for it to change. Therefore it can potentially fit a function, and find a global minimum, while the parameters have almost not moved at all.
  
  I think the intuition for this "conspiracy" is that the total change in \(f\) is given by a sum over all the parameters' individual gradients. The number of parameters grows like \(n^2\). gradient w.r.t. last hidden layer activations is \(O(1/\sqrt{n})\), w.r.t to second to last hidden layer activations is \(O(\sqrt{n}(1/\sqrt{n})^2) = O(1/\sqrt{n})\), where the \(\sqrt{n}\) comes from variance of summing over all the activations in last hidden layer. This means that the gradient w.r.t. to a weight, in NTK parameterization, is \(O((1/\sqrt{n})^2)=O(1/n)\) In GD, each weight changes by an ammount of the same order as the gradient (assumin \(O(1)\) learning rate, which we assume for NTK-parametrization learning rate), so each weight contributes to change \(f\) by \(O(1/n^2)\). Therefore the total contribution from all the weights is \(O(1)\). Note that the contributions all have the same sign as they are essentially the gradient w.r.t. that weight, squared, so they add linearly, (and not growing like \(\sqrt{n}\) if they were all randomly signed)
10. guillefix 02 Jul 2019
  
  in Public
  
  ~(`)1pn`W(`)
  
  i guess the first product here is element-wise, although it's not explicitly said
11. guillefix 02 Jul 2019
  
  in Public
  
  The connection weightsW(L)ijvary at rate1pnL, inducing a change of the same rate to the whole sum
  
  From chain rule
12. guillefix 01 Jul 2019
  
  in Public
  
  N(0;1)
  
  huh? no \(1/\sqrt{n}\) ?
13. guillefix 01 Jul 2019
  
  in Public
  
  ANNrealization functionF(L):RP! F, mapping parameterstofunctionsfin a spaceF.
  
  :O we studied the same object in our paper! But we called it the parameter-function map!
  
  https://arxiv.org/abs/1805.08522
14. guillefix 01 Jul 2019
  
  in Public
  
  This shows a direct connection to kernel methodsand motivates the use of early stopping to reduce overfitting in the training of ANNs
  
  But early stopping doesn't seem to often help in ANN training
Visit annotations in context

Annotators

guillefix

URL

semanticscholar.org/reader/7a84a692327534fd227fa1e07fcb3816b633c591
arxiv.org arxiv.org

1906.03291.pdf

7
1. guillefix 31 Jul 2019
  
  in Public
  
  Finally and most importantly: how do weconnect all of this back to a rigorous PAC learning framework?
  
  https://arxiv.org/abs/1805.08522
2. guillefix 31 Jul 2019
  
  in Public
  
  high dimensionality crushes bad minima into dust
  
  If high dimensionality is the reason, then why does the high dimensional linear model work so well?
3. guillefix 31 Jul 2019
  
  in Public
  
  rescalings of network parameters are irrelevant
  
  Not always. If you scale all parameters of a ReLU network with biases, it changes the function. If biases are zero, it doesn't change the function.
  
  It's true that batch-norm makes networks independent of parameter scaling, for the layer immediately before a batch norm layer.
4. guillefix 31 Jul 2019
  
  in Public
  
  hen parameters are small (say, 0.1), aperturbation of size 1 might cause a major performance degradation. Conversely, when parameters
  
  This feels like it may be true for nonlinearities like tanh, but not sure if it will be for relu.
  
  For ReLU, larger parameters increase the gradient of the outputs w.r.t. to parameters in lower layers. Exponentially in the number of layers! This is what allows the lazy regime (see literature on NTK / lazy training etc)
5. guillefix 31 Jul 2019
  
  in Public
  
  the linearmodel achieves only 49% test accuracy, whileResNet-18 achieves 92%.
  
  This is interesting in that it shows that "overparametrization" is not enought to get models that generalize well. Neural networks must have special properties beyond being overparametrized
6. guillefix 31 Jul 2019
  
  in Public
  
  Interestingly, all of these methods generalize far better than the linear model. While thereare undeniably differences between the performance of different optimizers, the presence of implicitregularization for virtually any optimizer strongly indicates thatimplicit regularization is caused inlarge part by the geometry of the loss function, rather than the choice of optimizer alone.
  
  YES. That's what we say too: http://guillefix.me/nnbias/
7. guillefix 31 Jul 2019
  
  in Public
  
  bad minima are everywhere
  
  This is a very loose statement. They could be a set of very little measure that is just rather dense in the space. So yeah they could be "everywhere", but be very rare and hard to find, unless you are beeing trained to find them explicitly
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1906.03291.pdf
terrytao.wordpress.com terrytao.wordpress.com

254A, Notes 3: The operator norm of a random matrix

1
1. guillefix 30 Jul 2019
  
  in Public
  
  Markov’s inequality.
  
  I could only get
  
  \(\mathbf{P}(|M x| \geq \sqrt{An}) \leq C^n \exp (-c A n)\)
  
  Applying Markov inequality :P Did I do anything wrong?
Visit annotations in context

Annotators

guillefix

URL

terrytao.wordpress.com/2010/01/09/254a-notes-3-the-operator-norm-of-a-random-matrix/
arxiv.org arxiv.org

1902.06720.pdf

3
1. guillefix 22 Jul 2019
  
  in Public
  
  [35,
  
  citation 35 and 6 are the same citation (one in the conference, and one in arxiv). I think they should be merged.
2. guillefix 21 Jul 2019
  
  in Public
  
  width
  
  square root of width?
3. guillefix 09 Jul 2019
  
  in Public
  
  which forinstance can be seen from the explicit width dependence of the gradients in the NTK parameterization
  
  Yeah but the NTK parametrization makes the gradients much smaller. For normal parametrization, gradient of individual weights is not infinitesimal right?
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1902.06720.pdf
arxiv.org arxiv.org

1906.05392.pdf

2
1. guillefix 22 Jul 2019
  
  in Public
  
  n≤d
  
  \(d\) is the input dimension. But also the number of parameters, because we are looking at a linear model here.
2. guillefix 22 Jul 2019
  
  in Public
  
  We assume the feature matrix can be decomposed into the formX=X+ZwhereXis low-rank(i.e. rank(X)=r<<n) with singular value decompositionX=UVTwithU∈Rn×r,∈Rr×r,V∈Rd×r,andZ∈Rn×dis a matrix with i.i.d.N(0;2x~n)entries.
  
  Noise model. Information space is the component of the inputs that live in a low-dim space (low rank component), and the nuisance space is the component that corresponds to i.i.d. noise, which will w.h.p. be of maximum rank
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1906.05392.pdf
arxiv.org arxiv.org

1812.07956.pdf

11
1. guillefix 12 Jul 2019
  
  in Public
  
  r large, we obtain a low limit training accuracy and do not observe overfitting, asurprising fact since this amounts to solving an over-parameterized linear system. This behavioris due to a poorly conditioned linearized model, see Appendix C.
  
  Wait, so it seems that in all the experiments with CNNs you just found that the lazy training didn't converge to a global minimum of training error. So it doesn't mean they aren't generalizing well!
  
  Is your Jacobian degenerate for the first set of experiments (with squared loss), because if not, then your theorem implies that they should converge to a global minimum right?
2. guillefix 12 Jul 2019
  
  in Public
  
  hat manages to interpolate the observations with just a smalldisplacement in parameter space (in both cases, near zero training loss was achieved).
  
  zero training loss is achieved both in the lazy and non-lazy regime, but the non-lazy solution generalizes much better
3. guillefix 12 Jul 2019
  
  in Public
  
  Cover illustration.
  
  I suppose that in both the lazy and non-lazy regime, it has reached a global minimum of training loss?
4. guillefix 12 Jul 2019
  
  in Public
  
  Theorem 2.5(Under-parameterized lazy training).Assume thatFis separable,Ris stronglyconvex,h(w0) = 0andrankDh(w)is constant on a neighborhood ofw0. Then there exists0>0such that for all0the gradient flow(4)converges at a geometric rate (asymptoticallyindependent of) to a local minimum ofF.
  
  Convergence to local minimum, removing assumption about non-degeneracy of Jacobian
5. guillefix 12 Jul 2019
  
  in Public
  
  In terms of convergence results, this paper's main new result is the convergence of gradient flow, and showing that it stays close to the tangent (linearized) gradient flow.
  
  And saying this for general parametrized models. The assumption of nondegenerate Jacobian is related to over-parametrization, as nondegeneracy is more likely when one is overparametrized.
6. guillefix 12 Jul 2019
  
  in Public
  
  he gradient flow needs to be integrated with a step-size of order1=Lip(rF) = 1=Lip(h)2
  
  size of step size for gradient flow to be a good approximation
7. guillefix 12 Jul 2019
  
  in Public
  
  s!1,supt0kw(t)w0k=O(1=)
  
  How come it can find a minimum arbitrarily close to the initialization?
  
  Ah I see by the non-degenerate Jacobian assumption, you can find a local change that will fit \(y^*\), and \(\alpha\) large is just needed to reach the overall size/scale of \(y^*\) with the local change
8. guillefix 12 Jul 2019
  
  in Public
  
  kh(w0)y?kis bounded
  
  How realistic is this?
9. guillefix 12 Jul 2019
  
  in Public
  
  square-integrablefunctions with respect tox
  
  why do we need them to be square-integrable?
10. guillefix 07 Jul 2019
  
  in Public
  
  are bound to reach the lazy regime as the sizes of all layers grow unbounded
  
  and the learning rate tending to zero..
11. guillefix 02 Jul 2019
  
  in Public
  
  r
  
  Remember this nabla is w.r.t. to its argument not parameters \(w\)
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1812.07956.pdf
arxiv.org arxiv.org

1906.05827.pdf

4
1. guillefix 09 Jul 2019
  
  in Public
  
  Consider the class of linear functions overX=Rd, with squared parametrization as follows
  
  Seems quite artificial, but ok
2. guillefix 09 Jul 2019
  
  in Public
  
  duplicating units and negating their signs, the Jacobian of the modelis degenerate at initialization, or in their notationmin= 0
  
  is this if the weights are tied only? Do they assume they are tied?
3. guillefix 09 Jul 2019
  
  in Public
  
  The data are generated by a5-sparse predictoraccording toy(n)N(h;x(n)i;0:01)withd= 1000andN= 100.
  
  perhaps large initialization is like a small L2 norm bias, and small initialization like an L1 norm bias. So the kernel regime is bad for learning sparse networks (I think Lee also says this in his talk)
4. guillefix 09 Jul 2019
  
  in Public
  
  training with gradient descent has the effect of finding the mini-mum RKHS norm solution.
  
  they showed that for GD and logistic regression, but what about SGD, and square loss? I think for square loss you need either early stopping or regularization to get min norm solution?
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1906.05827.pdf
arxiv.org arxiv.org

1905.10337.pdf

1
1. guillefix 07 Jul 2019
  
  in Public
  
  Can we moreprovide theoretical justications for this gap?
  
  are all our base belong to us?
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1905.10337.pdf
arxiv.org arxiv.org

1905.10843.pdf

1
1. guillefix 07 Jul 2019
  
  in Public
  
  distance
  
  the kernel distance?
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1905.10843.pdf
arxiv.org arxiv.org

Untitled document

2
1. guillefix 07 Jul 2019
  
  in Public
  
  the case of a regression loss, the obtained modelbehaves similarly to a minimum norm kernel least squares solution
  
  Only its expected value, see page 7 in Jacot2018, if I understood correctly
2. guillefix 07 Jul 2019
  
  in Public
  
  State-of-the-art neural networks are heavily over-parameterized, making the optimization algorithma crucial ingredient
  
  The fact that most naive learning algorithms work well, makes me question the "crucial" qualifier..
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1905.12173.pdf
www.wikiwand.com www.wikiwand.com

Hamilton–Jacobi equation | Wikiwand

1
1. guillefix 04 Jul 2019
  
  in Public
  
  which is a nonlinear
  
  typo on above equation? this appears to be the same as the Schrodinger equation
Visit annotations in context

Annotators

guillefix

URL

wikiwand.com/en/Hamilton–Jacobi_equation
arxiv.org arxiv.org

Untitled document

2
1. guillefix 03 Jul 2019
  
  in Public
  
  raining just the top layer with anℓ2loss is equivalent to a kernel regression for the following kernel:kerx,x′=Eθ∼W[f(θ,x)·fθ,x′],
  
  This is the expected value of the kernel, not the actual kernel, which would correspond to a random features kernel right?
  
  Hmm I think I remember random features converging when their number grows to infinity, but the product \(f(\theta,x)f(\theta,x')\) doesn't stochastically converge when the width grows to infinity right? Only its expectation converges
2. guillefix 03 Jul 2019
  
  in Public
  
  aGaussian Process (GP)[Neal,1996].This model as well as analogous ones with multiple layers [Lee et al.,2018,Matthews et al.,2018]and convolutional filters [Novak et al.,2019,Garriga-Alonso et al.,2019] make up the GaussianProcess view of deep learning. These correspond to infinitely wide deep nets whose all parametersare chosen randomly (with careful scaling), and only the top(classification) layer is trained.
  
  Maybe, but these kernels also correspond to those of a fully trained ideal Bayesian neural network, with prior over weights given by the iid initialization
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1904.11955.pdf
Jun 2019
www.marxists.org www.marxists.org

Existentialism is a Humanism, Jean-Paul Sarte 1946

1
1. guillefix 30 Jun 2019
  
  in Public
  
  He is not like that on account of a cowardly heart or lungs or cerebrum, he has not become like that through his physiological organism; he is like that because he has made himself into a coward by actions.
  
  philosology does affect you as well though...
Visit annotations in context

Annotators

guillefix

URL

marxists.org/reference/archive/sartre/works/exist/sartre.htm
arxiv.org arxiv.org

1203.3506.pdf

1
1. guillefix 25 Jun 2019
  
  in Public
  
  f= logpd.
  
  If the optimum of (2) is given by this when function is unrestricted, if we consider a family with zero "approximation error" (so that the optimum is in the family), then the optimum on the family is the same as over all functions
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1203.3506.pdf
arxiv.org arxiv.org

1904.00956.pdf

1
1. guillefix 10 Jun 2019
  
  in Public
  
  we can employ efficient off-policy reinforcement learningalgorithms that are faster than current meta-RL methods,which are typically on-policy (Finn et al., 2017a; Duanet al., 2016).
  
  Why are previous meta-RL algorithms typically on-policy?
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1904.00956.pdf
May 2019
dspace.mit.edu dspace.mit.edu

MIT-CSAIL-TR-2018-019.pdf

3
1. guillefix 17 May 2019
  
  in Public
  
  D∗[T∗μ,T(Zm0)]
  
  I see this as one of the main innovations of the paper. This term is a discrepancy between the sample, and the true distribution \(\mu\). This would allow Z_m to be sampled from a different distribution for instance, allowing to get bounds that account for distributional drift, for instance.
2. guillefix 17 May 2019
  
  in Public
  
  V[f]
  
  This basically offers a measure of the variance of the loss (in a non-statistical sense) over the instance space, of the learned function.
3. guillefix 17 May 2019
  
  in Public
  
  hus, in classical bounds including data-dependent ones,asHgets larger and more complex, the bounds tend to become more pessimistic for theactual instance ˆyA(Sm)(learned with the actual instanceSm), which is avoided in Theorem1.
  
  Sure, but that is also avoided in some statistial learning approaches, like Structural Risk Minimization, PAC-Bayes, and the luckiness framework, which you cite!
Visit annotations in context

Annotators

guillefix

URL

dspace.mit.edu/bitstream/handle/1721.1/118307/MIT-CSAIL-TR-2018-019.pdf
arxiv.org arxiv.org

1706.03762.pdf

2
1. guillefix 12 May 2019
  
  in Public
  
  the total computational costis similar to that of single-head attention with full dimensionality
  
  smaller?
2. guillefix 12 May 2019
  
  in Public
  
  Multi-head attention allows the model to jointly attend to information from different representationsubspaces at different positions. With a single attention head, averaging inhibits this.
  
  So if I understand correctly, with a single head, different parts of the d_model-dimensional query vector may "want" to attend to different parts of the key, but because the weight of the values is computed by summing over all elements in the dot product, it would just average these local weights. Sepparating into different heads, allows to attend to different value vectors for different "reasons".
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1706.03762
Apr 2019
Local file Local file

LowKLowP-Ard-shorter.pdf

13
1. guillefix 30 Apr 2019
  
  in Public
  
  theprobability
  
  log probability
2. guillefix 30 Apr 2019
  
  in Public
  
  say
  
  ~~
3. guillefix 30 Apr 2019
  
  in Public
  
  too weak
  
  for Kmax
4. guillefix 30 Apr 2019
  
  in Public
  
  tighter
  
  in relative terms
5. guillefix 30 Apr 2019
  
  in Public
  
  o generatex
  
  given the map \(f\)
6. guillefix 30 Apr 2019
  
  in Public
  
  rst e
  
  first give x, and then enumerate ... identifying all inputs p mapping to x, namely \(f^{-1}(x)\)
7. guillefix 30 Apr 2019
  
  in Public
  
  lays a key rolein
  
  is the main component in
8. guillefix 30 Apr 2019
  
  in Public
  
  ,
  
  :
9. guillefix 30 Apr 2019
  
  in Public
  
  function
  
  of
10. guillefix 30 Apr 2019
  
  in Public
  
  NI= 2n.
  
  for binary strings
11. guillefix 30 Apr 2019
  
  in Public
  
  derived
  
  suggested
12. guillefix 30 Apr 2019
  
  in Public
  
  Since manyprocesses in science and engineering can be described asinput-output maps that are not UTMs
  
  Perhaps say "This suggests that, even though many maps are not UTMs, the principle that low K are high P should hold widely"
  
  because it is not because they are not UTMs, but it is in spite of them not being UTMs, I would argue.
13. guillefix 30 Apr 2019
  
  in Public
  
  , a classic categorization of machinesby their computational power,
  
  in parenthesis
Annotators

guillefix
www.cs.toronto.edu www.cs.toronto.edu

Kernel Cookbook

1
1. guillefix 03 Apr 2019
  
  in Public
  
  k(y,x,x′,y′)
  
  should be \(k(y,x,y',x')\) right?
  
  typo
Visit annotations in context

Tags

typo

Annotators

guillefix

URL

cs.toronto.edu/~duvenaud/cookbook/
distill.pub distill.pub

A Visual Exploration of Gaussian Processes

1
1. guillefix 03 Apr 2019
  
  in Public
  
  If we add a periodic and a linear kernel, the global trend of the linear kernel is incorporated into the combined kernel.
  
  Remember that kernel functions with one of its arguments evaluated are members of the reproducing kernel Hilbert space to which all the functions supported by a particular Gaussian process belong.
  
  Therefore adding kernels, amounts to adding the functions on these two spaces. That is why the resulting functions work like this when combining kernels!
Visit annotations in context

Annotators

guillefix

URL

distill.pub/2019/visual-exploration-gaussian-processes
www.jmlr.org www.jmlr.org

langford05a.dvi

1
1. guillefix 02 Apr 2019
  
  in Public
  
  concave in both arguments. Jensen’s inequality (f(x,y)concave⇒E f(x,y)≥f(Ex,Ey))
  
  Actually it's convex
Visit annotations in context

Annotators

guillefix

URL

jmlr.org/papers/volume6/langford05a/langford05a.pdf
Mar 2019
www.jmlr.org www.jmlr.org

langford05a.dvi

14
1. guillefix 29 Mar 2019
  
  in Public
  
  A stochastic error rate,ˆQ(~w,μ)S=E~x,y∼S ̄F(μγ(~x,y)
  
  Remember that the w sampled from the "posterior" isn't necessarely parallel to the original w, so that the stochastic classification rate isn't simply F(sign(margin)) but something more complicated; see the proof.
2. guillefix 29 Mar 2019
  
  in Public
  
  Since the PAC-Bayes bound is (almost) a generalization of the Occam’s Razor bound, the tightnessresult for Occam’s Razor also applies to PAC-Bayes bounds.
  
  Oh, c'mon :PP You are just showing that PAC-Bayes is tight as a statement for all Q and for a particular P. As in you are saying that if we only let it depend on the quantities it can depend (namely KL divergence between Q and P, delta, etc), then it can't be made tighter, because then it would break for the particular choice of D, hypothesis class, Q, and for any value of KL in that case in the Theorem 4.4 above.
  
  --> What I mean is this: that we say the bound is a function f(KL, delta, m, etc). Theorem 4.4 shows that there is a choice of learning problem and algorithm such that these arguments could be anything, and the bound is tight. Therefore, we can't lower this bound without it failing. It is tight in that sense. However, it may not be tight if we allow the bound to depend on other quantities!
3. guillefix 29 Mar 2019
  
  in Public
  
  The lower bound theorem implies that we can not improve an Occam’s Razor like statement.
  
  Yeah, as in if it only depends on \(P(c)\) and the other quantities expressed there, and have it not depend on the algorithm, so it should be a general function that takes \(P(c)\), \(\delta\) etc, but the same function for any algorithm. Then yes. And this is what they mean here.
4. guillefix 29 Mar 2019
  
  in Public
  
  For all P(c), m, k,δthere exists a learning problem D andalgorithm such that
  
  Depends what do you mean by For all \P(c)) are you fixing the hypothesis class or what? Because your proof assumes a particular type of hypothesis class... For P(c) having support over a hypothesis class where the union bound doesn't hold, then it is not tight any more..
5. guillefix 29 Mar 2019
  
  in Public
  
  The distributionDcan be drawn by first selectingYwith a single unbiased coin flip, and thenchoosing theith component of the vectorXindependently, Pr((X1, ...,Xn)|Y) =Πni=1Pr(Xi|Y). Theindividual components are chosen so Pr(Xi=Y|Y) =Bin(m,k,δP(c)).The classifiers we consider just use one feature to make their classification:ci(x) =xi. The trueerror of these classifiers is given by:cD=Bin(m,k,δP(c))
  
  Ok, so this has proven that the Occam bound is tight for this particular \(D\) for this particular hypothesis class, which is quite special, because it has the property that the union bound becomes tight. But that is a very special property of this hypothesis class (or more general, of this choice of support for \(P\) right??)
6. guillefix 29 Mar 2019
  
  in Public
  
  if any classifier has a too-small train error, thenthe classifier with minimal train error must have a too-small train error
  
  this is because having "too-small train error" here means (is equivalent to) having train error smaller than \(k\), so that the classifier with smallest train error also has it smaller than \(k\) and therefore it also has too small train error.
7. guillefix 26 Mar 2019
  
  in Public
  
  The differences between the agnostic and realizable case are fundamentally related to the decrease inthe variance of a binomial as the bias (i.e. true error) approaches 0.
  
  If we observe a zero empirical error, then the probability of observing that decreases very quickly with increasing true error. So I wouldn't really say it's the decrease in variance of the binomial (one could imagine distributions where the variance doesn't decrease as you go to 0, but which still have the property that makes the realizable case error rate smaller in the same way as here)
8. guillefix 26 Mar 2019
  
  in Public
  
  cS
  
  Remember they define training error as error count
9. guillefix 26 Mar 2019
  
  in Public
  
  PrS∼DmcD≤Bin(m,ˆcS,δ)≥1−δ.
  
  Btw, this approach only works because \(c_D\) is a one dimensional random variable, so that {the set of \(k\) such that \(Bin(m,k,p)<\delta\)} equals {the set of \(k\) less than or equal to {the maximum \(k\) such that \(Bin(m,k,p)<\delta\)}}.
  
  This happens because the different ("confidence interval") set of \(k\) defining \(Bin(m,k,p)\) (i.e. all \(k\) smaller than or equal to some \(k\)) are all nested. In more general situations with other confidence intervals defined (like for 2D cumulative distributions) this may not happen.
10. guillefix 26 Mar 2019
  
  in Public
  
  δ∈(0,1]
  
  No,
  
  Because Lemma 3.6 requires \(\frac{k}{m}<p\), then we need \(\delta\) to be small enough such that {{the \(c_D\) that results from solving the Chernoff bound = \(\delta\)} be larger than \(\frac{k}{m}\)}</p>
11. guillefix 26 Mar 2019
  
  in Public
  
  ε
  
  \(c_D\)
12. guillefix 26 Mar 2019
  
  in Public
  
  The test set bound is, essentially, perfectly tight. For any classifier with a sufficiently large trueerror, the bound is violated exactly aδportion of the time
  
  And any tighter bound would be violated a larger portion of the time, at least for some value of the true error (note that for fixed or constrained true error one can have tighter bounds).
  
  The need for a sufficiently large true error is because for instance for true error zero, the bound is never violated.
  
  But still the fact of it being perfectly tight is because of what I said above.
13. guillefix 26 Mar 2019
  
  in Public
  
  more
  
  less
  
  typo
14. guillefix 26 Mar 2019
  
  in Public
  
  All of the results presented here fall in the realm of classical statistics. In particular, all ran-domizations are over draws of the data, and our results have the form of confidence intervals.
  
  So not Bayesian statistics
Visit annotations in context

Tags

typo

Annotators

guillefix

URL

jmlr.org/papers/volume6/langford05a/langford05a.pdf
www.gaussianprocess.org www.gaussianprocess.org

Gaussian Processes for Machine Learning

2
1. guillefix 27 Mar 2019
  
  in Public
  
  he nor-malizing term of eq. (3.53),ZEP=q(y|X), is the EP algorithm’s approximationto the normalizing termZfrom eq. (3.48) and eq. (3.49)
  
  so the EP approximation to the marginal likelihood?
2. guillefix 27 Mar 2019
  
  in Public
  
  in the EP framework we approximate the likelihood by alocal like-lihood approximation13in the form of an un-normalized Gaussian function inthe latent variablefi
  
  how is the EP approximation good when the probit function as likelihood is shaped so differently from the unormalized Gaussian which is used to approximatie it!?
Visit annotations in context

Annotators

guillefix

URL

gaussianprocess.org/gpml/chapters/RW3.pdf
arxiv.org arxiv.org

1611.08408.pdf

1
1. guillefix 27 Mar 2019
  
  in Public
  
  n adversarial network is used to define a loss function which sidestepsthe need to explicitly evaluate or approximatepx(·)
  
  Adversarial training as an alternative to maximum likelihood training!
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1611.08408.pdf
papers.nips.cc papers.nips.cc

Semi-supervised Learning with Deep Generative Models

2
1. guillefix 25 Mar 2019
  
  in Public
  
  Dirichlet prior over these parameterss
  
  You'd need to sum over all \(y\) for that?
2. guillefix 25 Mar 2019
  
  in Public
  
  q(zjx)q(yjx),
  
  Ehm, in the line below you have \(q_\phi(z|y,z)\) not \(q_\phi(z|x)\)
Visit annotations in context

Annotators

guillefix

URL

papers.nips.cc/paper/5352-semi-supervised-learning-with-deep-generative-models.pdf
arxiv.org arxiv.org

1807.01613.pdf

2
1. guillefix 25 Mar 2019
  
  in Public
  
  Inheriting from the properties of stochasticprocesses, we assume thatQis invariant to permutationsofOandT.
  
  In principle, just like for consistency, they could choose not to enforce permutation invariance (e.g. using just an RNN), and rely on the model learning it from data.
  
  However, the contributions they are making (here and in the Neural Processes paper) is ways of putting the structure that Bayesian inference on stochastic processes should satisfy explicitly on the model, so that it has to learn well.
  
  We are putting prior knowledge about how to use prior knowledge
2. guillefix 25 Mar 2019
  
  in Public
  
  without imposing consistency
  
  I think in the context of CNPs consistency would imply things like
  
  \(\sum_{y_1} p(y_1)p(y_2|y_1) =p_2\)
  
  These things are not automatically guaranteed by the framework used here. The data should constrain the network to satisfy these approximately
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1807.01613.pdf
arxiv.org arxiv.org

1807.01622.pdf

3
1. guillefix 25 Mar 2019
  
  in Public
  
  Since the decodergis non-linear, we can use amortisedvariational inference to learn it.
  
  So, the nice thing about NPs is that if you did the inference of \(z\) exactly it would be a stochastic process exactly (unlike CNPs that don't have a simple interpretation/approach that guarantees being an exact stochastic process). However, because of not doing the inference of \(z\) exactly, consistency of the resulting marginal distributions is not exactly guaranteed.
2. guillefix 25 Mar 2019
  
  in Public
  
  non-trivial ‘kernels’
  
  or even stochastic processes that aren't GPs and may not be describable by kernels
3. guillefix 25 Mar 2019
  
  in Public
  
  Given that this is a neural approximationthe curves will sometimes only approach the observationspoints as opposed to go through them as it is the case forGPs.
  
  ?? For GPs with Gaussian likelihood the functions don't pass exactly through the observation points, just near, as in here?
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1807.01622.pdf
arxiv.org arxiv.org

1902.04760.pdf

16
1. guillefix 18 Mar 2019
  
  in Public
  
  h1;:::;hk
  
  I think he meant \(h^{j_1},...,h^{j_k}\)
  
  typo
2. guillefix 18 Mar 2019
  
  in Public
  
  (W0>Wx)i= 0whereW0is an iid copy ofW
  
  So we sample the A transposes independently from the As I see. That is the gradient independence assumption.
  
  Btw is this related to some stuff I saw Hinton spoke about showing that you could do backprop with weights different from the forward prop, and they couldn't understand why. Is this the explanation? Could this, as they suggested, be related to a way the brain could be doing backprop?
3. guillefix 18 Mar 2019
  
  in Public
  
  The sampling ofinput G-vars models the distribution of the first hidden layer across multiple inputs, sampling of the first layer parameters(see Appendix B.1 for an example), and/or sampling of bias vectors.
  
  "upon sampling of the first layer parameters" or "sampling the first layer parameters", I guess he meant?
  
  oh ok , he kinda means sampling of the last layer parameters, but when we do backprop, it's the first layer..
4. guillefix 18 Mar 2019
  
  in Public
  
  In general, the multiplefvigallow for multiple NN outputs.
  
  Ah ok, the \(v^i\) are like the weights of the outputs in the loss function
  
  NO. each \(v_i\) is a vector, the vector of weights which when multiplied by some \(\mathtt{g}\) or \(\mathtt{h}\) give a single real-valued output labelled \(i\). This we call a linear readout layer.
  
  When we back propagate the loss we would multiply each of these vectors by the derivative of the loss w.r.t. to each of these outputs.
  
  Can this be done in this formalism?
5. guillefix 18 Mar 2019
  
  in Public
  
  batchnorm, whose Jacobian has a singularity on a 1-dimensional affine subspace(and in particular, at the origin).
  
  so that the \(\mathtt{f^l}\) are not polynomially bounded
6. guillefix 18 Mar 2019
  
  in Public
  
  am
  
  again should be \(a_{j_i}^l\)
  
  typo
7. guillefix 18 Mar 2019
  
  in Public
  
  m
  
  don't need \(m\) here?
  
  typo
8. guillefix 18 Mar 2019
  
  in Public
  
  has enough power to express thebackpropagation offand compute the gradients with respect to hidden states.
  
  Note that the Nonlin functions in the appended lines are allowed to depend on the previous \(\mathtt{g}\)s, which is necessary to compute the backpropagation of the gradient through the nonlinearity. The odness works for backprop because the Nonlin is just linear w.r.t. the \(v\)s
9. guillefix 18 Mar 2019
  
  in Public
  
  Theorem 4.3.
  
  Generalized law of large numbers extended to a case where the i.i.d. r.v.s have a distribution that changes as we increase the number of samples (although the way it changes is constrained).
  
  EDIT: see comment
  
  The reason this is nontrivial is that the \(\phi(g_i^{\frak{c}t})\) have a distribution that changes with \(t\), although it approaches a limit. So it is different from the standard law of large numbers.
  
  In words, here we are saying that the empirical averages as we increase \(t\) approach (almost surely) the expected value of \(Z\) distributed according to the limit distribution of \(\phi(g_i^{\frak{c}t})\)
  
  explanation
10. guillefix 18 Mar 2019
  
  in Public
  
  theLat which the exponentiating effect ofWLkicks in increases withn
  
  the \(L\) at which exponential effect kicks increases like \(\sqrt{n}\) it seems according to these arguments. Nice
  
  seems related to things in here http://papers.nips.cc/paper/7339-which-neural-net-architectures-give-rise-to-exploding-and-vanishing-gradients
11. guillefix 18 Mar 2019
  
  in Public
  
  zN(c;Kc)
  
  \(\mu^\frak{c}\) is the vector of means of the tuple of \(i\)th components of the vectors in the argument of either \(\mathtt{f}^a\) or \(\mathtt{f}^b\), and \(K^\frac{c}\) is the covariance matrix of this tuple/vector.
  
  Note (also useful for the above definition) that we are defining means and covariances for any individual component of the vectors \(\mathtt{g}\). That is, we are describing the distribution of \(\mathtt{g}^{\frak{c}}_i=(\mathtt{g}^l_i)_{g^l\in\frak{c}}\) for any \(i\). Different tuples of components are independently distributed, as explained in a comment in the beggining of the Setup section above
  
  explanation
12. guillefix 18 Mar 2019
  
  in Public
  
  c(gl)
  
  This is defined for \(g^l \in \frak{c}\)
  
  clarification
13. guillefix 18 Mar 2019
  
  in Public
  
  Kc(gl;gm)
  
  This is defined for \(g^l, g^m \in \frak{c}\)
  
  clarification
14. guillefix 18 Mar 2019
  
  in Public
  
  amji
  
  should be \(a^l_{j_i}\)
  
  typo
15. guillefix 18 Mar 2019
  
  in Public
  
  gcinti N(cint;Kcint)for eachi;j
  
  Note the subscript \(i\), so this is the distribution for the tuple of the \(i\)th components of all the Invecs in \(\frac{c}\).
  
  We therefore allow the \(i\)th component of two different Invecs to be correlated (useful to model the distribution of the first hidden layer, as per the usual NNGP analysis). But we don't allow different components of Invecs, \(g_i^{lt}\) and \(g_j^{mt}\) for \(i\neq j\), to be correlated.
  
  Thre is a typo, it should say for each \(i\).
  
  explanation
16. guillefix 18 Mar 2019
  
  in Public
  
  sequence (int2N)
  
  what does this mean? Ah \(t\) is like a "time", so it is the index of the sequence. \(lt\) just represents two indices (not their product!)
Visit annotations in context

Tags

explanation

clarification

typo

Annotators

guillefix

URL

arxiv.org/pdf/1902.04760.pdf
Feb 2019
arxiv.org arxiv.org

1812.10156.pdf

5
1. guillefix 11 Feb 2019
  
  in Public
  
  that such functions should be simple with respect to all the measures of complexity above,
  
  Why?? Do you show that all functions that have the property of being insensitive to large changes in the inputs have high probability? If so, then say it, if not. Then not all such functions need be simple w.r.t. to all measures of complexity that are found to correlate with probability for a random NN
2. guillefix 11 Feb 2019
  
  in Public
  
  Schmidhuber, 1997, Dingleet al., 2018].
  
  Dingle et al explore bias towards low complexity, but not the relation to generalization.
3. guillefix 11 Feb 2019
  
  in Public
  
  our result implies that the probability of this function is exponentially small inn.
  
  Although I think it is exponentially small in \(n\), why is that implied by your result? All that you know from what we've been told up to this point in the paper, is that it's probability has to be smaller than sqrt(log(n)/n), or using symmetry, we can divide this by 1/n
4. guillefix 11 Feb 2019
  
  in Public
  
  ne,
  
  for \(n>> 1\), the probability of all of the Hamming-distance-1 neighbours giving the same result goes to 0. So the weight of the term corresponding to Hamming distance 1 goes to 1, in the average
5. guillefix 11 Feb 2019
  
  in Public
  
  wo
  
  Approximately a geometric distribution with success probability p=2, for large \(n\), and the mean of a geometric distribution is 1/p = 2
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1812.10156.pdf
Jan 2019
arxiv.org arxiv.org

1608.04493.pdf

2
1. guillefix 29 Jan 2019
  
  in Public
  
  (3)
  
  This is a Schmitt trigger :D, see here: https://youtu.be/Iu21laCEsVs?t=2m31s
2. guillefix 29 Jan 2019
  
  in Public
  
  Inspired by the method of Lagrange Multipliers
  
  How is this inspired by the method of Lagrange Multipliers?
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1608.04493.pdf
yann.lecun.com yann.lecun.com

lecun-90b.pdf

1
1. guillefix 28 Jan 2019
  
  in Public
  
  0AB
  
  This is not \(\delta u_k\), why is he using \(u_k\) ? I don't see what's the justification
Visit annotations in context

Annotators

guillefix

URL

yann.lecun.com/exdb/publis/pdf/lecun-90b.pdf

guillefix

Annotations: 683

Joined: February 23, 2015

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

Annotators

URL

Annotators

Annotators

Tags

Annotators

URL

Annotators

URL

Tags

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

Tags

Annotators

URL

Annotators

URL

Annotators

URL

Tags

Annotators

URL

Annotators

URL

Annotators

URL

Annotators