380 Matching Annotations
  1. Jul 2020
  2. Jun 2020
    1. (x;W1,...,Wl,b1,...,bl)

      it should depend on \(W^{l+1}\) and \(b^{l+1}\) too

    1. Naturally, such an increase in the learning rate also increases the mean stepsE[∆w]. However,we found that this effect is negligible sinceE[∆w]is typically orders of magnitude lower than thestandard deviation.

      Interesting. This is why the intuition that increasing the learning rate would decrease the number of updates is probably not true, because what seems to determine the number of steps is the amount of noise!

  3. May 2020
    1. 〈O( ̄θ)〉=〈[[O[ ̄θ−η ̄∇LB(θ)]]]m.b.〉.

      this is missing some time indices?

    1. We omit thedβexp (−cγ) +bβlog(1δ)nterm since it does not change with changein random labels.

      how can we be sure it is non-vacuous then? hmm

    2. while ̃Hθ†l,φ[j,j] can change based onα-scaling Dinh et al. [2017], the effective curvature is scale invariant

      do you mean because you change \(\sigma\) too? Was that what Dinh et al. were talking about? Or just the fact that there are other \theta (not reparametrizing, just finding new \theta) which have high curvature, but produce same function?

    3. (f) stays valid for the test error rate in (a)

      if you take into account the spread in (f) and (a) it would seem that for some runs the upper bound isn't valid?

    4. Then, based on the ‘fast rate’ PAC-Bayes bound as before, we have the following result

      the posterior Q is a strange posterior over hypotheses. How do they take the KL divergence with the prior Because the posterior is defined by two parameters (\(\theta_\rho\) and \(\theta\))

    5. Further, all the diagonal elementsdecrease as more samples are used for training.

      Really? That sounds surprising!

      I would have expected that as more training samples are added the parameters get more constrained (if the number of parameters is kept fixed).

    6. Theorem 1

      Derandomization of the margin loss

    7. The bound provides a concrete realization of the notionof ‘flatness’ in deep nets [Smith and Le, 2018, Hochreiter and Schmidhuber, 1997, Keskar et al., 2017] andillustrates a trade-off between curvature and distance from initialization.

      is there evidence that distance from initialization anti-correlates with generalization? Even evidence for sharpness <> generalization isn't very strong.

    8. In spite of the dependency on the Hessian diagonal elements, which canbe changed based on re-parameterization without changing the function [Smith and Le, 2018, Dinh et al.,2017], the bound itself is scale invariant since KL-divergence is invariant to such re-parameterizations Klee-man [2011], Li et al. [2019].

      i thought Dinh's criticism wasn't so much about reparametrization, but about the fact that there are other minima which are sharper but give the same function. KL wouldn't be invariant to that, as you aren't changing the prior in that case?

  4. Apr 2020
    1. ∈Ck

      this sum was over all points in the training set in the previous step, and now it's over all points ?

      Just think of the case where the partition C_i is made up of singletons, one for each possible point. Then, the robustness would be zero, but the generalizatoin error bound doesn't seem right then.

      This made me suspect there may be something wrong, and I think it could be at this step. If we kept the sum to be over training sets, now we can;t upper bound the result by the max in the next two lines, I think!

  5. Mar 2020
    1. because of the softmax operation.

      more like because of the Heaviside operation

    2. the signs of f and 𝑓̃ f~\tilde{f} are the same.

      and therefore the classification functions are the same

    3. f~\tilde{f} as 𝑓𝑉=𝜌𝑓̃ fV=ρf~f_V=\rho \tilde{f},

      this is confusing, is f_V or \tilde{f} the normalized network?

    4. Our main results should also hold for SGD.

      Will this be commented on in more detail?

    5. normalized weights Vk as the variables of interest

      Can we even reparametrize to the normalized weights? For homogeneous networks, it's obvious that we can. But for ReLU networks with biases it's less obvious. If one multiplies the biases via constants that grow exponentially with weight, the function is left invariant. We can always do this until the paramter vector is left normalized. Therefore we can reparametrize to the normalized vectors even with biases, but dunno if they consider this case here.

    6. This mechanism underlies regularization in deep networks for exponential losses

      we cannot say this, until we know more. Is this the reason why the generalize? Is this even sufficient to explain their generalization?

    1. Bahdanau et al.(2019) learn a reward function jointly with the action policybut does so using an external expert dataset whereas ouragent uses trajectories collected through its own exploration

      Yeah what they do here is similar to IRL, in that we are trying to learn a human NL-conditioned reward function, but we do it via supervision, rather than demonstration. More similar to the work on "learning from human preferences"

    1. other agents

      which share the same policy right? otherwise it woud be off-policy experience?

    2. Zero Sum

      don't understand this one

    3. specific choice ofλ

      here, a specific choice of \(\lambda\) can determine which solutions among the many which satisfy the constraint we choose. Similarly to the choice of convex regularizer in the GAIL paper

    1. The problem with the max entropy approach in Ziebart et al. 2008 is that it maximizes the entropy of trajectory distributions, without the constraint that these distributions must be realizable by causally-acting policies/agents. They then construct a causal policy from this distribution, but following the policy may result in a different trajectory distribution!

      The question is what would be the maximum entropy path distribution that is compatible with a causal policy? Does maximizing causal entropy give that? Not clear. Instead they prove a different property of maximum causal entropy: Theorem 3 in Ziebert 2012

    2. Z(θ)

      Remember the partition function sums over trajectories which are compatible with the MDP dynamics only.

      Trajectories incompatible with the dynamics have probability 0 of course

    1. Ziebart et al. (2008)

      The problem with the max entropy approach in Ziebart et al. 2008 is that it maximizes the entropy of trajectory distributions, without the constraint that these distributions must be realizable by causally-acting policies/agents. They then construct a causal policy from this distribution, but following the policy may result in a different trajectory distribution!

      The question is what would be the maximum entropy path distribution that is compatible with a causal policy? Does maximizing causal entropy give that? Not clear. Instead they prove a different property of maximum causal entropy: Theorem 3

    2. eθ>F(X,Y)

      this is P(Y|X), right? but it should be P(Y|X,Y_{1:t-1})?

    1. without interactionwith the expert

      how do things change when you can interact with the expert?

  6. Feb 2020
    1. Attention: Mask

      by this, do they mean the attention weighted aggregation step?

    2. nlayerdmodel3dattn

      are they ignoring the \(W^O\) matrix? from the original Transformer paper?

    3. Large models are more sample-efficient than small models, reaching the same level ofperformance with fewer optimization steps (Figure 2) and using fewer data points (Figure 4)

      hmm in teresting. why are larger models more sample efficient?

    4. Theperformance penalty depends predictably on the ratioN0.74/D

      That is weird, what's the origin of this?

    5. hmm do they look at generalization gap?

      is trend on test loss due to parameter count, mostly due to effect on expressivyt / tranining loss (similarly with compute)?

    1. Some preliminary numerical simulations show that thisapproach does predict high robustness and log scaling.However, it only makes any sense if transitions from onephenotype to another phenotype are memoryless.

      I thought the whole transition matrix approach itself assumed memorylessness

    2. LetPbe a row vectorspecifying the probability distribution over phenotypes. Wewant to find a stochastic transition matrixM(rows sum toone) such that

      why do we want P to be stationary?

    3. Mhas 1s on the diagonals,and 0s elsewhere, for example

      that is high robustness right?

    4. Fano’s inequality)

      doesn't Fano's inequality give H(X|Y) on the numerator which is a lower bound on H(X), and so doesnt imply this?


    1. Intrinsic motivations f

      Basically the idea is that the RL/HER part is intrisnsically motivated with LP, to solve more and more tasks while the goal sampling part is intrinsically motivated to get trajectories that give new information to learn the reward function. I suppose they could add a bit of LP to the goal sampling as well to have some tendency to sample trajectories that may help to solve new tasks.

    2. High-quality trajectories are trajectories where the agent collectsdescriptions from the social partner for goals that are rarely reached.

      why do you want more than one description for a goal? A: Ah, because the goal will be the same but the final state may not be for each of these trajectories, thus giving more data to train the reward function.

  7. Jan 2020
    1. f memory-based sample efficient methods

      bandits methods, which are suitable for sequences of indepenedent experiments

    1. We find that the object geometry makes a significantdifferences in how hard the problem is

      apply some goal exploration process like POET?

    1. When it comes to NNs, the regulariza-tion mechanism is also well appreciated in the literature,since they traditionally suffer from overparameterization,resulting in overfitting.

      No. Overparametrized networks have been shown to generalize even without explicit regularization (Zhang et al. 2017)

    1. Therefore, we can get the following generalization bound:

      as long as the value of L is bounded by at most 1/delta or something right?

    1. They use on-average stability that does not imply generalization bounds with highprobability

      Their bounds on expectations can be converted to bounds with high probability, as they claim in page 3, citing "Shalev-Shwartz, S., Shamir, O., Srebro, N., and Sridharan, K. Learnability, stability and uniform convergence. Journal of Machine Learning Research, 11(Oct):2635–2670, 2010."

    1. forTďmstep

      one pass SGD

    2. validation error which is used asan empirical estimate forRpw1q

      so their bound has the disadvantage that it needs an estimate given by the validation error to compute the bound! So it can't be computed from the training data alone!!

    3. our bound corroborates the intuition that whenever we start at a good location of the objectivefunction, the algorithm is more stable and thus generalizes better.

      This is a nice intuition for why good initializations can lead to good generalization

    4. Rpw1q ́R‹

      remember that \(R\) is the population risk, so this isn't a priori something that we can know?

  8. Dec 2019
  9. arxiv.org arxiv.org
    1. Whileit is known having a finite VC-dimension (Vapnik and Chervonenkis, 1991) or equivalentlybeing CVEEEloostable (Mukherjee et al., 2006) is necessary and sufficient for the EmpiricalRisk Minimization (ERM) to generalize,

      it is only necessary to generalize in the worst case over data distributions right?

    1. The bounds based on`2-path normand spectral norm can be derived directly from the those based on`1-path norm and`2norm respectively

      Hmm. how?

      This implies that even though the l2 path norms seem non-vacuous on Figure 1, they aren't. They appear so, because we have dropped the "terms that only depend on depth or number of hidden units", which are large for l2-path norm

    1. ExperimentsIn

      experiments only in 2 dimensional input space. Could results depend on the input dimensionality?

  10. Nov 2019
    1. min(Td;2S)

      the min is because depending on which is larger one or the other of the two limits of the integral, dominates

    2. 29

      Compare this to the analysis of Sollich ( https://pdfs.semanticscholar.org/7294/862e59c8c3a65167260c0156427f4757c67e.pdf ) which is in the well-specified setting. There there's no dependence on the labels of the training data. Here neither, but at least there's dependence on the distribution of the target labels, so that it allows for more general types of assumptions.

    3. K(x)is an even

      which can be seen from its definition as a covariance.

    4. of a Teacher Gaussian process with covarianceKTand assume that they lie in theRKHS of the Student kernelKS, namely

      ah yes, being in RKHS means having a finite norm in the RKHS, which makes sense. But not sure how restrictive this is, just like I'm not sure if simply being n-times differentiable is a good measure of complexity of the function. Are there n-times differentiable functions that approximate any less smooth function? Maybe Lipschitz constant of derivatives (smoothness constants) could be more quantitatively useful?

    5. If both kernels are Laplace kernels thenT=S=d+ 1andEMSEn1=d, whichscales very slowly with the dataset size in large dimensions. If the Teacher is a Gaussian kernel(T=1) and the Student is a Laplace kernel then= 2(1 + 1=d), leading to!2asd!1

      hm, wait what? But wouldn't the Bayes optimal answer be obtained if the student has the same kernel as the teacher?


      as \(n\to\infty\)

    7. We perform kernel classification via the algorithmsoft-margin SVM.

      which approximates a point estimator of the Gaussian process classifier, but I don't know the exact relation.

    8. man


    9. Importantly (i) Eq. (1) leads to a prediction for(d)that accurately matches our numerical study forrandom training data points, leading to the conjecture that Eq. (1) holds in that case as well.

      Compare with: https://arxiv.org/pdf/1909.11500.pdf where they find that random inputs give rise to plateaus, hmm at least with epochs, but they cite papers where these are apparently found for training set size (perhaps only for thin networks?)

    10. s a result, various works on kernel regressionmake the much stronger assumption that the training points are sampled from a target function thatbelongs to thereproducing kernel Hilbert space(RKHS) of the kernel (see for example [Smola et al.,1998]). With this assumptiondoes not depend ond(for instance in [Rudi and Rosasco, 2017]= 1=2is guaranteed). Yet, RKHS is a very strong assumption which requires the smoothness ofthe target function to increase withd[Bach, 2017] (see more on this point below), which may not berealistic in large dimensions.

      I think when they say "it belongs to an RKHS", they mean that it does so with a fixed/bounded norm (otherwise almost any function would satisfy this, for universal RKHSs). This is consistent with the next comment saying, that this assumption implies smoothness (smoothness<>small RKHS norm generally)

  11. openreview.net openreview.net
    1. Seems like PPO works better than their approach in several of the experiments. Hmm

    1. irreducible error (e.g.,Bayes error)

      more commonly model capacity limitations I guess?

    1. GMM on a dataset of previously sampled parametersconcatenated to their respective ALP measure.

      the GMM is only fitted to the parameter part or the (parameter, ALP) vector?

    1. nevertheless, the few re-maining ones must still differ in a finite fraction of bits fromeach other and from the teacher so that perfect generaliza-tion is still impossible. For aslightly above aconly the cou-plings of the teacher survive.

      Lenka Zdeborová, Florent Krzakala have found that at the capacity threshold, algorithms tend to have the longest running times, i.e. the worst-case examples seem to live at that transition

    2. For a committeeof two students it can be shown that when the number ofexamples is large, the information gain does not decreasebut reaches a positive constant. This results in a much fasterdecrease of the generalization error. Instead of being in-versely proportional to the number of examples, the de-crease is now exponentially fast

      For the case of the perceptron you can see how the uncertainty region (whose volume approximates the generalization error) approximately halves (or is reduced by about a constant) after every optimal query.

    1. n general, the baseline leaves the expected value of the update unchanged,but it can have a large

      because baseline depends on S, it can reduce the variance from state to state (not the one from action to action).

      WRONG: IT can reduce the action to action variance of the gradient (not the variance of the advantage!)

  12. Oct 2019
    1. computevar1bbÂj

      this is the covariance matrix

    2. This suggests that the effect ofj(x)is to rotate the gradient field and move thecritical points, also seen in Fig. 4b.

      how does this equation suggest this?

    3. sampling with replacement has better regularization

      but you are saying that the temperature (\(\beta^{-1}\) is lower when you sample with replacement, so that the regularization should be less?

    4. conservative

      how does this mean that it is conservatice?

    5. This implies that SGD implicitlyperforms variational inference with a uniform prior, albeit of a different loss than the one used tocompute back-propagation gradients

      The interpreation of doing variational inference with a uniform prior is because if we interpret the minimization objective as an ELBO, the second term is like the KL divergence between the approximate posterior and a uniform prior (whicih just gives the entropy). Nice

      If \(\rho\) doesn't have any constraints then this should give the exact posterior with uniform prior, and likelihood given by \(\Phi(x)\)

    1. The second particularity is that since the computation of the rewardRpp;c;;oqis internal to themachine, it can be computed any time after the experimentpc;;oqand for any problempPP,not only the particular problem that the agent was trying to solve. Consequently, if the machineexperiments a policyin contextcand observeso(e.g. trying to solve problemp1), and storesthe resultspc;;oqof this experiment in its memory, then when later on it self-generates problemsp2;p3;:::;piit can compute on the fly (and without new actual actions in the environment) theassociated rewardsRp2pc;;oq;Rp3pc;;oq;:::;Rpipc;;oqand use this information to improveover these goalsp2;p3;:::;pi.

      like hindsight experience replay

    1. Although methods to learndisentangled representation of the world exist [25,26,27], they do not allow to distinguish featuresthat are controllable by the learner from features describing external phenomena that are outsidethe control of the agent.

      learning controllabe features is similar to learning a causal model of the world I think

    1. We find that the full NTK has better approximation propertiescompared to other function classes typically defined for ReLU activations [5, 13, 15], which arise for instancewhen only training the weights in the last layer, or when considering Gaussian process limits of ReLUnetworks (e.g., [20, 24, 32]).

      NTK has "better approximation properties". What do they mean more precisely?

    1. and we have left the activation kernel unchanged,K`=1M`A0`A0T`

      what is the reason to do this?

    2. (A`jJ`)

      J_l is the covariance for a single column of A_l right?

    3. Second, we modified theinputs by zeroing-out all but the first input unit (Fig. 1 right).

      how does this work more precisely? The targets are generated by feeding the modified inputs to the "teacher network", but the student network gets the unmodified inputs?

    4. for MAP inference, the learned representationstransition from the input to the output kernel, irrespective of the network width.

      how is MAP inference implemented?

    5. he representations in learned neural networks slowly transitionfrom being similar to the input kernel (i.e. the inner product of the inputs) to being similar to theoutput kernel (i.e. the inner product of one-hot vectors representing targets).

      this transition, as what? as the layer width is increased?

    6. the covariance in the top-layer kernel induced by randomnessin the lower-layer weights.

      what does he mean by this?

    7. e.g.compare performance in Garriga-Alonso et al. (2019) and Novak et al. (2019) against He et al.(2016) and Chen et al. (2018)).

      but in here the GP networks lack many important features like batch-norm, pooling etc! Not sure if this example is a fair comparison. Also, not clear whether this difference is due to finite width or SGD (a question that Novak also asks)

    8. enabling efficient and exact reasoning aboutuncertainty

      Only in regression... AAaaAaaAh ÒwÓ

    1. significant new benchmark for performance of a pure kernel-based method on CIFAR-10, being 10% higher than the methods reported in [Novak et al., 2019]

      Interesting, so apparently the NTK works better than the NNGP for this architecture at least

    1. Optimally, these parameters are chosen such that the true predictiveprocessP(t§jx§;S) is closest toQ(t§jx§;S) in relative entropy.

      in which sense is this optimal?

    2. Bayes classiØer

      I thought the Bayes classifier would predict sign ( E_w [P(t|y)y(x|w)] - 0.5) ?

    3. our task is then to separate the structure from thenoise.

      Well, and to find the correct regularity; generalization is not just about separating structure from noise. Unless by "noise" here, you mean also the stochasticity in the training sample (of inputs)..

    4. We know of no interesting real-world learningproblem which comes without any sort of prior knowledg

      Yep, no free lunch

    5. (theluckycase)

      again I wouldn't call it "unlucky", because the whole proof is that the generalization is good, because it's very unlikely to have obtained this training set by luck, so that it's most likely that we obtained it by having chosen a good prior. So I would call it "good prior" case.

    1. , such as cross entropy loss, encourage a larger outputmargin

      The fact that they also encourage a large SVM-margin is not so trivial tho

    2. the gap between predictions on the true label and andnext most confident label.

      In SVMs, for instance, "margin" refers to the distance between classification boundary and a point. This can be related to the definition of margin here, but they are not the same?

      E.g. if we have a small SVM-margin, but a really large weight norm, then we would still have a small output margin.

      Ah, that's why they normalize by weight norm I suppose yeah.

    1. This is further consistent with recent experimental work showing that neuralnetworks are often robust to re-initialization but not re-randomization of layers (Zhang et al. [42]).

      what does this mean?

    2. Kernels from single hidden layer randomly initializedReLUnetwork convergence to analytic kernel using Monte Carlo sampling (Msamples). See §I foradditional discussion

      I think the monte carlo estimate of the NTK is a montecarlo estimate of the average NTK (as in average over initializations), not of the initialization-dependent NTK which Jacot studied. Jacot showed that in infinite width limit both are the same.

      But it seems from their results that even for finite width the average NTK is closer to the limit NTK than the single-sample NTK. This makes sense, because the single sample one has extra fluctuation around average.

    3. We observe that the empirical kernel^gives more accurate dynamics for finite width networks.

      That is a very interesting observation!

    4. =0n

      yeah! so in standard parametrization, the learning rate is indeed O(1/n) !

    5.  < critical

      is the condition \(\eta <\eta_{\text{critical}}\) on the learning rate just so that gradient descent and gradient flow give similar results?

    1. Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent

      You didn't except hypothes.is in here did you?

      Bamboozled again!

    1. One may argue that a natural question consid-ering convex polygons would be whether they are separatefrom each other. Unfortunately, there is an infeasible compu-tational lower bound for this question.

      Yeah the no-miss-inclusion property doesn't imply the hulls don't intersect. Think of two perpependicular rectangles which intersect but where the corners are not inside the other rectangle

  13. Sep 2019
    1. For example, linear networks suffer worse conditioningthan any nonlinear network, and although nonlinear networks may have many small eigenvalues theyare generically non-degenerate.

      but doesn't it necessarily have to be degenerate when the number of trainnig points is smaller than the number of parameters

    1. For Softplus and Sigmoid, the training algorithm is stuck at a low test accuracy10%

      wow. what's their train accuracy?

    2. activation functions that do not have an EOC, such as Softplus and Sigmoid

      how does their phase diagram look like?

    3. might be explained by thislog(L)factor.

      Is that likely given that it's so small that it can't be seen experimentally?

    4. 2EOC

      what exactly is the EOC for ReLU?


    1. increasesits mean

      Right, but it may decrease its mean squared, which is what you are interested in.

    2. We again make use ofthe wide network assumption

      Isn't this now assuming large dimensionality of inputs?

    3. It was shown in He et al. [2015] that for ReLU networks initialized using Equation 2,the total mean is zero and total variance is the same forallpre-activations, regardless of the sampledistribution.

      Well it seems to me that they just looked at the average over weights. But their basic result is true if you average over any input distribution.You just get the average squared norm of the input multiplying the variances at each layer, but the variance at each layer are still all the same

    4. s

      insert comma here

    5. For almost all samples, these neurons are either operating as if they werelinear, or do not exist.

      Unclear what you mean here

    1. thenyl1has zero mean and has a symmetricdistribution around zero. This leads toE[x2l] =12Var[yl1]

      This is very nice. We don't need the infinite width assumption to calculate how variances propagate through the network. This is unlike for covariances or higher moments

    2. wl1have a symmetric distribution around zero

      so the fundamental assumptions is that the weights have a distribution which is symmetric around 0, not just of mean 0

  14. arxiv.org arxiv.org
    1. However, sincexis affected byW,band the parameters of all the layers below, changesto those parameters during training will likely move manydimensions ofxinto the saturated regime of the nonlin-earity and slow down the convergence. This effect isamplified as the network depth increases.

      Why is this?

    1. We hope this work will encourage further research that facilitatesthe discovery of new architectures that not only possess inductive biases for practical domains, butcan also be trained with algorithms that may not require gradient computation

      The WANN is an example of meta learning architectures that can be trained with new algorithms

    2. In the discussion they point out a couple of ways the work could be useful or interesting (transfer learning, training techniques beyond gradient-methods), but they don't make many clear points It seems the main point of the paper is to just point out that this is possible, and give food for thought.


    1. for erf andb= 0, the even degreeks all vanish

      Does this mean that erf networks are not fully expressive?

    1. The correlation vanishes if we train to a xed, nite loss instead of for a speci ed number ofepochs, and weakens as we increase the number of layers in the network

      Do you mean the other way round? I thought the experiments in this section were running SGD until it reaches a small fixed loss value, like you say in Figure 5.

      EDIT: Ah I see you repeat this in the end. Make sure you mention that you state that the experiments in Section 3.2 are for a fixed number of epochs.

      The fact that you need to train for a fixed number of epochs is interesting. Perhaps, the self-similarity among different minima occurs within a "layer" of weight space corresponding to weights of a given norm, and different layers are related by rescaling also (see section about flatness vs epochs, assuming norm increases with epochs). But if you train to a fixed loss, i guess the norm/layer needed to reach that loss is different for different data/functions, so that's why the correlation vanishes?

    2. We restrict our attention to the recti ed linear activation(ReLU) function, described by

      as in the rest of the paper?

    3. atness

      I think in the plots where you talk about flatness, the flatness axis should be -log(prod lambda), so that larger values correspond to higher flatness

    4. This is consistent with the aforementioned self-similarity properties of the simple Boolean network.

      what do you mean? flatness could increase by the same amounts, whether the lambda_max correlated with flatness or not, no? The two phenomena could be related though

    5. as gradient 1.687, indicating that atness has increasedupon further training

      The gradient here just indicates that larger flatness increases in flatness more right? It's the y-intercept that is showing here that the flatness increases after 100 epochs

    6. , the volume does not change (up to some noise limit).

      I suppose because the function stops changing? Is this training on part of the inputs, rather than on all the inputs? If the latter is the case, it's trivial that, if the algorithm converges, it will not change the function it finds after enough epochs.

    7. 7-40-40-2

      I thought the network was 7-40-40-1

    8. For varying proportions of corruption(up to 50%)

      This makes it seem to me like you corrupted a fraction of S_train, rather than appending an S_attack? We don't want that as then we are not fitting the same-sized (uncorrupted) training set.

    9. (using the original, uncorrupted train dataset)

      You'll have to explain to me (and the reader) what precisely you did. What are each of the points in Figure 15? A different training set, the same training set?

    10. We restrict the training set to 500 examples

      Previously you said 512 examples. Which one is it?

    11. image pixels

      pixels in the images of size 28x28

    12. We deliberately corrupt a small, variable proportion of the training dataset to producesolutions with diverse generalisation performances.

      To be clear, S_train is fixed in size, but S_attack varies in size right? It's not S_train + S_attack that's fixed in size. You should make this clear, because both approaches have been used, but in this case, we want S_train to be fixed.

    13. Figure 11

      \(\alpha>1\) increases the norm and \(\alpha<1\) decreases it, it seems. I guess this is because in the former case, we are increasing more parameters than we are decreasing it, and viceversa in the latter case.

      On the other hand, both increase the sharpness. This shows that sharpness and norm don't necessarily follow each other. However, it may be that for solutions that SGD finds, sharpness and norm do correlate. [ this is in a similar spirit to the alpha scaling rebuttal; while there are sharp regions with large and small norm, perhaps constrained to SGD-typical regions, the two quantities correlate. We could check this ]

    14. deteriorating atness as we-scale

      what are you referring to here?

    15. after a xed number of iterations with random weightand bias initialisation)

      and standard optimization algorithms

    16. attestpoint

      well, we don't know if the flattest, but definitely flatter than alpha-scaled ones

    17. how can minima corresponding toidentical functions have arbitrarily di erent atnesses?

      the question here is: "if minima corresponding to the same function (which thus generalize identically) can have arbitrarily different flatnesses, how can flatness be used as a proxy for generalization?

    18. (Wi;bi;Wi+1)!( Wi; bi;1Wi+1)

      where \(\alpha>0\)

    19. any form (recti edor otherwise)

      as long as it is twice differentiable, so that the Hessian is defined

    20. redundant

      rather than "redundant", "have limitations", or "are inappropriate"?

    21. most simple output function.

      in the sample

    22. symmetry

      self-similarity is a better term, as it's more commonly used, specially in cases, where the similarity at different isn't exact

    23. irrespectiveof complications due to rounding,

      what complications?

    24. the upper band corresponds to functions dominated by 0 and the lower band corresponds tofunctions dominated by 1.

      How is this? I would have expected behavior to be symmetric under changing outputs 0 to outputs 1 (think of changing sign of last layer)

    25. a xed numberof epochs

      enough to reach global minimum / target function?

    26. atness is strongly anticorrelated with the volume

      I would define the x axis as "sharpness", or if you want to call it flatness, make it negative. A minimum with larger -log(lambda) is more flat right?

    27. is is because

      this is expected because

      This is the MDL argument right?

    28. nction behaviour (and hence loss)

      here we are assuming that the loss measures discrepancy on all input points, so that zero loss is only achieved by functions equal to the target function.

    29. arguments

      and can provide rigorous bounds in some cases via PAC-Bayes theory

    30. f there is a bias, it will obey this bound.

      the bound is always obeyed, but it is only tight if there is enough bias.

      In other words, one only obtains simplicity bias (simple functions being more likely) if there is bias to begin with.

    31. But precisely why di erent solutions di er in atness and why optimisation algorithmsused during training converge so surely to 'good', at solutions remains unclear.

      also, the MDL principle based on flatness doesn't provide nonvacuous bounds [lecun et al], except on few recent exceptions [dan roy et al]


  15. Aug 2019
    1. the limiting kernelsK1carry (almost) no information onx;x0and have therefore little expressive power


    2. 1t(X).

      how does this gamma term evolve?

    3. when using SGD, the gradient update can be seen as a GD with aGaussian noise (Hu et al., 2018; Li et al., 2017).

      Think of each step of Brownian motion as integrating many mini steps of SGD.

      This is reminiscent of CLT

    4. Recent work by Jacot et al. (2018) has showed that training a neural networkof any kind with a full batch gradient descent in parameter space is equivalentto kernel gradient descent in function space with respect to the Neural TangentKernel (NTK).

      Only when the learning rate is very small


    1. Theorem 4.1(Weak Spectral Simplicity Bias).

      No bias towards complexity

    2. Kernels as Integral Operators

      How does this definition of the kernel as an integral operator fit in the rest of the story as a kernel of a Gaussian process? I thought a Gaussian process doesn't define an input over functions?

    3. with the larger stepsize2

      why does the larger step size cause more stability?

    4. whereCdk

      you haven't defined \(\Delta\) in the statement of the theorem

    1. non-constant

      because it is not a polynomial

    2. w(`)!0

      Why?? The operator norm of a random Gaussian matrix goes like \(O(\sqrt{n})\) ? (see https://terrytao.wordpress.com/2010/01/09/254a-notes-3-the-operator-norm-of-a-random-matrix/ e.g.)

    3. a(k)(t)a(k)(0) +c~a(k)(t)andw(k)(t)w(k)(0) + ~w(k)(t)to getthe polynomial bound

      Do they substitute \(a^{(k)}\) and \(w^{(k)}\) with \(A(t)\)? That is a valid bound, but seems quite loose. But it doesn't seem to be what they are saying. Substituting the LHS by RHS on Q, doesn't guarantee we obtain a polynomial on A(t) ?

    4. Idn`

      should be \(Id_{n_0}\)

    5. (`);d(`+1

      This is an extension of the definition of <>_pin to allow for vector-valued functions living in spaces of different dimensionality, in which case we take the outer product I think

    1. stochastic gradient descen

      An in fact SGD seems to work better in terms of both convergence, and generalization, in practice. Can we explain that theoretically?

    2. Rqdh−1×p

      As this has \(p\) columns, I'm assuming they only consider a stride of \(1\), for the convolutional networks (probably easy to generalize regardless)

    3. λ4

      \(\lambda_0\) is \(\lambda_\text{min}(H)\) I suppose

    4. show

      -extra word-

    5. depends

      depends on

    6. at linearrate.

      The rate of convergence is going to be very slow though for big \(n\), because the learning rate is very small

    7. ηλminK(H)2

      Is this guaranteed to be \(<1\)? Otherwise, the inequality couldn't be right.

      Seems like this being less than \(1\) depends on what \(\lambda_{\text{min}}(\mathbf{K^{(H)}})\) is

    8. A(H−1)

      Hmm, should this be \(\mathbf{A}_{ij}^{(H-1)}\)?

      I think so. See page 40 for example.

    9. K(h)ij

      These are the standard Neural Network Gaussian Process kernels I think

    10. gradient descentwith a constant positive step size

      But, like for NTK paper, step size is effectively very small, because of using NTK parametrization. Except that because \(m\), is at least a finite step size :P

    11. ǫ1

      Should be \(\mathcal{E}_1\)?

    12. BothZou et al.(2018)andAllen-Zhu et al.(2018c) train a subset of the layers

      Really? That sounds like a big difference with practice..

    13. in a unified fashionin AppendixE.

      How does this compare to the unified treatment of Greg Yang?

    14. Jacot et al.(2018) do not establish theconvergence of gradient flow to a global minimizer.

      They do, for a positive definite kernel right?

  16. Jul 2019
    1. this concludes theproo

      Technically we've shown that the variation in \(\alpha\) is \(O(1/\sqrt{n_L})\), but not the derivative? I think however, one can express the variation in the NTK as a sum over variations of alpha (by using the arguments here, and then integrating in time), giving us the desired result that the variation of the NTK goes to zero as \(O(1/\sqrt{n_L})\)

    2. Rn`n`+1

      should be \(\mathbb{R}^{n_{l+1} \times n_l}\) I think?

    3. recursive bounds

      Uses Cauchy-Schwartz for Operator norm of matrices, which can be obtained easily from definition

    4. A(t)stays uni-formly bounded on[0;]

      Why does it stay uniformly bounded on \([0,\tau]\) and not on \([0,T]\) ?

      Which theorem are they applying from reference [6], Thm 4?

    5. The summands@W(L)ijf;j0(x)@W(L)ijf;j00(x0)of the NTK hence vary at rate ofn3=

      Each of the derivatives has a factor of \(\frac{1}{\sqrt{n_{L}}}\), plus we get an extra factor of \(\frac{1}{\sqrt{n_{L}}}\) from the derivative of \(\alpha_{i}^{(L)}\)

    6. @tjjjjjj@tjj

      For the 2-norm, at least?

    7. theactivations


    8. hence thatk1pnLW(L)(0)kopis bounded

      as Frobenius norm bounds spectral norm (which is the same as operator norm for matrices)

    1. Finally and most importantly: how do weconnect all of this back to a rigorous PAC learning framework?
    2. high dimensionality crushes bad minima into dust

      If high dimensionality is the reason, then why does the high dimensional linear model work so well?

    3. rescalings of network parameters are irrelevant

      Not always. If you scale all parameters of a ReLU network with biases, it changes the function. If biases are zero, it doesn't change the function.

      It's true that batch-norm makes networks independent of parameter scaling, for the layer immediately before a batch norm layer.

    4. hen parameters are small (say, 0.1), aperturbation of size 1 might cause a major performance degradation. Conversely, when parameters

      This feels like it may be true for nonlinearities like tanh, but not sure if it will be for relu.

      For ReLU, larger parameters increase the gradient of the outputs w.r.t. to parameters in lower layers. Exponentially in the number of layers! This is what allows the lazy regime (see literature on NTK / lazy training etc)

    5. the linearmodel achieves only 49% test accuracy, whileResNet-18 achieves 92%.

      This is interesting in that it shows that "overparametrization" is not enought to get models that generalize well. Neural networks must have special properties beyond being overparametrized

    6. Interestingly, all of these methods generalize far better than the linear model. While thereare undeniably differences between the performance of different optimizers, the presence of implicitregularization for virtually any optimizer strongly indicates thatimplicit regularization is caused inlarge part by the geometry of the loss function, rather than the choice of optimizer alone.

      YES. That's what we say too: http://guillefix.me/nnbias/

    7. bad minima are everywhere

      This is a very loose statement. They could be a set of very little measure that is just rather dense in the space. So yeah they could be "everywhere", but be very rare and hard to find, unless you are beeing trained to find them explicitly

    1. Markov’s inequality.

      I could only get

      \(\mathbf{P}(|M x| \geq \sqrt{An}) \leq C^n \exp (-c A n)\)

      Applying Markov inequality :P Did I do anything wrong?

    1. [35,

      citation 35 and 6 are the same citation (one in the conference, and one in arxiv). I think they should be merged.

    2. width

      square root of width?

    1. n≤d

      \(d\) is the input dimension. But also the number of parameters, because we are looking at a linear model here.

    2. We assume the feature matrix can be decomposed into the formX=X+ZwhereXis low-rank(i.e. rank(X)=r<<n) with singular value decompositionX=UVTwithU∈Rn×r,∈Rr×r,V∈Rd×r,andZ∈Rn×dis a matrix with i.i.d.N(0;2x~n)entries.

      Noise model. Information space is the component of the inputs that live in a low-dim space (low rank component), and the nuisance space is the component that corresponds to i.i.d. noise, which will w.h.p. be of maximum rank

    1. r large, we obtain a low limit training accuracy and do not observe overfitting, asurprising fact since this amounts to solving an over-parameterized linear system. This behavioris due to a poorly conditioned linearized model, see Appendix C.

      Wait, so it seems that in all the experiments with CNNs you just found that the lazy training didn't converge to a global minimum of training error. So it doesn't mean they aren't generalizing well!

      Is your Jacobian degenerate for the first set of experiments (with squared loss), because if not, then your theorem implies that they should converge to a global minimum right?

    2. hat manages to interpolate the observations with just a smalldisplacement in parameter space (in both cases, near zero training loss was achieved).

      zero training loss is achieved both in the lazy and non-lazy regime, but the non-lazy solution generalizes much better

    3. Cover illustration.

      I suppose that in both the lazy and non-lazy regime, it has reached a global minimum of training loss?

    4. Theorem 2.5(Under-parameterized lazy training).Assume thatFis separable,Ris stronglyconvex,h(w0) = 0andrankDh(w)is constant on a neighborhood ofw0. Then there exists0>0such that for all0the gradient flow(4)converges at a geometric rate (asymptoticallyindependent of) to a local minimum ofF.

      Convergence to local minimum, removing assumption about non-degeneracy of Jacobian