 Nov 2019

arxiv.org arxiv.org

min(Td;2S)
the min is because depending on which is larger one or the other of the two limits of the integral, dominates

29
Compare this to the analysis of Sollich ( https://pdfs.semanticscholar.org/7294/862e59c8c3a65167260c0156427f4757c67e.pdf ) which is in the wellspecified setting. There there's no dependence on the labels of the training data. Here neither, but at least there's dependence on the distribution of the target labels, so that it allows for more general types of assumptions.

K(x)is an even
which can be seen from its definition as a covariance.

of a Teacher Gaussian process with covarianceKTand assume that they lie in theRKHS of the Student kernelKS, namely
ah yes, being in RKHS means having a finite norm in the RKHS, which makes sense. But not sure how restrictive this is, just like I'm not sure if simply being ntimes differentiable is a good measure of complexity of the function. Are there ntimes differentiable functions that approximate any less smooth function? Maybe Lipschitz constant of derivatives (smoothness constants) could be more quantitatively useful?

If both kernels are Laplace kernels thenT=S=d+ 1andEMSEn1=d, whichscales very slowly with the dataset size in large dimensions. If the Teacher is a Gaussian kernel(T=1) and the Student is a Laplace kernel then= 2(1 + 1=d), leading to!2asd!1
hm, wait what? But wouldn't the Bayes optimal answer be obtained if the student has the same kernel as the teacher?

as \(n\to\infty\)

We perform kernel classification via the algorithmsoftmargin SVM.
which approximates a point estimator of the Gaussian process classifier, but I don't know the exact relation.

man
mean

Importantly (i) Eq. (1) leads to a prediction for(d)that accurately matches our numerical study forrandom training data points, leading to the conjecture that Eq. (1) holds in that case as well.
Compare with: https://arxiv.org/pdf/1909.11500.pdf where they find that random inputs give rise to plateaus, hmm at least with epochs, but they cite papers where these are apparently found for training set size (perhaps only for thin networks?)

s a result, various works on kernel regressionmake the much stronger assumption that the training points are sampled from a target function thatbelongs to thereproducing kernel Hilbert space(RKHS) of the kernel (see for example [Smola et al.,1998]). With this assumptiondoes not depend ond(for instance in [Rudi and Rosasco, 2017]= 1=2is guaranteed). Yet, RKHS is a very strong assumption which requires the smoothness ofthe target function to increase withd[Bach, 2017] (see more on this point below), which may not berealistic in large dimensions.
I think when they say "it belongs to an RKHS", they mean that it does so with a fixed/bounded norm (otherwise almost any function would satisfy this, for universal RKHSs). This is consistent with the next comment saying, that this assumption implies smoothness (smoothness<>small RKHS norm generally)


openreview.net openreview.netpdf1

Seems like PPO works better than their approach in several of the experiments. Hmm


arxiv.org arxiv.org

irreducible error (e.g.,Bayes error)
more commonly model capacity limitations I guess?


arxiv.org arxiv.org

GMM on a dataset of previously sampled parametersconcatenated to their respective ALP measure.
the GMM is only fitted to the parameter part or the (parameter, ALP) vector?


www.ki.tuberlin.de www.ki.tuberlin.de

nevertheless, the few remaining ones must still differ in a finite fraction of bits fromeach other and from the teacher so that perfect generalization is still impossible. For aslightly above aconly the couplings of the teacher survive.
Lenka Zdeborová, Florent Krzakala have found that at the capacity threshold, algorithms tend to have the longest running times, i.e. the worstcase examples seem to live at that transition

For a committeeof two students it can be shown that when the number ofexamples is large, the information gain does not decreasebut reaches a positive constant. This results in a much fasterdecrease of the generalization error. Instead of being inversely proportional to the number of examples, the decrease is now exponentially fast
For the case of the perceptron you can see how the uncertainty region (whose volume approximates the generalization error) approximately halves (or is reduced by about a constant) after every optimal query.


incompleteideas.net incompleteideas.net

n general, the baseline leaves the expected value of the update unchanged,but it can have a large
because baseline depends on S, it can reduce the variance from state to state (not the one from action to action).
WRONG: IT can reduce the action to action variance of the gradient (not the variance of the advantage!)

 Oct 2019

arxiv.org arxiv.org

computevar1bbÂj
this is the covariance matrix

This suggests that the effect ofj(x)is to rotate the gradient field and move thecritical points, also seen in Fig. 4b.
how does this equation suggest this?

sampling with replacement has better regularization
but you are saying that the temperature (\(\beta^{1}\) is lower when you sample with replacement, so that the regularization should be less?

conservative
how does this mean that it is conservatice?

This implies that SGD implicitlyperforms variational inference with a uniform prior, albeit of a different loss than the one used tocompute backpropagation gradients
The interpreation of doing variational inference with a uniform prior is because if we interpret the minimization objective as an ELBO, the second term is like the KL divergence between the approximate posterior and a uniform prior (whicih just gives the entropy). Nice
If \(\rho\) doesn't have any constraints then this should give the exact posterior with uniform prior, and likelihood given by \(\Phi(x)\)


arxiv.org arxiv.org

The second particularity is that since the computation of the rewardRpp;c;;oqis internal to themachine, it can be computed any time after the experimentpc;;oqand for any problempPP,not only the particular problem that the agent was trying to solve. Consequently, if the machineexperiments a policyin contextcand observeso(e.g. trying to solve problemp1), and storesthe resultspc;;oqof this experiment in its memory, then when later on it selfgenerates problemsp2;p3;:::;piit can compute on the fly (and without new actual actions in the environment) theassociated rewardsRp2pc;;oq;Rp3pc;;oq;:::;Rpipc;;oqand use this information to improveover these goalsp2;p3;:::;pi.
like hindsight experience replay


arxiv.org arxiv.org

Although methods to learndisentangled representation of the world exist [25,26,27], they do not allow to distinguish featuresthat are controllable by the learner from features describing external phenomena that are outsidethe control of the agent.
learning controllabe features is similar to learning a causal model of the world I think


arxiv.org arxiv.org

We find that the full NTK has better approximation propertiescompared to other function classes typically defined for ReLU activations [5, 13, 15], which arise for instancewhen only training the weights in the last layer, or when considering Gaussian process limits of ReLUnetworks (e.g., [20, 24, 32]).
NTK has "better approximation properties". What do they mean more precisely?


arxiv.org arxiv.org

and we have left the activation kernel unchanged,K`=1M`A0`A0T`
what is the reason to do this?

(A`jJ`)
J_l is the covariance for a single column of A_l right?

Second, we modified theinputs by zeroingout all but the first input unit (Fig. 1 right).
how does this work more precisely? The targets are generated by feeding the modified inputs to the "teacher network", but the student network gets the unmodified inputs?

for MAP inference, the learned representationstransition from the input to the output kernel, irrespective of the network width.
how is MAP inference implemented?

he representations in learned neural networks slowly transitionfrom being similar to the input kernel (i.e. the inner product of the inputs) to being similar to theoutput kernel (i.e. the inner product of onehot vectors representing targets).
this transition, as what? as the layer width is increased?

the covariance in the toplayer kernel induced by randomnessin the lowerlayer weights.
what does he mean by this?

e.g.compare performance in GarrigaAlonso et al. (2019) and Novak et al. (2019) against He et al.(2016) and Chen et al. (2018)).
but in here the GP networks lack many important features like batchnorm, pooling etc! Not sure if this example is a fair comparison. Also, not clear whether this difference is due to finite width or SGD (a question that Novak also asks)

enabling efficient and exact reasoning aboutuncertainty
Only in regression... AAaaAaaAh ÒwÓ



significant new benchmark for performance of a pure kernelbased method on CIFAR10, being 10% higher than the methods reported in [Novak et al., 2019]
Interesting, so apparently the NTK works better than the NNGP for this architecture at least


www.jmlr.org www.jmlr.org

Optimally, these parameters are chosen such that the true predictiveprocessP(t§jx§;S) is closest toQ(t§jx§;S) in relative entropy.
in which sense is this optimal?

Bayes classiØer
I thought the Bayes classifier would predict sign ( E_w [P(ty)y(xw)]  0.5) ?

our task is then to separate the structure from thenoise.
Well, and to find the correct regularity; generalization is not just about separating structure from noise. Unless by "noise" here, you mean also the stochasticity in the training sample (of inputs)..

We know of no interesting realworld learningproblem which comes without any sort of prior knowledg
Yep, no free lunch

(theluckycase)
again I wouldn't call it "unlucky", because the whole proof is that the generalization is good, because it's very unlikely to have obtained this training set by luck, so that it's most likely that we obtained it by having chosen a good prior. So I would call it "good prior" case.


arxiv.org arxiv.org

, such as cross entropy loss, encourage a larger outputmargin
The fact that they also encourage a large SVMmargin is not so trivial tho

the gap between predictions on the true label and andnext most confident label.
In SVMs, for instance, "margin" refers to the distance between classification boundary and a point. This can be related to the definition of margin here, but they are not the same?
E.g. if we have a small SVMmargin, but a really large weight norm, then we would still have a small output margin.
Ah, that's why they normalize by weight norm I suppose yeah.


arxiv.org arxiv.org

This is further consistent with recent experimental work showing that neuralnetworks are often robust to reinitialization but not rerandomization of layers (Zhang et al. [42]).
what does this mean?

Kernels from single hidden layer randomly initializedReLUnetwork convergence to analytic kernel using Monte Carlo sampling (Msamples). See §I foradditional discussion
I think the monte carlo estimate of the NTK is a montecarlo estimate of the average NTK (as in average over initializations), not of the initializationdependent NTK which Jacot studied. Jacot showed that in infinite width limit both are the same.
But it seems from their results that even for finite width the average NTK is closer to the limit NTK than the singlesample NTK. This makes sense, because the single sample one has extra fluctuation around average.

We observe that the empirical kernel^gives more accurate dynamics for finite width networks.
That is a very interesting observation!

=0n
yeah! so in standard parametrization, the learning rate is indeed O(1/n) !

< critical
is the condition \(\eta <\eta_{\text{critical}}\) on the learning rate just so that gradient descent and gradient flow give similar results?


github.com github.com

Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent
You didn't except hypothes.is in here did you?
Bamboozled again!


arxiv.org arxiv.org

One may argue that a natural question considering convex polygons would be whether they are separatefrom each other. Unfortunately, there is an infeasible computational lower bound for this question.
Yeah the nomissinclusion property doesn't imply the hulls don't intersect. Think of two perpependicular rectangles which intersect but where the corners are not inside the other rectangle


projects.raspberrypi.org projects.raspberrypi.org

most laptop and desktop computers have one.
lol yeah sure

 Sep 2019

papers.nips.cc papers.nips.cc

For example, linear networks suffer worse conditioningthan any nonlinear network, and although nonlinear networks may have many small eigenvalues theyare generically nondegenerate.
but doesn't it necessarily have to be degenerate when the number of trainnig points is smaller than the number of parameters



For Softplus and Sigmoid, the training algorithm is stuck at a low test accuracy10%
wow. what's their train accuracy?

activation functions that do not have an EOC, such as Softplus and Sigmoid
how does their phase diagram look like?

might be explained by thislog(L)factor.
Is that likely given that it's so small that it can't be seen experimentally?

2EOC
what exactly is the EOC for ReLU?


arxiv.org arxiv.org

increasesits mean
Right, but it may decrease its mean squared, which is what you are interested in.

We again make use ofthe wide network assumption
Isn't this now assuming large dimensionality of inputs?

It was shown in He et al. [2015] that for ReLU networks initialized using Equation 2,the total mean is zero and total variance is the same forallpreactivations, regardless of the sampledistribution.
Well it seems to me that they just looked at the average over weights. But their basic result is true if you average over any input distribution.You just get the average squared norm of the input multiplying the variances at each layer, but the variance at each layer are still all the same

s
insert comma here

For almost all samples, these neurons are either operating as if they werelinear, or do not exist.
Unclear what you mean here


arxiv.org arxiv.org

thenyl1has zero mean and has a symmetricdistribution around zero. This leads toE[x2l] =12Var[yl1]
This is very nice. We don't need the infinite width assumption to calculate how variances propagate through the network. This is unlike for covariances or higher moments

wl1have a symmetric distribution around zero
so the fundamental assumptions is that the weights have a distribution which is symmetric around 0, not just of mean 0



However, sincexis affected byW,band the parameters of all the layers below, changesto those parameters during training will likely move manydimensions ofxinto the saturated regime of the nonlinearity and slow down the convergence. This effect isamplified as the network depth increases.
Why is this?


Local file Local file

We hope this work will encourage further research that facilitatesthe discovery of new architectures that not only possess inductive biases for practical domains, butcan also be trained with algorithms that may not require gradient computation
The WANN is an example of meta learning architectures that can be trained with new algorithms

In the discussion they point out a couple of ways the work could be useful or interesting (transfer learning, training techniques beyond gradientmethods), but they don't make many clear points It seems the main point of the paper is to just point out that this is possible, and give food for thought.


arxiv.org arxiv.org

for erf andb= 0, the even degreeks all vanish
Does this mean that erf networks are not fully expressive?


Local file Local file

The correlation vanishes if we train to a xed, nite loss instead of for a specied number ofepochs, and weakens as we increase the number of layers in the network
Do you mean the other way round? I thought the experiments in this section were running SGD until it reaches a small fixed loss value, like you say in Figure 5.
EDIT: Ah I see you repeat this in the end. Make sure you mention that you state that the experiments in Section 3.2 are for a fixed number of epochs.
The fact that you need to train for a fixed number of epochs is interesting. Perhaps, the selfsimilarity among different minima occurs within a "layer" of weight space corresponding to weights of a given norm, and different layers are related by rescaling also (see section about flatness vs epochs, assuming norm increases with epochs). But if you train to a fixed loss, i guess the norm/layer needed to reach that loss is different for different data/functions, so that's why the correlation vanishes?

We restrict our attention to the rectied linear activation(ReLU) function, described by
as in the rest of the paper?

atness
I think in the plots where you talk about flatness, the flatness axis should be log(prod lambda), so that larger values correspond to higher flatness

This is consistent with the aforementioned selfsimilarity properties of the simple Boolean network.
what do you mean? flatness could increase by the same amounts, whether the lambda_max correlated with flatness or not, no? The two phenomena could be related though

as gradient 1.687, indicating that atness has increasedupon further training
The gradient here just indicates that larger flatness increases in flatness more right? It's the yintercept that is showing here that the flatness increases after 100 epochs

, the volume does not change (up to some noise limit).
I suppose because the function stops changing? Is this training on part of the inputs, rather than on all the inputs? If the latter is the case, it's trivial that, if the algorithm converges, it will not change the function it finds after enough epochs.

740402
I thought the network was 740401

For varying proportions of corruption(up to 50%)
This makes it seem to me like you corrupted a fraction of S_train, rather than appending an S_attack? We don't want that as then we are not fitting the samesized (uncorrupted) training set.

(using the original, uncorrupted train dataset)
You'll have to explain to me (and the reader) what precisely you did. What are each of the points in Figure 15? A different training set, the same training set?

We restrict the training set to 500 examples
Previously you said 512 examples. Which one is it?

image pixels
pixels in the images of size 28x28

We deliberately corrupt a small, variable proportion of the training dataset to producesolutions with diverse generalisation performances.
To be clear, S_train is fixed in size, but S_attack varies in size right? It's not S_train + S_attack that's fixed in size. You should make this clear, because both approaches have been used, but in this case, we want S_train to be fixed.

Figure 11
\(\alpha>1\) increases the norm and \(\alpha<1\) decreases it, it seems. I guess this is because in the former case, we are increasing more parameters than we are decreasing it, and viceversa in the latter case.
On the other hand, both increase the sharpness. This shows that sharpness and norm don't necessarily follow each other. However, it may be that for solutions that SGD finds, sharpness and norm do correlate. [ this is in a similar spirit to the alpha scaling rebuttal; while there are sharp regions with large and small norm, perhaps constrained to SGDtypical regions, the two quantities correlate. We could check this ]

deteriorating atness as wescale
what are you referring to here?

after a xed number of iterations with random weightand bias initialisation)
and standard optimization algorithms

attestpoint
well, we don't know if the flattest, but definitely flatter than alphascaled ones

how can minima corresponding toidentical functions have arbitrarily dierent atnesses?
the question here is: "if minima corresponding to the same function (which thus generalize identically) can have arbitrarily different flatnesses, how can flatness be used as a proxy for generalization?

(Wi;bi;Wi+1)!(Wi;bi;1Wi+1)
where \(\alpha>0\)

any form (rectiedor otherwise)
as long as it is twice differentiable, so that the Hessian is defined

redundant
rather than "redundant", "have limitations", or "are inappropriate"?

most simple output function.
in the sample

symmetry
selfsimilarity is a better term, as it's more commonly used, specially in cases, where the similarity at different isn't exact

irrespectiveof complications due to rounding,
what complications?

the upper band corresponds to functions dominated by 0 and the lower band corresponds tofunctions dominated by 1.
How is this? I would have expected behavior to be symmetric under changing outputs 0 to outputs 1 (think of changing sign of last layer)

a xed numberof epochs
enough to reach global minimum / target function?

atness is strongly anticorrelated with the volume
I would define the x axis as "sharpness", or if you want to call it flatness, make it negative. A minimum with larger log(lambda) is more flat right?

is is because
this is expected because
This is the MDL argument right?

nction behaviour (and hence loss)
here we are assuming that the loss measures discrepancy on all input points, so that zero loss is only achieved by functions equal to the target function.

arguments
and can provide rigorous bounds in some cases via PACBayes theory

f there is a bias, it will obey this bound.
the bound is always obeyed, but it is only tight if there is enough bias.
In other words, one only obtains simplicity bias (simple functions being more likely) if there is bias to begin with.

But precisely why dierent solutions dier in atness and why optimisation algorithmsused during training converge so surely to 'good', at solutions remains unclear.
also, the MDL principle based on flatness doesn't provide nonvacuous bounds [lecun et al], except on few recent exceptions [dan roy et al]

 Aug 2019


the limiting kernelsK1carry (almost) no information onx;x0and have therefore little expressive power
Why?

1t(X).
how does this gamma term evolve?

when using SGD, the gradient update can be seen as a GD with aGaussian noise (Hu et al., 2018; Li et al., 2017).
Think of each step of Brownian motion as integrating many mini steps of SGD.
This is reminiscent of CLT

Recent work by Jacot et al. (2018) has showed that training a neural networkof any kind with a full batch gradient descent in parameter space is equivalentto kernel gradient descent in function space with respect to the Neural TangentKernel (NTK).
Only when the learning rate is very small


arxiv.org arxiv.org

Theorem 4.1(Weak Spectral Simplicity Bias).
No bias towards complexity

Kernels as Integral Operators
How does this definition of the kernel as an integral operator fit in the rest of the story as a kernel of a Gaussian process? I thought a Gaussian process doesn't define an input over functions?

with the larger stepsize2
why does the larger step size cause more stability?

whereCdk
you haven't defined \(\Delta\) in the statement of the theorem


arxiv.org arxiv.org

nonconstant
because it is not a polynomial

w(`)!0
Why?? The operator norm of a random Gaussian matrix goes like \(O(\sqrt{n})\) ? (see https://terrytao.wordpress.com/2010/01/09/254anotes3theoperatornormofarandommatrix/ e.g.)

a(k)(t)a(k)(0) +c~a(k)(t)andw(k)(t)w(k)(0) + ~w(k)(t)to getthe polynomial bound
Do they substitute \(a^{(k)}\) and \(w^{(k)}\) with \(A(t)\)? That is a valid bound, but seems quite loose. But it doesn't seem to be what they are saying. Substituting the LHS by RHS on Q, doesn't guarantee we obtain a polynomial on A(t) ?

Idn`
should be \(Id_{n_0}\)

(`);d(`+1
This is an extension of the definition of <>_pin to allow for vectorvalued functions living in spaces of different dimensionality, in which case we take the outer product I think


arxiv.org arxiv.org

stochastic gradient descen
An in fact SGD seems to work better in terms of both convergence, and generalization, in practice. Can we explain that theoretically?

Rqdh−1×p
As this has \(p\) columns, I'm assuming they only consider a stride of \(1\), for the convolutional networks (probably easy to generalize regardless)

λ4
\(\lambda_0\) is \(\lambda_\text{min}(H)\) I suppose

show
extra word

depends
depends on

at linearrate.
The rate of convergence is going to be very slow though for big \(n\), because the learning rate is very small

ηλminK(H)2
Is this guaranteed to be \(<1\)? Otherwise, the inequality couldn't be right.
Seems like this being less than \(1\) depends on what \(\lambda_{\text{min}}(\mathbf{K^{(H)}})\) is

A(H−1)
Hmm, should this be \(\mathbf{A}_{ij}^{(H1)}\)?
I think so. See page 40 for example.

K(h)ij
These are the standard Neural Network Gaussian Process kernels I think

gradient descentwith a constant positive step size
But, like for NTK paper, step size is effectively very small, because of using NTK parametrization. Except that because \(m\), is at least a finite step size :P

ǫ1
Should be \(\mathcal{E}_1\)?

BothZou et al.(2018)andAllenZhu et al.(2018c) train a subset of the layers
Really? That sounds like a big difference with practice..

in a unified fashionin AppendixE.
How does this compare to the unified treatment of Greg Yang?

Jacot et al.(2018) do not establish theconvergence of gradient flow to a global minimizer.
They do, for a positive definite kernel right?

 Jul 2019

arxiv.org arxiv.org

this concludes theproo
Technically we've shown that the variation in \(\alpha\) is \(O(1/\sqrt{n_L})\), but not the derivative? I think however, one can express the variation in the NTK as a sum over variations of alpha (by using the arguments here, and then integrating in time), giving us the desired result that the variation of the NTK goes to zero as \(O(1/\sqrt{n_L})\)

Rn`n`+1
should be \(\mathbb{R}^{n_{l+1} \times n_l}\) I think?

recursive bounds
Uses CauchySchwartz for Operator norm of matrices, which can be obtained easily from definition

A(t)stays uniformly bounded on[0;]
Why does it stay uniformly bounded on \([0,\tau]\) and not on \([0,T]\) ?
Which theorem are they applying from reference [6], Thm 4?

The summands@W(L)ijf;j0(x)@W(L)ijf;j00(x0)of the NTK hence vary at rate ofn3=
Each of the derivatives has a factor of \(\frac{1}{\sqrt{n_{L}}}\), plus we get an extra factor of \(\frac{1}{\sqrt{n_{L}}}\) from the derivative of \(\alpha_{i}^{(L)}\)

@tjjjjjj@tjj
For the 2norm, at least?

theactivations
preactivations?

hence thatk1pnLW(L)(0)kopis bounded
as Frobenius norm bounds spectral norm (which is the same as operator norm for matrices)

Theorem 2.Assume thatis a Lipschitz, twice differentiable nonlinearity function, with boundedsecond derivative. For anyTsuch that the integralRT0kdtkpindtstays stochastically bounded, asn1;:::;nL1!1, we have, uniformly fort2[0;T]
Under the NTK parametrization, which they use, this limit implies that the learning rate (for GD on the standardparametrization) is \(O(1/\sqrt{n}\) (where \(n\) is layer size). So the parameters move less and less for a fixed \(T\), in this limit, which is, intuitively, why the NTK stays constant for this period of time until \(T\)
The interesting thing is that the function \(f\) can change, as all the parameters "conspire" for it to change. Therefore it can potentially fit a function, and find a global minimum, while the parameters have almost not moved at all.
I think the intuition for this "conspiracy" is that the total change in \(f\) is given by a sum over all the parameters' individual gradients. The number of parameters grows like \(n^2\). gradient w.r.t. last hidden layer activations is \(O(1/\sqrt{n})\), w.r.t to second to last hidden layer activations is \(O(\sqrt{n}(1/\sqrt{n})^2) = O(1/\sqrt{n})\), where the \(\sqrt{n}\) comes from variance of summing over all the activations in last hidden layer. This means that the gradient w.r.t. to a weight, in NTK parameterization, is \(O((1/\sqrt{n})^2)=O(1/n)\) In GD, each weight changes by an ammount of the same order as the gradient (assumin \(O(1)\) learning rate, which we assume for NTKparametrization learning rate), so each weight contributes to change \(f\) by \(O(1/n^2)\). Therefore the total contribution from all the weights is \(O(1)\). Note that the contributions all have the same sign as they are essentially the gradient w.r.t. that weight, squared, so they add linearly, (and not growing like \(\sqrt{n}\) if they were all randomly signed)

~(`)1pn`W(`)
i guess the first product here is elementwise, although it's not explicitly said

The connection weightsW(L)ijvary at rate1pnL, inducing a change of the same rate to the whole sum
From chain rule

N(0;1)
huh? no \(1/\sqrt{n}\) ?

ANNrealization functionF(L):RP! F, mapping parameterstofunctionsfin a spaceF.
:O we studied the same object in our paper! But we called it the parameterfunction map!

This shows a direct connection to kernel methodsand motivates the use of early stopping to reduce overfitting in the training of ANNs
But early stopping doesn't seem to often help in ANN training


arxiv.org arxiv.org

Finally and most importantly: how do weconnect all of this back to a rigorous PAC learning framework?

high dimensionality crushes bad minima into dust
If high dimensionality is the reason, then why does the high dimensional linear model work so well?

rescalings of network parameters are irrelevant
Not always. If you scale all parameters of a ReLU network with biases, it changes the function. If biases are zero, it doesn't change the function.
It's true that batchnorm makes networks independent of parameter scaling, for the layer immediately before a batch norm layer.

hen parameters are small (say, 0.1), aperturbation of size 1 might cause a major performance degradation. Conversely, when parameters
This feels like it may be true for nonlinearities like tanh, but not sure if it will be for relu.
For ReLU, larger parameters increase the gradient of the outputs w.r.t. to parameters in lower layers. Exponentially in the number of layers! This is what allows the lazy regime (see literature on NTK / lazy training etc)

the linearmodel achieves only 49% test accuracy, whileResNet18 achieves 92%.
This is interesting in that it shows that "overparametrization" is not enought to get models that generalize well. Neural networks must have special properties beyond being overparametrized

Interestingly, all of these methods generalize far better than the linear model. While thereare undeniably differences between the performance of different optimizers, the presence of implicitregularization for virtually any optimizer strongly indicates thatimplicit regularization is caused inlarge part by the geometry of the loss function, rather than the choice of optimizer alone.
YES. That's what we say too: http://guillefix.me/nnbias/

bad minima are everywhere
This is a very loose statement. They could be a set of very little measure that is just rather dense in the space. So yeah they could be "everywhere", but be very rare and hard to find, unless you are beeing trained to find them explicitly


terrytao.wordpress.com terrytao.wordpress.com

Markov’s inequality.
I could only get
\(\mathbf{P}(M x \geq \sqrt{An}) \leq C^n \exp (c A n)\)
Applying Markov inequality :P Did I do anything wrong?


arxiv.org arxiv.org

[35,
citation 35 and 6 are the same citation (one in the conference, and one in arxiv). I think they should be merged.

width
square root of width?

which forinstance can be seen from the explicit width dependence of the gradients in the NTK parameterization
Yeah but the NTK parametrization makes the gradients much smaller. For normal parametrization, gradient of individual weights is not infinitesimal right?


arxiv.org arxiv.org

n≤d
\(d\) is the input dimension. But also the number of parameters, because we are looking at a linear model here.

We assume the feature matrix can be decomposed into the formX=X+ZwhereXis lowrank(i.e. rank(X)=r<<n) with singular value decompositionX=UVTwithU∈Rn×r,∈Rr×r,V∈Rd×r,andZ∈Rn×dis a matrix with i.i.d.N(0;2x~n)entries.
Noise model. Information space is the component of the inputs that live in a lowdim space (low rank component), and the nuisance space is the component that corresponds to i.i.d. noise, which will w.h.p. be of maximum rank


arxiv.org arxiv.org

r large, we obtain a low limit training accuracy and do not observe overfitting, asurprising fact since this amounts to solving an overparameterized linear system. This behavioris due to a poorly conditioned linearized model, see Appendix C.
Wait, so it seems that in all the experiments with CNNs you just found that the lazy training didn't converge to a global minimum of training error. So it doesn't mean they aren't generalizing well!
Is your Jacobian degenerate for the first set of experiments (with squared loss), because if not, then your theorem implies that they should converge to a global minimum right?

hat manages to interpolate the observations with just a smalldisplacement in parameter space (in both cases, near zero training loss was achieved).
zero training loss is achieved both in the lazy and nonlazy regime, but the nonlazy solution generalizes much better

Cover illustration.
I suppose that in both the lazy and nonlazy regime, it has reached a global minimum of training loss?

Theorem 2.5(Underparameterized lazy training).Assume thatFis separable,Ris stronglyconvex,h(w0) = 0andrankDh(w)is constant on a neighborhood ofw0. Then there exists0>0such that for all0the gradient flow(4)converges at a geometric rate (asymptoticallyindependent of) to a local minimum ofF.
Convergence to local minimum, removing assumption about nondegeneracy of Jacobian

In terms of convergence results, this paper's main new result is the convergence of gradient flow, and showing that it stays close to the tangent (linearized) gradient flow.
And saying this for general parametrized models. The assumption of nondegenerate Jacobian is related to overparametrization, as nondegeneracy is more likely when one is overparametrized.

he gradient flow needs to be integrated with a stepsize of order1=Lip(rF) = 1=Lip(h)2
size of step size for gradient flow to be a good approximation

s!1,supt0kw(t)w0k=O(1=)
How come it can find a minimum arbitrarily close to the initialization?
Ah I see by the nondegenerate Jacobian assumption, you can find a local change that will fit \(y^*\), and \(\alpha\) large is just needed to reach the overall size/scale of \(y^*\) with the local change

kh(w0)y?kis bounded
How realistic is this?

squareintegrablefunctions with respect tox
why do we need them to be squareintegrable?

are bound to reach the lazy regime as the sizes of all layers grow unbounded
and the learning rate tending to zero..

r
Remember this nabla is w.r.t. to its argument not parameters \(w\)


arxiv.org arxiv.org

Consider the class of linear functions overX=Rd, with squared parametrization as follows
Seems quite artificial, but ok

duplicating units and negating their signs, the Jacobian of the modelis degenerate at initialization, or in their notationmin= 0
is this if the weights are tied only? Do they assume they are tied?

The data are generated by a5sparse predictoraccording toy(n)N(h;x(n)i;0:01)withd= 1000andN= 100.
perhaps large initialization is like a small L2 norm bias, and small initialization like an L1 norm bias. So the kernel regime is bad for learning sparse networks (I think Lee also says this in his talk)

training with gradient descent has the effect of finding the minimum RKHS norm solution.
they showed that for GD and logistic regression, but what about SGD, and square loss? I think for square loss you need either early stopping or regularization to get min norm solution?


arxiv.org arxiv.org

Can we moreprovide theoretical justications for this gap?
are all our base belong to us?


arxiv.org arxiv.org

distance
the kernel distance?


arxiv.org arxiv.org

the case of a regression loss, the obtained modelbehaves similarly to a minimum norm kernel least squares solution
Only its expected value, see page 7 in Jacot2018, if I understood correctly

Stateoftheart neural networks are heavily overparameterized, making the optimization algorithma crucial ingredient
The fact that most naive learning algorithms work well, makes me question the "crucial" qualifier..


www.wikiwand.com www.wikiwand.com

which is a nonlinear
typo on above equation? this appears to be the same as the Schrodinger equation


arxiv.org arxiv.org

raining just the top layer with anℓ2loss is equivalent to a kernel regression for the following kernel:kerx,x′=Eθ∼W[f(θ,x)·fθ,x′],
This is the expected value of the kernel, not the actual kernel, which would correspond to a random features kernel right?
Hmm I think I remember random features converging when their number grows to infinity, but the product \(f(\theta,x)f(\theta,x')\) doesn't stochastically converge when the width grows to infinity right? Only its expectation converges

aGaussian Process (GP)[Neal,1996].This model as well as analogous ones with multiple layers [Lee et al.,2018,Matthews et al.,2018]and convolutional filters [Novak et al.,2019,GarrigaAlonso et al.,2019] make up the GaussianProcess view of deep learning. These correspond to infinitely wide deep nets whose all parametersare chosen randomly (with careful scaling), and only the top(classification) layer is trained.
Maybe, but these kernels also correspond to those of a fully trained ideal Bayesian neural network, with prior over weights given by the iid initialization

 Jun 2019

www.marxists.org www.marxists.org

He is not like that on account of a cowardly heart or lungs or cerebrum, he has not become like that through his physiological organism; he is like that because he has made himself into a coward by actions.
philosology does affect you as well though...


arxiv.org arxiv.org

f= logpd.
If the optimum of (2) is given by this when function is unrestricted, if we consider a family with zero "approximation error" (so that the optimum is in the family), then the optimum on the family is the same as over all functions


arxiv.org arxiv.org

we can employ efficient offpolicy reinforcement learningalgorithms that are faster than current metaRL methods,which are typically onpolicy (Finn et al., 2017a; Duanet al., 2016).
Why are previous metaRL algorithms typically onpolicy?

 May 2019

dspace.mit.edu dspace.mit.edu

D∗[T∗μ,T(Zm0)]
I see this as one of the main innovations of the paper. This term is a discrepancy between the sample, and the true distribution \(\mu\). This would allow Z_m to be sampled from a different distribution for instance, allowing to get bounds that account for distributional drift, for instance.

V[f]
This basically offers a measure of the variance of the loss (in a nonstatistical sense) over the instance space, of the learned function.

hus, in classical bounds including datadependent ones,asHgets larger and more complex, the bounds tend to become more pessimistic for theactual instance ˆyA(Sm)(learned with the actual instanceSm), which is avoided in Theorem1.
Sure, but that is also avoided in some statistial learning approaches, like Structural Risk Minimization, PACBayes, and the luckiness framework, which you cite!


arxiv.org arxiv.org

the total computational costis similar to that of singlehead attention with full dimensionality
smaller?

Multihead attention allows the model to jointly attend to information from different representationsubspaces at different positions. With a single attention head, averaging inhibits this.
So if I understand correctly, with a single head, different parts of the d_modeldimensional query vector may "want" to attend to different parts of the key, but because the weight of the values is computed by summing over all elements in the dot product, it would just average these local weights. Sepparating into different heads, allows to attend to different value vectors for different "reasons".

 Apr 2019

Local file Local file

theprobability
log probability

say
~~

too weak
for Kmax

tighter
in relative terms

o generatex
given the map \(f\)

rst e
first give x, and then enumerate ... identifying all inputs p mapping to x, namely \(f^{1}(x)\)

lays a key rolein
is the main component in

,
:

function
of

NI= 2n.
for binary strings

derived
suggested

Since manyprocesses in science and engineering can be described asinputoutput maps that are not UTMs
Perhaps say "This suggests that, even though many maps are not UTMs, the principle that low K are high P should hold widely"
because it is not because they are not UTMs, but it is in spite of them not being UTMs, I would argue.

, a classic categorization of machinesby their computational power,
in parenthesis


www.cs.toronto.edu www.cs.toronto.edu

k(y,x,x′,y′)
should be \(k(y,x,y',x')\) right?


distill.pub distill.pub

If we add a periodic and a linear kernel, the global trend of the linear kernel is incorporated into the combined kernel.
Remember that kernel functions with one of its arguments evaluated are members of the reproducing kernel Hilbert space to which all the functions supported by a particular Gaussian process belong.
Therefore adding kernels, amounts to adding the functions on these two spaces. That is why the resulting functions work like this when combining kernels!


www.jmlr.org www.jmlr.org

concave in both arguments. Jensen’s inequality (f(x,y)concave⇒E f(x,y)≥f(Ex,Ey))
Actually it's convex

 Mar 2019

www.jmlr.org www.jmlr.org

A stochastic error rate,ˆQ(~w,μ)S=E~x,y∼S ̄F(μγ(~x,y)
Remember that the w sampled from the "posterior" isn't necessarely parallel to the original w, so that the stochastic classification rate isn't simply F(sign(margin)) but something more complicated; see the proof.

Since the PACBayes bound is (almost) a generalization of the Occam’s Razor bound, the tightnessresult for Occam’s Razor also applies to PACBayes bounds.
Oh, c'mon :PP You are just showing that PACBayes is tight as a statement for all Q and for a particular P. As in you are saying that if we only let it depend on the quantities it can depend (namely KL divergence between Q and P, delta, etc), then it can't be made tighter, because then it would break for the particular choice of D, hypothesis class, Q, and for any value of KL in that case in the Theorem 4.4 above.
> What I mean is this: that we say the bound is a function f(KL, delta, m, etc). Theorem 4.4 shows that there is a choice of learning problem and algorithm such that these arguments could be anything, and the bound is tight. Therefore, we can't lower this bound without it failing. It is tight in that sense. However, it may not be tight if we allow the bound to depend on other quantities!

The lower bound theorem implies that we can not improve an Occam’s Razor like statement.
Yeah, as in if it only depends on \(P(c)\) and the other quantities expressed there, and have it not depend on the algorithm, so it should be a general function that takes \(P(c)\), \(\delta\) etc, but the same function for any algorithm. Then yes. And this is what they mean here.

For all P(c), m, k,δthere exists a learning problem D andalgorithm such that
Depends what do you mean by For all \P(c)) are you fixing the hypothesis class or what? Because your proof assumes a particular type of hypothesis class... For P(c) having support over a hypothesis class where the union bound doesn't hold, then it is not tight any more..

The distributionDcan be drawn by first selectingYwith a single unbiased coin flip, and thenchoosing theith component of the vectorXindependently, Pr((X1, ...,Xn)Y) =Πni=1Pr(XiY). Theindividual components are chosen so Pr(Xi=YY) =Bin(m,k,δP(c)).The classifiers we consider just use one feature to make their classification:ci(x) =xi. The trueerror of these classifiers is given by:cD=Bin(m,k,δP(c))
Ok, so this has proven that the Occam bound is tight for this particular \(D\) for this particular hypothesis class, which is quite special, because it has the property that the union bound becomes tight. But that is a very special property of this hypothesis class (or more general, of this choice of support for \(P\) right??)
