 Dec 2018

arxiv.org arxiv.org

it does not distinguish between models which fit the training data equally well
Well, if it is regularized (so there is an effective prior), then this isn't true!


openreview.net openreview.netpdf7

on networks and problems at a practical scale
Yeah. But not on the original networks, but on the compressed ones!

L()
This is a 01 loss. It can be applied to top1 or top5 defining the loss appropriately!

The pruned model achieves a validationaccuracy of60 %.
this is far from SOTA. right?

A direct application of our naïve Occam bound yields nonvacuousbound on the test error of98:6%(with95%confidence)
Ok, that is nonvacuous, but just, right?

Our simple Occam bound requires only minimal assumptions, and can be directly applied to existingcompressed networks
But remember that the compression scheme technically shouldn't depend on the data, for the bound to be valid (as the PACBayes prior depends on the compression scheme)

We obtain a bound on the training error of46%
On the training error, or the test error?

We prunethe network using Dynamic Network Surgery (Guo et al., 2016), pruning all but1:5%of the networkweights.
wow, they are very weightcompressible indeed


emtiyaz.github.io emtiyaz.github.io

Meaneld underestimates the variance
Independence increases concentration

REq=i(w=i)[logp(D;w)]dw
You are normalizing the log not the distribution here!


www.cs.cmu.edu www.cs.cmu.edu

/
Exp missing

 Nov 2018

emtiyaz.github.io emtiyaz.github.io

DKL[qkp]
It is minus this!


openreview.net openreview.netpdf6

jSjc+jCjc
These are the sizes of the codes for encoding the vector of weight indices and discretized weight values, according to coding scheme \(c\)

264bytes, which is272bits
isn't \(2^{64}\ bytes \(2^{67}\) bits, as each byte is \(2^3\) bits?

Theorem 4.1
Bound on KL divergence for Universal prior, and point posterior

Ascis injective onHc, we have thatZ1.
By Kraft inequality, and \(m\) being a probability measure

The generalization bound can be evaluated by compressing a trained network, measuring theeffective compressed size, and substituting this value into the bound.
An important question is: are the bounds valid for the compressed network, or for the original one as well? (as isn't the case in the related work: https://arxiv.org/abs/1802.05296 )

empirical evidence suggests that they fail to improve performance in practice(Wilson et al., 2017)
Well. If this is the case it means that proving a generalization bound for one of these procedures "implies" a bound for the standard training procedure, as you are claiming that the standard one generalizes not (much) worse than these ones


www.youtube.com www.youtube.com

To understand phenomenon described by Saxe and in the video at 43:00, we can think of this: Low eigenvalues in XX^T correspond to directions with little variations in the input. However, by the random fluctuation eta, the output could have an O(1) variation, even for arbitrarily small input variation, which requires a large weight to fit, and produces large generalization error.
for $\alpha<1$ the probability of this directions in input space with low variation decreases, as we get less directions overall with points I think (directions with no points/variation are ignored for the algorithm which projects weight into input subspace, and these are the 0 eigenvalue parts of the MarchenkoPastur distr)


arxiv.org arxiv.org

weakinteractions between layers can cause the network to have high sharpness value.
why does interacting weakly cause high sharpnes??
I think the reason is that for layers which interact strongly random perturbations tend to cause smaller relative change on the output, than for layers that interact strongly. And this smaller relative change on outputs probably translates into a smaller absolute change in the Loss...
think about aligned eigenvectors and stuff

depends only linearly on depth and does not have any exponential dependence, unlike other notionsof generalization.
Aren't VC dim bounds for NNs linear in depth also?

forming a confusion set that includessamples with random labels.
This is what's done in Wu et al https://arxiv.org/abs/1706.10239 also

middle and right
For a fixed expected sharpness the KL (and thus the effective capacity/bound) increases less for true labels than for random lables. Also, in both cases, it increases monotonically

capacity is proportional toCMmarginndiamM(X).
This is the "curse of dimensionality"!
Probably can prove using covering numbers which bound RadComp (via Massart's lemma)

simply bounding the Lipschitz constantof the network is not enough to get a reasonable capacity control
but are the generalization error bounds for Lipschitz functions tight?

However, the covering number of the input domaincan be exponential in the input dimension and the capacity can still grow as
Unless inputs lie on a slow dimensional manifold I guess

Instead, to meaningfully compare norms of the network, we should explicitly take into account thescaling of the outputs of the network. One way this can be done, when the training error is indeedzero, is to consider the “margin” of the predictions in addition to the norms of the parameters.
This has been used in several papers on generalization bounds for neural nets.
I think the idea is like that for margin bounds for SVMs. You can't bound the RadComp using the 01 loss as it's Lipschitz (related to above arguments). But you can bound the Hinge loss (which is related to having a margin!), and then use the fact that the Hinge loss upper bounds the 01 loss.
The intuition I guess is that the previous troublesome cases with normbased capacity can be solved by taking margin into account, as the extremes in both cases are reducing/increasing the margin (for weights going to zero/infinity).

proportional toQdi=1kWik21;1,wherekWik1;1is the maximum over hidden units in layeriof the`1norm of incoming weights tothe hidden unit [4].
This comes from the Rademacher complexity of the neural network. See here: http://www.stats.ox.ac.uk/~rebeschi/teaching/AFoL/18/material/lecture3.pdf#page=4 for a derivation

One can then ensure generalization of a learned hypothesishin terms of the capacity ofHM;M(h)
They just state this, but they should maybe cite some work on nonuniform learnability, SRM, MDL, so that people unfamiliar with it can see why this is!

anymeasure which is uniform across all functions representable by a given architecture, is not sufficientto explain the generalization ability of neural networks trained in practice. For linear models, normsand marginbased measures, and not the number of parameters, are commonly used for capacitycontrol [4,8,24].
These are all SRM based.
They depend (most often implicitly via the training data) on the datadistribution/target function, and the algorithm (also implicitely) If you get a good bound with SRM it's most likely because your algorithm is biased towards the right kind of solution (kind basically corresponds to the classes used in SRM). In other words, a bad algorithm (relative to target fun) will be very unlikely to produce a good bound. On the other hand, a good algorithm may produce a bad bound, if the prior is chosen badly. So that SRM bounds may not be tight in general


arxiv.org arxiv.org

typo \(\theta\) shouldn't be here

 Oct 2018

www.stats.ox.ac.uk www.stats.ox.ac.uk

www.stats.ox.ac.uk www.stats.ox.ac.uk

+ 1
I don't think the \(+1\) is actually necessary

Proposition 5.1
Remember: larger metric correspond to smaller rulers

the
there

5.2
5.3


www.stats.ox.ac.uk www.stats.ox.ac.uk

2
shouldn't there be a \(\lambda\) here too?
And in fact another \(\lambda\) from Rad(A_1)?

=
this should be \(\leq\),
thought I would have used \(\mathcal{L} \cup \mathcal{L}\) instead

0
appostrophe shouldn't be there

h
hull

=
missing parenthesis

l
$$j$$


www.stats.ox.ac.uk www.stats.ox.ac.uk

=
\(\leq\)


arxiv.org arxiv.org

(U) =RSP(Ujv)d(v).
? How does this define what \(\rho\) is?


arxiv.org arxiv.org

he largestlast width function seems to converge slightly faster than the largest last width function
typo

more
less


courses.cs.washington.edu courses.cs.washington.edu

distribution

 Aug 2018

arxiv.org arxiv.org

DdCid!dCjd!E=1NdCd!2+F(!)ij
I don't get this formula. What are C_i and C_j supposed to mean. Or rather, where is the randomness over which we average?


arxiv.org arxiv.org

is not surprising and there are suitable options, even without GANs
Well, what is nontrivial about this is generalizing (as usual), how do you generate novel samples, what are samples that "look like" the rest of the data. How do you define that?


openreview.net openreview.netpdf1

The alignment problem is how to make sure that the input image in A is mapped (via image generation) to an analog image in B, where this analogy is not defined by training pairs or in any other explicit way (see Sec. 2).
Well, you are attempting to define it as "the alignment given by the lowest complexity meme".
I think this doesn't address the question, which is more generally addressed to the whole field of crossdomain mapping, I think. Why aren't the new neural network approaches compared with previous approaches of unlabelled crossdomain mapping, which can be formalized as different forms of alignment between sets of points. I think the main difference here is that in those approaches you typically know the whole target domain, while here we don't quite know it, we just have a few samples from it, but I feel this is a surmontable problem

 Jul 2018

www.wikiwand.com www.wikiwand.com

The position, C {\displaystyle C} , of the camera expressed in world coordinates is C = − R − 1 T = − R T T {\displaystyle C=R^{1}T=R^{T}T} (since R {\displaystyle R} is a rotation matrix).
The minus is because we are translating the world in the opposite way that the camera was translated to get it right relative to the camera, while putting the camera at the origin. The R^1 is because we first rotate by R and then translate by T, so to get the original translation vector we need to undo the rotation to T

u 0 {\displaystyle u_{0)) and v 0 {\displaystyle v_{0)) represent the principal point, which would be ideally in the centre of the image.
It is multiplied by the z component of the point because we want to translate by a fixed amount in the uv plane. However, after being acted by the camera matrix, they are in homogeneous coordinates, scaled by the z component!


arxiv.org arxiv.org

dJd=dLd
Why??
I see, I think the derivative wrt \(\theta\) here is supposed to be while z satisfies the constraints in (32)...

restricted architectures which partition the hidden units.Our approach does not have these restrictions
But it has whatever restriction being an ODE imposses. How does the expressivity and learnability of the ODEnet compare with ResNets?

Minibatching
Can you not just compute the gradient for each input, and average them?

Second
typo?

tendtstart(z(t))dt
This second term comes from the fact that we want the probability of observations at times \(t_1, ..., t_N\), and at no other time between \(t_{start}\) and \(t_{end}\)

layers of only a single hidden unit
corresponding to weight matrices of rank 1

Instantaneous Change of Variables
This is just the differential form of the continuity equation

6 residual blocks, which are replaced by an ODESolve
What are the residual blocks precisely? Single layer + ReLU?

the number of evaluations of the hidden state dynamics required, a detail delegated to the ODE solverand dependent on the initial state or input.
Adaptive computation time

\(\theta(t)\)


www.youtube.com www.youtube.com

1:00:00 scale separation as a way to go beyond Markov (local interactions) assumption

~32:00 What about the domain of the function being effectively lower dimensional, rather than a strongly regularity assumption? That would also work, right? Could this be the case for images? (what's the dimensionality of the manifold of natural images?)
Nice. I like the idea of regularity <> low dimensional representation. I guess by that general definition, the above is a form of regularity..
He comments about this on 38:30
Tags
Annotators
URL


pdfs.semanticscholar.org pdfs.semanticscholar.org

it should be +? This typo carries through to later parts


arxiv.org arxiv.org

to have identical norm at the input layer,
This works because the marginal variance is independent of the orientation of the input vector. This can be proven by noting that a random Gaussian matrix doesn't change its distribution when rotated by an orthonormal similarity matrix..


arxiv.org arxiv.org

(2)
Why is this the variational loss???

 Jun 2018

arxiv.org arxiv.org

Since GPs give exact marginal likelihood estimates, this kernel construction mayallow principled hyperparameter selection, or nonlinearity design, e.g. by gradient ascent on the loglikelihood w.r.t. the hyperparameters. Although this is not the focus of current work, we hope toreturn to this topic in followup work
Cool idea!

will be a sum of i.i.d. terms
Not true (for finite width nets) I think. the terms are conditionally independent, when conditioned on the value of the final hidden layer. If we consider the joint distribution over all the weights over all layers, then the terms won't be independent. However, in the limit of infinite widths, they do become independent. This is because their covariance is \(0\) and in the infinite width limit, they become Gaussian with that 0 covariance, and therefore independent.
Remember also that this is for a fixed input (and later analysis is for a finite collection of inputs)


web.math.princeton.edu web.math.princeton.edu

concentration
is this concentration phenomenon related to the asymptotic equipartition properties of large collections of independent (or weakly dependent) variables?

 May 2018

www.stats.ox.ac.uk www.stats.ox.ac.uk

−1
this means the functional inverse

φ
this should be the other phi?

 Apr 2018

www.lesswrong.com www.lesswrong.com

all the semantic consequences.
What are semantic consequences? Is there such a thing as syntactic consequences? Is the statement said here different for them? I don't think so right?

all possible subsets of a set?
subsets are defined as properties in Logic. Secondorder logic goes above firstorder logic by being able to quantify over properties (and thus over subsets)

 Jan 2018

static.googleusercontent.com static.googleusercontent.com

he matrixYl(Yl)Tis thelthlayer data covariance matrix. The distribution of its eigenvalues (or thesingular values ofYl) determine the extent to which the input signals become distorted or stretchedas they propagate through the network
I would say more that it determines the appearance of correlations between the elements of Y.
Also, this isn't quite the covariance matrix, unless the mean of Y is zero?


arxiv.org arxiv.org

While local optima may not be a problem withdeep neural networks in supervised learning where the correct answer is always given (Pascanu et al., 2014), the sameis not true in reinforcement learning problems with sparseor deceptive rewards.
Reason why evolutionary methods are useful in RL, but not so much in SL. What about UL?

random search
Is this what they mean https://en.wikipedia.org/wiki/Random_search ?

 Dec 2017

www.people.fas.harvard.edu www.people.fas.harvard.edu

irst, eigenvalueswhich are exactly zero (λi= 0) correspond to directions with no learning dynamics so thatthe parametersziwill remain atzi(0) indefinitely. These directions form afrozen subspacein which no learning occurs. Hence, if there are zero eigenvalues, weight initializations canhave a lasting impact on generalization performance even after arbitrarily long training.
No training on singular directions.
However, it seems like sloppy directions. Those with small eigenvalue in input correlation matrix causes overfitting. This is because in these direction there is small variability on the inputs, and yet there is variability on the output due to noise. To fit the small variability, we learn large weights, which are in fact too large, and don't generalize well..


arxiv.org arxiv.org

(1)
See here. This is the crossentropy between the likelihood for $\theta$, $q(X\theta)$, and $q(X\theta_0)$. It is the same as the disorderaveraged energy of state $\theta$, that is averaged over $X$. Note that because the energy is a mean over $x$, and each $x$ is independent, then for the disorderaveraged energy, we can just calculate the average for one $X$.
If we evaluate this at $\theta=\theta_0$, then we have the Shannon entropy of $q(X\theta_0$. $\theta_0$ is the disorderdaveraged ground state (it's not hard to see that it is the $\theta$ which minimizes $H(\theta;\theta_0), this is just a property of crossentropy).
The other one is the empirical version of this.

(3)
This is equal to
$$\int_\Theta d\theta \bar{\omega}(\theta)\prod_{X\in x^N}q(X\theta)$$
which is just the probability of observing the samples $x^N$.
But the interesting thing is that it can be written as a partition function where $\theta$ is the physical state vector as said after.

 Dec 2016

journal.frontiersin.org journal.frontiersin.org

nodeperturbationlike correlations
What is this?
