681 Matching Annotations
  1. Dec 2018
    1. it does not distinguish between models which fit the training data equally well

      Well, if it is regularized (so there is an effective prior), then this isn't true!

  2. openreview.net openreview.net
    1. on networks and problems at a practical scale

      Yeah. But not on the original networks, but on the compressed ones!

    2. L()

      This is a 0-1 loss. It can be applied to top-1 or top-5 defining the loss appropriately!

    3. The pruned model achieves a validationaccuracy of60 %.

      this is far from SOTA. right?

    4. A direct application of our naïve Occam bound yields non-vacuousbound on the test error of98:6%(with95%confidence)

      Ok, that is non-vacuous, but just, right?

    5. Our simple Occam bound requires only minimal assumptions, and can be directly applied to existingcompressed networks

      But remember that the compression scheme technically shouldn't depend on the data, for the bound to be valid (as the PAC-Bayes prior depends on the compression scheme)

    6. We obtain a bound on the training error of46%

      On the training error, or the test error?

    7. We prunethe network using Dynamic Network Surgery (Guo et al., 2016), pruning all but1:5%of the networkweights.

      wow, they are very weight-compressible indeed

    1. Mean- eld underestimates the vari-ance

      Independence increases concentration

    2. REq=i(w=i)[logp(D;w)]dw

      You are normalizing the log not the distribution here!

  3. Nov 2018
  4. openreview.net openreview.net
    1. jSjc+jCjc

      These are the sizes of the codes for encoding the vector of weight indices and discretized weight values, according to coding scheme \(c\)

    2. 264bytes, which is272bits

      isn't \(2^{64}\ bytes \(2^{67}\) bits, as each byte is \(2^3\) bits?

    3. Theorem 4.1

      Bound on KL divergence for Universal prior, and point posterior

    4. Ascis injective onHc, we have thatZ1.

      By Kraft inequality, and \(m\) being a probability measure

    5. The generalization bound can be evaluated by compressing a trained network, measuring theeffective compressed size, and substituting this value into the bound.

      An important question is: are the bounds valid for the compressed network, or for the original one as well? (as isn't the case in the related work: https://arxiv.org/abs/1802.05296 )

    6. empirical evidence suggests that they fail to improve performance in practice(Wilson et al., 2017)

      Well. If this is the case it means that proving a generalization bound for one of these procedures "implies" a bound for the standard training procedure, as you are claiming that the standard one generalizes not (much) worse than these ones

    1. To understand phenomenon described by Saxe and in the video at 43:00, we can think of this: Low eigenvalues in XX^T correspond to directions with little variations in the input. However, by the random fluctuation eta, the output could have an O(1) variation, even for arbitrarily small input variation, which requires a large weight to fit, and produces large generalization error.

      for $\alpha<1$ the probability of this directions in input space with low variation decreases, as we get less directions overall with points I think (directions with no points/variation are ignored for the algorithm which projects weight into input subspace, and these are the 0 eigenvalue parts of the Marchenko-Pastur distr)

    1. weakinteractions between layers can cause the network to have high sharpness value.

      why does interacting weakly cause high sharpnes??

      I think the reason is that for layers which interact strongly random perturbations tend to cause smaller relative change on the output, than for layers that interact strongly. And this smaller relative change on outputs probably translates into a smaller absolute change in the Loss...

      think about aligned eigenvectors and stuff

    2. depends only linearly on depth and does not have any exponential dependence, unlike other notionsof generalization.

      Aren't VC dim bounds for NNs linear in depth also?

    3. forming a confusion set that includessamples with random labels.

      This is what's done in Wu et al https://arxiv.org/abs/1706.10239 also

    4. middle and right

      For a fixed expected sharpness the KL (and thus the effective capacity/bound) increases less for true labels than for random lables. Also, in both cases, it increases monotonically

    5. capacity is proportional toCMmarginndiamM(X).

      This is the "curse of dimensionality"!

      Probably can prove using covering numbers which bound RadComp (via Massart's lemma)

    6. simply bounding the Lipschitz constantof the network is not enough to get a reasonable capacity control

      but are the generalization error bounds for Lipschitz functions tight?

    7. However, the covering number of the input domaincan be exponential in the input dimension and the capacity can still grow as

      Unless inputs lie on a slow dimensional manifold I guess

    8. Instead, to meaningfully compare norms of the network, we should explicitly take into account thescaling of the outputs of the network. One way this can be done, when the training error is indeedzero, is to consider the “margin” of the predictions in addition to the norms of the parameters.

      This has been used in several papers on generalization bounds for neural nets.

      I think the idea is like that for margin bounds for SVMs. You can't bound the RadComp using the 0-1 loss as it's Lipschitz (related to above arguments). But you can bound the Hinge loss (which is related to having a margin!), and then use the fact that the Hinge loss upper bounds the 0-1 loss.

      The intuition I guess is that the previous troublesome cases with norm-based capacity can be solved by taking margin into account, as the extremes in both cases are reducing/increasing the margin (for weights going to zero/infinity).

    9. proportional toQdi=1kWik21;1,wherekWik1;1is the maximum over hidden units in layeriof the`1norm of incoming weights tothe hidden unit [4].

      This comes from the Rademacher complexity of the neural network. See here: http://www.stats.ox.ac.uk/~rebeschi/teaching/AFoL/18/material/lecture3.pdf#page=4 for a derivation

    10. One can then ensure generalization of a learned hypothesishin terms of the capacity ofHM;M(h)

      They just state this, but they should maybe cite some work on nonuniform learnability, SRM, MDL, so that people unfamiliar with it can see why this is!

    11. anymeasure which is uniform across all functions representable by a given architecture, is not sufficientto explain the generalization ability of neural networks trained in practice. For linear models, normsand margin-based measures, and not the number of parameters, are commonly used for capacitycontrol [4,8,24].

      These are all SRM based.

      They depend (most often implicitly via the training data) on the data-distribution/target function, and the algorithm (also implicitely) If you get a good bound with SRM it's most likely because your algorithm is biased towards the right kind of solution (kind basically corresponds to the classes used in SRM). In other words, a bad algorithm (relative to target fun) will be very unlikely to produce a good bound. On the other hand, a good algorithm may produce a bad bound, if the prior is chosen badly. So that SRM bounds may not be tight in general

  5. Oct 2018
    1. 2

      shouldn't there be a \(\lambda\) here too?

      And in fact another \(\lambda\) from Rad(A_1)?

    2. =

      this should be \(\leq\),

      thought I would have used \(\mathcal{L} \cup -\mathcal{L}\) instead

    3. 0

      appostrophe shouldn't be there

    4. h


    5. =

      missing parenthesis

    6. l


    1. (U) =RSP(Ujv)d(v).

      ? How does this define what \(\rho\) is?

    1. he largestlast width function seems to converge slightly faster than the largest last width function


    2. more


  6. Aug 2018
    1. DdCid!dCjd!E=1NdCd!2+F(!)ij

      I don't get this formula. What are C_i and C_j supposed to mean. Or rather, where is the randomness over which we average?

    1. is not surprising and there are suitable options, even without GANs

      Well, what is nontrivial about this is generalizing (as usual), how do you generate novel samples, what are samples that "look like" the rest of the data. How do you define that?

  7. openreview.net openreview.net
    1. The alignment problem is how to make sure that the input image in A is mapped (via image generation) to an analog image in B, where this analogy is not defined by training pairs or in any other explicit way (see Sec. 2).

      Well, you are attempting to define it as "the alignment given by the lowest complexity meme".

      I think this doesn't address the question, which is more generally addressed to the whole field of cross-domain mapping, I think. Why aren't the new neural network approaches compared with previous approaches of unlabelled cross-domain mapping, which can be formalized as different forms of alignment between sets of points. I think the main difference here is that in those approaches you typically know the whole target domain, while here we don't quite know it, we just have a few samples from it, but I feel this is a surmontable problem

  8. Jul 2018
    1. The position, C {\displaystyle C} , of the camera expressed in world coordinates is C = − R − 1 T = − R T T {\displaystyle C=-R^{-1}T=-R^{T}T} (since R {\displaystyle R} is a rotation matrix).

      The minus is because we are translating the world in the opposite way that the camera was translated to get it right relative to the camera, while putting the camera at the origin. The R^-1 is because we first rotate by R and then translate by T, so to get the original translation vector we need to undo the rotation to T

    2. u 0 {\displaystyle u_{0)) and v 0 {\displaystyle v_{0)) represent the principal point, which would be ideally in the centre of the image.

      It is multiplied by the z component of the point because we want to translate by a fixed amount in the u-v plane. However, after being acted by the camera matrix, they are in homogeneous coordinates, scaled by the z component!

    1. dJd=dLd


      I see, I think the derivative wrt \(\theta\) here is supposed to be while z satisfies the constraints in (32)...

    2. restricted architectures which partition the hidden units.Our approach does not have these restrictions

      But it has whatever restriction being an ODE imposses. How does the expressivity and learnability of the ODEnet compare with ResNets?

    3. Minibatching

      Can you not just compute the gradient for each input, and average them?

    4. Second


    5. tendtstart(z(t))dt

      This second term comes from the fact that we want the probability of observations at times \(t_1, ..., t_N\), and at no other time between \(t_{start}\) and \(t_{end}\)

    6. layers of only a single hidden unit

      corresponding to weight matrices of rank 1

    7. Instantaneous Change of Variables

      This is just the differential form of the continuity equation

    8. 6 residual blocks, which are replaced by an ODESolve

      What are the residual blocks precisely? Single layer + ReLU?

    9. the number of evaluations of the hidden state dynamics required, a detail delegated to the ODE solverand dependent on the initial state or input.

      Adaptive computation time



    1. 1:00:00 scale separation as a way to go beyond Markov (local interactions) assumption

    2. ~32:00 What about the domain of the function being effectively lower dimensional, rather than a strongly regularity assumption? That would also work, right? Could this be the case for images? (what's the dimensionality of the manifold of natural images?)

      Nice. I like the idea of regularity <> low dimensional representation. I guess by that general definition, the above is a form of regularity..

      He comments about this on 38:30

    1. to have identical norm at the input layer,

      This works because the marginal variance is independent of the orientation of the input vector. This can be proven by noting that a random Gaussian matrix doesn't change its distribution when rotated by an orthonormal similarity matrix..

  9. Jun 2018
    1. Since GPs give exact marginal likelihood estimates, this kernel construction mayallow principled hyperparameter selection, or nonlinearity design, e.g. by gradient ascent on the loglikelihood w.r.t. the hyperparameters. Although this is not the focus of current work, we hope toreturn to this topic in follow-up work

      Cool idea!

    2. will be a sum of i.i.d. terms

      Not true (for finite width nets) I think. the terms are conditionally independent, when conditioned on the value of the final hidden layer. If we consider the joint distribution over all the weights over all layers, then the terms won't be independent. However, in the limit of infinite widths, they do become independent. This is because their covariance is \(0\) and in the infinite width limit, they become Gaussian with that 0 covariance, and therefore independent.

      Remember also that this is for a fixed input (and later analysis is for a finite collection of inputs)

  10. web.math.princeton.edu web.math.princeton.edu
    1. concentration

      is this concentration phenomenon related to the asymptotic equipartition properties of large collections of independent (or weakly dependent) variables?

  11. May 2018
  12. Apr 2018
    1. all the semantic consequences.

      What are semantic consequences? Is there such a thing as syntactic consequences? Is the statement said here different for them? I don't think so right?

    2. all possible subsets of a set?

      subsets are defined as properties in Logic. Second-order logic goes above first-order logic by being able to quantify over properties (and thus over subsets)

  13. Jan 2018
  14. static.googleusercontent.com static.googleusercontent.com
    1. he matrixYl(Yl)Tis thelth-layer data covariance matrix. The distribution of its eigenvalues (or thesingular values ofYl) determine the extent to which the input signals become distorted or stretchedas they propagate through the network

      I would say more that it determines the appearance of correlations between the elements of Y.

      Also, this isn't quite the covariance matrix, unless the mean of Y is zero?

    1. While local optima may not be a problem withdeep neural networks in supervised learning where the cor-rect answer is always given (Pascanu et al., 2014), the sameis not true in reinforcement learning problems with sparseor deceptive rewards.

      Reason why evolutionary methods are useful in RL, but not so much in SL. What about UL?

    2. random search
  15. Dec 2017
    1. irst, eigenvalueswhich are exactly zero (λi= 0) correspond to directions with no learning dynamics so thatthe parametersziwill remain atzi(0) indefinitely. These directions form afrozen subspacein which no learning occurs. Hence, if there are zero eigenvalues, weight initializations canhave a lasting impact on generalization performance even after arbitrarily long training.

      No training on singular directions.

      However, it seems like sloppy directions. Those with small eigenvalue in input correlation matrix causes overfitting. This is because in these direction there is small variability on the inputs, and yet there is variability on the output due to noise. To fit the small variability, we learn large weights, which are in fact too large, and don't generalize well..

    1. (1)

      See here. This is the cross-entropy between the likelihood for $\theta$, $q(X|\theta)$, and $q(X|\theta_0)$. It is the same as the disorder-averaged energy of state $\theta$, that is averaged over $X$. Note that because the energy is a mean over $x$, and each $x$ is independent, then for the disorder-averaged energy, we can just calculate the average for one $X$.

      If we evaluate this at $\theta=\theta_0$, then we have the Shannon entropy of $q(X|\theta_0$. $\theta_0$ is the disorderd-averaged ground state (it's not hard to see that it is the $\theta$ which minimizes $H(\theta;\theta_0), this is just a property of cross-entropy).

      The other one is the empirical version of this.

    2. (3)

      This is equal to

      $$\int_\Theta d\theta \bar{\omega}(\theta)\prod_{X\in x^N}q(X|\theta)$$

      which is just the probability of observing the samples $x^N$.

      But the interesting thing is that it can be written as a partition function where $\theta$ is the physical state vector as said after.

  16. Dec 2016