154 Matching Annotations
  1. Last 7 days
    1. D∗[T∗μ,T(Zm0)]

      I see this as one of the main innovations of the paper. This term is a discrepancy between the sample, and the true distribution \(\mu\). This would allow Z_m to be sampled from a different distribution for instance, allowing to get bounds that account for distributional drift, for instance.

    2. V[f]

      This basically offers a measure of the variance of the loss (in a non-statistical sense) over the instance space, of the learned function.

    3. hus, in classical bounds including data-dependent ones,asHgets larger and more complex, the bounds tend to become more pessimistic for theactual instance ˆyA(Sm)(learned with the actual instanceSm), which is avoided in Theorem1.

      Sure, but that is also avoided in some statistial learning approaches, like Structural Risk Minimization, PAC-Bayes, and the luckiness framework, which you cite!

  2. May 2019
    1. the total computational costis similar to that of single-head attention with full dimensionality

      smaller?

    2. Multi-head attention allows the model to jointly attend to information from different representationsubspaces at different positions. With a single attention head, averaging inhibits this.

      So if I understand correctly, with a single head, different parts of the d_model-dimensional query vector may "want" to attend to different parts of the key, but because the weight of the values is computed by summing over all elements in the dot product, it would just average these local weights. Sepparating into different heads, allows to attend to different value vectors for different "reasons".

  3. Apr 2019
    1. theprobability

      log probability

    2. say

      ~~

    3. too weak

      for Kmax

    4. tighter

      in relative terms

    5. o generatex

      given the map \(f\)

    6. rst e

      first give x, and then enumerate ... identifying all inputs p mapping to x, namely \(f^{-1}(x)\)

    7. lays a key rolein

      is the main component in

    8. ,

      :

    9. function

      of

    10. NI= 2n.

      for binary strings

    11. derived

      suggested

    12. Since manyprocesses in science and engineering can be described asinput-output maps that are not UTMs

      Perhaps say "This suggests that, even though many maps are not UTMs, the principle that low K are high P should hold widely"

      because it is not because they are not UTMs, but it is in spite of them not being UTMs, I would argue.

    13. , a classic categorization of machinesby their computational power,

      in parenthesis

    Annotators

    1. If we add a periodic and a linear kernel, the global trend of the linear kernel is incorporated into the combined kernel.

      Remember that kernel functions with one of its arguments evaluated are members of the reproducing kernel Hilbert space to which all the functions supported by a particular Gaussian process belong.

      Therefore adding kernels, amounts to adding the functions on these two spaces. That is why the resulting functions work like this when combining kernels!

    1. concave in both arguments. Jensen’s inequality (f(x,y)concave⇒E f(x,y)≥f(Ex,Ey))

      Actually it's convex

  4. Mar 2019
    1. A stochastic error rate,ˆQ(~w,μ)S=E~x,y∼S ̄F(μγ(~x,y)

      Remember that the w sampled from the "posterior" isn't necessarely parallel to the original w, so that the stochastic classification rate isn't simply F(sign(margin)) but something more complicated; see the proof.

    2. Since the PAC-Bayes bound is (almost) a generalization of the Occam’s Razor bound, the tightnessresult for Occam’s Razor also applies to PAC-Bayes bounds.

      Oh, c'mon :PP You are just showing that PAC-Bayes is tight as a statement for all Q and for a particular P. As in you are saying that if we only let it depend on the quantities it can depend (namely KL divergence between Q and P, delta, etc), then it can't be made tighter, because then it would break for the particular choice of D, hypothesis class, Q, and for any value of KL in that case in the Theorem 4.4 above.

      --> What I mean is this: that we say the bound is a function f(KL, delta, m, etc). Theorem 4.4 shows that there is a choice of learning problem and algorithm such that these arguments could be anything, and the bound is tight. Therefore, we can't lower this bound without it failing. It is tight in that sense. However, it may not be tight if we allow the bound to depend on other quantities!

    3. The lower bound theorem implies that we can not improve an Occam’s Razor like statement.

      Yeah, as in if it only depends on \(P(c)\) and the other quantities expressed there, and have it not depend on the algorithm, so it should be a general function that takes \(P(c)\), \(\delta\) etc, but the same function for any algorithm. Then yes. And this is what they mean here.

    4. For all P(c), m, k,δthere exists a learning problem D andalgorithm such that

      Depends what do you mean by For all \P(c)) are you fixing the hypothesis class or what? Because your proof assumes a particular type of hypothesis class... For P(c) having support over a hypothesis class where the union bound doesn't hold, then it is not tight any more..

    5. The distributionDcan be drawn by first selectingYwith a single unbiased coin flip, and thenchoosing theith component of the vectorXindependently, Pr((X1, ...,Xn)|Y) =Πni=1Pr(Xi|Y). Theindividual components are chosen so Pr(Xi=Y|Y) =Bin(m,k,δP(c)).The classifiers we consider just use one feature to make their classification:ci(x) =xi. The trueerror of these classifiers is given by:cD=Bin(m,k,δP(c))

      Ok, so this has proven that the Occam bound is tight for this particular \(D\) for this particular hypothesis class, which is quite special, because it has the property that the union bound becomes tight. But that is a very special property of this hypothesis class (or more general, of this choice of support for \(P\) right??)

    6. if any classifier has a too-small train error, thenthe classifier with minimal train error must have a too-small train error

      this is because having "too-small train error" here means (is equivalent to) having train error smaller than \(k\), so that the classifier with smallest train error also has it smaller than \(k\) and therefore it also has too small train error.

    7. The differences between the agnostic and realizable case are fundamentally related to the decrease inthe variance of a binomial as the bias (i.e. true error) approaches 0.

      If we observe a zero empirical error, then the probability of observing that decreases very quickly with increasing true error. So I wouldn't really say it's the decrease in variance of the binomial (one could imagine distributions where the variance doesn't decrease as you go to 0, but which still have the property that makes the realizable case error rate smaller in the same way as here)

    8. cS

      Remember they define training error as error count

    9. PrS∼DmcD≤Bin(m,ˆcS,δ)≥1−δ.

      Btw, this approach only works because \(c_D\) is a one dimensional random variable, so that {the set of \(k\) such that \(Bin(m,k,p)<\delta\)} equals {the set of \(k\) less than or equal to {the maximum \(k\) such that \(Bin(m,k,p)<\delta\)}}.

      This happens because the different ("confidence interval") set of \(k\) defining \(Bin(m,k,p)\) (i.e. all \(k\) smaller than or equal to some \(k\)) are all nested. In more general situations with other confidence intervals defined (like for 2D cumulative distributions) this may not happen.

    10. δ∈(0,1]

      No,

      Because Lemma 3.6 requires \(\frac{k}{m}<p\), then we need \(\delta\) to be small enough such that {{the \(c_D\) that results from solving the Chernoff bound = \(\delta\)} be larger than \(\frac{k}{m}\)}</p>

    11. ε

      \(c_D\)

    12. The test set bound is, essentially, perfectly tight. For any classifier with a sufficiently large trueerror, the bound is violated exactly aδportion of the time

      And any tighter bound would be violated a larger portion of the time, at least for some value of the true error (note that for fixed or constrained true error one can have tighter bounds).

      The need for a sufficiently large true error is because for instance for true error zero, the bound is never violated.

      But still the fact of it being perfectly tight is because of what I said above.

    13. more

      less

    14. All of the results presented here fall in the realm of classical statistics. In particular, all ran-domizations are over draws of the data, and our results have the form of confidence intervals.

      So not Bayesian statistics

    1. he nor-malizing term of eq. (3.53),ZEP=q(y|X), is the EP algorithm’s approximationto the normalizing termZfrom eq. (3.48) and eq. (3.49)

      so the EP approximation to the marginal likelihood?

    2. in the EP framework we approximate the likelihood by alocal like-lihood approximation13in the form of an un-normalized Gaussian function inthe latent variablefi

      how is the EP approximation good when the probit function as likelihood is shaped so differently from the unormalized Gaussian which is used to approximatie it!?

    1. n adversarial network is used to define a loss function which sidestepsthe need to explicitly evaluate or approximatepx(·)

      Adversarial training as an alternative to maximum likelihood training!

    1. Dirichlet prior over these parameterss

      You'd need to sum over all \(y\) for that?

    2. q(zjx)q(yjx),

      Ehm, in the line below you have \(q_\phi(z|y,z)\) not \(q_\phi(z|x)\)

    1. Inheriting from the properties of stochasticprocesses, we assume thatQis invariant to permutationsofOandT.

      In principle, just like for consistency, they could choose not to enforce permutation invariance (e.g. using just an RNN), and rely on the model learning it from data.

      However, the contributions they are making (here and in the Neural Processes paper) is ways of putting the structure that Bayesian inference on stochastic processes should satisfy explicitly on the model, so that it has to learn well.

      We are putting prior knowledge about how to use prior knowledge

    2. without imposing consistency

      I think in the context of CNPs consistency would imply things like

      \(\sum_{y_1} p(y_1)p(y_2|y_1) =p_2\)

      These things are not automatically guaranteed by the framework used here. The data should constrain the network to satisfy these approximately

    1. Since the decodergis non-linear, we can use amortisedvariational inference to learn it.

      So, the nice thing about NPs is that if you did the inference of \(z\) exactly it would be a stochastic process exactly (unlike CNPs that don't have a simple interpretation/approach that guarantees being an exact stochastic process). However, because of not doing the inference of \(z\) exactly, consistency of the resulting marginal distributions is not exactly guaranteed.

    2. non-trivial ‘kernels’

      or even stochastic processes that aren't GPs and may not be describable by kernels

    3. Given that this is a neural approximationthe curves will sometimes only approach the observationspoints as opposed to go through them as it is the case forGPs.

      ?? For GPs with Gaussian likelihood the functions don't pass exactly through the observation points, just near, as in here?

    1. h1;:::;hk

      I think he meant \(h^{j_1},...,h^{j_k}\)

    2. (W0>Wx)i= 0whereW0is an iid copy ofW

      So we sample the A transposes independently from the As I see. That is the gradient independence assumption.

      Btw is this related to some stuff I saw Hinton spoke about showing that you could do backprop with weights different from the forward prop, and they couldn't understand why. Is this the explanation? Could this, as they suggested, be related to a way the brain could be doing backprop?

    3. The sampling ofinput G-vars models the distribution of the first hidden layer across multiple inputs, sampling of the first layer parameters(see Appendix B.1 for an example), and/or sampling of bias vectors.

      "upon sampling of the first layer parameters" or "sampling the first layer parameters", I guess he meant?

      oh ok , he kinda means sampling of the last layer parameters, but when we do backprop, it's the first layer..

    4. In general, the multiplefvigallow for multiple NN outputs.

      Ah ok, the \(v^i\) are like the weights of the outputs in the loss function

      NO. each \(v_i\) is a vector, the vector of weights which when multiplied by some \(\mathtt{g}\) or \(\mathtt{h}\) give a single real-valued output labelled \(i\). This we call a linear readout layer.

      When we back propagate the loss we would multiply each of these vectors by the derivative of the loss w.r.t. to each of these outputs.

      Can this be done in this formalism?

    5. batchnorm, whose Jacobian has a singularity on a 1-dimensional affine subspace(and in particular, at the origin).

      so that the \(\mathtt{f^l}\) are not polynomially bounded

    6. am

      again should be \(a_{j_i}^l\)

    7. m

      don't need \(m\) here?

    8. has enough power to express thebackpropagation offand compute the gradients with respect to hidden states.

      Note that the Nonlin functions in the appended lines are allowed to depend on the previous \(\mathtt{g}\)s, which is necessary to compute the backpropagation of the gradient through the nonlinearity. The odness works for backprop because the Nonlin is just linear w.r.t. the \(v\)s

    9. Theorem 4.3.

      Generalized law of large numbers extended to a case where the i.i.d. r.v.s have a distribution that changes as we increase the number of samples (although the way it changes is constrained).

      EDIT: see comment

      The reason this is nontrivial is that the \(\phi(g_i^{\frak{c}t})\) have a distribution that changes with \(t\), although it approaches a limit. So it is different from the standard law of large numbers.

      In words, here we are saying that the empirical averages as we increase \(t\) approach (almost surely) the expected value of \(Z\) distributed according to the limit distribution of \(\phi(g_i^{\frak{c}t})\)

    10. theLat which the exponentiating effect ofWLkicks in increases withn

      the \(L\) at which exponential effect kicks increases like \(\sqrt{n}\) it seems according to these arguments. Nice

      seems related to things in here http://papers.nips.cc/paper/7339-which-neural-net-architectures-give-rise-to-exploding-and-vanishing-gradients

    11. zN(c;Kc)

      \(\mu^\frak{c}\) is the vector of means of the tuple of \(i\)th components of the vectors in the argument of either \(\mathtt{f}^a\) or \(\mathtt{f}^b\), and \(K^\frac{c}\) is the covariance matrix of this tuple/vector.

      Note (also useful for the above definition) that we are defining means and covariances for any individual component of the vectors \(\mathtt{g}\). That is, we are describing the distribution of \(\mathtt{g}^{\frak{c}}_i=(\mathtt{g}^l_i)_{g^l\in\frak{c}}\) for any \(i\). Different tuples of components are independently distributed, as explained in a comment in the beggining of the Setup section above

    12. c(gl)

      This is defined for \(g^l \in \frak{c}\)

    13. Kc(gl;gm)

      This is defined for \(g^l, g^m \in \frak{c}\)

    14. amji

      should be \(a^l_{j_i}\)

    15. gcinti N(cint;Kcint)for eachi;j

      Note the subscript \(i\), so this is the distribution for the tuple of the \(i\)th components of all the Invecs in \(\frac{c}\).

      We therefore allow the \(i\)th component of two different Invecs to be correlated (useful to model the distribution of the first hidden layer, as per the usual NNGP analysis). But we don't allow different components of Invecs, \(g_i^{lt}\) and \(g_j^{mt}\) for \(i\neq j\), to be correlated.

      Thre is a typo, it should say for each \(i\).

    16. sequence (int2N)

      what does this mean? Ah \(t\) is like a "time", so it is the index of the sequence. \(lt\) just represents two indices (not their product!)

  5. Feb 2019
    1. that such functions should be simple with respect to all the measures of complexity above,

      Why?? Do you show that all functions that have the property of being insensitive to large changes in the inputs have high probability? If so, then say it, if not. Then not all such functions need be simple w.r.t. to all measures of complexity that are found to correlate with probability for a random NN

    2. Schmidhuber, 1997, Dingleet al., 2018].

      Dingle et al explore bias towards low complexity, but not the relation to generalization.

    3. our result implies that the probability of this function is exponentially small inn.

      Although I think it is exponentially small in \(n\), why is that implied by your result? All that you know from what we've been told up to this point in the paper, is that it's probability has to be smaller than sqrt(log(n)/n), or using symmetry, we can divide this by 1/n

    4. ne,

      for \(n>> 1\), the probability of all of the Hamming-distance-1 neighbours giving the same result goes to 0. So the weight of the term corresponding to Hamming distance 1 goes to 1, in the average

    5. wo

      Approximately a geometric distribution with success probability p=2, for large \(n\), and the mean of a geometric distribution is 1/p = 2

  6. Jan 2019
    1. (3)

      This is a Schmitt trigger :D, see here: https://youtu.be/Iu21laCEsVs?t=2m31s

    2. Inspired by the method of Lagrange Multipliers

      How is this inspired by the method of Lagrange Multipliers?

    1. 0AB

      This is not \(\delta u_k\), why is he using \(u_k\) ? I don't see what's the justification

    1. other nonquantitative uses of bounds (such as providing indirect motivations forlearning algorithms via constant fitting) do exist. We do not focus on those uses here

      What do they mean by "learning algorithms via constant fitting"?

  7. Dec 2018
    1. 

      Hyperparameter I guess

    2. it does not distinguish between models which fit the training data equally well

      Well, if it is regularized (so there is an effective prior), then this isn't true!

  8. openreview.net openreview.net
    1. on networks and problems at a practical scale

      Yeah. But not on the original networks, but on the compressed ones!

    2. L()

      This is a 0-1 loss. It can be applied to top-1 or top-5 defining the loss appropriately!

    3. The pruned model achieves a validationaccuracy of60 %.

      this is far from SOTA. right?

    4. A direct application of our naïve Occam bound yields non-vacuousbound on the test error of98:6%(with95%confidence)

      Ok, that is non-vacuous, but just, right?

    5. Our simple Occam bound requires only minimal assumptions, and can be directly applied to existingcompressed networks

      But remember that the compression scheme technically shouldn't depend on the data, for the bound to be valid (as the PAC-Bayes prior depends on the compression scheme)

    6. We obtain a bound on the training error of46%

      On the training error, or the test error?

    7. We prunethe network using Dynamic Network Surgery (Guo et al., 2016), pruning all but1:5%of the networkweights.

      wow, they are very weight-compressible indeed

    1. Mean- eld underestimates the vari-ance

      Independence increases concentration

    2. REq=i(w=i)[logp(D;w)]dw

      You are normalizing the log not the distribution here!

  9. Nov 2018
  10. openreview.net openreview.net
    1. jSjc+jCjc

      These are the sizes of the codes for encoding the vector of weight indices and discretized weight values, according to coding scheme \(c\)

    2. 264bytes, which is272bits

      isn't \(2^{64}\ bytes \(2^{67}\) bits, as each byte is \(2^3\) bits?

    3. Theorem 4.1

      Bound on KL divergence for Universal prior, and point posterior

    4. Ascis injective onHc, we have thatZ1.

      By Kraft inequality, and \(m\) being a probability measure

    5. The generalization bound can be evaluated by compressing a trained network, measuring theeffective compressed size, and substituting this value into the bound.

      An important question is: are the bounds valid for the compressed network, or for the original one as well? (as isn't the case in the related work: https://arxiv.org/abs/1802.05296 )

    6. empirical evidence suggests that they fail to improve performance in practice(Wilson et al., 2017)

      Well. If this is the case it means that proving a generalization bound for one of these procedures "implies" a bound for the standard training procedure, as you are claiming that the standard one generalizes not (much) worse than these ones

    1. To understand phenomenon described by Saxe and in the video at 43:00, we can think of this: Low eigenvalues in XX^T correspond to directions with little variations in the input. However, by the random fluctuation eta, the output could have an O(1) variation, even for arbitrarily small input variation, which requires a large weight to fit, and produces large generalization error.

      for $\alpha<1$ the probability of this directions in input space with low variation decreases, as we get less directions overall with points I think (directions with no points/variation are ignored for the algorithm which projects weight into input subspace, and these are the 0 eigenvalue parts of the Marchenko-Pastur distr)

    1. weakinteractions between layers can cause the network to have high sharpness value.

      why does interacting weakly cause high sharpnes??

      I think the reason is that for layers which interact strongly random perturbations tend to cause smaller relative change on the output, than for layers that interact strongly. And this smaller relative change on outputs probably translates into a smaller absolute change in the Loss...

      think about aligned eigenvectors and stuff

    2. depends only linearly on depth and does not have any exponential dependence, unlike other notionsof generalization.

      Aren't VC dim bounds for NNs linear in depth also?

    3. forming a confusion set that includessamples with random labels.

      This is what's done in Wu et al https://arxiv.org/abs/1706.10239 also

    4. middle and right

      For a fixed expected sharpness the KL (and thus the effective capacity/bound) increases less for true labels than for random lables. Also, in both cases, it increases monotonically

    5. capacity is proportional toCMmarginndiamM(X).

      This is the "curse of dimensionality"!

      Probably can prove using covering numbers which bound RadComp (via Massart's lemma)

    6. simply bounding the Lipschitz constantof the network is not enough to get a reasonable capacity control

      but are the generalization error bounds for Lipschitz functions tight?

    7. However, the covering number of the input domaincan be exponential in the input dimension and the capacity can still grow as

      Unless inputs lie on a slow dimensional manifold I guess

    8. The bounds based on`2-path normand spectral norm can be derived directly from the those based on`1-path norm and`2norm respectively

      Hmm. how?

    9. Instead, to meaningfully compare norms of the network, we should explicitly take into account thescaling of the outputs of the network. One way this can be done, when the training error is indeedzero, is to consider the “margin” of the predictions in addition to the norms of the parameters.

      This has been used in several papers on generalization bounds for neural nets.

      I think the idea is like that for margin bounds for SVMs. You can't bound the RadComp using the 0-1 loss as it's Lipschitz (related to above arguments). But you can bound the Hinge loss (which is related to having a margin!), and then use the fact that the Hinge loss upper bounds the 0-1 loss.

      The intuition I guess is that the previous troublesome cases with norm-based capacity can be solved by taking margin into account, as the extremes in both cases are reducing/increasing the margin (for weights going to zero/infinity).

    10. proportional toQdi=1kWik21;1,wherekWik1;1is the maximum over hidden units in layeriof the`1norm of incoming weights tothe hidden unit [4].

      This comes from the Rademacher complexity of the neural network. See here: http://www.stats.ox.ac.uk/~rebeschi/teaching/AFoL/18/material/lecture3.pdf#page=4 for a derivation

    11. One can then ensure generalization of a learned hypothesishin terms of the capacity ofHM;M(h)

      They just state this, but they should maybe cite some work on nonuniform learnability, SRM, MDL, so that people unfamiliar with it can see why this is!

    12. anymeasure which is uniform across all functions representable by a given architecture, is not sufficientto explain the generalization ability of neural networks trained in practice. For linear models, normsand margin-based measures, and not the number of parameters, are commonly used for capacitycontrol [4,8,24].

      These are all SRM based.

      They depend (most often implicitly via the training data) on the data-distribution/target function, and the algorithm (also implicitely) If you get a good bound with SRM it's most likely because your algorithm is biased towards the right kind of solution (kind basically corresponds to the classes used in SRM). In other words, a bad algorithm (relative to target fun) will be very unlikely to produce a good bound. On the other hand, a good algorithm may produce a bad bound, if the prior is chosen badly. So that SRM bounds may not be tight in general

  11. Oct 2018
    1. 2

      shouldn't there be a \(\lambda\) here too?

      And in fact another \(\lambda\) from Rad(A_1)?

    2. =

      this should be \(\leq\),

      thought I would have used \(\mathcal{L} \cup -\mathcal{L}\) instead

    3. 0

      appostrophe shouldn't be there

    4. h

      hull

    5. =

      missing parenthesis

    6. l

      $$j$$

    1. (U) =RSP(Ujv)d(v).

      ? How does this define what \(\rho\) is?

    1. he largestlast width function seems to converge slightly faster than the largest last width function

      typo

    2. more

      less

  12. Aug 2018
    1. DdCid!dCjd!E=1NdCd!2+F(!)ij

      I don't get this formula. What are C_i and C_j supposed to mean. Or rather, where is the randomness over which we average?

    1. is not surprising and there are suitable options, even without GANs

      Well, what is nontrivial about this is generalizing (as usual), how do you generate novel samples, what are samples that "look like" the rest of the data. How do you define that?

  13. openreview.net openreview.net
    1. The alignment problem is how to make sure that the input image in A is mapped (via image generation) to an analog image in B, where this analogy is not defined by training pairs or in any other explicit way (see Sec. 2).

      Well, you are attempting to define it as "the alignment given by the lowest complexity meme".

      I think this doesn't address the question, which is more generally addressed to the whole field of cross-domain mapping, I think. Why aren't the new neural network approaches compared with previous approaches of unlabelled cross-domain mapping, which can be formalized as different forms of alignment between sets of points. I think the main difference here is that in those approaches you typically know the whole target domain, while here we don't quite know it, we just have a few samples from it, but I feel this is a surmontable problem

  14. Jul 2018
    1. The position, C {\displaystyle C} , of the camera expressed in world coordinates is C = − R − 1 T = − R T T {\displaystyle C=-R^{-1}T=-R^{T}T} (since R {\displaystyle R} is a rotation matrix).

      The minus is because we are translating the world in the opposite way that the camera was translated to get it right relative to the camera, while putting the camera at the origin. The R^-1 is because we first rotate by R and then translate by T, so to get the original translation vector we need to undo the rotation to T

    2. u 0 {\displaystyle u_{0)) and v 0 {\displaystyle v_{0)) represent the principal point, which would be ideally in the centre of the image.

      It is multiplied by the z component of the point because we want to translate by a fixed amount in the u-v plane. However, after being acted by the camera matrix, they are in homogeneous coordinates, scaled by the z component!

    1. dJd=dLd

      Why??

      I see, I think the derivative wrt \(\theta\) here is supposed to be while z satisfies the constraints in (32)...

    2. restricted architectures which partition the hidden units.Our approach does not have these restrictions

      But it has whatever restriction being an ODE imposses. How does the expressivity and learnability of the ODEnet compare with ResNets?

    3. Minibatching

      Can you not just compute the gradient for each input, and average them?

    4. Second

      typo?

    5. tendtstart(z(t))dt

      This second term comes from the fact that we want the probability of observations at times \(t_1, ..., t_N\), and at no other time between \(t_{start}\) and \(t_{end}\)

    6. layers of only a single hidden unit

      corresponding to weight matrices of rank 1

    7. Instantaneous Change of Variables

      This is just the differential form of the continuity equation

    8. 6 residual blocks, which are replaced by an ODESolve

      What are the residual blocks precisely? Single layer + ReLU?

    9. the number of evaluations of the hidden state dynamics required, a detail delegated to the ODE solverand dependent on the initial state or input.

      Adaptive computation time

    10. 

      \(\theta(t)\)

    1. 1:00:00 scale separation as a way to go beyond Markov (local interactions) assumption

    2. ~32:00 What about the domain of the function being effectively lower dimensional, rather than a strongly regularity assumption? That would also work, right? Could this be the case for images? (what's the dimensionality of the manifold of natural images?)

      Nice. I like the idea of regularity <> low dimensional representation. I guess by that general definition, the above is a form of regularity..

      He comments about this on 38:30

    1. to have identical norm at the input layer,

      This works because the marginal variance is independent of the orientation of the input vector. This can be proven by noting that a random Gaussian matrix doesn't change its distribution when rotated by an orthonormal similarity matrix..

  15. Jun 2018
    1. Since GPs give exact marginal likelihood estimates, this kernel construction mayallow principled hyperparameter selection, or nonlinearity design, e.g. by gradient ascent on the loglikelihood w.r.t. the hyperparameters. Although this is not the focus of current work, we hope toreturn to this topic in follow-up work

      Cool idea!

    2. will be a sum of i.i.d. terms

      Not true (for finite width nets) I think. the terms are conditionally independent, when conditioned on the value of the final hidden layer. If we consider the joint distribution over all the weights over all layers, then the terms won't be independent. However, in the limit of infinite widths, they do become independent. This is because their covariance is \(0\) and in the infinite width limit, they become Gaussian with that 0 covariance, and therefore independent.

      Remember also that this is for a fixed input (and later analysis is for a finite collection of inputs)

    1. concentration

      is this concentration phenomenon related to the asymptotic equipartition properties of large collections of independent (or weakly dependent) variables?

  16. May 2018
  17. Apr 2018
    1. all the semantic consequences.

      What are semantic consequences? Is there such a thing as syntactic consequences? Is the statement said here different for them? I don't think so right?

    2. all possible subsets of a set?

      subsets are defined as properties in Logic. Second-order logic goes above first-order logic by being able to quantify over properties (and thus over subsets)

  18. Jan 2018
  19. static.googleusercontent.com static.googleusercontent.com
    1. he matrixYl(Yl)Tis thelth-layer data covariance matrix. The distribution of its eigenvalues (or thesingular values ofYl) determine the extent to which the input signals become distorted or stretchedas they propagate through the network

      I would say more that it determines the appearance of correlations between the elements of Y.

      Also, this isn't quite the covariance matrix, unless the mean of Y is zero?

    1. While local optima may not be a problem withdeep neural networks in supervised learning where the cor-rect answer is always given (Pascanu et al., 2014), the sameis not true in reinforcement learning problems with sparseor deceptive rewards.

      Reason why evolutionary methods are useful in RL, but not so much in SL. What about UL?

    2. random search
  20. Dec 2017
    1. irst, eigenvalueswhich are exactly zero (λi= 0) correspond to directions with no learning dynamics so thatthe parametersziwill remain atzi(0) indefinitely. These directions form afrozen subspacein which no learning occurs. Hence, if there are zero eigenvalues, weight initializations canhave a lasting impact on generalization performance even after arbitrarily long training.

      No training on singular directions.

      However, it seems like sloppy directions. Those with small eigenvalue in input correlation matrix causes overfitting. This is because in these direction there is small variability on the inputs, and yet there is variability on the output due to noise. To fit the small variability, we learn large weights, which are in fact too large, and don't generalize well..

    1. (1)

      See here. This is the cross-entropy between the likelihood for $\theta$, $q(X|\theta)$, and $q(X|\theta_0)$. It is the same as the disorder-averaged energy of state $\theta$, that is averaged over $X$. Note that because the energy is a mean over $x$, and each $x$ is independent, then for the disorder-averaged energy, we can just calculate the average for one $X$.

      If we evaluate this at $\theta=\theta_0$, then we have the Shannon entropy of $q(X|\theta_0$. $\theta_0$ is the disorderd-averaged ground state (it's not hard to see that it is the $\theta$ which minimizes $H(\theta;\theta_0), this is just a property of cross-entropy).


      The other one is the empirical version of this.

    2. (3)

      This is equal to

      $$\int_\Theta d\theta \bar{\omega}(\theta)\prod_{X\in x^N}q(X|\theta)$$

      which is just the probability of observing the samples $x^N$.

      But the interesting thing is that it can be written as a partition function where $\theta$ is the physical state vector as said after.

  21. Dec 2016