683 Matching Annotations

Jan 2019
www.jmlr.org www.jmlr.org

langford05a.dvi

1
1. guillefix 28 Jan 2019
  
  in Public
  
  other nonquantitative uses of bounds (such as providing indirect motivations forlearning algorithms via constant fitting) do exist. We do not focus on those uses here
  
  What do they mean by "learning algorithms via constant fitting"?
Visit annotations in context

Annotators

guillefix

URL

jmlr.org/papers/volume6/langford05a/langford05a.pdf
Dec 2018
arxiv.org arxiv.org

1808.05563.pdf

2
1. guillefix 04 Dec 2018
  
  in Public
  
  Hyperparameter I guess
2. guillefix 04 Dec 2018
  
  in Public
  
  it does not distinguish between models which fit the training data equally well
  
  Well, if it is regularized (so there is an effective prior), then this isn't true!
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1808.05563.pdf
openreview.net openreview.net

pdf

7
1. guillefix 03 Dec 2018
  
  in Public
  
  on networks and problems at a practical scale
  
  Yeah. But not on the original networks, but on the compressed ones!
2. guillefix 03 Dec 2018
  
  in Public
  
  L()
  
  This is a 0-1 loss. It can be applied to top-1 or top-5 defining the loss appropriately!
3. guillefix 03 Dec 2018
  
  in Public
  
  The pruned model achieves a validationaccuracy of60 %.
  
  this is far from SOTA. right?
4. guillefix 03 Dec 2018
  
  in Public
  
  A direct application of our naïve Occam bound yields non-vacuousbound on the test error of98:6%(with95%confidence)
  
  Ok, that is non-vacuous, but just, right?
5. guillefix 03 Dec 2018
  
  in Public
  
  Our simple Occam bound requires only minimal assumptions, and can be directly applied to existingcompressed networks
  
  But remember that the compression scheme technically shouldn't depend on the data, for the bound to be valid (as the PAC-Bayes prior depends on the compression scheme)
6. guillefix 03 Dec 2018
  
  in Public
  
  We obtain a bound on the training error of46%
  
  On the training error, or the test error?
7. guillefix 03 Dec 2018
  
  in Public
  
  We prunethe network using Dynamic Network Surgery (Guo et al., 2016), pruning all but1:5%of the networkweights.
  
  wow, they are very weight-compressible indeed
Visit annotations in context

Annotators

guillefix

URL

openreview.net/forum
emtiyaz.github.io emtiyaz.github.io

approxBayesInference.pdf

2
1. guillefix 01 Dec 2018
  
  in Public
  
  Mean-eld underestimates the vari-ance
  
  Independence increases concentration
2. guillefix 01 Dec 2018
  
  in Public
  
  REq=i(w=i)[logp(D;w)]dw
  
  You are normalizing the log not the distribution here!
Visit annotations in context

Annotators

guillefix

URL

emtiyaz.github.io/teaching/ds3_2018/approxBayesInference.pdf
www.cs.cmu.edu www.cs.cmu.edu

10708-scribe-lecture13.pdf

1
1. guillefix 01 Dec 2018
  
  in Public
  
  /
  
  Exp missing
Visit annotations in context

Annotators

guillefix

URL

cs.cmu.edu/~epxing/Class/10708-17/notes-17/10708-scribe-lecture13.pdf
Nov 2018
emtiyaz.github.io emtiyaz.github.io

approxBayesInference.pdf

1
1. guillefix 30 Nov 2018
  
  in Public
  
  DKL[qkp]
  
  It is minus this!
Visit annotations in context

Annotators

guillefix

URL

emtiyaz.github.io/teaching/ds3_2018/approxBayesInference.pdf
openreview.net openreview.net

pdf

6
1. guillefix 27 Nov 2018
  
  in Public
  
  jSjc+jCjc
  
  These are the sizes of the codes for encoding the vector of weight indices and discretized weight values, according to coding scheme $c$
2. guillefix 27 Nov 2018
  
  in Public
  
  264bytes, which is272bits
  
  isn't $2^{64}\ bytes \(2^{67}$ bits, as each byte is $2^3$ bits?
3. guillefix 27 Nov 2018
  
  in Public
  
  Theorem 4.1
  
  Bound on KL divergence for Universal prior, and point posterior
4. guillefix 27 Nov 2018
  
  in Public
  
  Ascis injective onHc, we have thatZ1.
  
  By Kraft inequality, and $m$ being a probability measure
5. guillefix 19 Nov 2018
  
  in Public
  
  The generalization bound can be evaluated by compressing a trained network, measuring theeffective compressed size, and substituting this value into the bound.
  
  An important question is: are the bounds valid for the compressed network, or for the original one as well? (as isn't the case in the related work: https://arxiv.org/abs/1802.05296 )
6. guillefix 08 Nov 2018
  
  in Public
  
  empirical evidence suggests that they fail to improve performance in practice(Wilson et al., 2017)
  
  Well. If this is the case it means that proving a generalization bound for one of these procedures "implies" a bound for the standard training procedure, as you are claiming that the standard one generalizes not (much) worse than these ones
Visit annotations in context

Annotators

guillefix

URL

openreview.net/forum
www.youtube.com www.youtube.com

Sompolinsky II Beg Rohu 2018

1
1. guillefix 08 Nov 2018
  
  in Public
  
  To understand phenomenon described by Saxe and in the video at 43:00, we can think of this: Low eigenvalues in XX^T correspond to directions with little variations in the input. However, by the random fluctuation eta, the output could have an O(1) variation, even for arbitrarily small input variation, which requires a large weight to fit, and produces large generalization error.
  
  for $\alpha<1$ the probability of this directions in input space with low variation decreases, as we get less directions overall with points I think (directions with no points/variation are ignored for the algorithm which projects weight into input subspace, and these are the 0 eigenvalue parts of the Marchenko-Pastur distr)
Visit annotations in context

Annotators

guillefix

URL

youtube.com/watch
arxiv.org arxiv.org

1706.08947.pdf

11
1. guillefix 07 Nov 2018
  
  in Public
  
  weakinteractions between layers can cause the network to have high sharpness value.
  
  why does interacting weakly cause high sharpnes??
  
  I think the reason is that for layers which interact strongly random perturbations tend to cause smaller relative change on the output, than for layers that interact strongly. And this smaller relative change on outputs probably translates into a smaller absolute change in the Loss...
  
  think about aligned eigenvectors and stuff
2. guillefix 07 Nov 2018
  
  in Public
  
  depends only linearly on depth and does not have any exponential dependence, unlike other notionsof generalization.
  
  Aren't VC dim bounds for NNs linear in depth also?
3. guillefix 07 Nov 2018
  
  in Public
  
  forming a confusion set that includessamples with random labels.
  
  This is what's done in Wu et al https://arxiv.org/abs/1706.10239 also
4. guillefix 07 Nov 2018
  
  in Public
  
  middle and right
  
  For a fixed expected sharpness the KL (and thus the effective capacity/bound) increases less for true labels than for random lables. Also, in both cases, it increases monotonically
5. guillefix 07 Nov 2018
  
  in Public
  
  capacity is proportional toCMmarginndiamM(X).
  
  This is the "curse of dimensionality"!
  
  Probably can prove using covering numbers which bound RadComp (via Massart's lemma)
6. guillefix 07 Nov 2018
  
  in Public
  
  simply bounding the Lipschitz constantof the network is not enough to get a reasonable capacity control
  
  but are the generalization error bounds for Lipschitz functions tight?
7. guillefix 07 Nov 2018
  
  in Public
  
  However, the covering number of the input domaincan be exponential in the input dimension and the capacity can still grow as
  
  Unless inputs lie on a slow dimensional manifold I guess
8. guillefix 07 Nov 2018
  
  in Public
  
  Instead, to meaningfully compare norms of the network, we should explicitly take into account thescaling of the outputs of the network. One way this can be done, when the training error is indeedzero, is to consider the “margin” of the predictions in addition to the norms of the parameters.
  
  This has been used in several papers on generalization bounds for neural nets.
  
  I think the idea is like that for margin bounds for SVMs. You can't bound the RadComp using the 0-1 loss as it's Lipschitz (related to above arguments). But you can bound the Hinge loss (which is related to having a margin!), and then use the fact that the Hinge loss upper bounds the 0-1 loss.
  
  The intuition I guess is that the previous troublesome cases with norm-based capacity can be solved by taking margin into account, as the extremes in both cases are reducing/increasing the margin (for weights going to zero/infinity).
9. guillefix 07 Nov 2018
  
  in Public
  
  proportional toQdi=1kWik21;1,wherekWik1;1is the maximum over hidden units in layeriof the`1norm of incoming weights tothe hidden unit [4].
  
  This comes from the Rademacher complexity of the neural network. See here: http://www.stats.ox.ac.uk/~rebeschi/teaching/AFoL/18/material/lecture3.pdf#page=4 for a derivation
10. guillefix 07 Nov 2018
  
  in Public
  
  One can then ensure generalization of a learned hypothesishin terms of the capacity ofHM;M(h)
  
  They just state this, but they should maybe cite some work on nonuniform learnability, SRM, MDL, so that people unfamiliar with it can see why this is!
11. guillefix 07 Nov 2018
  
  in Public
  
  anymeasure which is uniform across all functions representable by a given architecture, is not sufficientto explain the generalization ability of neural networks trained in practice. For linear models, normsand margin-based measures, and not the number of parameters, are commonly used for capacitycontrol [4,8,24].
  
  These are all SRM based.
  
  They depend (most often implicitly via the training data) on the data-distribution/target function, and the algorithm (also implicitely) If you get a good bound with SRM it's most likely because your algorithm is biased towards the right kind of solution (kind basically corresponds to the classes used in SRM). In other words, a bad algorithm (relative to target fun) will be very unlikely to produce a good bound. On the other hand, a good algorithm may produce a bad bound, if the prior is chosen badly. So that SRM bounds may not be tight in general
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1706.08947.pdf
arxiv.org arxiv.org

1810.05148.pdf

1
1. guillefix 05 Nov 2018
  
  in Public
  
  typo $\theta$ shouldn't be here
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1810.05148.pdf
Oct 2018
www.stats.ox.ac.uk www.stats.ox.ac.uk

lecture4.pdf

1
1. guillefix 31 Oct 2018
  
  in Public
  
  f
  
  is
Visit annotations in context

Annotators

guillefix

URL

stats.ox.ac.uk/~rebeschi/teaching/AFoL/18/material/lecture4.pdf
www.stats.ox.ac.uk www.stats.ox.ac.uk

lecture5.pdf

4
1. guillefix 31 Oct 2018
  
  in Public
  
  + 1
  
  I don't think the $+1$ is actually necessary
2. guillefix 31 Oct 2018
  
  in Public
  
  Proposition 5.1
  
  Remember: larger metric correspond to smaller rulers
3. guillefix 31 Oct 2018
  
  in Public
  
  the
  
  there
4. guillefix 31 Oct 2018
  
  in Public
  
  5.2
  
  5.3
Visit annotations in context

Annotators

guillefix

URL

stats.ox.ac.uk/~rebeschi/teaching/AFoL/18/material/lecture5.pdf
www.stats.ox.ac.uk www.stats.ox.ac.uk

lecture3.pdf

6
1. guillefix 30 Oct 2018
  
  in Public
  
  2
  
  shouldn't there be a $\lambda$ here too?
  
  And in fact another $\lambda$ from Rad(A_1)?
2. guillefix 30 Oct 2018
  
  in Public
  
  =
  
  this should be $\leq$,
  
  thought I would have used $\mathcal{L} \cup -\mathcal{L}$ instead
3. guillefix 30 Oct 2018
  
  in Public
  
  0
  
  appostrophe shouldn't be there
4. guillefix 30 Oct 2018
  
  in Public
  
  h
  
  hull
5. guillefix 30 Oct 2018
  
  in Public
  
  =
  
  missing parenthesis
6. guillefix 30 Oct 2018
  
  in Public
  
  l
  
  $$j$$
Visit annotations in context

Annotators

guillefix

URL

stats.ox.ac.uk/~rebeschi/teaching/AFoL/18/material/lecture3.pdf
www.stats.ox.ac.uk www.stats.ox.ac.uk

lecture2.pdf

1
1. guillefix 30 Oct 2018
  
  in Public
  
  =
  
  $\leq$
Visit annotations in context

Annotators

guillefix

URL

stats.ox.ac.uk/~rebeschi/teaching/AFoL/18/material/lecture2.pdf
arxiv.org arxiv.org

1810.04586.pdf

1
1. guillefix 23 Oct 2018
  
  in Public
  
  (U) =RSP(Ujv)d(v).
  
  ? How does this define what $\rho$ is?
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1810.04586.pdf
arxiv.org arxiv.org

1804.11271.pdf

2
1. guillefix 17 Oct 2018
  
  in Public
  
  he largestlast width function seems to converge slightly faster than the largest last width function
  
  typo
2. guillefix 17 Oct 2018
  
  in Public
  
  more
  
  less
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1804.11271.pdf
courses.cs.washington.edu courses.cs.washington.edu

lecture13.pdf

1
1. guillefix 02 Oct 2018
  
  in Public
  
  distribution
Visit annotations in context

Annotators

guillefix

URL

courses.cs.washington.edu/courses/cse522/11wi/scribes/lecture13.pdf
Aug 2018
arxiv.org arxiv.org

1710.06451.pdf

1
1. guillefix 21 Aug 2018
  
  in Public
  
  DdCid!dCjd!E=1NdCd!2+F(!)ij
  
  I don't get this formula. What are C_i and C_j supposed to mean. Or rather, where is the randomness over which we average?
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1710.06451.pdf
arxiv.org arxiv.org

1807.08501.pdf

1
1. guillefix 10 Aug 2018
  
  in Public
  
  is not surprising and there are suitable options, even without GANs
  
  Well, what is nontrivial about this is generalizing (as usual), how do you generate novel samples, what are samples that "look like" the rest of the data. How do you define that?
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1807.08501.pdf
openreview.net openreview.net

pdf

1
1. guillefix 10 Aug 2018
  
  in Public
  
  The alignment problem is how to make sure that the input image in A is mapped (via image generation) to an analog image in B, where this analogy is not defined by training pairs or in any other explicit way (see Sec. 2).
  
  Well, you are attempting to define it as "the alignment given by the lowest complexity meme".
  
  I think this doesn't address the question, which is more generally addressed to the whole field of cross-domain mapping, I think. Why aren't the new neural network approaches compared with previous approaches of unlabelled cross-domain mapping, which can be formalized as different forms of alignment between sets of points. I think the main difference here is that in those approaches you typically know the whole target domain, while here we don't quite know it, we just have a few samples from it, but I feel this is a surmontable problem
  
  GAN
Visit annotations in context

Tags

GAN

Annotators

guillefix

URL

openreview.net/forum
Jul 2018
www.wikiwand.com www.wikiwand.com

Camera resectioning | Wikiwand

2
1. guillefix 22 Jul 2018
  
  in Public
  
  The position, C {\displaystyle C} , of the camera expressed in world coordinates is C = − R − 1 T = − R T T {\displaystyle C=-R^{-1}T=-R^{T}T} (since R {\displaystyle R} is a rotation matrix).
  
  The minus is because we are translating the world in the opposite way that the camera was translated to get it right relative to the camera, while putting the camera at the origin. The R^-1 is because we first rotate by R and then translate by T, so to get the original translation vector we need to undo the rotation to T
2. guillefix 22 Jul 2018
  
  in Public
  
  u 0 {\displaystyle u_{0)) and v 0 {\displaystyle v_{0)) represent the principal point, which would be ideally in the centre of the image.
  
  It is multiplied by the z component of the point because we want to translate by a fixed amount in the u-v plane. However, after being acted by the camera matrix, they are in homogeneous coordinates, scaled by the z component!
Visit annotations in context

Annotators

guillefix

URL

wikiwand.com/en/Camera_resectioning
arxiv.org arxiv.org

1806.07366.pdf

10
1. guillefix 18 Jul 2018
  
  in Public
  
  dJd=dLd
  
  Why??
  
  I see, I think the derivative wrt $\theta$ here is supposed to be while z satisfies the constraints in (32)...
2. guillefix 18 Jul 2018
  
  in Public
  
  restricted architectures which partition the hidden units.Our approach does not have these restrictions
  
  But it has whatever restriction being an ODE imposses. How does the expressivity and learnability of the ODEnet compare with ResNets?
3. guillefix 18 Jul 2018
  
  in Public
  
  Minibatching
  
  Can you not just compute the gradient for each input, and average them?
4. guillefix 18 Jul 2018
  
  in Public
  
  Second
  
  typo?
5. guillefix 18 Jul 2018
  
  in Public
  
  tendtstart(z(t))dt
  
  This second term comes from the fact that we want the probability of observations at times $t_1, ..., t_N$, and at no other time between $t_{start}$ and $t_{end}$
6. guillefix 18 Jul 2018
  
  in Public
  
  layers of only a single hidden unit
  
  corresponding to weight matrices of rank 1
7. guillefix 18 Jul 2018
  
  in Public
  
  Instantaneous Change of Variables
  
  This is just the differential form of the continuity equation
8. guillefix 18 Jul 2018
  
  in Public
  
  6 residual blocks, which are replaced by an ODESolve
  
  What are the residual blocks precisely? Single layer + ReLU?
9. guillefix 18 Jul 2018
  
  in Public
  
  the number of evaluations of the hidden state dynamics required, a detail delegated to the ODE solverand dependent on the initial state or input.
  
  Adaptive computation time
10. guillefix 18 Jul 2018
  
  in Public
  
  $\theta(t)$
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1806.07366
www.youtube.com www.youtube.com

Mallat I Beg Rohu 2018 - YouTube

2
1. guillefix 12 Jul 2018
  
  in Public
  
  1:00:00 scale separation as a way to go beyond Markov (local interactions) assumption
2. guillefix 12 Jul 2018
  
  in Public
  
  ~32:00 What about the domain of the function being effectively lower dimensional, rather than a strongly regularity assumption? That would also work, right? Could this be the case for images? (what's the dimensionality of the manifold of natural images?)
  
  Nice. I like the idea of regularity <> low dimensional representation. I guess by that general definition, the above is a form of regularity..
  
  He comments about this on 38:30
  
  high-dimensions machine learning
Visit annotations in context

Tags

machine learning

high-dimensions

Annotators

guillefix

URL

youtube.com/watch
pdfs.semanticscholar.org pdfs.semanticscholar.org

092b962bcb430fdcebf1407d1299adb1a10b.pdf

1
1. guillefix 10 Jul 2018
  
  in Public
  
  it should be +? This typo carries through to later parts
Visit annotations in context

Annotators

guillefix

URL

pdfs.semanticscholar.org/3e38/092b962bcb430fdcebf1407d1299adb1a10b.pdf
arxiv.org arxiv.org

1711.00165.pdf

1
1. guillefix 03 Jul 2018
  
  in Public
  
  to have identical norm at the input layer,
  
  This works because the marginal variance is independent of the orientation of the input vector. This can be proven by noting that a random Gaussian matrix doesn't change its distribution when rotated by an orthonormal similarity matrix..
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1711.00165.pdf
arxiv.org arxiv.org

1806.03335.pdf

1
1. guillefix 03 Jul 2018
  
  in Public
  
  (2)
  
  Why is this the variational loss???
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1806.03335.pdf
Jun 2018
arxiv.org arxiv.org

1711.00165.pdf

2
1. guillefix 20 Jun 2018
  
  in Public
  
  Since GPs give exact marginal likelihood estimates, this kernel construction mayallow principled hyperparameter selection, or nonlinearity design, e.g. by gradient ascent on the loglikelihood w.r.t. the hyperparameters. Although this is not the focus of current work, we hope toreturn to this topic in follow-up work
  
  Cool idea!
2. guillefix 19 Jun 2018
  
  in Public
  
  will be a sum of i.i.d. terms
  
  Not true (for finite width nets) I think. the terms are conditionally independent, when conditioned on the value of the final hidden layer. If we consider the joint distribution over all the weights over all layers, then the terms won't be independent. However, in the limit of infinite widths, they do become independent. This is because their covariance is $0$ and in the infinite width limit, they become Gaussian with that 0 covariance, and therefore independent.
  
  Remember also that this is for a fixed input (and later analysis is for a finite collection of inputs)
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1711.00165.pdf
web.math.princeton.edu web.math.princeton.edu

ORF570.pdf

1
1. guillefix 09 Jun 2018
  
  in Public
  
  concentration
  
  is this concentration phenomenon related to the asymptotic equipartition properties of large collections of independent (or weakly dependent) variables?
Visit annotations in context

Annotators

guillefix

URL

web.math.princeton.edu/~rvan/APC550.pdf
May 2018
www.stats.ox.ac.uk www.stats.ox.ac.uk

The Mirror Descent Algorithm

2
1. guillefix 31 May 2018
  
  in Public
  
  −1
  
  this means the functional inverse
2. guillefix 31 May 2018
  
  in Public
  
  φ
  
  this should be the other phi?
Visit annotations in context

Annotators

guillefix

URL

stats.ox.ac.uk/~lienart/blog_opti_mda.html
Apr 2018
www.lesswrong.com www.lesswrong.com

Second-Order Logic: The Controversy

2
1. guillefix 25 Apr 2018
  
  in Public
  
  all the semantic consequences.
  
  What are semantic consequences? Is there such a thing as syntactic consequences? Is the statement said here different for them? I don't think so right?
2. guillefix 25 Apr 2018
  
  in Public
  
  all possible subsets of a set?
  
  subsets are defined as properties in Logic. Second-order logic goes above first-order logic by being able to quantify over properties (and thus over subsets)
Visit annotations in context

Annotators

guillefix

URL

lesswrong.com/posts/SWn4rqdycu83ikfBa/second-order-logic-the-controversy
Jan 2018
static.googleusercontent.com static.googleusercontent.com

46342.pdf

1
1. guillefix 15 Jan 2018
  
  in Public
  
  he matrixYl(Yl)Tis thelth-layer data covariance matrix. The distribution of its eigenvalues (or thesingular values ofYl) determine the extent to which the input signals become distorted or stretchedas they propagate through the network
  
  I would say more that it determines the appearance of correlations between the elements of Y.
  
  Also, this isn't quite the covariance matrix, unless the mean of Y is zero?
Visit annotations in context

Annotators

guillefix

URL

static.googleusercontent.com/media/research.google.com/en//pubs/archive/46342.pdf
arxiv.org arxiv.org

1712.06567.pdf

2
1. guillefix 10 Jan 2018
  
  in Public
  
  While local optima may not be a problem withdeep neural networks in supervised learning where the cor-rect answer is always given (Pascanu et al., 2014), the sameis not true in reinforcement learning problems with sparseor deceptive rewards.
  
  Reason why evolutionary methods are useful in RL, but not so much in SL. What about UL?
2. guillefix 10 Jan 2018
  
  in Public
  
  random search
  
  Is this what they mean https://en.wikipedia.org/wiki/Random_search ?
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1712.06567.pdf
Dec 2017
www.people.fas.harvard.edu www.people.fas.harvard.edu

Advani, Saxe - 2017 - High-dimensional dynamics of generalization error in neural networks.pdf

1
1. guillefix 15 Dec 2017
  
  in Public
  
  irst, eigenvalueswhich are exactly zero (λi= 0) correspond to directions with no learning dynamics so thatthe parametersziwill remain atzi(0) indefinitely. These directions form afrozen subspacein which no learning occurs. Hence, if there are zero eigenvalues, weight initializations canhave a lasting impact on generalization performance even after arbitrarily long training.
  
  No training on singular directions.
  
  However, it seems like sloppy directions. Those with small eigenvalue in input correlation matrix causes overfitting. This is because in these direction there is small variability on the inputs, and yet there is variability on the output due to noise. To fit the small variability, we learn large weights, which are in fact too large, and don't generalize well..
Visit annotations in context

Annotators

guillefix

URL

people.fas.harvard.edu/~asaxe/papers/Advani, Saxe - 2017 - High-dimensional dynamics of generalization error in neural networks.pdf
arxiv.org arxiv.org

1706.01428.pdf

2
1. guillefix 05 Dec 2017
  
  in Public
  
  (1)
  
  See here. This is the cross-entropy between the likelihood for $\theta$, $q(X|\theta)$, and $q(X|\theta_0)$. It is the same as the disorder-averaged energy of state $\theta$, that is averaged over $X$. Note that because the energy is a mean over $x$, and each $x$ is independent, then for the disorder-averaged energy, we can just calculate the average for one $X$.
  
  If we evaluate this at $\theta=\theta_0$, then we have the Shannon entropy of $q(X|\theta_0$. $\theta_0$ is the disorderd-averaged ground state (it's not hard to see that it is the $\theta$ which minimizes $H(\theta;\theta_0), this is just a property of cross-entropy).
  
  The other one is the empirical version of this.
2. guillefix 05 Dec 2017
  
  in Public
  
  (3)
  
  This is equal to
  
  $$\int_\Theta d\theta \bar{\omega}(\theta)\prod_{X\in x^N}q(X|\theta)$$
  
  which is just the probability of observing the samples $x^N$.
  
  But the interesting thing is that it can be written as a partition function where $\theta$ is the physical state vector as said after.
Visit annotations in context

Annotators

guillefix

URL

arxiv.org/pdf/1706.01428.pdf
Dec 2016
journal.frontiersin.org journal.frontiersin.org

Toward an Integration of Deep Learning and Neuroscience

1
1. guillefix 11 Dec 2016
  
  in Public
  
  node-perturbation-like correlations
  
  What is this?
Visit annotations in context

Annotators

guillefix

URL

journal.frontiersin.org/article/10.3389/fncom.2016.00094/full

guillefix

Annotations: 683

Joined: February 23, 2015

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Tags

Annotators

URL

Annotators

URL

Annotators

URL

Tags

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL