38 Matching Annotations
  1. Mar 2019
    1. heap.sort() maintains the heap invariant

      may swap the indices of the nodes at the same height but will keep the sorted array a min heap

    1. The most common error I see is a subconscious assumption that each word can have at most one synonym

      Use sets as the value.

  2. Oct 2018
    1. The perplexity of the model q is defined as b − 1 N ∑ i = 1 N log b ⁡ q ( x i ) {\displaystyle b^{-{\frac {1}{N}}\sum _{i=1}^{N}\log _{b}q(x_{i})}}

      The perplexity formula is missing the probability distribution \(p\)

    1. It has been demonstrated that this formulation is almost equivalent to a SLIM model,[9] which is an item-item model based recommender

      So a pre-trained item model can be used to make such recommendations.

    2. The user's latent factors represent the preference of that user for the corresponding item's latent factors

      The higher the value of the dot product between the two, the higher the preference.

    3. two lower dimensional matrices

      Not necessary (in fact, often not) square. Typically each user is represented by a vector of dimension strictly less than the number of items and vice versa.

    1. it will not try to start a failover if the master link was disconnected for more than the specified amount of time

      Why would it exhibit this behavior? Is it because a slave that's disconnected from the master for too long has stale data? Or is it because the slave made be failing as well?

  3. Sep 2018
    1. conditional distribution for individual components can be constructed

      So the conditional distribution is conditioned on other components?

    2. p(y∣x)=∫p(y∣f,x)p(f∣x)df

      \(y\) is the data, \(f\) is the model, \(x\) is the input variable

    1. must store an amount of information which increases with the size of the data

      Or you can use MCMC.

    2. some

      *sum

    3. calculation once again involves inverting a NxN matrix as in the kernel space representation of regression

      this is why we use MCMC or other distribution sampling technique instead

    4. $f(x_)foratestvectorinputforatestvectorinput for a test vector input x_,givenatrainingsetXwithvaluesyfortheGPisonceagainagaussiangivenbyequationCwithameanvector,givenatrainingsetXwithvaluesyfortheGPisonceagainagaussiangivenbyequationCwithameanvector, given a training set X with values y for the GP is once again a gaussian given by equation C with a mean vector m_andcovariancematrixandcovariancematrix and covariance matrix k_$:

      ...$f(x)$ for a test vector input $x$, given a training set $X$ with values $y$ for the GP is once again a gaussian given by equation C with a mean vector $m$ and covariance matrix $k$:

    5. corvariance

      *covariance

    6. in equation B for the marginal of a gaussian, only the covariance of the block of the matrix involving the unmarginalized dimensions matters! Thus “if you ask only for the properties of the function (you are fitting to the data) at a finite number of points, then inference in the Gaussian process will give you the same answer if you ignore the infinitely many other points, as if you would have taken them all into account!”(Rasmunnsen)

      key insight into Gaussian processes

    7. they

      *the

    8. im

      *in

    9. Notice now that the features only appear in the combination κ(x,x′)=xTΣx′,κ(x,x′)=xTΣx′,\kappa(x,x') = x^T \Sigma x', thus leading to writing the posterior predictive as p(f(x∗)|x∗,X,y)=N(κ(x∗,X)(κ(XT,X)+σ2I)−1y,κ(x∗,x∗)−κ(x∗,XT)(κ(XT,X)+σ2I)−1κ(XT,x∗))p(f(x∗)|x∗,X,y)=N(κ(x∗,X)(κ(XT,X)+σ2I)−1y,κ(x∗,x∗)−κ(x∗,XT)(κ(XT,X)+σ2I)−1κ(XT,x∗))p(f(x_*) | x_* , X, y) = N\left(\kappa(x_*,X) \left(\kappa(X^T,X) + \sigma^2 I\right)^{-1}y,\,\,\, \kappa(x_*,x_*) - \kappa(x_*,X^T)\left(\kappa(X^T,X) + \sigma^2 I\right)^{-1} \kappa(X^T,x_*) \right) The function κκ\kappa is called the kernel

      how the kernel came about?

  4. Aug 2018
    1. expected values change during the series

      So no longer identically distributed

    1. To learn a network for Cifar-10, DARTS takes just 4 GPU days, compared to 1800 GPU days for NASNet and 3150 GPU days for AmoebaNet

      What about in comparison to ENAS?

  5. Jul 2018
  6. mp.weixin.qq.com mp.weixin.qq.com
    1. 极大极小(Max-min)博弈

      Choose D to maximally discriminate D vs G and at the same time learn the real data; choose G to best "confuse" D.

    2. 交叉熵

      Cross entropy

    3. 零和博弈

      Zero-sum game

  7. Jun 2018
    1. you will have to do a lot of work to select appropriate input data and to code the data as numeric values

      Not really anymore with the advent of convolutional neural networks.

    Annotators

  8. May 2018