 Aug 2019

en.wikipedia.org en.wikipedia.org

As logbilinear regression model for unsupervised learning of word representations, it combines the features of two model families, namely the global matrix factorization and local context window methods
What does "logbilinear regression" mean exactly?

 Jul 2019

en.wikipedia.org en.wikipedia.org

An Oblivious Tree is a rooted tree with the following property: All the leaves are in the same level. All the internal nodes have degree at most 3. Only the nodes along the rightmost path in the tree may have degree of one.
Note this is not the definition of the oblivious decision trees in the CatBoost paper.
There a oblivious decision tree means a tree where the feature used for splitting is the same across all intermediate nodes within the same level of the tree, and the leaves are all in the same level.
See: https://stats.stackexchange.com/questions/353172/whatisobliviousdecisiontreeandwhy

 Jun 2019

arxiv.org arxiv.org

features (sparse)
are these feature values or actual features?

Note that thescalar multipledoes not meanxkis linear withx0
x_k is not a linear function of x_0

We argue that the CrossNet learns a special typeof highorder feature interactions, where each hidden layer in theCrossNet is a scalar multiple ofx0
In that case CrossNet doesn't really learn anything?

multivalent,
takes on more than one value

univalent,
takes on a unique value
Tags
Annotators
URL

 Mar 2019


heap.sort() maintains the heap invariant
may swap the indices of the nodes at the same height but will keep the sorted array a min heap



The most common error I see is a subconscious assumption that each word can have at most one synonym
Use sets as the value.

 Oct 2018

en.wikipedia.org en.wikipedia.org

The perplexity of the model q is defined as b − 1 N ∑ i = 1 N log b q ( x i ) {\displaystyle b^{{\frac {1}{N}}\sum _{i=1}^{N}\log _{b}q(x_{i})}}
The perplexity formula is missing the probability distribution \(p\)


en.wikipedia.org en.wikipedia.org

It has been demonstrated that this formulation is almost equivalent to a SLIM model,[9] which is an itemitem model based recommender
So a pretrained item model can be used to make such recommendations.

The user's latent factors represent the preference of that user for the corresponding item's latent factors
The higher the value of the dot product between the two, the higher the preference.

two lower dimensional matrices
Not necessary (in fact, often not) square. Typically each user is represented by a vector of dimension strictly less than the number of items and vice versa.


karpathy.github.io karpathy.github.io

are are
*are
Tags
Annotators
URL


redis.io redis.io

it will not try to start a failover if the master link was disconnected for more than the specified amount of time
Why would it exhibit this behavior? Is it because a slave that's disconnected from the master for too long has stale data? Or is it because the slave made be failing as well?


mp.weixin.qq.com mp.weixin.qq.com机器之心2

会话
Session

Python

 Sep 2018


conditional distribution for individual components can be constructed
So the conditional distribution is conditioned on other components?

p(y∣x)=∫p(y∣f,x)p(f∣x)df
\(y\) is the data, \(f\) is the model, \(x\) is the input variable


am207.github.io am207.github.io

marginaly
*marginal

corvariance
*covariance

y=y1,…,yn=m
\(n = m\)


am207.github.io am207.github.io

must store an amount of information which increases with the size of the data
Or you can use MCMC.

some
*sum

calculation once again involves inverting a NxN matrix as in the kernel space representation of regression
this is why we use MCMC or other distribution sampling technique instead

$f(x_)foratestvectorinputforatestvectorinput for a test vector input x_,givenatrainingsetXwithvaluesyfortheGPisonceagainagaussiangivenbyequationCwithameanvector,givenatrainingsetXwithvaluesyfortheGPisonceagainagaussiangivenbyequationCwithameanvector, given a training set X with values y for the GP is once again a gaussian given by equation C with a mean vector m_andcovariancematrixandcovariancematrix and covariance matrix k_$:
...$f(x)$ for a test vector input $x$, given a training set $X$ with values $y$ for the GP is once again a gaussian given by equation C with a mean vector $m$ and covariance matrix $k$:

corvariance
*covariance

in equation B for the marginal of a gaussian, only the covariance of the block of the matrix involving the unmarginalized dimensions matters! Thus “if you ask only for the properties of the function (you are fitting to the data) at a finite number of points, then inference in the Gaussian process will give you the same answer if you ignore the infinitely many other points, as if you would have taken them all into account!”(Rasmunnsen)
key insight into Gaussian processes

they
*the

im
*in

Notice now that the features only appear in the combination κ(x,x′)=xTΣx′,κ(x,x′)=xTΣx′,\kappa(x,x') = x^T \Sigma x', thus leading to writing the posterior predictive as p(f(x∗)x∗,X,y)=N(κ(x∗,X)(κ(XT,X)+σ2I)−1y,κ(x∗,x∗)−κ(x∗,XT)(κ(XT,X)+σ2I)−1κ(XT,x∗))p(f(x∗)x∗,X,y)=N(κ(x∗,X)(κ(XT,X)+σ2I)−1y,κ(x∗,x∗)−κ(x∗,XT)(κ(XT,X)+σ2I)−1κ(XT,x∗))p(f(x_*)  x_* , X, y) = N\left(\kappa(x_*,X) \left(\kappa(X^T,X) + \sigma^2 I\right)^{1}y,\,\,\, \kappa(x_*,x_*)  \kappa(x_*,X^T)\left(\kappa(X^T,X) + \sigma^2 I\right)^{1} \kappa(X^T,x_*) \right) The function κκ\kappa is called the kernel
how the kernel came about?


am207.github.io am207.github.io

generate
more like "sample"

f∗
\(f^*\) denotes the model



Bayesian approach to handling observation noise
One core contribution of this work.


mp.weixin.qq.com mp.weixin.qq.com腾讯AI实验室1

通道剪枝算法
channel pruning algorithm

 Aug 2018

en.wikipedia.org en.wikipedia.org

expected values change during the series
So no longer identically distributed


www.fast.ai www.fast.ai

To learn a network for Cifar10, DARTS takes just 4 GPU days, compared to 1800 GPU days for NASNet and 3150 GPU days for AmoebaNet
What about in comparison to ENAS?

 Jul 2018

spark.apache.org spark.apache.org

partitioner
How to define a partitioner?


mp.weixin.qq.com mp.weixin.qq.comSigAI3

极大极小(Maxmin)博弈
Choose D to maximally discriminate D vs G and at the same time learn the real data; choose G to best "confuse" D.

交叉熵
Cross entropy

零和博弈
Zerosum game

 Jun 2018

people.cs.bris.ac.uk people.cs.bris.ac.uk

isometrics in ROC space
What does this mean exactly?


Local file Local file

you will have to do a lot of work to select appropriate input data and to code the data as numeric values
Not really anymore with the advent of convolutional neural networks.


ai.intel.com ai.intel.com

next state s’
Is the next state s' the state reached by taking the action with the highest reward?

 May 2018

blog.cloudera.com blog.cloudera.com

The number of tasks is the single most important parameter.
