- Nov 2021
-
en.wikipedia.org en.wikipedia.org
-
squared absolute error in f ( x ) {\displaystyle f(x)} in typical cases
Because it's unimodal? So quadratic approximate works well in typical cases?
-
- Sep 2021
-
www.csie.ntu.edu.tw www.csie.ntu.edu.tw
-
In total,this means that the dot product (i.e. the interaction) of thefactor vectors ofAliceandStar Trekwill be similar to the oneofAliceandStar Wars– which also makes intuitively sense
This illustrates the power of collaborative filtering.
-
- Jul 2021
-
dl.acm.org dl.acm.org
-
a
$$u$$
-
argeting user return as the objectivedoes not significantly affect user engagement (e.g., the actualfuture user return and churning risk) and shows a trend ofhurting perception metrics compared with the baseline
Very interesting and somewhat counter-intuitive. Increased return should suggest more user engagement, no?
-
user return
"sessions"
Tags
Annotators
URL
-
- Jun 2021
-
arxiv.org arxiv.org
-
Gramian trick
aka Kernel trick
-
m|Ic|q(j|c) ̃α(c,j)
Why multiply by the sample size and \(I_c\)?
m samples in algorithm 1, but why \(I_c\)...
-
samplesmnegative
Sample multiple negatives in order to not under represent the negative examples?
-
For example, if the batch containspositive observations (c1,i1),(c2,i2),..., then{i1,i2,...}are treated as the neg-atives for this batch, e.g., forc1, the negatives would be (c1,i2),(c1,i3)
Interesting, how can you be sure that i2 is not relevant for c1?
-
a ground truth set of relevant items{i1,i2,...}
So we do not consider the order of the items.
-
weak negatives are derived from all the remaining items
Remaining items that the user had a chance to select.
-
heseembedding matrices,W∈RC×dandH∈RI×dare the model parametersθ
In particular, the number of items and context dimensions are fixed per model. This makes difficult to apply the model to new items or in new context (e.g. recommend to new users).
-
-
papers.nips.cc papers.nips.cc
-
This modeling choice is consistentwith the idea of ranking the documents with largest scores first; intuitively, the more documents ina permutation are in the decreasing order of score, the bigger the probability of the permutation is.
In particular, scores themselves matter (not just the ranking).
-
Proof.
Jensen's Inequality
Also \(\langle \bullet \rangle_F\) is the expectation with respect to \(\mathrm{Pr}(\pi^k | F, q^k)\).
-
Zkis the normalization factor
Is this the https://en.wikipedia.org/wiki/Discounted_cumulative_gain#Normalized_DCG "idealized DCG" (ranked by relevance score) up to a point $p$ (a parameter)
-
F(d, q)the ranking function thattakes a document-query pair(d, q)
So F is a function of just one d? So univariate scoring function
-
The listwise approaches can be classified into two categorie
Two categories of listwise approaches:
- Those that directly optimize the IR evaluation metrics
- Those that define a listwise loss function as an indirect way to optimize the IR evaluation metrics.
-
main difficulty in optimizing these evaluation metrics is that they aredependent on the rank position of documents induced by the ranking function, not the numericalvalues output by the ranking function.
The point being that the order of the updates are not what's learned directly.
-
We propose a probabilisticframework that addresses this challenge by optimizing the expectation of NDCGover all the possible permutations of documents
Does the model need to output the probability used to compute the NDCG?
-
-
cacm.acm.org cacm.acm.org
-
learning to learn, or meta-learning
Interestingly two of the three papers referenced date back to the previous millennium.
-
assign an energy (that is, a badness)
The term "energy" is often used in early deep learning (e.g. "energy based models"), apparently borrowed from statistical physics.
-
Its meaning resides in its relationships to other symbols which can be represented by a set of symbolic expressions or by a relational graph
Something akin to expert systems or Bayesian networks.
-
- Jan 2021
-
netflixtechblog.com netflixtechblog.com
-
how artwork performs in relation to other artwork we select in the same page or session
This is the slate optimization problem.
-
when presenting a specific piece of artwork for a title influenced a member to play (or not to play) a title and when a member would have played a title (or not) regardless of which image we presented
So to establish some causality between the thumbnail shown and the user's viewing of the show.
-
- Nov 2020
-
iwww.corp.linkedin.com iwww.corp.linkedin.com
-
even two members characterized by the similar sets of features described above may still have different preferences over jobs, due to the fact that some intrinsic difference between those two members may not be well captured by the features
So this brings the question on how GLMix can capture this difference if the data cannot already capture the difference?
-
such individual-specific effects among different populations (e.g., members, jobs) are ignored in GLM
Because they are drowned out by the broader pattern?
-
- Apr 2020
-
ai.googleblog.com ai.googleblog.com
-
based on logged experiences of a DQN agent
So a different DQN agent is used to generate the logging data?
-
- Aug 2019
-
en.wikipedia.org en.wikipedia.org
-
As log-bilinear regression model for unsupervised learning of word representations, it combines the features of two model families, namely the global matrix factorization and local context window methods
What does "log-bilinear regression" mean exactly?
-
- Jul 2019
-
en.wikipedia.org en.wikipedia.org
-
An Oblivious Tree is a rooted tree with the following property: All the leaves are in the same level. All the internal nodes have degree at most 3. Only the nodes along the rightmost path in the tree may have degree of one.
Note this is not the definition of the oblivious decision trees in the CatBoost paper.
There a oblivious decision tree means a tree where the feature used for splitting is the same across all intermediate nodes within the same level of the tree, and the leaves are all in the same level.
See: https://stats.stackexchange.com/questions/353172/what-is-oblivious-decision-tree-and-why
-
- Jun 2019
-
arxiv.org arxiv.org
-
features (sparse)
are these feature values or actual features?
-
Note that thescalar multipledoes not meanxkis linear withx0
x_k is not a linear function of x_0
-
We argue that the CrossNet learns a special typeof high-order feature interactions, where each hidden layer in theCrossNet is a scalar multiple ofx0
In that case CrossNet doesn't really learn anything?
-
multivalent,
takes on more than one value
-
univalent,
takes on a unique value
Tags
Annotators
URL
-
- Mar 2019
-
-
heap.sort() maintains the heap invariant
may swap the indices of the nodes at the same height but will keep the sorted array a min heap
-
-
-
The most common error I see is a subconscious assumption that each word can have at most one synonym
Use sets as the value.
-
- Oct 2018
-
en.wikipedia.org en.wikipedia.org
-
The perplexity of the model q is defined as b − 1 N ∑ i = 1 N log b q ( x i ) {\displaystyle b^{-{\frac {1}{N}}\sum _{i=1}^{N}\log _{b}q(x_{i})}}
The perplexity formula is missing the probability distribution \(p\)
-
-
en.wikipedia.org en.wikipedia.org
-
It has been demonstrated that this formulation is almost equivalent to a SLIM model,[9] which is an item-item model based recommender
So a pre-trained item model can be used to make such recommendations.
-
The user's latent factors represent the preference of that user for the corresponding item's latent factors
The higher the value of the dot product between the two, the higher the preference.
-
two lower dimensional matrices
Not necessary (in fact, often not) square. Typically each user is represented by a vector of dimension strictly less than the number of items and vice versa.
-
-
karpathy.github.io karpathy.github.io
-
are are
*are
Tags
Annotators
URL
-
-
redis.io redis.io
-
it will not try to start a failover if the master link was disconnected for more than the specified amount of time
Why would it exhibit this behavior? Is it because a slave that's disconnected from the master for too long has stale data? Or is it because the slave made be failing as well?
-
-
mp.weixin.qq.com mp.weixin.qq.com机器之心2
-
会话
Session
-
Python
-
- Sep 2018
-
-
conditional distribution for individual components can be constructed
So the conditional distribution is conditioned on other components?
-
p(y∣x)=∫p(y∣f,x)p(f∣x)df
\(y\) is the data, \(f\) is the model, \(x\) is the input variable
-
-
am207.github.io am207.github.io
-
marginaly
*marginal
-
corvariance
*covariance
-
y=y1,…,yn=m
\(n = m\)
-
-
am207.github.io am207.github.io
-
must store an amount of information which increases with the size of the data
Or you can use MCMC.
-
some
*sum
-
calculation once again involves inverting a NxN matrix as in the kernel space representation of regression
this is why we use MCMC or other distribution sampling technique instead
-
$f(x_)foratestvectorinputforatestvectorinput for a test vector input x_,givenatrainingsetXwithvaluesyfortheGPisonceagainagaussiangivenbyequationCwithameanvector,givenatrainingsetXwithvaluesyfortheGPisonceagainagaussiangivenbyequationCwithameanvector, given a training set X with values y for the GP is once again a gaussian given by equation C with a mean vector m_andcovariancematrixandcovariancematrix and covariance matrix k_$:
...$f(x)$ for a test vector input $x$, given a training set $X$ with values $y$ for the GP is once again a gaussian given by equation C with a mean vector $m$ and covariance matrix $k$:
-
corvariance
*covariance
-
in equation B for the marginal of a gaussian, only the covariance of the block of the matrix involving the unmarginalized dimensions matters! Thus “if you ask only for the properties of the function (you are fitting to the data) at a finite number of points, then inference in the Gaussian process will give you the same answer if you ignore the infinitely many other points, as if you would have taken them all into account!”(Rasmunnsen)
key insight into Gaussian processes
-
they
*the
-
im
*in
-
Notice now that the features only appear in the combination κ(x,x′)=xTΣx′,κ(x,x′)=xTΣx′,\kappa(x,x') = x^T \Sigma x', thus leading to writing the posterior predictive as p(f(x∗)|x∗,X,y)=N(κ(x∗,X)(κ(XT,X)+σ2I)−1y,κ(x∗,x∗)−κ(x∗,XT)(κ(XT,X)+σ2I)−1κ(XT,x∗))p(f(x∗)|x∗,X,y)=N(κ(x∗,X)(κ(XT,X)+σ2I)−1y,κ(x∗,x∗)−κ(x∗,XT)(κ(XT,X)+σ2I)−1κ(XT,x∗))p(f(x_*) | x_* , X, y) = N\left(\kappa(x_*,X) \left(\kappa(X^T,X) + \sigma^2 I\right)^{-1}y,\,\,\, \kappa(x_*,x_*) - \kappa(x_*,X^T)\left(\kappa(X^T,X) + \sigma^2 I\right)^{-1} \kappa(X^T,x_*) \right) The function κκ\kappa is called the kernel
how the kernel came about?
-
-
am207.github.io am207.github.io
-
generate
more like "sample"
-
f∗
\(f^*\) denotes the model
-
-
-
Bayesian approach to handling observation noise
One core contribution of this work.
-
-
mp.weixin.qq.com mp.weixin.qq.com腾讯AI实验室1
-
通道剪枝算法
channel pruning algorithm
-
- Aug 2018
-
en.wikipedia.org en.wikipedia.org
-
expected values change during the series
So no longer identically distributed
-
-
www.fast.ai www.fast.ai
-
To learn a network for Cifar-10, DARTS takes just 4 GPU days, compared to 1800 GPU days for NASNet and 3150 GPU days for AmoebaNet
What about in comparison to ENAS?
-
- Jul 2018
-
spark.apache.org spark.apache.org
-
partitioner
How to define a partitioner?
-
-
mp.weixin.qq.com mp.weixin.qq.comSigAI3
-
极大极小(Max-min)博弈
Choose D to maximally discriminate D vs G and at the same time learn the real data; choose G to best "confuse" D.
-
交叉熵
Cross entropy
-
零和博弈
Zero-sum game
-
- Jun 2018
-
people.cs.bris.ac.uk people.cs.bris.ac.uk
-
isometrics in ROC space
What does this mean exactly?
-
-
Local file Local file
-
you will have to do a lot of work to select appropriate input data and to code the data as numeric values
Not really anymore with the advent of convolutional neural networks.
-
-
ai.intel.com ai.intel.com
-
next state s’
Is the next state s' the state reached by taking the action with the highest reward?
-
- May 2018
-
blog.cloudera.com blog.cloudera.com
-
The number of tasks is the single most important parameter.
-