7 Matching Annotations
  1. Jun 2017
  2. arxiv.org arxiv.org
    1. The analysis showsthat, although they are superficially similar, NCE is a general parameter estimation technique that is asymp-totically unbiased, while negative sampling is best understood as a family of binary classification modelsthat are useful for learning word representations but not asa general-purpose estimator

      I think NCE is slightly different from CE. Unfortunately, Chris sort of ignores Noah's work on CE in this explanation. Although, the connection between NCE and NS is nicely explained.

    1. We present an extension to Jaynes’ maximum entropy principle that handles latent variables. Theprinciple oflatent maximum entropywe propose is different from both Jaynes’ maximum entropy principleand maximum likelihood estimation, but often yields better estimates in the presence of hidden variablesand limited training data. We first show that solving for a latent maximum entropy model poses a hardnonlinear constrained optimization problem in general. However, we then show that feasible solutions tothis problem can be obtained efficiently for the special case of log-linear models—which forms the basisfor an efficient approximation to the latent maximum entropy principle. We derive an algorithm thatcombines expectation-maximization with iterative scaling to produce feasible log-linear solutions. Thisalgorithm can be interpreted as an alternating minimization algorithm in the information divergence, andreveals an intimate connection between the latent maximum entropy and maximum likelihood principles.To select a final model, we generate a series of feasible candidates, calculate the entropy of each, andchoose the model that attains the highest entropy. Our experimental results show that estimation basedon the latent maximum entropy principle generally gives better results than maximum likelihood whenestimating latent variable models on small observed data samples.

      Towards intelligent negative sampling

    1. Wang et al. (2002) discuss the latent maximumentropy principle. They advocate running EM manytimes and selecting the local maximum that maxi-mizes entropy. One might do the same for the localmaxima of any CE objective, though theoretical andexperimental support for this idea remain for futurework.

      Interesting proposal, quite similar to the neg. sampling with 'exploration / exploitation'.

      Definitely, worth atleast a couple reads!

    2. One can envision amixedobjective function that tries to fit the labeledexamples while discriminating unlabeled examplesfrom their neighborhoods.

      Interesting - a mixed objective function -> this seems like a multi-task framework!

      --> Re-read and understand

    3. We have presentedcontrastive estimation, a newprobabilistic estimation criterion that forces a modelto explain why the given training data were betterthan bad data implied by the positive examples.

      This is again an interesting way to see it: "... forces a model to explain why the given training data were better than bad data implied by the positive examples."

    4. Viewed as a CE method, this approach (though ef-fective when there are few hypotheses) seems mis-guided; the objective says to move mass to each ex-ample at the expense of all other training examples

      A very cool remark and makes sense!!

    5. An alternative is to restrict theneighborhood to the set of observed training exam-ples rather than all possible examples (Riezler, 1999;Johnson et al., 1999; Riezler et al., 2000):

      This equation is reminiscent of the equation proposed by Nickel et al., 2017 - the Poincare Embeddings paper. Especially, look for Negative Sampling.