57 Matching Annotations
  1. Oct 2024
    1. The novelty here over such past works is a theoretical analysis in the method-of-moments tradition

      method of moments allows modelling a RV via it moments - mean, variance etc. In reality we can say very little when the inputs are high dimensional as we will see no samples for almost all combinations.

    2. Pr[w emitted at time t | ct] ∝ exp(〈ct, vw〉).

      the chance that some word will be emitted is proportional to the exponential of the inner product of its latent vector with the the current context. The proportional indicates that we are omitting the normalization term, which is a sum over all such exponents for all possible words.

    3. a new generative model, a dynamic version of the log-linear topic model

      but what does this model generate ?

    4. However, skip-gram is a discriminative model (due to the use of negative sampling

      this is a great insight.

    5. he old PMI method is a bit mysterious.

      PMI was considered as an information theoretic foundation to explain the success of TD-IDF with 1-hot encoded word representations. While how the latter works has been called mysterious, PMI is has clear probabilistic underpinnings.

    6. They are constructedby various models whose unifying philosophy is that the meaning of a word is defined by “the company itkeeps” (Firth, 1957), namely, co-occurrence statistics.

      Distributional Representation or Distributed Semantics would a better definitions, for word embeddings and it due to Co-occurrence. Because vector representations include 1-Hot encodings and TD-IDF methods

    7. none of these earliergenerative models has been linked to PMI models

      Both consider semantics but PMI is localised at the word level while Topic models aggregate semantics over collections of words.

    8. The chief methodological contribution is using themodel priors to analytically derive a closed-form expression that directly explains (1.1);
    9. GloVe

      Global vectors use the full co-occurrence matrix

    10. Reweighting heuristics are known to improve these methods, as is dimen-sion reduction

      Without reweighting and dimensionality reduction PMI doesn't scale to higher dimensions. (I.e. for large vocabulary).

    11. Linguistic regularities in sparse and explicit word representations

      consider a brief review

    12. MLE

      MLE are not in the bayesian tradition

    13. ut if the random walk mixes fairly quickly (the mixing time is related to the logarithm of the vocabularysize), then the distribution of Xw,w′ ’s is very close to a multinomial distribution Mul( ̃L, {p(w, w′)}), where ̃L = ∑w,w′ Xw,w′ is the total number of word pairs.

      what is this mixing time ?

    14. Furthermore, their argument only applies to very high-dimensional word embeddings, and thusdoes not address low-dimensional embeddings, which have superior quality in applications

      This suggest this method should be good for low dimensional embeddings.

    15. GloVe
    16. PMI matrix is found to be closely approximated by a low rank matrix

      In higher dimensions, PMI faces the following challenges 1. Curse of Dimensionality: As the dimensionality increases, the data points become more sparse, i.e. data points are generally far apart, making co-occurrence statistics less reliable. PMI relies heavily on meaningful co-occurrence counts, which become sparse and noisy in high-dimensional spaces, leading to instability in the calculated PMI values.

    17. word2vec
    18. hey are compatible with all sorts of local structure among word vectorssuch as existence of clusterings, which would be absent in truly random vectors drawn from our prior

      clustered or isotropic

    19. Weakening the model assumptions

      This section sheds light on the model - it is like the missing motivation. It helps by considering the model in the context of an experiment.

    20. The concentration of the partition functions

      Not clear where this analysis comes from. Kernel suggest a gaussian process

    21. Bayesian tradition

      The Bayesian tradition includes writing the equations of the model one on top of the other and explaining the parameters latent variables etc.

    22. interesting

      means non trivial

    23. Having n vectors be isotropic in d dimensions requires d  n. This isotropy isneeded in the calculations (i.e., multidimensional integral) that yield (1.1). It also holds empirically for ourword vectors, as shown in Section 5.

      isotropy is motivated by need in integration.

      holds empirically (

    24. Furthermore, we will assume that in the bulk, the wordvectors are distributed uniformly in space, earlier referred to as isotropy

      the Isotropy assumptions simplifies integration but seems to fit with experiment

    25. The isotropy of low-dimensional word vectors also plays a key role in our explanation of the rela-tions=lines phenomenon (Section 4). The isotropy has a “purification” effect that mitigates the effect ofthe (rather large) approximation error in the PMI models

      This need further consideration.

      There are different hypothesis for the existence of the "power law" and some of them my not fit with this isotropy.

    26. uggests word vectors need to have varying lengths

      since there is an inner product of v_twith c_t in R^d the word vectors must all have length d so they have the same length ?

    27. The model treats corpus generation as a dynamic process, where the t-th word is produced at step t. Theprocess is driven by the random walk of a discourse vector ct ∈ <d. Its coordinates represent what is beingtalked about.2 Each word has a (time-invariant) latent vector vw ∈ <d that captures its correlations withthe discourse vector.

      the model is a random walk with t being the word and the RV being a random vector called a discourse vector of dimension d. This vector is a distributed representation of the semantics at word t.

      what is the discourse - is it one hot encoded, is it orthogonal, is it sparse, is it disentangled, is it compositional ? What is a small change in a single dimension???

    28. The discourse vector ct does a slow random walk (meaning that ct+1 is obtained from ct by adding a smallrandom displacement vector), so that nearby words are generated under similar discourses.

      What is a slow/quick random walk ? How many steps are allowed and in how many dimensions?

      A bigger point is that the meaning of words dont seem to follow this model ? Adjacent words in a sentence can have dramatically different semantics. I.E. despite Firth - only a few words are strongly related. The majority are not.

      we need a second equation for the change in c_t - what is small here ?

    29. We are interestedin the probabilities that word pairs co-occur near each other, so occasional big jumps in the random walkare allowed because they have negligible effect on these probabilities

      Important point

      but how often can we have big jumps without effecting the word bigram probabilities? Much of english is words like The and it which don't have semantic content.

      also is the correlation short term of long term

    30. This is reminiscent of analysis ofsimilar random walk models in finance

      err I recall that that model of options doen't have 2000 dimensions - typicaly just one latent variable - the value of the underlaying stock at some future date and the value is real number. But the semantics are spread over a high dim space and far from real valued. So AFAIKS this is stretching the analogy too far.

    31. By contrast our random walk involvesa latent discourse vector, which has a clearer semantic interpretation and has proven useful in subsequentwork, e.g. understanding structure of word embeddings for polysemous words Arora et al. (2016)

      Doesn't that paper uses a modified model and not the same model? If this paper a prior or a posterior of that paper :-)

    32. Assuming a prior on the random walk we analytically integrate out thehidden random variables and compute a simple closed form expression that approximately connects themodel parameters to the observable joint probabilities

      What assumptions are the prior encoding here?

    33. Belanger and Kakade (2015) haveproposed a dynamic model for text using Kalman Filters, where the sequence of words is generated fromGaussian linear dynamical systems, rather than the log-linear model in our case

      how are the different dynamics expected to create better dynamics (which of gaussian or log linear aggregation more realistic?)

    34. The dynamic topic model of Blei and Lafferty(2006) utilizes topic dynamics, but with a linear word production model.

      Topics are statistical aggregates over a window or some collective and so it makes sense to model them as slowly changing. However Topics are likely also subject to significant change at say pragraph and sentence boundries and the small changes are likely an artifact of the sampling rather then of the generating process. Also text can belong to multiple topics. Words on the other hand should not be modeled as having multiple meanings.

    35. There appears to be no theoretical explanation for this empirical finding about the approximate low rank of the PMI matrix.

      This comment highlights the gap in the theoretical understanding of why the PMI matrix exhibits an approximately low rank. It raises the question of whether this low-rank property could be formally proven or if it is purely an empirical observation.

    36. Latent Dirichlet Allocation (LDA) a
    37. we propose a probabilistic modelof text generation that augments the log-linear topic model of Mnih and Hinton (2007) with dynamics, inthe form of a random walk over a latent discourse space
    1. Suppose A sees object iand signals, then B will infer object i with probability Sj51m p ij q9ji

      this is the basic building block of the model

    2. A Linguistic Error Limit.

      this section discusses how we can inject communication errors into the game. The form used creates a tradeoff between more signals (expressivity) and accuracy (expected no of correct messages)

    3. hecrucial difference between word and sentence formation is thatthe first consists essentially of memorizing all (relevant) wordsof a language, whereas the second is based on grammaticalrules.

      if stems have a hierarchy via phonemic similarity and there is a rich morphology, learning words may be greatly simplified by a learning a few systematic rules. Idealy we need to learn a lexicon with just one base form and apply a group action to get all the forms. In this case we only need to memorize a small lexicon.

    4. More realistically, we may assume that correct understand-ing of a word is based (to some extent) on matching theperceived string of phonemes to known words of the language

      Here we may be considering a form of compositionality, where semantics are established by preferential selection of languages with related patterns of phonemes. What is missing is that these patterns are not grounded in the structural features if the state being expressed..

    5. This equation assumes that understanding of a word isbased on the correct understanding of each individual sound.

      decoding the state is mandated on decoding the each signal in the sequence

    6. The passive matrixQ is derived from the active matrix: q ji 5 pijySi p ij.

      this q is a re-normalised transposed of p<br /> this is what receivers learn in lewis games.

      having this transpose solution means that for each agent there is randomly chosen a lewis equilibrium in place at the start of equilibrium and selection is acting the fitness function which is highest for a signaling system. As n increases the number of partial pooling equilibria grows much faster then the number of separating equilibria

    7. For the next round, individuals produce offspring propor-tional to their payoff.

      the payoffs per interaction are symmetric but the aggregations are not!!!

      this is a combination of a cooperative step and a competitive step

    8. The matrix P contains the entries p ij,denoting the probability that for a speaker object i is associatedwith sound j

      This is basically a probabilistic belief of the signaling system by the speaker and listener in a lewis signaling game.

    9. Hence, we assume that both speaker and listener receive areward for mutual understanding. If for example only thelistener receives a benefit, then the evolution of languagerequires cooperation

      When agents get the same rewards they are in a cooperative game. This is not expressed clearly.

      And again in line with the Lewis Signaling Game

    10. early in the evolution of language,errors in signaling and perception would be common

      This notion of errors driving language evolution is central to this paper. And is also one that has gone under appreciated in later research.

    1. given arbitrarily large vocab-ularies

      When agent A can transmit the full state in one symbol the agents have coordinated on a completely separating equilibrium and Q has perfect information. Also RL algs will not have a reward signal to find a more refined equilibrium unless some reward shaping is done to incentivize agents to coordinate on agent's Q's state as well.

    2. it does not convey thefunctional meaning of language, grounding (map-ping physical concepts to words), compositional-ity (combining knowledge of simpler concepts todescribe richer concepts), or aspects of planning

      some problems with neural dialog modles

    3. Task & Talk

      Task and Talk is Clearly based in the Lewis signaling game. However this setup seems to lead to entangled solutions at least for the minimal number of required signals.<br /> A needs to send one of three signals - the one he does not need. Then B needs to send one of four symbols for the state of one then the other (order is selected by symmetry breaking and learned thoroug reinfocement.) In this minimal version the signals are reused. If agent A has 8 signals he can respond fully without needed a second round.

    4. 64

      so 64 states and an state space that both symmetrical and has three normal subgroups - so agent could learn to choose composable representations if one provides that right incentives to pick these over other equilibria.

    5. Specifically, in a sequence of ‘negative’ resultsculminating in a ‘positive’ one, we find that whileagents always successfully invent communicationprotocols and languages to achieve their goalswith near-perfect accuracies, the invented lan-guages are decidedly not compositional, inter-pretable, or ‘natural’;

      so AFAIK this is what we expect in RL - a quick and efficent solution not the Oxford English dictionary. Or that solving a maze should be done with style.

    6. at are the conditions that lead to theemergence of human-interpretable or composi-tional grounded language?

      so they just mean compsitional language

    1. iterated learning model

      this suggests a dynamic system with issues of different phases like chaotic dynamics, periodicity etc unless the authors can demonstrate this always converges.

    2. stable irregularity in language

      This is best understood through two english examples 1. The irregualr verb "to be" is used in many other constructions e.g. the future tense, so instability would have significant impact on learning. 1. there are many loan words in english which preserved thier forien morphology. This makes learning them as a group easier (they are irregualr but follow a template)in

    3. one adult and one learner

      I think that one on one language games fosters strong co-adaptation and eliminates the frictions that pressure the development of more efficent representations. Using multiple students can probably get faster evolution.

    4. Rather than appealing to communicative pressures and nat-ural selection, the suggestion is that structure-preserving map-pings emerge from the dynamics of iterated learning

      This is the main point of the paper.