57 Matching Annotations

Oct 2024
arxiv.org arxiv.org

1502.03520v8.pdf

37
1. bishdata 29 Oct 2024
  
  in Public
  
  The novelty here over such past works is a theoretical analysis in the method-of-moments tradition
  
  method of moments allows modelling a RV via it moments - mean, variance etc. In reality we can say very little when the inputs are high dimensional as we will see no samples for almost all combinations.
  
  Problems
2. bishdata 29 Oct 2024
  
  in Public
  
  Pr[w emitted at time t | ct] ∝ exp(〈ct, vw〉).
  
  the chance that some word will be emitted is proportional to the exponential of the inner product of its latent vector with the the current context. The proportional indicates that we are omitting the normalization term, which is a sum over all such exponents for all possible words.
3. bishdata 29 Oct 2024
  
  in Public
  
  a new generative model, a dynamic version of the log-linear topic model
  
  but what does this model generate ?
  
  question
4. bishdata 29 Oct 2024
  
  in Public
  
  However, skip-gram is a discriminative model (due to the use of negative sampling
  
  this is a great insight.
  
  insight
5. bishdata 29 Oct 2024
  
  in Public
  
  he old PMI method is a bit mysterious.
  
  PMI was considered as an information theoretic foundation to explain the success of TD-IDF with 1-hot encoded word representations. While how the latter works has been called mysterious, PMI is has clear probabilistic underpinnings.
6. bishdata 29 Oct 2024
  
  in Public
  
  They are constructedby various models whose unifying philosophy is that the meaning of a word is defined by “the company itkeeps” (Firth, 1957), namely, co-occurrence statistics.
  
  Distributional Representation or Distributed Semantics would a better definitions, for word embeddings and it due to Co-occurrence. Because vector representations include 1-Hot encodings and TD-IDF methods
7. bishdata 29 Oct 2024
  
  in Public
  
  none of these earliergenerative models has been linked to PMI models
  
  Both consider semantics but PMI is localised at the word level while Topic models aggregate semantics over collections of words.
8. bishdata 29 Oct 2024
  
  in Public
  
  The chief methodological contribution is using themodel priors to analytically derive a closed-form expression that directly explains (1.1);
  
  main contribution
9. bishdata 29 Oct 2024
  
  in Public
  
  GloVe
  
  Global vectors use the full co-occurrence matrix
10. bishdata 29 Oct 2024
  
  in Public
  
  Reweighting heuristics are known to improve these methods, as is dimen-sion reduction
  
  Without reweighting and dimensionality reduction PMI doesn't scale to higher dimensions. (I.e. for large vocabulary).
  
  curse of dimensionality
11. bishdata 28 Oct 2024
  
  in Public
  
  Linguistic regularities in sparse and explicit word representations
  
  consider a brief review
12. bishdata 28 Oct 2024
  
  in Public
  
  MLE
  
  MLE are not in the bayesian tradition
13. bishdata 28 Oct 2024
  
  in Public
  
  ut if the random walk mixes fairly quickly (the mixing time is related to the logarithm of the vocabularysize), then the distribution of Xw,w′ ’s is very close to a multinomial distribution Mul( ̃L, {p(w, w′)}), where ̃L = ∑w,w′ Xw,w′ is the total number of word pairs.
  
  what is this mixing time ?
14. bishdata 27 Oct 2024
  
  in Public
  
  Furthermore, their argument only applies to very high-dimensional word embeddings, and thusdoes not address low-dimensional embeddings, which have superior quality in applications
  
  This suggest this method should be good for low dimensional embeddings.
15. bishdata 27 Oct 2024
  
  in Public
  
  GloVe
  
  related work
16. bishdata 27 Oct 2024
  
  in Public
  
  PMI matrix is found to be closely approximated by a low rank matrix
  
  In higher dimensions, PMI faces the following challenges 1. Curse of Dimensionality: As the dimensionality increases, the data points become more sparse, i.e. data points are generally far apart, making co-occurrence statistics less reliable. PMI relies heavily on meaningful co-occurrence counts, which become sparse and noisy in high-dimensional spaces, leading to instability in the calculated PMI values.
17. bishdata 27 Oct 2024
  
  in Public
  
  word2vec
  
  related work
18. bishdata 27 Oct 2024
  
  in Public
  
  hey are compatible with all sorts of local structure among word vectorssuch as existence of clusterings, which would be absent in truly random vectors drawn from our prior
  
  clustered or isotropic
19. bishdata 27 Oct 2024
  
  in Public
  
  Weakening the model assumptions
  
  This section sheds light on the model - it is like the missing motivation. It helps by considering the model in the context of an experiment.
  
  motivation
20. bishdata 27 Oct 2024
  
  in Public
  
  The concentration of the partition functions
  
  Not clear where this analysis comes from. Kernel suggest a gaussian process
21. bishdata 27 Oct 2024
  
  in Public
  
  Bayesian tradition
  
  The Bayesian tradition includes writing the equations of the model one on top of the other and explaining the parameters latent variables etc.
22. bishdata 27 Oct 2024
  
  in Public
  
  interesting
  
  means non trivial
23. bishdata 27 Oct 2024
  
  in Public
  
  Having n vectors be isotropic in d dimensions requires d n. This isotropy isneeded in the calculations (i.e., multidimensional integral) that yield (1.1). It also holds empirically for ourword vectors, as shown in Section 5.
  
  isotropy is motivated by need in integration.
  
  holds empirically (
24. bishdata 27 Oct 2024
  
  in Public
  
  Furthermore, we will assume that in the bulk, the wordvectors are distributed uniformly in space, earlier referred to as isotropy
  
  the Isotropy assumptions simplifies integration but seems to fit with experiment
25. bishdata 27 Oct 2024
  
  in Public
  
  The isotropy of low-dimensional word vectors also plays a key role in our explanation of the rela-tions=lines phenomenon (Section 4). The isotropy has a “purification” effect that mitigates the effect ofthe (rather large) approximation error in the PMI models
  
  This need further consideration.
  
  There are different hypothesis for the existence of the "power law" and some of them my not fit with this isotropy.
26. bishdata 27 Oct 2024
  
  in Public
  
  uggests word vectors need to have varying lengths
  
  since there is an inner product of v_twith c_t in R^d the word vectors must all have length d so they have the same length ?
  
  Problems
27. bishdata 27 Oct 2024
  
  in Public
  
  The model treats corpus generation as a dynamic process, where the t-th word is produced at step t. Theprocess is driven by the random walk of a discourse vector ct ∈ <d. Its coordinates represent what is beingtalked about.2 Each word has a (time-invariant) latent vector vw ∈ <d that captures its correlations withthe discourse vector.
  
  the model is a random walk with t being the word and the RV being a random vector called a discourse vector of dimension d. This vector is a distributed representation of the semantics at word t.
  
  what is the discourse - is it one hot encoded, is it orthogonal, is it sparse, is it disentangled, is it compositional ? What is a small change in a single dimension???
  
  model questions
28. bishdata 27 Oct 2024
  
  in Public
  
  The discourse vector ct does a slow random walk (meaning that ct+1 is obtained from ct by adding a smallrandom displacement vector), so that nearby words are generated under similar discourses.
  
  What is a slow/quick random walk ? How many steps are allowed and in how many dimensions?
  
  A bigger point is that the meaning of words dont seem to follow this model ? Adjacent words in a sentence can have dramatically different semantics. I.E. despite Firth - only a few words are strongly related. The majority are not.
  
  we need a second equation for the change in c_t - what is small here ?
  
  Problems
29. bishdata 27 Oct 2024
  
  in Public
  
  We are interestedin the probabilities that word pairs co-occur near each other, so occasional big jumps in the random walkare allowed because they have negligible effect on these probabilities
  
  Important point
  
  but how often can we have big jumps without effecting the word bigram probabilities? Much of english is words like The and it which don't have semantic content.
  
  also is the correlation short term of long term
  
  Problems
30. bishdata 27 Oct 2024
  
  in Public
  
  This is reminiscent of analysis ofsimilar random walk models in finance
  
  err I recall that that model of options doen't have 2000 dimensions - typicaly just one latent variable - the value of the underlaying stock at some future date and the value is real number. But the semantics are spread over a high dim space and far from real valued. So AFAIKS this is stretching the analogy too far.
  
  Problems
31. bishdata 27 Oct 2024
  
  in Public
  
  By contrast our random walk involvesa latent discourse vector, which has a clearer semantic interpretation and has proven useful in subsequentwork, e.g. understanding structure of word embeddings for polysemous words Arora et al. (2016)
  
  Doesn't that paper uses a modified model and not the same model? If this paper a prior or a posterior of that paper :-)
  
  Problems
32. bishdata 27 Oct 2024
  
  in Public
  
  Assuming a prior on the random walk we analytically integrate out thehidden random variables and compute a simple closed form expression that approximately connects themodel parameters to the observable joint probabilities
  
  What assumptions are the prior encoding here?
  
  Problems
33. bishdata 27 Oct 2024
  
  in Public
  
  Belanger and Kakade (2015) haveproposed a dynamic model for text using Kalman Filters, where the sequence of words is generated fromGaussian linear dynamical systems, rather than the log-linear model in our case
  
  how are the different dynamics expected to create better dynamics (which of gaussian or log linear aggregation more realistic?)
  
  Problems
34. bishdata 27 Oct 2024
  
  in Public
  
  The dynamic topic model of Blei and Lafferty(2006) utilizes topic dynamics, but with a linear word production model.
  
  Topics are statistical aggregates over a window or some collective and so it makes sense to model them as slowly changing. However Topics are likely also subject to significant change at say pragraph and sentence boundries and the small changes are likely an artifact of the sampling rather then of the generating process. Also text can belong to multiple topics. Words on the other hand should not be modeled as having multiple meanings.
  
  Problems
35. bishdata 27 Oct 2024
  
  in Public
  
  There appears to be no theoretical explanation for this empirical finding about the approximate low rank of the PMI matrix.
  
  This comment highlights the gap in the theoretical understanding of why the PMI matrix exhibits an approximately low rank. It raises the question of whether this low-rank property could be formally proven or if it is purely an empirical observation.
  
  insight
36. bishdata 27 Oct 2024
  
  in Public
  
  Latent Dirichlet Allocation (LDA) a
  
  related work
37. bishdata 27 Oct 2024
  
  in Public
  
  we propose a probabilistic modelof text generation that augments the log-linear topic model of Mnih and Hinton (2007) with dynamics, inthe form of a random walk over a latent discourse space
  
  main contribution
Visit annotations in context

Tags

insight

curse of dimensionality

related work

main contribution

Problems

model

question

questions

motivation

Annotators

bishdata

URL

arxiv.org/pdf/1502.03520
pmc.ncbi.nlm.nih.gov pmc.ncbi.nlm.nih.gov

pq149908028p

10
1. bishdata 18 Oct 2024
  
  in Public
  
  Suppose A sees object iand signals, then B will infer object i with probability Sj51m p ij q9ji
  
  this is the basic building block of the model
2. bishdata 17 Oct 2024
  
  in Public
  
  A Linguistic Error Limit.
  
  this section discusses how we can inject communication errors into the game. The form used creates a tradeoff between more signals (expressivity) and accuracy (expected no of correct messages)
3. bishdata 17 Oct 2024
  
  in Public
  
  hecrucial difference between word and sentence formation is thatthe first consists essentially of memorizing all (relevant) wordsof a language, whereas the second is based on grammaticalrules.
  
  if stems have a hierarchy via phonemic similarity and there is a rich morphology, learning words may be greatly simplified by a learning a few systematic rules. Idealy we need to learn a lexicon with just one base form and apply a group action to get all the forms. In this case we only need to memorize a small lexicon.
4. bishdata 17 Oct 2024
  
  in Public
  
  More realistically, we may assume that correct understand-ing of a word is based (to some extent) on matching theperceived string of phonemes to known words of the language
  
  Here we may be considering a form of compositionality, where semantics are established by preferential selection of languages with related patterns of phonemes. What is missing is that these patterns are not grounded in the structural features if the state being expressed..
5. bishdata 17 Oct 2024
  
  in Public
  
  This equation assumes that understanding of a word isbased on the correct understanding of each individual sound.
  
  decoding the state is mandated on decoding the each signal in the sequence
6. bishdata 17 Oct 2024
  
  in Public
  
  The passive matrixQ is derived from the active matrix: q ji 5 pijySi p ij.
  
  this q is a re-normalised transposed of p<br /> this is what receivers learn in lewis games.
  
  having this transpose solution means that for each agent there is randomly chosen a lewis equilibrium in place at the start of equilibrium and selection is acting the fitness function which is highest for a signaling system. As n increases the number of partial pooling equilibria grows much faster then the number of separating equilibria
7. bishdata 16 Oct 2024
  
  in Public
  
  For the next round, individuals produce offspring propor-tional to their payoff.
  
  the payoffs per interaction are symmetric but the aggregations are not!!!
  
  this is a combination of a cooperative step and a competitive step
8. bishdata 16 Oct 2024
  
  in Public
  
  The matrix P contains the entries p ij,denoting the probability that for a speaker object i is associatedwith sound j
  
  This is basically a probabilistic belief of the signaling system by the speaker and listener in a lewis signaling game.
9. bishdata 16 Oct 2024
  
  in Public
  
  Hence, we assume that both speaker and listener receive areward for mutual understanding. If for example only thelistener receives a benefit, then the evolution of languagerequires cooperation
  
  When agents get the same rewards they are in a cooperative game. This is not expressed clearly.
  
  And again in line with the Lewis Signaling Game
10. bishdata 16 Oct 2024
  
  in Public
  
  early in the evolution of language,errors in signaling and perception would be common
  
  This notion of errors driving language evolution is central to this paper. And is also one that has gone under appreciated in later research.
Visit annotations in context

Annotators

bishdata

URL

pmc.ncbi.nlm.nih.gov/articles/PMC22182/pdf/pq008028.pdf
scontent.ftlv16-1.fna.fbcdn.net scontent.ftlv16-1.fna.fbcdn.net

78881499_542457583219935_8134657351270531072_n.pdf

6
1. bishdata 16 Oct 2024
  
  in Public
  
  given arbitrarily large vocab-ularies
  
  When agent A can transmit the full state in one symbol the agents have coordinated on a completely separating equilibrium and Q has perfect information. Also RL algs will not have a reward signal to find a more refined equilibrium unless some reward shaping is done to incentivize agents to coordinate on agent's Q's state as well.
  
  analysis
2. bishdata 16 Oct 2024
  
  in Public
  
  it does not convey thefunctional meaning of language, grounding (map-ping physical concepts to words), compositional-ity (combining knowledge of simpler concepts todescribe richer concepts), or aspects of planning
  
  some problems with neural dialog modles
3. bishdata 15 Oct 2024
  
  in Public
  
  Task & Talk
  
  Task and Talk is Clearly based in the Lewis signaling game. However this setup seems to lead to entangled solutions at least for the minimal number of required signals.<br /> A needs to send one of three signals - the one he does not need. Then B needs to send one of four symbols for the state of one then the other (order is selected by symmetry breaking and learned thoroug reinfocement.) In this minimal version the signals are reused. If agent A has 8 signals he can respond fully without needed a second round.
  
  anlysis
4. bishdata 15 Oct 2024
  
  in Public
  
  64
  
  so 64 states and an state space that both symmetrical and has three normal subgroups - so agent could learn to choose composable representations if one provides that right incentives to pick these over other equilibria.
5. bishdata 15 Oct 2024
  
  in Public
  
  Specifically, in a sequence of ‘negative’ resultsculminating in a ‘positive’ one, we find that whileagents always successfully invent communicationprotocols and languages to achieve their goalswith near-perfect accuracies, the invented lan-guages are decidedly not compositional, inter-pretable, or ‘natural’;
  
  so AFAIK this is what we expect in RL - a quick and efficent solution not the Oxford English dictionary. Or that solving a maze should be done with style.
6. bishdata 15 Oct 2024
  
  in Public
  
  at are the conditions that lead to theemergence of human-interpretable or composi-tional grounded language?
  
  so they just mean compsitional language
Visit annotations in context

Tags

analysis

anlysis

Annotators

bishdata

URL

scontent.ftlv16-1.fna.fbcdn.net/v/t39.8562-6/78881499_542457583219935_8134657351270531072_n.pdf
citeseerx.ist.psu.edu citeseerx.ist.psu.edu

Spontaneous evolution of linguistic structure - an iterated learning model of the emergence of regul - Evolutionary Computation, IEEE Transactions on

4
1. bishdata 15 Oct 2024
  
  in Public
  
  iterated learning model
  
  this suggests a dynamic system with issues of different phases like chaotic dynamics, periodicity etc unless the authors can demonstrate this always converges.
  
  critical
2. bishdata 15 Oct 2024
  
  in Public
  
  stable irregularity in language
  
  This is best understood through two english examples 1. The irregualr verb "to be" is used in many other constructions e.g. the future tense, so instability would have significant impact on learning. 1. there are many loan words in english which preserved thier forien morphology. This makes learning them as a group easier (they are irregualr but follow a template)in
  
  insight
3. bishdata 15 Oct 2024
  
  in Public
  
  one adult and one learner
  
  I think that one on one language games fosters strong co-adaptation and eliminates the frictions that pressure the development of more efficent representations. Using multiple students can probably get faster evolution.
  
  critical
4. bishdata 15 Oct 2024
  
  in Public
  
  Rather than appealing to communicative pressures and nat-ural selection, the suggestion is that structure-preserving map-pings emerge from the dynamics of iterated learning
  
  This is the main point of the paper.
  
  the big idea
Visit annotations in context

Tags

the big idea

insight

critical

Annotators

bishdata

URL

citeseerx.ist.psu.edu/document

bishdata

Annotations: 57

Joined: October 14, 2024

Tags

Annotators

URL

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL