arg maxvw;vcP(w;c)2Dlog11+evcvw
maximise the log probability.
arg maxvw;vcP(w;c)2Dlog11+evcvw
maximise the log probability.
p(D= 1jw;c)the probability that(w;c)came from the data, and byp(D= 0jw;c) =1p(D= 1jw;c)the probability that(w;c)didnot.
probability of word,context present in text or not.
Loosely speaking, we seek parameter values (thatis, vector representations for both words and con-texts) such that the dot productvwvcassociatedwith “good” word-context pairs is maximized.
In the skip-gram model, each wordw2Wisassociated with a vectorvw2Rdand similarlyeach contextc2Cis represented as a vectorvc2Rd, whereWis the words vocabulary,Cis the contexts vocabulary, anddis the embed-ding dimensionality.
Factors involved in the Skip gram model