21 Matching Annotations
  1. Oct 2020
  2. Sep 2020
  3. May 2020
    1. for query, query_embedding in zip(queries, query_embeddings): distances = scipy.spatial.distance.cdist([query_embedding], corpus_embeddings, "cosine")[0]

      How to calculate cosine distance between vector and corpus

    1. simple approach is to average the second to last hiden layer of each token producing a single 768 length vector

      Proposition how to obtain single vector for the whole sentence

    2. It is worth noting that word-level similarity comparisons are not appropriate with BERT embeddings because these embeddings are contextually dependent, meaning that the word vector changes depending on the sentence it appears in. This allows wonderful things like polysemy so that e.g. your representation encodes river “bank” and not a financial institution “bank”, but makes direct word-to-word similarity comparisons less valuable. However, for sentence embeddings similarity comparison is still valid such that one can query, for example, a single sentence against a dataset of other sentences in order to find the most similar. Depending on the similarity metric used, the resulting similarity values will be less informative than the relative ranking of similarity outputs since many similarity metrics make assumptions about the vector space (equally-weighted dimensions, for example) that do not hold for our 768-dimensional vector space.

      Thoughts on similarity comparison for word and sentence level embeddings.

    3. For out of vocabulary words that are composed of multiple sentence and character-level embeddings, there is a further issue of how best to recover this embedding. Averaging the embeddings is the most straightforward solution (one that is relied upon in similar embedding models with subword vocabularies like fasttext), but summation of subword embeddings and simply taking the last token embedding (remember that the vectors are context sensitive) are acceptable alternative strategies.

      Strategies for how to get an embedding for a OOV word

    4. It should be noted that although the [CLS] acts as an “aggregate representation” for classification tasks, this is not the best choice for a high quality sentence embedding vector. According to BERT author Jacob Devlin: “I’m not sure what these vectors are, since BERT does not generate meaningful sentence vectors. It seems that this is is doing average pooling over the word tokens to get a sentence vector, but we never suggested that this will generate meaningful sentence representations.”

      About [CLS] token not being a good quality sentence level embedding :O

    5. In order to get the individual vectors we will need to combine some of the layer vectors…but which layer or combination of layers provides the best representation?

      Strategies for aggregating the information from 12 layers

    6. This object has four dimensions, in the following order: The layer number (12 layers) The batch number (1 sentence) The word / token number (22 tokens in our sentence) The hidden unit / feature number (768 features) That’s 202,752 unique values just to represent our one sentence!

      Expected dimensionality for a sentence embedding

    7. BERT offers an advantage over models like Word2Vec, because while each word has a fixed representation under Word2Vec regardless of the context within which the word appears, BERT produces word representations that are dynamically informed by the words around them.

      Advantage of BERT embedding over word2vec

    1. BERT for feature extraction The fine-tuning approach isn’t the only way to use BERT. Just like ELMo, you can use the pre-trained BERT to create contextualized word embeddings. Then you can feed these embeddings to your existing model – a process the paper shows yield results not far behind fine-tuning BERT on a task such as named-entity recognition.

      How to extract embeddings from BERT

    1. ([CLS]).The final hidden state corresponding to this token is used as the ag- gregate sequence representation for classification tasks.

      Aggregate sequence representation? Does it mean it is the sentence embedding?

    1. I extracted embeddings from a pytorch model (pytorch_model.bin file). The code to extract is pasted here. It assumes the embeddings are stored with the name bert.embeddings.word_embeddings.weight.

      How to extract raw BERT input embeddings? Those are not context aware.

    1. about 30,000 vectors or embeddings (we can train the model with our own vocabulary if needed- though this has many factors to be considered before doing so, such as the need to pre-train model from scratch with the new vocabulary). These vectors are referred to as raw vectors/embeddings in this post to distinguish them from their transformed counterparts once they pass through the BERT model.These learned raw vectors are similar to the vector output of a word2vec model — a single vector represents a word regardless of its different meanings or senses. For instance, all the different senses/meanings (cell phone, biological cell, prison cell) of a word like “cell” is combined into a single vector.

      BERT offers two kind of embeddings:

      1. similar to word2vec - a single vector represents a word regardless of its different meanings or senses
      2. context aware embedding - after they pass through the model
  4. Oct 2019
    1. MDX is a superset of Markdown. It allows you to write JSX inside markdown. This includes importing and rendering React components!
  5. Sep 2019
    1. Text embedding models convert any input text into an output vector of numbers, and in the process map semantically similar words near each other in the embedding space: Figure 2: Text embeddings convert any text into a vector of numbers (left). Semantically similar pieces of text are mapped nearby each other in the embedding space (right). Given a trained text embedding model, we can directly measure the associations the model has between words or phrases. Many of these associations are expected and are helpful for natural language tasks. However, some associations may be problematic or hurtful. For example, the ground-breaking paper by Bolukbasi et al. [4] found that the vector-relationship between "man" and "woman" was similar to the relationship between "physician" and "registered nurse" or "shopkeeper" and "housewife"

      love that Big Lebowski reference

  6. Jan 2019
    1. Grid devices can be nested or layered along with other devices and your plug-ins,

      Thanks to training for Cycling ’74 Max, had a kind of micro-epiphany about encapsulation, a year or so ago. Nesting devices in one another sounds like a convenience but there’s a rather deep effect on workflow when you start arranging things in this way: you don’t have to worry about the internals of a box/patcher/module/device if you really know what you can expect out of it. Though some may take this for granted (after all, other modular systems have had it for quite a while), there’s something profound about getting modules that can include other modules. Especially when some of these are third-party plugins.

  7. Aug 2018
    1. Publishers and other sites can include a simple line of javascript to enable annotation by default across their content.

      Publishers and platform hosts who want to learn more about embedding annotations can learn more about best practices here.

  8. Apr 2017
    1. Word2vec is a particularly computationally-efficient predictive model for learning word embeddings from raw text. It comes in two flavors, the Continuous Bag-of-Words model (CBOW) and the Skip-Gram model (Section 3.1 and 3.2 in Mikolov et al.).