23 Matching Annotations
  1. Nov 2020
  2. Jun 2020
    1. epsilon. Is a very small number to prevent any division by zero in the implementation (e.g. 10E-8). Further, learning rate decay can also be used with Adam. The paper uses a decay rate alpha = alpha/sqrt(t) updted each epoch (t) for the logistic regression demonstration. The Adam paper suggests: Good default settings for the tested machine learning problems are alpha=0.001, beta1=0.9, beta2=0.999 and epsilon=10−8 The TensorFlow documentation suggests some tuning of epsilon: The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1. We can see that the popular deep learning libraries generally use the default parameters recommended by the paper. TensorFlow: learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08. Keras: lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0. Blocks: learning_rate=0.002, beta1=0.9, beta2=0.999, epsilon=1e-08, decay_factor=1. Lasagne: learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08 Caffe: learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08 MxNet: learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8 Torch: learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8

      Should we expose EPS as one of the experiment parameters? I think that we shouldn't since it is a rather technical parameter.

  3. May 2020
    1. for query, query_embedding in zip(queries, query_embeddings): distances = scipy.spatial.distance.cdist([query_embedding], corpus_embeddings, "cosine")[0]

      How to calculate cosine distance between vector and corpus

    1. simple approach is to average the second to last hiden layer of each token producing a single 768 length vector

      Proposition how to obtain single vector for the whole sentence

    2. It is worth noting that word-level similarity comparisons are not appropriate with BERT embeddings because these embeddings are contextually dependent, meaning that the word vector changes depending on the sentence it appears in. This allows wonderful things like polysemy so that e.g. your representation encodes river “bank” and not a financial institution “bank”, but makes direct word-to-word similarity comparisons less valuable. However, for sentence embeddings similarity comparison is still valid such that one can query, for example, a single sentence against a dataset of other sentences in order to find the most similar. Depending on the similarity metric used, the resulting similarity values will be less informative than the relative ranking of similarity outputs since many similarity metrics make assumptions about the vector space (equally-weighted dimensions, for example) that do not hold for our 768-dimensional vector space.

      Thoughts on similarity comparison for word and sentence level embeddings.

    3. For out of vocabulary words that are composed of multiple sentence and character-level embeddings, there is a further issue of how best to recover this embedding. Averaging the embeddings is the most straightforward solution (one that is relied upon in similar embedding models with subword vocabularies like fasttext), but summation of subword embeddings and simply taking the last token embedding (remember that the vectors are context sensitive) are acceptable alternative strategies.

      Strategies for how to get an embedding for a OOV word

    4. It should be noted that although the [CLS] acts as an “aggregate representation” for classification tasks, this is not the best choice for a high quality sentence embedding vector. According to BERT author Jacob Devlin: “I’m not sure what these vectors are, since BERT does not generate meaningful sentence vectors. It seems that this is is doing average pooling over the word tokens to get a sentence vector, but we never suggested that this will generate meaningful sentence representations.”

      About [CLS] token not being a good quality sentence level embedding :O

    5. In order to get the individual vectors we will need to combine some of the layer vectors…but which layer or combination of layers provides the best representation?

      Strategies for aggregating the information from 12 layers

    6. This object has four dimensions, in the following order: The layer number (12 layers) The batch number (1 sentence) The word / token number (22 tokens in our sentence) The hidden unit / feature number (768 features) That’s 202,752 unique values just to represent our one sentence!

      Expected dimensionality for a sentence embedding

    7. BERT offers an advantage over models like Word2Vec, because while each word has a fixed representation under Word2Vec regardless of the context within which the word appears, BERT produces word representations that are dynamically informed by the words around them.

      Advantage of BERT embedding over word2vec

    1. BERT for feature extraction The fine-tuning approach isn’t the only way to use BERT. Just like ELMo, you can use the pre-trained BERT to create contextualized word embeddings. Then you can feed these embeddings to your existing model – a process the paper shows yield results not far behind fine-tuning BERT on a task such as named-entity recognition.

      How to extract embeddings from BERT

    1. ([CLS]).The final hidden state corresponding to this token is used as the ag- gregate sequence representation for classification tasks.

      Aggregate sequence representation? Does it mean it is the sentence embedding?

    2. :It is fairly expensive (four days on 4 to 16 Cloud TPUs), but is a one-time procedure for each language

      Estimates on the model pre-training from scratch

    1. I extracted embeddings from a pytorch model (pytorch_model.bin file). The code to extract is pasted here. It assumes the embeddings are stored with the name bert.embeddings.word_embeddings.weight.

      How to extract raw BERT input embeddings? Those are not context aware.

    1. about 30,000 vectors or embeddings (we can train the model with our own vocabulary if needed- though this has many factors to be considered before doing so, such as the need to pre-train model from scratch with the new vocabulary). These vectors are referred to as raw vectors/embeddings in this post to distinguish them from their transformed counterparts once they pass through the BERT model.These learned raw vectors are similar to the vector output of a word2vec model — a single vector represents a word regardless of its different meanings or senses. For instance, all the different senses/meanings (cell phone, biological cell, prison cell) of a word like “cell” is combined into a single vector.

      BERT offers two kind of embeddings:

      1. similar to word2vec - a single vector represents a word regardless of its different meanings or senses
      2. context aware embedding - after they pass through the model
    1. Given the disjoint vocabularies (Section2) andthe magnitude of improvement over BERT-Base(Section4), we suspect that while an in-domainvocabulary is helpful, SCIBERTbenefits mostfrom the scientific corpus pretraining.

      The specific vocabulary only slightly increases the model accuracy. Most of the benefit comes from domain specific corpus pre-training.

    2. We construct SCIVOCAB, a new WordPiece vo-cabulary on our scientific corpus using the Sen-tencePiece1library. We produce both cased anduncased vocabularies and set the vocabulary sizeto 30K to match the size of BASEVOCAB. The re-sulting token overlap between BASEVOCABandSCIVOCABis 42%, illustrating a substantial dif-ference in frequently used words between scien-tific and general domain texts

      For SciBERT they created a new vocabulary of the same size as for BERT. The overlap was at the level of 42%. We could check what is the overlap in our case?

    1. Although we could have constructed new WordPiece vocabulary based on biomedical corpora, we used the original vocabulary of BERTBASE for the following reasons: (i) compatibility of BioBERT with BERT, which allows BERT pre-trained on general domain corpora to be re-used, and makes it easier to interchangeably use existing models based on BERT and BioBERT and (ii) any new words may still be represented and fine-tuned for the biomedical domain using the original WordPiece vocabulary of BERT.

      BioBERT does not change the BERT vocabulary.

    1. def _tokenize(self, text): split_tokens = [] if self.do_basic_tokenize: for token in self.basic_tokenizer.tokenize(text, never_split=self.all_special_tokens): for sub_token in self.wordpiece_tokenizer.tokenize(token): split_tokens.append(sub_token) else: split_tokens = self.wordpiece_tokenizer.tokenize(text) return split_tokens

      How BERT tokenization works

    1. My initial experiments indicated that adding custom words to the vocab-file had some effects. However, at least on my corpus that can be described as "medical tweets", this effect just disappears after running the domain specific pretraining for a while. After spending quite some time on this, I have ended up dropping the custom vocab-files totally. Bert seems to be able to learn these specialised words by tokenizing them.

      sbs experience from extending the vocabulary for medical data

    2. Since Bert does an excellent job in tokenising and learning this combinations, do not expect dramatic improvements by adding words to the vocab. In my experience adding very specific terms, like common long medical latin words, have some effect. Adding words like "footballs" will likely just have negative effects since the current vector is already pretty good.

      Expected improvement of extending the BERT vocabulary

    1. As is the case in NLP applications in general, we begin by turning each input word into a vector using an embedding algorithm.

      What is the embedding algorithm for BERT?