Hypothesis

23 Matching Annotations

Nov 2020
www.sciencedirect.com www.sciencedirect.com

A study of active learning methods for named entity recognition in clinical text

1
1. dominik.lewy 16 Nov 2020
  
  in Public
  
  Active Learning for NER SURVEY
  
  ML_CPulse annotation
Visit annotations in context

Tags

ML_CPulse

annotation

Annotators

dominik.lewy

URL

sciencedirect.com/science/article/pii/S1532046415002038
www.slideshare.net www.slideshare.net

MMR-based active machine learning for Bio named entity recognition

1
1. dominik.lewy 16 Nov 2020
  
  in Public
  
  Active Learning strategies
  
  ML_CPulse annotation
Visit annotations in context

Tags

ML_CPulse

annotation

Annotators

dominik.lewy

URL

slideshare.net/seokhwankim7/mmrbased-active-machine-learning-for-bio-named-entity-recognition
Jun 2020
www.machinelearningmastery.com www.machinelearningmastery.com

Gentle Introduction to the Adam Optimization Algorithm for Deep Learning - Machine Learning Mastery

1
1. dominik.lewy 01 Jun 2020
  
  in Public
  
  epsilon. Is a very small number to prevent any division by zero in the implementation (e.g. 10E-8). Further, learning rate decay can also be used with Adam. The paper uses a decay rate alpha = alpha/sqrt(t) updted each epoch (t) for the logistic regression demonstration. The Adam paper suggests: Good default settings for the tested machine learning problems are alpha=0.001, beta1=0.9, beta2=0.999 and epsilon=10−8 The TensorFlow documentation suggests some tuning of epsilon: The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1. We can see that the popular deep learning libraries generally use the default parameters recommended by the paper. TensorFlow: learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08. Keras: lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0. Blocks: learning_rate=0.002, beta1=0.9, beta2=0.999, epsilon=1e-08, decay_factor=1. Lasagne: learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08 Caffe: learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08 MxNet: learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8 Torch: learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8
  
  Should we expose EPS as one of the experiment parameters? I think that we shouldn't since it is a rather technical parameter.
  
  ML_CPulse model
Visit annotations in context

Tags

ML_CPulse

model

Annotators

dominik.lewy

URL

machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/
May 2020
github.com github.com

UKPLab/sentence-transformers

1
1. dominik.lewy 22 May 2020
  
  in Public
  
  for query, query_embedding in zip(queries, query_embeddings): distances = scipy.spatial.distance.cdist([query_embedding], corpus_embeddings, "cosine")[0]
  
  How to calculate cosine distance between vector and corpus
  
  ML_CPulse embedding
Visit annotations in context

Tags

ML_CPulse

embedding

Annotators

dominik.lewy

URL

github.com/UKPLab/sentence-transformers
mccormickml.com mccormickml.com

BERT Word Embeddings Tutorial · Chris McCormick

7
1. katarzyna.rembelska 19 May 2020
  
  in Public
  
  simple approach is to average the second to last hiden layer of each token producing a single 768 length vector
  
  Proposition how to obtain single vector for the whole sentence
  
  ML_CPulse embedding
2. dominik.lewy 15 May 2020
  
  in Public
  
  It is worth noting that word-level similarity comparisons are not appropriate with BERT embeddings because these embeddings are contextually dependent, meaning that the word vector changes depending on the sentence it appears in. This allows wonderful things like polysemy so that e.g. your representation encodes river “bank” and not a financial institution “bank”, but makes direct word-to-word similarity comparisons less valuable. However, for sentence embeddings similarity comparison is still valid such that one can query, for example, a single sentence against a dataset of other sentences in order to find the most similar. Depending on the similarity metric used, the resulting similarity values will be less informative than the relative ranking of similarity outputs since many similarity metrics make assumptions about the vector space (equally-weighted dimensions, for example) that do not hold for our 768-dimensional vector space.
  
  Thoughts on similarity comparison for word and sentence level embeddings.
  
  ML_CPulse embedding
3. dominik.lewy 15 May 2020
  
  in Public
  
  For out of vocabulary words that are composed of multiple sentence and character-level embeddings, there is a further issue of how best to recover this embedding. Averaging the embeddings is the most straightforward solution (one that is relied upon in similar embedding models with subword vocabularies like fasttext), but summation of subword embeddings and simply taking the last token embedding (remember that the vectors are context sensitive) are acceptable alternative strategies.
  
  Strategies for how to get an embedding for a OOV word
  
  ML_CPulse embedding
4. dominik.lewy 15 May 2020
  
  in Public
  
  It should be noted that although the [CLS] acts as an “aggregate representation” for classification tasks, this is not the best choice for a high quality sentence embedding vector. According to BERT author Jacob Devlin: “I’m not sure what these vectors are, since BERT does not generate meaningful sentence vectors. It seems that this is is doing average pooling over the word tokens to get a sentence vector, but we never suggested that this will generate meaningful sentence representations.”
  
  About [CLS] token not being a good quality sentence level embedding :O
  
  ML_CPulse embedding
5. dominik.lewy 15 May 2020
  
  in Public
  
  In order to get the individual vectors we will need to combine some of the layer vectors…but which layer or combination of layers provides the best representation?
  
  Strategies for aggregating the information from 12 layers
  
  ML_CPulse embedding
6. dominik.lewy 15 May 2020
  
  in Public
  
  This object has four dimensions, in the following order: The layer number (12 layers) The batch number (1 sentence) The word / token number (22 tokens in our sentence) The hidden unit / feature number (768 features) That’s 202,752 unique values just to represent our one sentence!
  
  Expected dimensionality for a sentence embedding
  
  ML_CPulse embedding
7. dominik.lewy 15 May 2020
  
  in Public
  
  BERT offers an advantage over models like Word2Vec, because while each word has a fixed representation under Word2Vec regardless of the context within which the word appears, BERT produces word representations that are dynamically informed by the words around them.
  
  Advantage of BERT embedding over word2vec
  
  ML_CPulse embedding
Visit annotations in context

Tags

ML_CPulse

embedding

Annotators

dominik.lewy

katarzyna.rembelska

URL

mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/
jalammar.github.io jalammar.github.io

The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)

1
1. dominik.lewy 18 May 2020
  
  in Public
  
  BERT for feature extraction The fine-tuning approach isn’t the only way to use BERT. Just like ELMo, you can use the pre-trained BERT to create contextualized word embeddings. Then you can feed these embeddings to your existing model – a process the paper shows yield results not far behind fine-tuning BERT on a task such as named-entity recognition.
  
  How to extract embeddings from BERT
  
  ML_CPulse embedding
Visit annotations in context

Tags

ML_CPulse

embedding

Annotators

dominik.lewy

URL

jalammar.github.io/illustrated-bert/
blog.usejournal.com blog.usejournal.com

Part1: BERT for Advance NLP with Transformers in Pytorch

2
1. dominik.lewy 15 May 2020
  
  in Public
  
  ([CLS]).The final hidden state corresponding to this token is used as the ag- gregate sequence representation for classification tasks.
  
  Aggregate sequence representation? Does it mean it is the sentence embedding?
  
  ML_CPulse embedding
2. dominik.lewy 15 May 2020
  
  in Public
  
  :It is fairly expensive (four days on 4 to 16 Cloud TPUs), but is a one-time procedure for each language
  
  Estimates on the model pre-training from scratch
  
  ML_CPulse model
Visit annotations in context

Tags

ML_CPulse

model

embedding

Annotators

dominik.lewy

URL

blog.usejournal.com/part1-bert-for-advance-nlp-with-transformers-in-pytorch-357579d63512
medium.com medium.com

Not at all. I should have made clear how I extracted in the post.

1
1. dominik.lewy 15 May 2020
  
  in Public
  
  I extracted embeddings from a pytorch model (pytorch_model.bin file). The code to extract is pasted here. It assumes the embeddings are stored with the name bert.embeddings.word_embeddings.weight.
  
  How to extract raw BERT input embeddings? Those are not context aware.
  
  ML_CPulse embedding
Visit annotations in context

Tags

ML_CPulse

embedding

Annotators

dominik.lewy

URL

medium.com/@ajitrajasekharan/i-extracted-embeddings-from-a-pytorch-model-pytorch-model-bin-4686afe26135
towardsdatascience.com towardsdatascience.com

Examining BERT’s raw embeddings

1
1. dominik.lewy 15 May 2020
  
  in Public
  
  about 30,000 vectors or embeddings (we can train the model with our own vocabulary if needed- though this has many factors to be considered before doing so, such as the need to pre-train model from scratch with the new vocabulary). These vectors are referred to as raw vectors/embeddings in this post to distinguish them from their transformed counterparts once they pass through the BERT model.These learned raw vectors are similar to the vector output of a word2vec model — a single vector represents a word regardless of its different meanings or senses. For instance, all the different senses/meanings (cell phone, biological cell, prison cell) of a word like “cell” is combined into a single vector.
  
  BERT offers two kind of embeddings:
  
  similar to word2vec - a single vector represents a word regardless of its different meanings or senses
  
  context aware embedding - after they pass through the model
  
  ML_CPulse embedding
Visit annotations in context

Tags

ML_CPulse

embedding

Annotators

dominik.lewy

URL

towardsdatascience.com/examining-berts-raw-embeddings-fd905cb22df7
arxiv.org arxiv.org

Untitled document

2
1. dominik.lewy 15 May 2020
  
  in Public
  
  Given the disjoint vocabularies (Section2) andthe magnitude of improvement over BERT-Base(Section4), we suspect that while an in-domainvocabulary is helpful, SCIBERTbenefits mostfrom the scientific corpus pretraining.
  
  The specific vocabulary only slightly increases the model accuracy. Most of the benefit comes from domain specific corpus pre-training.
  
  ML_CPulse model
2. dominik.lewy 15 May 2020
  
  in Public
  
  We construct SCIVOCAB, a new WordPiece vo-cabulary on our scientific corpus using the Sen-tencePiece1library. We produce both cased anduncased vocabularies and set the vocabulary sizeto 30K to match the size of BASEVOCAB. The re-sulting token overlap between BASEVOCABandSCIVOCABis 42%, illustrating a substantial dif-ference in frequently used words between scien-tific and general domain texts
  
  For SciBERT they created a new vocabulary of the same size as for BERT. The overlap was at the level of 42%. We could check what is the overlap in our case?
  
  ML_CPulse model
Visit annotations in context

Tags

ML_CPulse

model

Annotators

dominik.lewy

URL

arxiv.org/pdf/1903.10676.pdf
academic.oup.com academic.oup.com

BioBERT: a pre-trained biomedical language representation model for biomedical text mining

1
1. dominik.lewy 15 May 2020
  
  in Public
  
  Although we could have constructed new WordPiece vocabulary based on biomedical corpora, we used the original vocabulary of BERTBASE for the following reasons: (i) compatibility of BioBERT with BERT, which allows BERT pre-trained on general domain corpora to be re-used, and makes it easier to interchangeably use existing models based on BERT and BioBERT and (ii) any new words may still be represented and fine-tuned for the biomedical domain using the original WordPiece vocabulary of BERT.
  
  BioBERT does not change the BERT vocabulary.
  
  ML_CPulse model
Visit annotations in context

Tags

ML_CPulse

model

Annotators

dominik.lewy

URL

academic.oup.com/bioinformatics/article/36/4/1234/5566506
huggingface.co huggingface.co

transformers.tokenization_bert — transformers 2.9.1 documentation

1
1. dominik.lewy 15 May 2020
  
  in Public
  
  def _tokenize(self, text): split_tokens = [] if self.do_basic_tokenize: for token in self.basic_tokenizer.tokenize(text, never_split=self.all_special_tokens): for sub_token in self.wordpiece_tokenizer.tokenize(token): split_tokens.append(sub_token) else: split_tokens = self.wordpiece_tokenizer.tokenize(text) return split_tokens
  
  How BERT tokenization works
  
  ML_CPulse model
Visit annotations in context

Tags

ML_CPulse

model

Annotators

dominik.lewy

URL

huggingface.co/transformers/_modules/transformers/tokenization_bert.html
github.com github.com

How to use my own additional vocabulary dictionary? · Issue #396 · google-research/bert

2
1. dominik.lewy 15 May 2020
  
  in Public
  
  My initial experiments indicated that adding custom words to the vocab-file had some effects. However, at least on my corpus that can be described as "medical tweets", this effect just disappears after running the domain specific pretraining for a while. After spending quite some time on this, I have ended up dropping the custom vocab-files totally. Bert seems to be able to learn these specialised words by tokenizing them.
  
  sbs experience from extending the vocabulary for medical data
  
  ML_CPulse model
2. dominik.lewy 15 May 2020
  
  in Public
  
  Since Bert does an excellent job in tokenising and learning this combinations, do not expect dramatic improvements by adding words to the vocab. In my experience adding very specific terms, like common long medical latin words, have some effect. Adding words like "footballs" will likely just have negative effects since the current vector is already pretty good.
  
  Expected improvement of extending the BERT vocabulary
  
  ML_CPulse model
Visit annotations in context

Tags

ML_CPulse

model

Annotators

dominik.lewy

URL

github.com/google-research/bert/issues/396
jalammar.github.io jalammar.github.io

The Illustrated Transformer

1
1. dominik.lewy 15 May 2020
  
  in Public
  
  As is the case in NLP applications in general, we begin by turning each input word into a vector using an embedding algorithm.
  
  What is the embedding algorithm for BERT?
  
  ML_CPulse model
Visit annotations in context

Tags

ML_CPulse

model

Annotators

dominik.lewy

URL

jalammar.github.io/illustrated-transformer/

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL