47 Matching Annotations
  1. May 2021
    1. The second spatiotemporal variant isa “(2+1)D” convolutional block, which explicitly factorizes3D convolution into two separate and successive operations,a 2D spatial convolution and a 1D temporal convolution.

      More nonlinearites and easier optimization task

  2. Mar 2021
    1. Regardless of the size of output pro-duced by the last convolutional layer, each network appliesglobal spatiotemporal average pooling to the final convolu-tional tensor, followed by a fully-connected (fc) layer per-forming the final classification (the output dimension of thefc layer matches the number of classes, e.g.,400for Kinet-ics).
    2. The first formulation is named mixed con-volution (MC) and consists in employing 3D convolutionsonly in the early layers of the network, with 2D convolu-tions in the top layers.
  3. Jan 2021
    1. A typical value ofτwe studied is16—this refreshing speed is roughly 2 frames sampled persecond for 30-fps videos.

      At what frequency is the data sampled for Slow pathway

    2. his method has been a foundation of manycompetitive results in the literature [12,13,55].

      Reference to v1b. This method does not use separate preprocessing in form of Optical Flow calculation as the network presented in v1b.

    3. One path-way is designed to capture semantic information that can begiven by images or a few sparse frames, and it operates atlowframe rates andslowrefreshing speed. In contrast, theother pathway is responsible for capturing rapidly changingmotion, by operating atfastrefreshing speed and high tem-poral resolution. Despite its high temporal rate, this pathwayis made verylightweight,e.g.,∼20% of total computation.This is because this pathway is designed to have fewer chan-nels and weaker ability to process spatial information, whilesuch information can be provided by the first pathway in aless redundant manner.

      Difference between pathways and computational complexity as % of total.

    4. a Slow pathway, operating at low framerate, to capture spatial semantics, and (ii) a Fast path-way, operating at high frame rate, to capture motion atfine temporal resolution.

      Motivation.

    1. For the extraction ofoptical flow and warped optical flow, we choose the TVL1 optical flow algorithm[35] implemented in OpenCV with CUDA.

      Optical flow algorithm used. This one is required for the temporal CNN.

    2. We use the mini-batch stochastic gradient descent algorithm to learn the net-work parameters, where the batch size is set to 256 and momentum set to 0.9.We initialize network weights with pre-trained models from ImageNet [33].

      Hyperparameters and weights initialization.

    3. Data Augmentation.Data augmentation can generate diverse training sam-ples and prevent severe over-fitting. In the original two-stream ConvNets, ran-dom cropping and horizontal flipping are employed to augment training samples.We exploit two new data augmentation techniques: corner cropping and scale-jittering.

      Traditional data augmentation techniques can be used for two stream architectures.

    4. Network Inputs.We are also interested in exploring more input modalitiesto enhance the discriminative power of temporal segment networks. Originally,the two-stream ConvNets used RGB images for the spatial stream and stackedoptical flow fields for the temporal stream.

      Standard data input format for a 2 stream architecture is build of: RGB image and stacked optical flow.

    5. Here a class scoreGiis inferred from the scores of thesame class on all the snippets, using an aggregation functiong. We empiricallyevaluated several different forms of the aggregation functiong, including evenlyaveraging, maximum, and weighted averaging in our experiments. Among them,evenly averaging is used to report our final recognition accuracies.

      How is the result aggregated from segment level to movie level.

    6. In experiments, the number of snippetsKis set to 3 according to previousworks on temporal modeling [16,17].

      The paper suggest 3 segment, the implementation in Gluon CV already has 7. The question that we should ask is how long should be the video clip? This should be the input to data loader.

    7. Temporal segment networ

      Visualization of the 2 stream (Spatial CNN and Temporal CNN) architecture

    8. Our first contribution is temporal segment net-work (TSN), a novel framework for video-based action recognition. whichis based on the idea of long-range temporal structure modeling

      Main contribution

    9. However, mainstream ConvNet frameworks [1,13] usually focus on appearancesand short-term motions, thus lacking the capacity to incorporate long-rangetemporal structure. Recently there are a few attempts [19,4,20] to deal withthis problem. These methods mostly rely on dense temporal sampling with apre-defined sampling interval. This approach would incur excessive computa-tional cost when applied to long video sequences, which limits its application inreal-world practice and poses a risk of missing important information for videoslonger than the maximal sequence length.

      Historical approach using a predefine sequence length

    10. In terms of temporal structure modeling, a key observation is thatconsecutive frames are highly redundant. Therefore, dense temporal sampling,which usually results in highly similar sampled frames, is unnecessary. Instead asparse temporal sampling strategy will be more favorable in this case. Motivatedby this observation, we develop a video-level framework, calledtemporal segmentnetwork(TSN). This framework extracts short snippets over a long video se-quence with a sparse sampling scheme, where the samples distribute uniformlyalong the temporal dimension.

      Confirms the intuition towards sparse sampling

    11. Limited by computational cost these methodsusually process sequences of fixed lengths ranging from 64 to 120 frames

      Number of frames processed by older approaches

  4. Nov 2020
  5. Jun 2020
    1. This is an algo-rithmic paradigm wherewandvare alternatively mini-mized, one at a time while the other is held fixed. Whenvis fixed, the weighted loss is typically minimized bystochastic gradient descent.

      ...

    2. commonly used inCL and self-paced learning (Kumar et al., 2010) requiresalternative variable updates, which is difficult for trainingvery deep CNNs via mini-batch stochastic gradient descent.

      OLD TRAINING

    3. Inspired by the recent success of Curriculum Learning (CL),this paper tackles this problem using CL (Bengio et al.,2009), a learning paradigm inspired by the cognitive processof human and animals, in which a model is learned grad-ually using samples ordered in a meaningful sequence

      STATIC CURRICULUM

    1. epsilon. Is a very small number to prevent any division by zero in the implementation (e.g. 10E-8). Further, learning rate decay can also be used with Adam. The paper uses a decay rate alpha = alpha/sqrt(t) updted each epoch (t) for the logistic regression demonstration. The Adam paper suggests: Good default settings for the tested machine learning problems are alpha=0.001, beta1=0.9, beta2=0.999 and epsilon=10−8 The TensorFlow documentation suggests some tuning of epsilon: The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1. We can see that the popular deep learning libraries generally use the default parameters recommended by the paper. TensorFlow: learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08. Keras: lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0. Blocks: learning_rate=0.002, beta1=0.9, beta2=0.999, epsilon=1e-08, decay_factor=1. Lasagne: learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08 Caffe: learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08 MxNet: learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8 Torch: learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8

      Should we expose EPS as one of the experiment parameters? I think that we shouldn't since it is a rather technical parameter.

  6. May 2020
    1. for query, query_embedding in zip(queries, query_embeddings): distances = scipy.spatial.distance.cdist([query_embedding], corpus_embeddings, "cosine")[0]

      How to calculate cosine distance between vector and corpus

    1. BERT for feature extraction The fine-tuning approach isn’t the only way to use BERT. Just like ELMo, you can use the pre-trained BERT to create contextualized word embeddings. Then you can feed these embeddings to your existing model – a process the paper shows yield results not far behind fine-tuning BERT on a task such as named-entity recognition.

      How to extract embeddings from BERT

    1. It is worth noting that word-level similarity comparisons are not appropriate with BERT embeddings because these embeddings are contextually dependent, meaning that the word vector changes depending on the sentence it appears in. This allows wonderful things like polysemy so that e.g. your representation encodes river “bank” and not a financial institution “bank”, but makes direct word-to-word similarity comparisons less valuable. However, for sentence embeddings similarity comparison is still valid such that one can query, for example, a single sentence against a dataset of other sentences in order to find the most similar. Depending on the similarity metric used, the resulting similarity values will be less informative than the relative ranking of similarity outputs since many similarity metrics make assumptions about the vector space (equally-weighted dimensions, for example) that do not hold for our 768-dimensional vector space.

      Thoughts on similarity comparison for word and sentence level embeddings.

    2. For out of vocabulary words that are composed of multiple sentence and character-level embeddings, there is a further issue of how best to recover this embedding. Averaging the embeddings is the most straightforward solution (one that is relied upon in similar embedding models with subword vocabularies like fasttext), but summation of subword embeddings and simply taking the last token embedding (remember that the vectors are context sensitive) are acceptable alternative strategies.

      Strategies for how to get an embedding for a OOV word

    3. It should be noted that although the [CLS] acts as an “aggregate representation” for classification tasks, this is not the best choice for a high quality sentence embedding vector. According to BERT author Jacob Devlin: “I’m not sure what these vectors are, since BERT does not generate meaningful sentence vectors. It seems that this is is doing average pooling over the word tokens to get a sentence vector, but we never suggested that this will generate meaningful sentence representations.”

      About [CLS] token not being a good quality sentence level embedding :O

    4. In order to get the individual vectors we will need to combine some of the layer vectors…but which layer or combination of layers provides the best representation?

      Strategies for aggregating the information from 12 layers

    5. This object has four dimensions, in the following order: The layer number (12 layers) The batch number (1 sentence) The word / token number (22 tokens in our sentence) The hidden unit / feature number (768 features) That’s 202,752 unique values just to represent our one sentence!

      Expected dimensionality for a sentence embedding

    6. BERT offers an advantage over models like Word2Vec, because while each word has a fixed representation under Word2Vec regardless of the context within which the word appears, BERT produces word representations that are dynamically informed by the words around them.

      Advantage of BERT embedding over word2vec

    1. ([CLS]).The final hidden state corresponding to this token is used as the ag- gregate sequence representation for classification tasks.

      Aggregate sequence representation? Does it mean it is the sentence embedding?

    2. :It is fairly expensive (four days on 4 to 16 Cloud TPUs), but is a one-time procedure for each language

      Estimates on the model pre-training from scratch

    1. I extracted embeddings from a pytorch model (pytorch_model.bin file). The code to extract is pasted here. It assumes the embeddings are stored with the name bert.embeddings.word_embeddings.weight.

      How to extract raw BERT input embeddings? Those are not context aware.

    1. about 30,000 vectors or embeddings (we can train the model with our own vocabulary if needed- though this has many factors to be considered before doing so, such as the need to pre-train model from scratch with the new vocabulary). These vectors are referred to as raw vectors/embeddings in this post to distinguish them from their transformed counterparts once they pass through the BERT model.These learned raw vectors are similar to the vector output of a word2vec model — a single vector represents a word regardless of its different meanings or senses. For instance, all the different senses/meanings (cell phone, biological cell, prison cell) of a word like “cell” is combined into a single vector.

      BERT offers two kind of embeddings:

      1. similar to word2vec - a single vector represents a word regardless of its different meanings or senses
      2. context aware embedding - after they pass through the model
    1. Given the disjoint vocabularies (Section2) andthe magnitude of improvement over BERT-Base(Section4), we suspect that while an in-domainvocabulary is helpful, SCIBERTbenefits mostfrom the scientific corpus pretraining.

      The specific vocabulary only slightly increases the model accuracy. Most of the benefit comes from domain specific corpus pre-training.

    2. We construct SCIVOCAB, a new WordPiece vo-cabulary on our scientific corpus using the Sen-tencePiece1library. We produce both cased anduncased vocabularies and set the vocabulary sizeto 30K to match the size of BASEVOCAB. The re-sulting token overlap between BASEVOCABandSCIVOCABis 42%, illustrating a substantial dif-ference in frequently used words between scien-tific and general domain texts

      For SciBERT they created a new vocabulary of the same size as for BERT. The overlap was at the level of 42%. We could check what is the overlap in our case?

    1. Although we could have constructed new WordPiece vocabulary based on biomedical corpora, we used the original vocabulary of BERTBASE for the following reasons: (i) compatibility of BioBERT with BERT, which allows BERT pre-trained on general domain corpora to be re-used, and makes it easier to interchangeably use existing models based on BERT and BioBERT and (ii) any new words may still be represented and fine-tuned for the biomedical domain using the original WordPiece vocabulary of BERT.

      BioBERT does not change the BERT vocabulary.

    1. def _tokenize(self, text): split_tokens = [] if self.do_basic_tokenize: for token in self.basic_tokenizer.tokenize(text, never_split=self.all_special_tokens): for sub_token in self.wordpiece_tokenizer.tokenize(token): split_tokens.append(sub_token) else: split_tokens = self.wordpiece_tokenizer.tokenize(text) return split_tokens

      How BERT tokenization works

    1. My initial experiments indicated that adding custom words to the vocab-file had some effects. However, at least on my corpus that can be described as "medical tweets", this effect just disappears after running the domain specific pretraining for a while. After spending quite some time on this, I have ended up dropping the custom vocab-files totally. Bert seems to be able to learn these specialised words by tokenizing them.

      sbs experience from extending the vocabulary for medical data

    2. Since Bert does an excellent job in tokenising and learning this combinations, do not expect dramatic improvements by adding words to the vocab. In my experience adding very specific terms, like common long medical latin words, have some effect. Adding words like "footballs" will likely just have negative effects since the current vector is already pretty good.

      Expected improvement of extending the BERT vocabulary

    1. As is the case in NLP applications in general, we begin by turning each input word into a vector using an embedding algorithm.

      What is the embedding algorithm for BERT?

    1. e.g. an LSTM

      It can be the embedding from BERT as well.