55 Matching Annotations
1. Sep 2022
2. transformer-circuits.pub transformer-circuits.pub
1. Consider a toy model where we train an embedding of five features of varying importanceWhere “importance” is a scalar multiplier on mean squared error loss. in two dimensions, add a ReLU afterwards for filtering, and vary the sparsity of the features.

#### URL

3. pyimagesearch.com pyimagesearch.com
1. Now, the progression of NLP, as discussed, tells a story. We begin with tokens and then build representations of these tokens. We use these representations to find similarities between tokens and embed them in a high-dimensional space. The same embeddings are also passed into sequential models that can process sequential data. Those models are used to build context and, through an ingenious way, attend to parts of the input sentence that are useful to the output sentence in translation.
2. Data, matrix multiplications, repeated and scaled with non-linear switches. Maybe that simplifies things a lot, but even today, most architectures boil down to these principles. Even the most complex systems, ideas, and papers can be boiled down to just that:

#### URL

4. Jun 2022
5. direct.mit.edu direct.mit.edu
1. The dominant idea is one of attention, by which a representation at a position is computed as a weighted combination of representations from other positions. A common self-supervision objective in a transformer model is to mask out occasional words in a text. The model works out what word used to be there. It does this by calculating from each word position (including mask positions) vectors that represent a query, key, and value at that position. The query at a position is compared with the value at every position to calculate how much attention to pay to each position; based on this, a weighted average of the values at all positions is calculated. This operation is repeated many times at each level of the transformer neural net, and the resulting value is further manipulated through a fully connected neural net layer and through use of normalization layers and residual connections to produce a new vector for each word. This whole process is repeated many times, giving extra layers of depth to the transformer neural net. At the end, the representation above a mask position should capture the word that was there in the original text: for instance, committee as illustrated in Figure 1.

#### URL

6. e2eml.school e2eml.school
1. This trick of using a one-hot vector to pull out a particular row of a matrix is at the core of how transformers work.

Matrix multiplication as table lookup

#### URL

7. May 2022
8. www.pnas.org www.pnas.org
1. Given the complexities of the brain’s structure and the functions it performs, any one of these models is surely oversimplified and ultimately wrong—at best, an approximation of some aspects of what the brain does. However, some models are less wrong than others, and consistent trends in performance across models can reveal not just which model best fits the brain but also which properties of a model underlie its fit to the brain, thus yielding critical insights that transcend what any single model can tell us.

#### URL

9. www.gwern.net www.gwern.net
1. Such a highly non-linear problem would clearly benefitfrom the computational power of many layers. Unfortu-nately, back-propagation learning generally slows downby an order of magnitude every time a layer is added toa network.

The problem in 1988

#### URL

10. Apr 2022
11. arxiv.org arxiv.org
1. Ourpre-trained network is nearly identical to the “AlexNet”architecture (Krizhevsky et al., 2012), but with local re-ponse normalization layers after pooling layers following(Jia et al., 2014). It was trained with the Caffe frameworkon the ImageNet 2012 dataset (Deng et al., 2009)

#### URL

12. cs231n.github.io cs231n.github.io
1. Example 1. For example, suppose that the input volume has size [32x32x3], (e.g. an RGB CIFAR-10 image). If the receptive field (or the filter size) is 5x5, then each neuron in the Conv Layer will have weights to a [5x5x3] region in the input volume, for a total of 5*5*3 = 75 weights (and +1 bias parameter). Notice that the extent of the connectivity along the depth axis must be 3, since this is the depth of the input volume. Example 2. Suppose an input volume had size [16x16x20]. Then using an example receptive field size of 3x3, every neuron in the Conv Layer would now have a total of 3*3*20 = 180 connections to the input volume. Notice that, again, the connectivity is local in 2D space (e.g. 3x3), but full along the input depth (20).

These two examples are the first two layers of Andrej Karpathy's wonderful working ConvNetJS CIFAR-10 demo here

#### URL

13. cs.stanford.edu cs.stanford.edu
1. input (32x32x3)max activation: 0.5, min: -0.5max gradient: 1.08696, min: -1.53051Activations:Activation Gradients:Weights:Weight Gradients:conv (32x32x16)filter size 5x5x3, stride 1max activation: 3.75919, min: -4.48241max gradient: 0.36571, min: -0.33032parameters: 16x5x5x3+16 = 1216

The dimensions of these first two layers are explained here

#### URL

1. Here the lower level layers are frozen and are not trained, only the new classification head will update itself to learn from the features provided from the pre-trained chopped up model on the left.

#### URL

15. distill.pub distill.pub
1. Starting from random noise, we optimize an image to activate a particular neuron (layer mixed4a, unit 11).

And then we use that image as a kind of variable name to refer to the neuron in a way that more helpful than the the layer number and neuron index within the layer. This explanation is via one of Chris Olah's YouTube videos (https://www.youtube.com/watch?v=gXsKyZ_Y_i8)

#### URL

16. Mar 2022
17. quillette.com quillette.com
1. A special quality of humans, not shared by evolution or, as yet, by machines, is our ability to recognize gaps in our understanding and to take joy in the process of filling them in. It is a beautiful thing to experience the mysterious, and powerful, too.

#### URL

18. Feb 2022
19. neuralnetworksanddeeplearning.com neuralnetworksanddeeplearning.com
1. Somewhat confusingly, and for historical reasons, such multiple layer networks are sometimes called multilayer perceptrons or MLPs, despite being made up of sigmoid neurons, not perceptrons. I'm not going to use the MLP terminology in this book, since I think it's confusing, but wanted to warn you of its existence.

#### URL

20. Dec 2021
21. www.nature.com www.nature.com
1. To test whether these distributed representations of meaning are neurally plausible, a number of studies have attempted to learn a mapping between particular semantic dimensions and patterns of brain activation

#### URL

1. the only thing an artificial neuron can do: classify a data point into one of two kinds by examining input values with weights and bias.

How does this relate to "weighted sum shows similarity between the weights and the inputs"?

#### URL

23. medium.com medium.com
1. I’m particularly interested in two questions: First, just how weird is machine learning? Second, what sorts of choices do developers make as they shape a project?

#### URL

24. Nov 2021
25. www.cell.com www.cell.com
1. ey use local computations to interpolate over task-rele-vant manifolds in a high-dimensional parameter space.

#### URL

26. e2eml.school e2eml.school
1. Now that we've made peace with the concepts of projections (matrix multiplications)

Projections are matrix multiplications.Why didn't you sayso? spatial and channel projections in the gated gmlp

2. Computers are especially good at matrix multiplications. There is an entire industry around building computer hardware specifically for fast matrix multiplications. Any computation that can be expressed as a matrix multiplication can be made shockingly efficient.
3. The selective-second-order-with-skips model is a useful way to think about what transformers do, at least in the decoder side. It captures, to a first approximation, what generative language models like OpenAI's GPT-3 are doing.

#### URL

27. distill.pub distill.pub
1. The following figure presents a simple functional diagram of the neural network we will use throughout the article. The neural network is a sequence of linear (both convolutional A convolution calculates weighted sums of regions in the input. In neural networks, the learnable weights in convolutional layers are referred to as the kernel. For example Image credit to https://towardsdatascience.com/gentle-dive-into-math-behind-convolutional-neural-networks-79a07dd44cf9. See also Convolution arithmetic. and fully-connected A fully-connected layer computes output neurons as weighted sum of input neurons. In matrix form, it is a matrix that linearly transforms the input vector into the output vector. ), max-pooling, and ReLU First introduced by Nair and Hinton, ReLU calculates f(x)=max(0,x)f(x)=max(0,x)f(x)=max(0,x) for each entry in a vector input. Graphically, it is a hinge at the origin: Image credit to https://pytorch.org/docs/stable/nn.html#relu layers, culminating in a softmax Softmax function calculates S(yi)=eyiΣj=1NeyjS(y_i)=\frac{e^{y_i}}{\Sigma_{j=1}^{N} e^{y_j}}S(yi​)=Σj=1N​eyj​eyi​​ for each entry (yiy_iyi​) in a vector input (yyy). For example, Image credit to https://ljvmiranda921.github.io/notebook/2017/08/13/softmax-and-the-negative-log-likelihood/ layer.

This is a great visualization of MNIST hidden layers.

#### URL

28. towardsdatascience.com towardsdatascience.com
1. The Query word can be interpreted as the word for which we are calculating Attention. The Key and Value word is the word to which we are paying attention ie. how relevant is that word to the Query word.

Finally

#### URL

29. www.lesswrong.com www.lesswrong.com
1. Other work on interpreting transformer internals has focused mostly on what the attention is looking at. The logit lens focuses on what GPT "believes" after each step of processing, rather than how it updates that belief inside the step.

#### URL

30. distill.pub distill.pub
1. The cube of activations that a neural network for computer vision develops at each hidden layer. Different slices of the cube allow us to target the activations of individual neurons, spatial positions, or channels.

This is first explanation of

#### URL

31. towardsdatascience.com towardsdatascience.com
1. The attention layer (W in the diagram) computes three vectors based on the input, termed key, query, and value.

Could you be more specific?

2. Attention is a means of selectively weighting different elements in input data, so that they will have an adjusted impact on the hidden states of downstream layers.

#### URL

32. www.pnas.org www.pnas.org
1. These findings provide strong evidence for a classic hypothesis about the computations underlying human language understanding, that the brain’s language system is optimized for predictive processing in the service of meaning extraction

#### URL

33. towardsdatascience.com towardsdatascience.com
1. To review, the Forget gate decides what is relevant to keep from prior steps. The input gate decides what information is relevant to add from the current step. The output gate determines what the next hidden state should be.Code DemoFor those of you who understand better through seeing the code, here is an example using python pseudo code.

#### URL

34. Oct 2021
35. colah.github.io colah.github.io
1. This approach, visualizing high-dimensional representations using dimensionality reduction, is an extremely broadly applicable technique for inspecting models in deep learning.
2. These layers warp and reshape the data to make it easier to classify.

#### URL

1. Even with this very primitive single neuron, you can achieve 90% accuracy when recognizing a handwritten text image1. To recognize all the digits from 0 to 9, you would need just ten neurons to recognize them with 92% accuracy.

And here is a Google Colab notebook that demonstrates that

#### URL

37. aegeorge42.github.io aegeorge42.github.io

#### URL

38. Sep 2021
39. arxiv.org arxiv.org
1. The models are developed in Python , using the Keras  and Tensorflow  libraries. Detailson the code and dependencies to run the experiments are listed in a Readme file available togetherwith the code in the Supplemental Material.

2. These results nonetheless show that it could be feasible to develop recurrent neural network modelsable to infer input-output behaviours of real biological systems, enabling researchers to advance theirunderstanding of these systems even in the absence of detailed level of connectivity.

Too strong a claim?

3. We show that GRU models with a hidden layersize of 4 units are able to accurately reproduce with high accuracy the system’sresponse to very different stimuli.

#### URL

40. arxiv.org arxiv.org
1. Humans perform a version of this task when interpretinghard-to-understand speech, such as an accent which is particularlyfast or slurred, or a sentence in a language we do not know verywell—we do not necessarily hear every single word that is said,but we pick up on salient key words and contextualize the rest tounderstand the sentence.

Boy, don't they

#### URL

41. www.ccom.ucsd.edu www.ccom.ucsd.edu
1. A neural network will predict your digit in the blue square above. Your image is 784 pixels (= 28 rows by 28 columns with black=1 and white=0). Those 784 features get fed into a 3 layer neural network; Input:784 - AvgPool:196 - Dense:100 - Softmax:10.

#### URL

42. www.isca-speech.org www.isca-speech.org
1. Personalized ASR models. For each of the 432 participants with disordered speech, we create a personalized ASR model (SI-2) from their own recordings. Our fine-tuning procedure was optimized for our adaptation process, where we only have between ¼ and 2 h of data per speaker. We found that updating only the first five encoder layers (versus the complete model) worked best and successfully prevented overfitting 

#### URL

43. jalammar.github.io jalammar.github.io
1. So whenever you hear of someone “training” a neural network, it just means finding the weights we use to calculate the prediction.

#### URL

44. Aug 2021
45. stats.stackexchange.com stats.stackexchange.com
1. I'm going to try provide an English text example. The following is based solely on my intuitive understanding of the paper 'Attention is all you need'.

This is also good

2. For the word q that your eyes see in the given sentence, what is the most related word k in the sentence to understand what q is about?
3. So basically: q = the vector representing a word K and V = your memory, thus all the words that have been generated before. Note that K and V can be the same (but don't have to). So what you do with attention is that you take your current query (word in most cases) and look in your memory for similar keys. To come up with a distribution of relevant words, the softmax function is then used.

#### URL

46. colah.github.io colah.github.io
1. A neural network with a hidden layer has universality: given enough hidden units, it can approximate any function. This is a frequently quoted – and even more frequently, misunderstood and applied – theorem. It’s true, essentially, because the hidden layer can be used as a lookup table.
2. Recursive Neural Networks

#### URL

47. arxiv.org arxiv.org
1. We show that BigBird is a universal approximator of sequence functions and is Turing complete,

#### URL

48. Jul 2021
49. www.codemotion.com www.codemotion.com
1. hyper-parameters, i.e., parameters external to the model, such as the learning rate, the batch size, the number of epochs.

#### URL

50. colah.github.io colah.github.io
1. Using multiple copies of a neuron in different places is the neural network equivalent of using functions. Because there is less to learn, the model learns more quickly and learns a better model. This technique – the technical name for it is ‘weight tying’ – is essential to the phenomenal results we’ve recently seen from deep learning.

#### URL

51. www.baeldung.com www.baeldung.com
1. Vectors with a small Euclidean distance from one another are located in the same region of a vector space. Vectors with a high cosine similarity are located in the same general direction from the origin.

#### URL

1. If you're serious about neural networks, I have one recommendation. Try to rebuild this network from memory.
2. If you're serious about neural networks, I have one recommendation. Try to rebuild this network from memory.

#### URL

53. mlech26l.github.io mlech26l.github.io
1. In our research, i.e., the wormnet project, we try to build machine learning models motivated by the C. elegans nervous system. By doing so, we have to pay a cost, as we constrain ourselves to such models in contrast to standard artificial neural networks, whose modeling space is purely constraint by memory and compute limitations. However, there are potentially some advantages and benefits we gain. Our objective is to better understand what’s necessary for effective neural information processing to emerge.

#### URL

54. Jun 2021
1. This dataset can not be classified by a single neuron, as the two groups of data points can't be divided by a single line.

#### URL

56. Jun 2015
57. Local file Local file
1. aren miteinander aber

nn

#### Annotators

58. Jan 2015
59. cs231n.github.io cs231n.github.io
1. k - Nearest Neighbor Classifier

Is there a probabilistic interpretation of k-NN? Say, something like "k-NN is equivalent to [a probabilistic model] under the following conditions on the data and the k."