- Apr 2023
-
towardsdatascience.com towardsdatascience.com
-
Now we are getting somewhere. At this point, we also see that the dimensions of W and b for each layer are specified by the dimensions of the inputs and the number of nodes in each layer. Let’s clean up the above diagram by not labeling every w and b value individually.
-
- Feb 2023
-
arxiv.org arxiv.org
-
the Elhage et al.(2021) study showing an information-copying role for self-attention.
It turns out Meng does refer to induction heads, just not by name.
Tags
Annotators
URL
-
- Jan 2023
-
www.cs.toronto.edu www.cs.toronto.edu
-
e twoareas in which the forward-forward algorithm may be superior to backpropagation are as a model oflearning in cortex and as a way of making use of very low-power analog hardware without resortingto reinforcement learning(Jabri and Flower, 1992).
Tags
Annotators
URL
-
- Dec 2022
-
rewriting.csail.mit.edu rewriting.csail.mit.edu
-
Our method is based on the hypothesis that the weights of a generator act as Optimal Linear Associative Memory (OLAM). OLAM is a classic single-layer neural data structure for memorizing associations that was described by Teuvo Kohonen and James A Anderson (independently) in the 1970s. In our case, we hypothesize that within a large modern multilayer convolutional network, the each individual layer plays the role of an OLAM that stores a set of rules that associates keys, which denote meaningful context, with values, which determine output.
-
- Sep 2022
-
transformer-circuits.pub transformer-circuits.pub
-
Consider a toy model where we train an embedding of five features of varying importanceWhere “importance” is a scalar multiplier on mean squared error loss. in two dimensions, add a ReLU afterwards for filtering, and vary the sparsity of the features.
Tags
Annotators
URL
-
-
pyimagesearch.com pyimagesearch.com
-
Now, the progression of NLP, as discussed, tells a story. We begin with tokens and then build representations of these tokens. We use these representations to find similarities between tokens and embed them in a high-dimensional space. The same embeddings are also passed into sequential models that can process sequential data. Those models are used to build context and, through an ingenious way, attend to parts of the input sentence that are useful to the output sentence in translation.
-
Data, matrix multiplications, repeated and scaled with non-linear switches. Maybe that simplifies things a lot, but even today, most architectures boil down to these principles. Even the most complex systems, ideas, and papers can be boiled down to just that:
-
- Jun 2022
-
direct.mit.edu direct.mit.edu
-
The dominant idea is one of attention, by which a representation at a position is computed as a weighted combination of representations from other positions. A common self-supervision objective in a transformer model is to mask out occasional words in a text. The model works out what word used to be there. It does this by calculating from each word position (including mask positions) vectors that represent a query, key, and value at that position. The query at a position is compared with the value at every position to calculate how much attention to pay to each position; based on this, a weighted average of the values at all positions is calculated. This operation is repeated many times at each level of the transformer neural net, and the resulting value is further manipulated through a fully connected neural net layer and through use of normalization layers and residual connections to produce a new vector for each word. This whole process is repeated many times, giving extra layers of depth to the transformer neural net. At the end, the representation above a mask position should capture the word that was there in the original text: for instance, committee as illustrated in Figure 1.
-
-
e2eml.school e2eml.school
-
This trick of using a one-hot vector to pull out a particular row of a matrix is at the core of how transformers work.
Matrix multiplication as table lookup
Tags
Annotators
URL
-
- May 2022
-
www.pnas.org www.pnas.org
-
Given the complexities of the brain’s structure and the functions it performs, any one of these models is surely oversimplified and ultimately wrong—at best, an approximation of some aspects of what the brain does. However, some models are less wrong than others, and consistent trends in performance across models can reveal not just which model best fits the brain but also which properties of a model underlie its fit to the brain, thus yielding critical insights that transcend what any single model can tell us.
Tags
Annotators
URL
-
-
www.gwern.net www.gwern.net
-
Such a highly non-linear problem would clearly benefitfrom the computational power of many layers. Unfortu-nately, back-propagation learning generally slows downby an order of magnitude every time a layer is added toa network.
The problem in 1988
-
- Apr 2022
-
-
Ourpre-trained network is nearly identical to the “AlexNet”architecture (Krizhevsky et al., 2012), but with local re-ponse normalization layers after pooling layers following(Jia et al., 2014). It was trained with the Caffe frameworkon the ImageNet 2012 dataset (Deng et al., 2009)
Tags
Annotators
URL
-
-
cs231n.github.io cs231n.github.io
-
Example 1. For example, suppose that the input volume has size [32x32x3], (e.g. an RGB CIFAR-10 image). If the receptive field (or the filter size) is 5x5, then each neuron in the Conv Layer will have weights to a [5x5x3] region in the input volume, for a total of 5*5*3 = 75 weights (and +1 bias parameter). Notice that the extent of the connectivity along the depth axis must be 3, since this is the depth of the input volume. Example 2. Suppose an input volume had size [16x16x20]. Then using an example receptive field size of 3x3, every neuron in the Conv Layer would now have a total of 3*3*20 = 180 connections to the input volume. Notice that, again, the connectivity is local in 2D space (e.g. 3x3), but full along the input depth (20).
These two examples are the first two layers of Andrej Karpathy's wonderful working ConvNetJS CIFAR-10 demo here
-
-
cs.stanford.edu cs.stanford.edu
-
input (32x32x3)max activation: 0.5, min: -0.5max gradient: 1.08696, min: -1.53051Activations:Activation Gradients:Weights:Weight Gradients:conv (32x32x16)filter size 5x5x3, stride 1max activation: 3.75919, min: -4.48241max gradient: 0.36571, min: -0.33032parameters: 16x5x5x3+16 = 1216
The dimensions of these first two layers are explained here
-
-
codelabs.developers.google.com codelabs.developers.google.com
-
Here the lower level layers are frozen and are not trained, only the new classification head will update itself to learn from the features provided from the pre-trained chopped up model on the left.
-
-
distill.pub distill.pub
-
Starting from random noise, we optimize an image to activate a particular neuron (layer mixed4a, unit 11).
And then we use that image as a kind of variable name to refer to the neuron in a way that more helpful than the the layer number and neuron index within the layer. This explanation is via one of Chris Olah's YouTube videos (https://www.youtube.com/watch?v=gXsKyZ_Y_i8)
-
- Mar 2022
-
quillette.com quillette.com
-
A special quality of humans, not shared by evolution or, as yet, by machines, is our ability to recognize gaps in our understanding and to take joy in the process of filling them in. It is a beautiful thing to experience the mysterious, and powerful, too.
-
- Feb 2022
-
neuralnetworksanddeeplearning.com neuralnetworksanddeeplearning.com
-
Somewhat confusingly, and for historical reasons, such multiple layer networks are sometimes called multilayer perceptrons or MLPs, despite being made up of sigmoid neurons, not perceptrons. I'm not going to use the MLP terminology in this book, since I think it's confusing, but wanted to warn you of its existence.
Tags
Annotators
URL
-
- Dec 2021
-
www.nature.com www.nature.com
-
To test whether these distributed representations of meaning are neurally plausible, a number of studies have attempted to learn a mapping between particular semantic dimensions and patterns of brain activation
Tags
Annotators
URL
-
-
cloud.google.com cloud.google.com
-
the only thing an artificial neuron can do: classify a data point into one of two kinds by examining input values with weights and bias.
How does this relate to "weighted sum shows similarity between the weights and the inputs"?
-
-
-
I’m particularly interested in two questions: First, just how weird is machine learning? Second, what sorts of choices do developers make as they shape a project?
-
- Nov 2021
-
www.cell.com www.cell.com
-
ey use local computations to interpolate over task-rele-vant manifolds in a high-dimensional parameter space.
Tags
Annotators
URL
-
-
e2eml.school e2eml.school
-
Now that we've made peace with the concepts of projections (matrix multiplications)
Projections are matrix multiplications.Why didn't you sayso? spatial and channel projections in the gated gmlp
-
Computers are especially good at matrix multiplications. There is an entire industry around building computer hardware specifically for fast matrix multiplications. Any computation that can be expressed as a matrix multiplication can be made shockingly efficient.
-
The selective-second-order-with-skips model is a useful way to think about what transformers do, at least in the decoder side. It captures, to a first approximation, what generative language models like OpenAI's GPT-3 are doing.
-
-
distill.pub distill.pub
-
The following figure presents a simple functional diagram of the neural network we will use throughout the article. The neural network is a sequence of linear (both convolutional A convolution calculates weighted sums of regions in the input. In neural networks, the learnable weights in convolutional layers are referred to as the kernel. For example Image credit to https://towardsdatascience.com/gentle-dive-into-math-behind-convolutional-neural-networks-79a07dd44cf9. See also Convolution arithmetic. and fully-connected A fully-connected layer computes output neurons as weighted sum of input neurons. In matrix form, it is a matrix that linearly transforms the input vector into the output vector. ), max-pooling, and ReLU First introduced by Nair and Hinton, ReLU calculates f(x)=max(0,x)f(x)=max(0,x)f(x)=max(0,x) for each entry in a vector input. Graphically, it is a hinge at the origin: Image credit to https://pytorch.org/docs/stable/nn.html#relu layers, culminating in a softmax Softmax function calculates S(yi)=eyiΣj=1NeyjS(y_i)=\frac{e^{y_i}}{\Sigma_{j=1}^{N} e^{y_j}}S(yi)=Σj=1Neyjeyi for each entry (yiy_iyi) in a vector input (yyy). For example, Image credit to https://ljvmiranda921.github.io/notebook/2017/08/13/softmax-and-the-negative-log-likelihood/ layer.
This is a great visualization of MNIST hidden layers.
Tags
Annotators
URL
-
-
towardsdatascience.com towardsdatascience.com
-
The Query word can be interpreted as the word for which we are calculating Attention. The Key and Value word is the word to which we are paying attention ie. how relevant is that word to the Query word.
Finally
-
-
www.lesswrong.com www.lesswrong.com
-
Other work on interpreting transformer internals has focused mostly on what the attention is looking at. The logit lens focuses on what GPT "believes" after each step of processing, rather than how it updates that belief inside the step.
-
-
distill.pub distill.pub
-
The cube of activations that a neural network for computer vision develops at each hidden layer. Different slices of the cube allow us to target the activations of individual neurons, spatial positions, or channels.
This is first explanation of
Tags
Annotators
URL
-
-
towardsdatascience.com towardsdatascience.com
-
The attention layer (W in the diagram) computes three vectors based on the input, termed key, query, and value.
Could you be more specific?
-
Attention is a means of selectively weighting different elements in input data, so that they will have an adjusted impact on the hidden states of downstream layers.
-
-
www.pnas.org www.pnas.org
-
These findings provide strong evidence for a classic hypothesis about the computations underlying human language understanding, that the brain’s language system is optimized for predictive processing in the service of meaning extraction
Tags
Annotators
URL
-
-
towardsdatascience.com towardsdatascience.com
-
To review, the Forget gate decides what is relevant to keep from prior steps. The input gate decides what information is relevant to add from the current step. The output gate determines what the next hidden state should be.Code DemoFor those of you who understand better through seeing the code, here is an example using python pseudo code.
-
- Oct 2021
-
colah.github.io colah.github.io
-
This approach, visualizing high-dimensional representations using dimensionality reduction, is an extremely broadly applicable technique for inspecting models in deep learning.
-
These layers warp and reshape the data to make it easier to classify.
-
-
cloud.google.com cloud.google.com
-
Even with this very primitive single neuron, you can achieve 90% accuracy when recognizing a handwritten text image1. To recognize all the digits from 0 to 9, you would need just ten neurons to recognize them with 92% accuracy.
And here is a Google Colab notebook that demonstrates that
-
-
aegeorge42.github.io aegeorge42.github.io
-
- Sep 2021
-
arxiv.org arxiv.org
-
The models are developed in Python [46], using the Keras [47] and Tensorflow [48] libraries. Detailson the code and dependencies to run the experiments are listed in a Readme file available togetherwith the code in the Supplemental Material.
I have not found the code or Readme file
-
These results nonetheless show that it could be feasible to develop recurrent neural network modelsable to infer input-output behaviours of real biological systems, enabling researchers to advance theirunderstanding of these systems even in the absence of detailed level of connectivity.
Too strong a claim?
-
We show that GRU models with a hidden layersize of 4 units are able to accurately reproduce with high accuracy the system’sresponse to very different stimuli.
Tags
Annotators
URL
-
-
-
Humans perform a version of this task when interpretinghard-to-understand speech, such as an accent which is particularlyfast or slurred, or a sentence in a language we do not know verywell—we do not necessarily hear every single word that is said,but we pick up on salient key words and contextualize the rest tounderstand the sentence.
Boy, don't they
Tags
Annotators
URL
-
-
www.ccom.ucsd.edu www.ccom.ucsd.edu
-
A neural network will predict your digit in the blue square above. Your image is 784 pixels (= 28 rows by 28 columns with black=1 and white=0). Those 784 features get fed into a 3 layer neural network; Input:784 - AvgPool:196 - Dense:100 - Softmax:10.
Tags
Annotators
URL
-
-
www.isca-speech.org www.isca-speech.org
-
Personalized ASR models. For each of the 432 participants with disordered speech, we create a personalized ASR model (SI-2) from their own recordings. Our fine-tuning procedure was optimized for our adaptation process, where we only have between ¼ and 2 h of data per speaker. We found that updating only the first five encoder layers (versus the complete model) worked best and successfully prevented overfitting [10]
-
-
jalammar.github.io jalammar.github.io
-
So whenever you hear of someone “training” a neural network, it just means finding the weights we use to calculate the prediction.
-
- Aug 2021
-
stats.stackexchange.com stats.stackexchange.com
-
I'm going to try provide an English text example. The following is based solely on my intuitive understanding of the paper 'Attention is all you need'.
This is also good
-
For the word q that your eyes see in the given sentence, what is the most related word k in the sentence to understand what q is about?
-
So basically: q = the vector representing a word K and V = your memory, thus all the words that have been generated before. Note that K and V can be the same (but don't have to). So what you do with attention is that you take your current query (word in most cases) and look in your memory for similar keys. To come up with a distribution of relevant words, the softmax function is then used.
-
-
colah.github.io colah.github.io
-
A neural network with a hidden layer has universality: given enough hidden units, it can approximate any function. This is a frequently quoted – and even more frequently, misunderstood and applied – theorem. It’s true, essentially, because the hidden layer can be used as a lookup table.
-
Recursive Neural Networks
-
-
arxiv.org arxiv.org
-
We show that BigBird is a universal approximator of sequence functions and is Turing complete,
Tags
Annotators
URL
-
- Jul 2021
-
www.codemotion.com www.codemotion.com
-
hyper-parameters, i.e., parameters external to the model, such as the learning rate, the batch size, the number of epochs.
-
-
colah.github.io colah.github.io
-
Using multiple copies of a neuron in different places is the neural network equivalent of using functions. Because there is less to learn, the model learns more quickly and learns a better model. This technique – the technical name for it is ‘weight tying’ – is essential to the phenomenal results we’ve recently seen from deep learning.
Tags
Annotators
URL
-
-
www.baeldung.com www.baeldung.com
-
Vectors with a small Euclidean distance from one another are located in the same region of a vector space. Vectors with a high cosine similarity are located in the same general direction from the origin.
-
-
iamtrask.github.io iamtrask.github.io
-
If you're serious about neural networks, I have one recommendation. Try to rebuild this network from memory.
-
If you're serious about neural networks, I have one recommendation. Try to rebuild this network from memory.
-
-
mlech26l.github.io mlech26l.github.io
-
In our research, i.e., the wormnet project, we try to build machine learning models motivated by the C. elegans nervous system. By doing so, we have to pay a cost, as we constrain ourselves to such models in contrast to standard artificial neural networks, whose modeling space is purely constraint by memory and compute limitations. However, there are potentially some advantages and benefits we gain. Our objective is to better understand what’s necessary for effective neural information processing to emerge.
Tags
Annotators
URL
-
- Jun 2021
-
cloud.google.com cloud.google.com
-
This dataset can not be classified by a single neuron, as the two groups of data points can't be divided by a single line.
-
- Jun 2015
-
Local file Local file
- Jan 2015
-
cs231n.github.io cs231n.github.io
-
k - Nearest Neighbor Classifier
Is there a probabilistic interpretation of k-NN? Say, something like "k-NN is equivalent to [a probabilistic model] under the following conditions on the data and the k."
Tags
Annotators
URL
-