198 Matching Annotations
  1. Apr 2023
    1. Now we are getting somewhere. At this point, we also see that the dimensions of W and b for each layer are specified by the dimensions of the inputs and the number of nodes in each layer. Let’s clean up the above diagram by not labeling every w and b value individually.
    1. While past work has characterized what kinds of functions ICL can learn (Garg et al., 2022; Laskin et al., 2022) and the distributional properties of pretraining that can elicit in-context learning (Xie et al., 2021; Chan et al., 2022), but how ICL learns these functions has remained unclear. What learning algorithms (if any) are implementable by deep network models? Which algorithms are actually discovered in the course of training? This paper takes first steps toward answering these questions, focusing on a widely used model architecture (the transformer) and an extremely well-understood class of learning problems (linear regression).
    1. It seems like the neuron basically adds the embedding of “ an” to the residual stream, which increases the output probability for “ an” since the unembedding step consists of taking the dot product of the final residual with each token2.

      This cleared the dust from my eyes in understanding what the MLP layer does

  2. Mar 2023
    1. Because then we have a world in which grown men, sipping tea, posit thought experiments about raping talking sex dolls, thinking that maybe you are one too.
    2. Others, like Dennett, the philosopher of mind, are even more blunt. We can’t live in a world with what he calls “counterfeit people.” “Counterfeit money has been seen as vandalism against society ever since money has existed,” he said. “Punishments included the death penalty and being drawn and quartered. Counterfeit people is at least as serious.”
  3. Feb 2023
    1. The simulator widget below contains the entire source code of the game. I’ll explain how it works in the following sections.
    1. The second purpose of skip connections is specific to transformers — preserving the original input sequence.
    2. Skip connections serve two purposes. The first is that they help keep the gradient smooth, which is a big help for backpropagation. Attention is a filter, which means that when it’s working correctly it will block most of what tries to pass through it.
    3. Once we have the result of our attention step, a vector that includes the most recent word and a small collection of the words that have preceded it, we need to translate that into features, each of which is a word pair. Attention masking gets us the raw material that we need, but it doesn’t build those word pair features. To do that, we can use a single layer fully connected neural network.

      Early transformer exploration focused on the attention layer/mechanism.The MLP that follows the attention layer is now being explored. ROME for example.

    1. If, on the other hand, I were to show you a brain scan taken before I believed it was going to rain, and after, there is no one in the world who could have the faintest clue what ideas these pictures were illustrating.

      They're working on it, for example, The neural architecture of language: Integrative modeling converges on predictive processing

    1. the Elhage et al.(2021) study showing an information-copying role for self-attention.

      It turns out Meng does refer to induction heads, just not by name.

  4. Jan 2023
    1. One of the main features of the high level architecture of a transformer is that each layer adds its results into what we call the “residual stream.”Constructing models with a residual stream traces back to early work by the Schmidhuber group, such as highway networks  and LSTMs, which have found significant modern success in the more recent residual network architecture . In transformers, the residual stream vectors are often called the “embedding.” We prefer the residual stream terminology, both because it emphasizes the residual nature (which we believe to be important) and also because we believe the residual stream often dedicates subspaces to tokens other than the present token, breaking the intuitions the embedding terminology suggests. The residual stream is simply the sum of the output of all the previous layers and the original embedding. We generally think of the residual stream as a communication channel, since it doesn't do any processing itself and all layers communicate through it.
    2. A transformer starts with a token embedding, followed by a series of “residual blocks”, and finally a token unembedding. Each residual block consists of an attention layer, followed by an MLP layer. Both the attention and MLP layers each “read” their input from the residual stream (by performing a linear projection), and then “write” their result to the residual stream by adding a linear projection back in. Each attention layer consists of multiple heads, which operate in parallel.
    1. You see the values of the self-attention weights are computed on the fly. They are data-dependent dynamic weights because they change dynamically in response to the data (fast weights).
    1. e twoareas in which the forward-forward algorithm may be superior to backpropagation are as a model oflearning in cortex and as a way of making use of very low-power analog hardware without resortingto reinforcement learning(Jabri and Flower, 1992).
  5. Dec 2022
    1. The attention distribution is usually generated with content-based attention. The attending RNN generates a query describing what it wants to focus on. Each item is dot-producted with the query to produce a score, describing how well it matches the query. The scores are fed into a softmax to create the attention distribution.

      This is the Key, Value, Query, yes?

    1. Our method is based on the hypothesis that the weights of a generator act as Optimal Linear Associative Memory (OLAM). OLAM is a classic single-layer neural data structure for memorizing associations that was described by Teuvo Kohonen and James A Anderson (independently) in the 1970s. In our case, we hypothesize that within a large modern multilayer convolutional network, the each individual layer plays the role of an OLAM that stores a set of rules that associates keys, which denote meaningful context, with values, which determine output.
  6. Oct 2022
  7. Sep 2022
    1. Consider a toy model where we train an embedding of five features of varying importanceWhere “importance” is a scalar multiplier on mean squared error loss. in two dimensions, add a ReLU afterwards for filtering, and vary the sparsity of the features.
    1. To see how this plays out, we can continue looking at matrix shapes. Tracing the matrix shape through the branches and weaves of the multihead attention blocks requires three more numbers. d_k: dimensions in the embedding space used for keys and queries. 64 in the paper. d_v: dimensions in the embedding space used for values. 64 in the paper. h: the number of heads. 8 in the paper.
    1. Now, the progression of NLP, as discussed, tells a story. We begin with tokens and then build representations of these tokens. We use these representations to find similarities between tokens and embed them in a high-dimensional space. The same embeddings are also passed into sequential models that can process sequential data. Those models are used to build context and, through an ingenious way, attend to parts of the input sentence that are useful to the output sentence in translation.
    2. Data, matrix multiplications, repeated and scaled with non-linear switches. Maybe that simplifies things a lot, but even today, most architectures boil down to these principles. Even the most complex systems, ideas, and papers can be boiled down to just that:
  8. Aug 2022
    1. Neural models more closely resemble movable type: they will change the way culture is transmitted in many social contexts.
  9. andrewbrown.substack.com andrewbrown.substack.com
    1. But the truths of religion appear in the lives of believers, not in their theologies,
  10. Jun 2022
    1. The dominant idea is one of attention, by which a representation at a position is computed as a weighted combination of representations from other positions. A common self-supervision objective in a transformer model is to mask out occasional words in a text. The model works out what word used to be there. It does this by calculating from each word position (including mask positions) vectors that represent a query, key, and value at that position. The query at a position is compared with the value at every position to calculate how much attention to pay to each position; based on this, a weighted average of the values at all positions is calculated. This operation is repeated many times at each level of the transformer neural net, and the resulting value is further manipulated through a fully connected neural net layer and through use of normalization layers and residual connections to produce a new vector for each word. This whole process is repeated many times, giving extra layers of depth to the transformer neural net. At the end, the representation above a mask position should capture the word that was there in the original text: for instance, committee as illustrated in Figure 1.
    1. The creator of GraphQL admits this. During his presentation on the library at a Facebook internal conference, an audience member asked him about the difference between GraphQL and SOAP. His response: SOAP requires XML. GraphQL defaults to JSON—though you can use XML.
    2. Conclusion There are decades of history and a broad cast of characters behind the web requests you know and love—as well as the ones that you might have never heard of. Information first traveled across the internet in 1969, followed by a lot of research in the ’70s, then private networks in the ’80s, then public networks in the ’90s. We got CORBA in 1991, followed by SOAP in 1999, followed by REST around 2003. GraphQL reimagined SOAP, but with JSON, around 2015. This all sounds like a history class fact sheet, but it’s valuable context for building our own web apps.
    1. This trick of using a one-hot vector to pull out a particular row of a matrix is at the core of how transformers work.

      Matrix multiplication as table lookup

  11. May 2022
    1. Given the complexities of the brain’s structure and the functions it performs, any one of these models is surely oversimplified and ultimately wrong—at best, an approximation of some aspects of what the brain does. However, some models are less wrong than others, and consistent trends in performance across models can reveal not just which model best fits the brain but also which properties of a model underlie its fit to the brain, thus yielding critical insights that transcend what any single model can tell us.
    1. According to a 2017 study, some 4.5 million American women have been threatened by a gun-wielding partner or former partner. Almost 1 million American women have survived after a gun was used by a partner against them.
    2. If there were any merit to the “defensive gun use” argument, you’d expect that one permissive nation to boast much greater safety.
    1. Such a highly non-linear problem would clearly benefitfrom the computational power of many layers. Unfortu-nately, back-propagation learning generally slows downby an order of magnitude every time a layer is added toa network.

      The problem in 1988

    1. The source sequence will be pass to the TransformerEncoder, which will produce a new representation of it. This new representation will then be passed to the TransformerDecoder, together with the target sequence so far (target words 0 to N). The TransformerDecoder will then seek to predict the next words in the target sequence (N+1 and beyond).
    1. When chatting with my father about the proton research he summed it up nicely, that two possible responses to hearing that how we measure something seems to change its nature, throwing the reliability of empirical testing into question, are: “Science has been disproved!” or “Great!  Another thing to figure out using the Scientific Method!” The latter reaction is everyday to those who are versed in and comfortable with the fact that science is not a set of doctrines but a process of discovery, hypothesis, disproof and replacement.  Yet the former reaction, “X is wrong therefore the system which yielded X is wrong!” is, in fact, the historical norm.
  12. Apr 2022
    1. Ourpre-trained network is nearly identical to the “AlexNet”architecture (Krizhevsky et al., 2012), but with local re-ponse normalization layers after pooling layers following(Jia et al., 2014). It was trained with the Caffe frameworkon the ImageNet 2012 dataset (Deng et al., 2009)
    1. Convolution Demo. Below is a running demo of a CONV layer. Since 3D volumes are hard to visualize, all the volumes (the input volume (in blue), the weight volumes (in red), the output volume (in green)) are visualized with each depth slice stacked in rows. The input volume is of size W1=5,H1=5,D1=3W1=5,H1=5,D1=3W_1 = 5, H_1 = 5, D_1 = 3, and the CONV layer parameters are K=2,F=3,S=2,P=1K=2,F=3,S=2,P=1K = 2, F = 3, S = 2, P = 1. That is, we have two filters of size 3×33×33 \times 3, and they are applied with a stride of 2. Therefore, the output volume size has spatial size (5 - 3 + 2)/2 + 1 = 3. Moreover, notice that a padding of P=1P=1P = 1 is applied to the input volume, making the outer border of the input volume zero. The visualization below iterates over the output activations (green), and shows that each element is computed by elementwise multiplying the highlighted input (blue) with the filter (red), summing it up, and then offsetting the result by the bias.

      Best explanation/inllustration of a convolution layer.and the ways the number relate.

    2. Example 1. For example, suppose that the input volume has size [32x32x3], (e.g. an RGB CIFAR-10 image). If the receptive field (or the filter size) is 5x5, then each neuron in the Conv Layer will have weights to a [5x5x3] region in the input volume, for a total of 5*5*3 = 75 weights (and +1 bias parameter). Notice that the extent of the connectivity along the depth axis must be 3, since this is the depth of the input volume. Example 2. Suppose an input volume had size [16x16x20]. Then using an example receptive field size of 3x3, every neuron in the Conv Layer would now have a total of 3*3*20 = 180 connections to the input volume. Notice that, again, the connectivity is local in 2D space (e.g. 3x3), but full along the input depth (20).

      These two examples are the first two layers of Andrej Karpathy's wonderful working ConvNetJS CIFAR-10 demo here

    1. input (32x32x3)max activation: 0.5, min: -0.5max gradient: 1.08696, min: -1.53051Activations:Activation Gradients:Weights:Weight Gradients:conv (32x32x16)filter size 5x5x3, stride 1max activation: 3.75919, min: -4.48241max gradient: 0.36571, min: -0.33032parameters: 16x5x5x3+16 = 1216

      The dimensions of these first two layers are explained here

    1. Here the lower level layers are frozen and are not trained, only the new classification head will update itself to learn from the features provided from the pre-trained chopped up model on the left.
    1. Starting from random noise, we optimize an image to activate a particular neuron (layer mixed4a, unit 11).

      And then we use that image as a kind of variable name to refer to the neuron in a way that more helpful than the the layer number and neuron index within the layer. This explanation is via one of Chris Olah's YouTube videos (https://www.youtube.com/watch?v=gXsKyZ_Y_i8)

    1. This just happened to me and it was because I was signed in to my work account at the same time.  I went to "sign out all" and signed in again and they reappeared

      When Android apps disappeared from my Chromebook, it was because I had added a managed account.

  13. Mar 2022
    1. A special quality of humans, not shared by evolution or, as yet, by machines, is our ability to recognize gaps in our understanding and to take joy in the process of filling them in. It is a beautiful thing to experience the mysterious, and powerful, too.
  14. Feb 2022
    1. Somewhat confusingly, and for historical reasons, such multiple layer networks are sometimes called multilayer perceptrons or MLPs, despite being made up of sigmoid neurons, not perceptrons. I'm not going to use the MLP terminology in this book, since I think it's confusing, but wanted to warn you of its existence.
  15. Jan 2022
    1. While heat pumps are the most cost effective way to use electricity to heat your home during the cooler months, leaving them running day and night is not economically efficient. According to Energywise, you should switch off your heat pump when you don’t need it. This is to avoid excessive energy waste.
    1. Treatment with single probiotic B. infantis didn't impact on abdominal pain, bloating/distention, or bowel habit satisfaction among IBS patients. However, patients who received composite probiotics containing B. infantis had significantly reduced abdominal pain
  16. Dec 2021
    1. To test whether these distributed representations of meaning are neurally plausible, a number of studies have attempted to learn a mapping between particular semantic dimensions and patterns of brain activation
    1. I grew up in a small town called Surry on the coast of down-east Maine. At Christmas, most everyone in our town bought their trees at Jordan's Tree Farm. $5 per tree, cut at your own risk. Thinking back, it seems funny to me now, since after all, this is rural Maine, the pine tree state. And you'd think everyone could cut their own trees on their own land. And it's not like the trees at the Jordan farm were so special. Pretty much everyone called them Charlie Brown trees. People came because of Robert Jordan. They were loyal to him, and they figured he could use the money.
    1. the only thing an artificial neuron can do: classify a data point into one of two kinds by examining input values with weights and bias.

      How does this relate to "weighted sum shows similarity between the weights and the inputs"?

    1. The transformer model introduces the idea of instead of adding another complex mechanism (attention) to an already complex Seq2Seq model; we can simplify the solution by forgetting about everything else and just focusing on attention.
    1. I’m particularly interested in two questions: First, just how weird is machine learning? Second, what sorts of choices do developers make as they shape a project?
  17. Nov 2021
    1. Now that we've made peace with the concepts of projections (matrix multiplications)

      Projections are matrix multiplications.Why didn't you sayso? spatial and channel projections in the gated gmlp

    2. Computers are especially good at matrix multiplications. There is an entire industry around building computer hardware specifically for fast matrix multiplications. Any computation that can be expressed as a matrix multiplication can be made shockingly efficient.
    3. The selective-second-order-with-skips model is a useful way to think about what transformers do, at least in the decoder side. It captures, to a first approximation, what generative language models like OpenAI's GPT-3 are doing.
    1. The following figure presents a simple functional diagram of the neural network we will use throughout the article. The neural network is a sequence of linear (both convolutional A convolution calculates weighted sums of regions in the input. In neural networks, the learnable weights in convolutional layers are referred to as the kernel. For example Image credit to https://towardsdatascience.com/gentle-dive-into-math-behind-convolutional-neural-networks-79a07dd44cf9. See also Convolution arithmetic. and fully-connected A fully-connected layer computes output neurons as weighted sum of input neurons. In matrix form, it is a matrix that linearly transforms the input vector into the output vector. ), max-pooling, and ReLU First introduced by Nair and Hinton, ReLU calculates f(x)=max(0,x)f(x)=max(0,x)f(x)=max(0,x) for each entry in a vector input. Graphically, it is a hinge at the origin: Image credit to https://pytorch.org/docs/stable/nn.html#relu layers, culminating in a softmax Softmax function calculates S(yi)=eyiΣj=1NeyjS(y_i)=\frac{e^{y_i}}{\Sigma_{j=1}^{N} e^{y_j}}S(yi​)=Σj=1N​eyj​eyi​​ for each entry (yiy_iyi​) in a vector input (yyy). For example, Image credit to https://ljvmiranda921.github.io/notebook/2017/08/13/softmax-and-the-negative-log-likelihood/ layer.

      This is a great visualization of MNIST hidden layers.

    1. The most beautiful and deepest experience a man can have is the sense of the mysterious.
    1. The Query word can be interpreted as the word for which we are calculating Attention. The Key and Value word is the word to which we are paying attention ie. how relevant is that word to the Query word.

      Finally

    1. Other work on interpreting transformer internals has focused mostly on what the attention is looking at. The logit lens focuses on what GPT "believes" after each step of processing, rather than how it updates that belief inside the step.
    1. The cube of activations that a neural network for computer vision develops at each hidden layer. Different slices of the cube allow us to target the activations of individual neurons, spatial positions, or channels.

      This is first explanation of

    1. The attention layer (W in the diagram) computes three vectors based on the input, termed key, query, and value.

      Could you be more specific?

    2. Attention is a means of selectively weighting different elements in input data, so that they will have an adjusted impact on the hidden states of downstream layers.
    1. These findings provide strong evidence for a classic hypothesis about the computations underlying human language understanding, that the brain’s language system is optimized for predictive processing in the service of meaning extraction
    1. On the geopolitical stage, it’s hard to argue with the claim that Twitter is a force of evil. But Twitter is also the infrastructural backbone of much of the digital humanities world.
    1. To review, the Forget gate decides what is relevant to keep from prior steps. The input gate decides what information is relevant to add from the current step. The output gate determines what the next hidden state should be.Code DemoFor those of you who understand better through seeing the code, here is an example using python pseudo code.
  18. Oct 2021
    1. This approach, visualizing high-dimensional representations using dimensionality reduction, is an extremely broadly applicable technique for inspecting models in deep learning.
    2. These layers warp and reshape the data to make it easier to classify.
    1. Even with this very primitive single neuron, you can achieve 90% accuracy when recognizing a handwritten text image1. To recognize all the digits from 0 to 9, you would need just ten neurons to recognize them with 92% accuracy.

      And here is a Google Colab notebook that demonstrates that

    1. Reports of death after COVID-19 vaccination are rare. More than 396 million doses of COVID-19 vaccines were administered in the United States from December 14, 2020, through October 4, 2021. During this time, VAERS received 8,390 reports of death (0.0021%) among people who received a COVID-19 vaccine. FDA requires healthcare providers to report any death after COVID-19 vaccination to VAERS, even if it’s unclear whether the vaccine was the cause. Reports of adverse events to VAERS following vaccination, including deaths, do not necessarily mean that a vaccine caused a health problem. A review of available clinical information, including death certificates, autopsy, and medical records, has not established a causal link to COVID-19 vaccines. However, recent reports indicate a plausible causal relationship between the J&J/Janssen COVID-19 Vaccine and TTS, a rare and serious adverse event—blood clots with low platelets—which has caused deaths pdf icon[1.4 MB, 40 pages].
    1. It is not only the essence of being human but also a vital property of life. Technological advances in communication shape society and lnake its members more interdependent

  19. Sep 2021
    1. The models are developed in Python [46], using the Keras [47] and Tensorflow [48] libraries. Detailson the code and dependencies to run the experiments are listed in a Readme file available togetherwith the code in the Supplemental Material.

      I have not found the code or Readme file

    2. These results nonetheless show that it could be feasible to develop recurrent neural network modelsable to infer input-output behaviours of real biological systems, enabling researchers to advance theirunderstanding of these systems even in the absence of detailed level of connectivity.

      Too strong a claim?

    3. We show that GRU models with a hidden layersize of 4 units are able to accurately reproduce with high accuracy the system’sresponse to very different stimuli.
    1. One popular theory among machine learning researchers is the manifold hypothesis: MNIST is a low dimensional manifold, sweeping and curving through its high-dimensional embedding space. Another hypothesis, more associated with topological data analysis, is that data like MNIST consists of blobs with tentacle-like protrusions sticking out into the surrounding space.
    1. This is what I call a leaky abstraction. TCP attempts to provide a complete abstraction of an underlying unreliable network, but sometimes, the network leaks through the abstraction and you feel the things that the abstraction can’t quite protect you from. This is but one example of what I’ve dubbed the Law of Leaky Abstractions:
    1. Humans perform a version of this task when interpretinghard-to-understand speech, such as an accent which is particularlyfast or slurred, or a sentence in a language we do not know verywell—we do not necessarily hear every single word that is said,but we pick up on salient key words and contextualize the rest tounderstand the sentence.

      Boy, don't they

    1. A neural network will predict your digit in the blue square above. Your image is 784 pixels (= 28 rows by 28 columns with black=1 and white=0). Those 784 features get fed into a 3 layer neural network; Input:784 - AvgPool:196 - Dense:100 - Softmax:10.
    1. If you have always wanted to know what it feels like to get stuck in a nonconsensual, one-way conversation with a libertarian high-school debate captain who’s more in love with his own brain than you will ever be with anyone or anything, Greenwald has just done you a great service. (I can already hear the debate captain shouting “point of personal privilege,” so I’ll try to steer clear of ad hominem from here on out.)
    1. Personalized ASR models. For each of the 432 participants with disordered speech, we create a personalized ASR model (SI-2) from their own recordings. Our fine-tuning procedure was optimized for our adaptation process, where we only have between ¼ and 2 h of data per speaker. We found that updating only the first five encoder layers (versus the complete model) worked best and successfully prevented overfitting [10]
    1. The researchers found that the model, when it is still confused by a given phoneme (that’s an individual speech sound like an “e” or “f”), has two kinds of errors. First, there’s the fact that it doesn’t recognize the phoneme for what was intended, and thus is not recognizing the word. And second, the model has to guess which phoneme the speaker did intend, and might choose the wrong one in cases where two or more words sound roughly similar.
    1. So whenever you hear of someone “training” a neural network, it just means finding the weights we use to calculate the prediction.
  20. Aug 2021
    1. So for each word, we create a Query vector, a Key vector, and a Value vector. These vectors are created by multiplying the embedding by three matrices that we trained during the training process.
    1. I'm going to try provide an English text example. The following is based solely on my intuitive understanding of the paper 'Attention is all you need'.

      This is also good

    2. For the word q that your eyes see in the given sentence, what is the most related word k in the sentence to understand what q is about?
    3. So basically: q = the vector representing a word K and V = your memory, thus all the words that have been generated before. Note that K and V can be the same (but don't have to). So what you do with attention is that you take your current query (word in most cases) and look in your memory for similar keys. To come up with a distribution of relevant words, the softmax function is then used.
    1. The Edgerton Essays are named for Norman Rockwell’s famous 1943 painting, “Freedom of Speech.” Rockwell depicted Jim Edgerton, a farmer in their small town, rising to speak and being respectfully listened to by his neighbors. That respectful, democratic spirit is too often missing today, and what we’re hoping to cultivate with this series.
    1. A neural network with a hidden layer has universality: given enough hidden units, it can approximate any function. This is a frequently quoted – and even more frequently, misunderstood and applied – theorem. It’s true, essentially, because the hidden layer can be used as a lookup table.
    2. Recursive Neural Networks
    3. t-SNE visualizations of word embeddings.
  21. Jul 2021
    1. In the language of Interpretable Machine Learning (IML) literature like Molnar et al.[20], input saliency is a method that explains individual predictions.
    1. Using multiple copies of a neuron in different places is the neural network equivalent of using functions. Because there is less to learn, the model learns more quickly and learns a better model. This technique – the technical name for it is ‘weight tying’ – is essential to the phenomenal results we’ve recently seen from deep learning.

    1. Vectors with a small Euclidean distance from one another are located in the same region of a vector space. Vectors with a high cosine similarity are located in the same general direction from the origin.
    1. If you're serious about neural networks, I have one recommendation. Try to rebuild this network from memory.
    2. Line 43: uses the "confidence weighted error" from l2 to establish an error for l1. To do this, it simply sends the error across the weights from l2 to l1. This gives what you could call a "contribution weighted error" because we learn how much each node value in l1 "contributed" to the error in l2. This step is called "backpropagating" and is the namesake of the algorithm

      Backpropagating

    3. If you're serious about neural networks, I have one recommendation. Try to rebuild this network from memory.
    1. In our research, i.e., the wormnet project, we try to build machine learning models motivated by the C. elegans nervous system. By doing so, we have to pay a cost, as we constrain ourselves to such models in contrast to standard artificial neural networks, whose modeling space is purely constraint by memory and compute limitations. However, there are potentially some advantages and benefits we gain. Our objective is to better understand what’s necessary for effective neural information processing to emerge.
    1. Recommendations DON'T use shifted PPMI with SVD. DON'T use SVD "correctly", i.e. without eigenvector weighting (performance drops 15 points compared to with eigenvalue weighting with (p = 0.5)). DO use PPMI and SVD with short contexts (window size of (2)). DO use many negative samples with SGNS. DO always use context distribution smoothing (raise unigram distribution to the power of (lpha = 0.75)) for all methods. DO use SGNS as a baseline (robust, fast and cheap to train). DO try adding context vectors in SGNS and GloVe.
  22. Jun 2021
    1. Here is an example run of the QnA model:

      This example doesn't work. The await gets an error. Since it's not inside the promise?

    1. One thing that should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great. The two methods that seem to scale arbitrarily in this way are search and learning

      This is a big lesson. As a field, we still have not thoroughly learned it, as we are continuing to make the same kind of mistakes. To see this, and to effectively resist it, we have to understand the appeal of these mistakes. We have to learn the bitter lesson that building in how we think we think does not work in the long run. The bitter lesson is based on the historical observations that 1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning. The eventual success is tinged with bitterness, and often incompletely digested, because it is success over a favored, human-centric approach.

    1. TensorFlow.js provides theLayers API,which mirrors the Keras API as closely as possible, in-cluding the serialization format.

      Surfing TensorFlow I was orbiting this conclusion. It's good to see it it stated clearly.

    1. The Hole Hawg is a drill made by the Milwaukee Tool Company.
    2. primal Jungian fugue
    3. They pay lip service to multiculturalism and diversity and non-judgmentalness, but they don't raise their own children that way.
    4. It comes through as the presumption that all authority figures--teachers, generals, cops, ministers, politicians--are hypocritical buffoons, and that hip jaded coolness is the only way to be.
  23. May 2021
    1. Note that variables cannot appear in the predicate position.
  24. Mar 2021
  25. Feb 2021
    1. There's this wonderful study done by Deborah Estrin at Cornell. If you plan and decide in advance what you’re going to eat and watch, the food you select and the video you watch will be different. Your video is likely to be slightly more intellectual and challenging, and your food is likely to be healthier for you. When you do it in advance it’s your planning self instead of your immediate-gratification self.
    1. There are two directions to look for: first, using the principle of independence between the sources and the knowledge management layer, and second, fine tuning the balance between automatic processing and manual curation.