198 Matching Annotations
  1. Apr 2023
    1. While past work has characterized what kinds of functions ICL can learn (Garg et al., 2022; Laskin et al., 2022) and the distributional properties of pretraining that can elicit in-context learning (Xie et al., 2021; Chan et al., 2022), but how ICL learns these functions has remained unclear. What learning algorithms (if any) are implementable by deep network models? Which algorithms are actually discovered in the course of training? This paper takes first steps toward answering these questions, focusing on a widely used model architecture (the transformer) and an extremely well-understood class of learning problems (linear regression).
  2. Mar 2023
    1. Others, like Dennett, the philosopher of mind, are even more blunt. We can’t live in a world with what he calls “counterfeit people.” “Counterfeit money has been seen as vandalism against society ever since money has existed,” he said. “Punishments included the death penalty and being drawn and quartered. Counterfeit people is at least as serious.”
  3. Feb 2023
    1. Once we have the result of our attention step, a vector that includes the most recent word and a small collection of the words that have preceded it, we need to translate that into features, each of which is a word pair. Attention masking gets us the raw material that we need, but it doesn’t build those word pair features. To do that, we can use a single layer fully connected neural network.

      Early transformer exploration focused on the attention layer/mechanism.The MLP that follows the attention layer is now being explored. ROME for example.

    1. If, on the other hand, I were to show you a brain scan taken before I believed it was going to rain, and after, there is no one in the world who could have the faintest clue what ideas these pictures were illustrating.

      They're working on it, for example, The neural architecture of language: Integrative modeling converges on predictive processing

  4. Jan 2023
    1. One of the main features of the high level architecture of a transformer is that each layer adds its results into what we call the “residual stream.”Constructing models with a residual stream traces back to early work by the Schmidhuber group, such as highway networks  and LSTMs, which have found significant modern success in the more recent residual network architecture . In transformers, the residual stream vectors are often called the “embedding.” We prefer the residual stream terminology, both because it emphasizes the residual nature (which we believe to be important) and also because we believe the residual stream often dedicates subspaces to tokens other than the present token, breaking the intuitions the embedding terminology suggests. The residual stream is simply the sum of the output of all the previous layers and the original embedding. We generally think of the residual stream as a communication channel, since it doesn't do any processing itself and all layers communicate through it.
    2. A transformer starts with a token embedding, followed by a series of “residual blocks”, and finally a token unembedding. Each residual block consists of an attention layer, followed by an MLP layer. Both the attention and MLP layers each “read” their input from the residual stream (by performing a linear projection), and then “write” their result to the residual stream by adding a linear projection back in. Each attention layer consists of multiple heads, which operate in parallel.
    1. e twoareas in which the forward-forward algorithm may be superior to backpropagation are as a model oflearning in cortex and as a way of making use of very low-power analog hardware without resortingto reinforcement learning(Jabri and Flower, 1992).
  5. Dec 2022
    1. The attention distribution is usually generated with content-based attention. The attending RNN generates a query describing what it wants to focus on. Each item is dot-producted with the query to produce a score, describing how well it matches the query. The scores are fed into a softmax to create the attention distribution.

      This is the Key, Value, Query, yes?

    1. Our method is based on the hypothesis that the weights of a generator act as Optimal Linear Associative Memory (OLAM). OLAM is a classic single-layer neural data structure for memorizing associations that was described by Teuvo Kohonen and James A Anderson (independently) in the 1970s. In our case, we hypothesize that within a large modern multilayer convolutional network, the each individual layer plays the role of an OLAM that stores a set of rules that associates keys, which denote meaningful context, with values, which determine output.
  6. Oct 2022
  7. Sep 2022
    1. To see how this plays out, we can continue looking at matrix shapes. Tracing the matrix shape through the branches and weaves of the multihead attention blocks requires three more numbers. d_k: dimensions in the embedding space used for keys and queries. 64 in the paper. d_v: dimensions in the embedding space used for values. 64 in the paper. h: the number of heads. 8 in the paper.
    1. Now, the progression of NLP, as discussed, tells a story. We begin with tokens and then build representations of these tokens. We use these representations to find similarities between tokens and embed them in a high-dimensional space. The same embeddings are also passed into sequential models that can process sequential data. Those models are used to build context and, through an ingenious way, attend to parts of the input sentence that are useful to the output sentence in translation.
  8. Aug 2022
  9. andrewbrown.substack.com andrewbrown.substack.com
  10. Jun 2022
    1. The dominant idea is one of attention, by which a representation at a position is computed as a weighted combination of representations from other positions. A common self-supervision objective in a transformer model is to mask out occasional words in a text. The model works out what word used to be there. It does this by calculating from each word position (including mask positions) vectors that represent a query, key, and value at that position. The query at a position is compared with the value at every position to calculate how much attention to pay to each position; based on this, a weighted average of the values at all positions is calculated. This operation is repeated many times at each level of the transformer neural net, and the resulting value is further manipulated through a fully connected neural net layer and through use of normalization layers and residual connections to produce a new vector for each word. This whole process is repeated many times, giving extra layers of depth to the transformer neural net. At the end, the representation above a mask position should capture the word that was there in the original text: for instance, committee as illustrated in Figure 1.
    1. Conclusion There are decades of history and a broad cast of characters behind the web requests you know and love—as well as the ones that you might have never heard of. Information first traveled across the internet in 1969, followed by a lot of research in the ’70s, then private networks in the ’80s, then public networks in the ’90s. We got CORBA in 1991, followed by SOAP in 1999, followed by REST around 2003. GraphQL reimagined SOAP, but with JSON, around 2015. This all sounds like a history class fact sheet, but it’s valuable context for building our own web apps.
  11. May 2022
    1. Given the complexities of the brain’s structure and the functions it performs, any one of these models is surely oversimplified and ultimately wrong—at best, an approximation of some aspects of what the brain does. However, some models are less wrong than others, and consistent trends in performance across models can reveal not just which model best fits the brain but also which properties of a model underlie its fit to the brain, thus yielding critical insights that transcend what any single model can tell us.
    1. According to a 2017 study, some 4.5 million American women have been threatened by a gun-wielding partner or former partner. Almost 1 million American women have survived after a gun was used by a partner against them.
    1. When chatting with my father about the proton research he summed it up nicely, that two possible responses to hearing that how we measure something seems to change its nature, throwing the reliability of empirical testing into question, are: “Science has been disproved!” or “Great!  Another thing to figure out using the Scientific Method!” The latter reaction is everyday to those who are versed in and comfortable with the fact that science is not a set of doctrines but a process of discovery, hypothesis, disproof and replacement.  Yet the former reaction, “X is wrong therefore the system which yielded X is wrong!” is, in fact, the historical norm.

      via http://known.kevinmarks.com/2021/sketches-of-a-history-of-skepticism-part-i-classical-eudaimonia

  12. Apr 2022
    1. Ourpre-trained network is nearly identical to the “AlexNet”architecture (Krizhevsky et al., 2012), but with local re-ponse normalization layers after pooling layers following(Jia et al., 2014). It was trained with the Caffe frameworkon the ImageNet 2012 dataset (Deng et al., 2009)
    1. Convolution Demo. Below is a running demo of a CONV layer. Since 3D volumes are hard to visualize, all the volumes (the input volume (in blue), the weight volumes (in red), the output volume (in green)) are visualized with each depth slice stacked in rows. The input volume is of size W1=5,H1=5,D1=3W1=5,H1=5,D1=3W_1 = 5, H_1 = 5, D_1 = 3, and the CONV layer parameters are K=2,F=3,S=2,P=1K=2,F=3,S=2,P=1K = 2, F = 3, S = 2, P = 1. That is, we have two filters of size 3×33×33 \times 3, and they are applied with a stride of 2. Therefore, the output volume size has spatial size (5 - 3 + 2)/2 + 1 = 3. Moreover, notice that a padding of P=1P=1P = 1 is applied to the input volume, making the outer border of the input volume zero. The visualization below iterates over the output activations (green), and shows that each element is computed by elementwise multiplying the highlighted input (blue) with the filter (red), summing it up, and then offsetting the result by the bias.

      Best explanation/inllustration of a convolution layer.and the ways the number relate.

    2. Example 1. For example, suppose that the input volume has size [32x32x3], (e.g. an RGB CIFAR-10 image). If the receptive field (or the filter size) is 5x5, then each neuron in the Conv Layer will have weights to a [5x5x3] region in the input volume, for a total of 5*5*3 = 75 weights (and +1 bias parameter). Notice that the extent of the connectivity along the depth axis must be 3, since this is the depth of the input volume. Example 2. Suppose an input volume had size [16x16x20]. Then using an example receptive field size of 3x3, every neuron in the Conv Layer would now have a total of 3*3*20 = 180 connections to the input volume. Notice that, again, the connectivity is local in 2D space (e.g. 3x3), but full along the input depth (20).

      These two examples are the first two layers of Andrej Karpathy's wonderful working ConvNetJS CIFAR-10 demo here

  13. Mar 2022
  14. Feb 2022
    1. Somewhat confusingly, and for historical reasons, such multiple layer networks are sometimes called multilayer perceptrons or MLPs, despite being made up of sigmoid neurons, not perceptrons. I'm not going to use the MLP terminology in this book, since I think it's confusing, but wanted to warn you of its existence.
  15. Jan 2022
    1. While heat pumps are the most cost effective way to use electricity to heat your home during the cooler months, leaving them running day and night is not economically efficient. According to Energywise, you should switch off your heat pump when you don’t need it. This is to avoid excessive energy waste.
  16. Dec 2021
    1. I grew up in a small town called Surry on the coast of down-east Maine. At Christmas, most everyone in our town bought their trees at Jordan's Tree Farm. $5 per tree, cut at your own risk. Thinking back, it seems funny to me now, since after all, this is rural Maine, the pine tree state. And you'd think everyone could cut their own trees on their own land. And it's not like the trees at the Jordan farm were so special. Pretty much everyone called them Charlie Brown trees. People came because of Robert Jordan. They were loyal to him, and they figured he could use the money.
  17. Nov 2021
    1. The following figure presents a simple functional diagram of the neural network we will use throughout the article. The neural network is a sequence of linear (both convolutional A convolution calculates weighted sums of regions in the input. In neural networks, the learnable weights in convolutional layers are referred to as the kernel. For example Image credit to https://towardsdatascience.com/gentle-dive-into-math-behind-convolutional-neural-networks-79a07dd44cf9. See also Convolution arithmetic. and fully-connected A fully-connected layer computes output neurons as weighted sum of input neurons. In matrix form, it is a matrix that linearly transforms the input vector into the output vector. ), max-pooling, and ReLU First introduced by Nair and Hinton, ReLU calculates f(x)=max(0,x)f(x)=max(0,x)f(x)=max(0,x) for each entry in a vector input. Graphically, it is a hinge at the origin: Image credit to https://pytorch.org/docs/stable/nn.html#relu layers, culminating in a softmax Softmax function calculates S(yi)=eyiΣj=1NeyjS(y_i)=\frac{e^{y_i}}{\Sigma_{j=1}^{N} e^{y_j}}S(yi​)=Σj=1N​eyj​eyi​​ for each entry (yiy_iyi​) in a vector input (yyy). For example, Image credit to https://ljvmiranda921.github.io/notebook/2017/08/13/softmax-and-the-negative-log-likelihood/ layer.

      This is a great visualization of MNIST hidden layers.

    1. To review, the Forget gate decides what is relevant to keep from prior steps. The input gate decides what information is relevant to add from the current step. The output gate determines what the next hidden state should be.Code DemoFor those of you who understand better through seeing the code, here is an example using python pseudo code.
  18. Oct 2021
    1. Reports of death after COVID-19 vaccination are rare. More than 396 million doses of COVID-19 vaccines were administered in the United States from December 14, 2020, through October 4, 2021. During this time, VAERS received 8,390 reports of death (0.0021%) among people who received a COVID-19 vaccine. FDA requires healthcare providers to report any death after COVID-19 vaccination to VAERS, even if it’s unclear whether the vaccine was the cause. Reports of adverse events to VAERS following vaccination, including deaths, do not necessarily mean that a vaccine caused a health problem. A review of available clinical information, including death certificates, autopsy, and medical records, has not established a causal link to COVID-19 vaccines. However, recent reports indicate a plausible causal relationship between the J&J/Janssen COVID-19 Vaccine and TTS, a rare and serious adverse event—blood clots with low platelets—which has caused deaths pdf icon[1.4 MB, 40 pages].
  19. Sep 2021
    1. These results nonetheless show that it could be feasible to develop recurrent neural network modelsable to infer input-output behaviours of real biological systems, enabling researchers to advance theirunderstanding of these systems even in the absence of detailed level of connectivity.

      Too strong a claim?

    1. One popular theory among machine learning researchers is the manifold hypothesis: MNIST is a low dimensional manifold, sweeping and curving through its high-dimensional embedding space. Another hypothesis, more associated with topological data analysis, is that data like MNIST consists of blobs with tentacle-like protrusions sticking out into the surrounding space.
    1. This is what I call a leaky abstraction. TCP attempts to provide a complete abstraction of an underlying unreliable network, but sometimes, the network leaks through the abstraction and you feel the things that the abstraction can’t quite protect you from. This is but one example of what I’ve dubbed the Law of Leaky Abstractions:
    1. If you have always wanted to know what it feels like to get stuck in a nonconsensual, one-way conversation with a libertarian high-school debate captain who’s more in love with his own brain than you will ever be with anyone or anything, Greenwald has just done you a great service. (I can already hear the debate captain shouting “point of personal privilege,” so I’ll try to steer clear of ad hominem from here on out.)
    1. Personalized ASR models. For each of the 432 participants with disordered speech, we create a personalized ASR model (SI-2) from their own recordings. Our fine-tuning procedure was optimized for our adaptation process, where we only have between ¼ and 2 h of data per speaker. We found that updating only the first five encoder layers (versus the complete model) worked best and successfully prevented overfitting [10]
    1. The researchers found that the model, when it is still confused by a given phoneme (that’s an individual speech sound like an “e” or “f”), has two kinds of errors. First, there’s the fact that it doesn’t recognize the phoneme for what was intended, and thus is not recognizing the word. And second, the model has to guess which phoneme the speaker did intend, and might choose the wrong one in cases where two or more words sound roughly similar.
  20. Aug 2021
    1. So basically: q = the vector representing a word K and V = your memory, thus all the words that have been generated before. Note that K and V can be the same (but don't have to). So what you do with attention is that you take your current query (word in most cases) and look in your memory for similar keys. To come up with a distribution of relevant words, the softmax function is then used.
    1. The Edgerton Essays are named for Norman Rockwell’s famous 1943 painting, “Freedom of Speech.” Rockwell depicted Jim Edgerton, a farmer in their small town, rising to speak and being respectfully listened to by his neighbors. That respectful, democratic spirit is too often missing today, and what we’re hoping to cultivate with this series.
  21. Jul 2021
    1. Using multiple copies of a neuron in different places is the neural network equivalent of using functions. Because there is less to learn, the model learns more quickly and learns a better model. This technique – the technical name for it is ‘weight tying’ – is essential to the phenomenal results we’ve recently seen from deep learning.

    1. Line 43: uses the "confidence weighted error" from l2 to establish an error for l1. To do this, it simply sends the error across the weights from l2 to l1. This gives what you could call a "contribution weighted error" because we learn how much each node value in l1 "contributed" to the error in l2. This step is called "backpropagating" and is the namesake of the algorithm

      Backpropagating

    1. In our research, i.e., the wormnet project, we try to build machine learning models motivated by the C. elegans nervous system. By doing so, we have to pay a cost, as we constrain ourselves to such models in contrast to standard artificial neural networks, whose modeling space is purely constraint by memory and compute limitations. However, there are potentially some advantages and benefits we gain. Our objective is to better understand what’s necessary for effective neural information processing to emerge.
    1. Recommendations DON'T use shifted PPMI with SVD. DON'T use SVD "correctly", i.e. without eigenvector weighting (performance drops 15 points compared to with eigenvalue weighting with (p = 0.5)). DO use PPMI and SVD with short contexts (window size of (2)). DO use many negative samples with SGNS. DO always use context distribution smoothing (raise unigram distribution to the power of (lpha = 0.75)) for all methods. DO use SGNS as a baseline (robust, fast and cheap to train). DO try adding context vectors in SGNS and GloVe.
  22. Jun 2021
    1. One thing that should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great. The two methods that seem to scale arbitrarily in this way are search and learning

      This is a big lesson. As a field, we still have not thoroughly learned it, as we are continuing to make the same kind of mistakes. To see this, and to effectively resist it, we have to understand the appeal of these mistakes. We have to learn the bitter lesson that building in how we think we think does not work in the long run. The bitter lesson is based on the historical observations that 1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning. The eventual success is tinged with bitterness, and often incompletely digested, because it is success over a favored, human-centric approach.

  23. May 2021
  24. Mar 2021
  25. Feb 2021
    1. There's this wonderful study done by Deborah Estrin at Cornell. If you plan and decide in advance what you’re going to eat and watch, the food you select and the video you watch will be different. Your video is likely to be slightly more intellectual and challenging, and your food is likely to be healthier for you. When you do it in advance it’s your planning self instead of your immediate-gratification self.
    1. There are two directions to look for: first, using the principle of independence between the sources and the knowledge management layer, and second, fine tuning the balance between automatic processing and manual curation.
    2. Other approaches have been created to manage information according to topics, such as the Darwin Information Typing Architecture (DITA), an XML architecture used in the industry for technical documentation.
    3. Although XML has become a lingua franca for publishing and data interchange, its usage has decreased among information technology professionals, who now tend to prefer JSON for data interchange, especially in situations where the data structure is straightforward.
  26. Mar 2018
  27. Feb 2018
  28. Jan 2018
    1. (Of course, there were plenty of other things happening between the sixteenth and twenty-first centuries that changed the shape of the world we live in. I've skipped changes in agricultural productivity due to energy economics, which finally broke the Malthusian trap our predecessors lived in. This in turn broke the long term cap on economic growth of around 0.1% per year in the absence of famine, plagues, and wars depopulating territories and making way for colonial invaders. I've skipped the germ theory of diseases, and the development of trade empires in the age of sail and gunpowder that were made possible by advances in accurate time-measurement. I've skipped the rise and—hopefully—decline of the pernicious theory of scientific racism that underpinned western colonialism and the slave trade. I've skipped the rise of feminism, the ideological position that women are human beings rather than property, and the decline of patriarchy. I've skipped the whole of the Enlightenment and the age of revolutions! But this is a technocentric congress, so I want to frame this talk in terms of AI, which we all like to think we understand.)
  29. May 2017
    1. “I and other so-called ‘deniers’ are members of the 97 percent consensus, which refers to the following: Yes, the earth’s climate has been warming overall for more than a century. Yes, humans emit CO2, and CO2 has an overall warming effect on the climate,” Curry said. Where the consensus ends, Curry added, is “whether the dominant cause of the recent warming is humans versus natural causes, how the 21st century climate will evolve, and whether warming is dangerous.”
  30. Apr 2017
  31. Mar 2017
    1. That summer was the first time he rented an inexpensive cottage on Gotts, a remote island off the coast of Maine; it lacked running water and electricity but was covered in pine forests and romantic mists. There, he wrote Levin, he was “reading nothing more frivolous than Plotinus and Husserl,” and Harry was welcome to join him “if Wellfleet becomes too worldly.”

      Paul de Man is buried on Gotts

  32. Feb 2017
    1. The following is a statement of the laws of physics, not just my own personal opinion. "When power is Variable, Power controls airspeed." "When power is fixed, Pitch controls airspeed." In general, airplanes go where you point them, and go as fast as the power dictates. This is the easiest way to fly, and it works in all airplanes.
  33. Jan 2017
  34. Jul 2016
  35. May 2016
    1. Simulation tests indicate that manual control of the capsule attitude during retrograde firing will be a difficult task requiring much practice on the part of the pilot. By changing the command function from acceleration to rate, the task complexity will be greatly reduced and the developmental effort on display and controller characteristics can be reduced accordingly
  36. Apr 2016
    1. By valuing capital gains above all others, we end up extracting the value of our marketplaces and rendering them incapable of generating economic activity. As a Deloitte study showed, corporate profits over net worth have been decreasing for 75 years. Corporations are great at accumulating capital, but terrible at deploying it. They vacuum the money off the playing field altogether, impoverishing the markets and consumers–not to mention the employees–on whom they ultimately depend.
    1. "Using visible wavelengths of light, it is difficult to tell if an asteroid is big and dark, or bright and small, because both combinations reflect the same amount of light," said Carrie Nugent, a NEOWISE scientist at the Infrared Processing and Analysis Center at California Institute of Technology, in Pasadena. "But when you look at an asteroid in the infrared with NEOWISE, the amount of infrared light corresponds with how big the asteroid is, and with some thermal models on a computer, you can figure out how big the asteroids are."
  37. Mar 2016
    1. Since the mid 1960s and the explosion of electronics, telephony, and the computer chip, corporate profit over net worth has been declining. This doesn’t mean that corporations have stopped making money. Profits in many sectors are still going up. But the most apparently successful companies are also sitting on more cash — real and borrowed — than ever before. Corporations have been great at extracting money from all corners of the world, but they don’t really have great ways of spending or investing it. The cash does nothing but collect.
  38. Feb 2016
    1. He expects that the logging project near Quimby’s land will likely generate about $755,250 at the state’s average sale price, $50.35 per cord of wood. The land has about 1,500 harvestable acres that contain about 30 cords of wood per acre, or 45,000 cords, but only about a third of that will be cut because the land is environmentally sensitive, Denico said. The Bureau of Parks and Lands expects to generate about $6.6 million in revenue this year selling about 130,000 cords of wood from its lots, Denico said. Last year, the bureau generated about $7 million harvesting about 139,000 cords of wood. The Legislature allows the cutting of about 160,000 cords of wood on state land annually, although the LePage administration has sought to increase that amount.
  39. Jan 2016
    1. (the richer tourists at Disney World wear t-shirts printed with the names of famous designers, because designs themselves can be bootlegged easily and with impunity. The only way to make clothing that cannot be legally bootlegged is to print copyrighted and trademarked words on it; once you have taken that step, the clothing itself doesn't really matter, and so a t-shirt is as good as anything else. T-shirts with expensive words on them are now the insignia of the upper class. T-shirts with cheap words, or no words at all, are for the commoners).

      Crane & Co. - Reg. trademark

    2. But even from this remove it was possible to glean certain patterns, and one that recurred as regularly as an urban legend was the one about how someone would move into a commune populated by sandal-wearing, peace-sign flashing flower children, and eventually discover that, underneath this facade, the guys who ran it were actually control freaks; and that, as living in a commune, where much lip service was paid to ideals of peace, love and harmony, had deprived them of normal, socially approved outlets for their control-freakdom, it tended to come out in other, invariably more sinister, ways.
  40. Dec 2015
    1. “Speakin’ o’ creeds,” and here old Mrs. Sargent paused in her work, “Elder Ransom from Acreville stopped with us last night, an’ he tells me they recite the Euthanasian Creed every few Sundays in the Episcopal Church.  I didn’t want him to know how ignorant I was, but I looked up the word in the dictionary.  It means easy death, and I can’t see any sense in that, though it’s a terrible long creed, the Elder says, an’ if it’s any longer ’n ourn, I should think anybody might easy die learnin’ it!” “I think the word is Athanasian,” ventured the minister’s wife.
    1. More venery. More love; more closeness; more sex and romance. Bring it back, no matter what, no matter how old we are. This fervent cry of ours has been certified by Simone de Beauvoir and Alice Munro and Laurence Olivier and any number of remarried or recoupled ancient classmates of ours. Laurence Olivier? I’m thinking of what he says somewhere in an interview: “Inside, we’re all seventeen, with red lips.”
  41. Sep 2015
  42. Jul 2015