198 Matching Annotations
  1. Apr 2023
    1. Now we are getting somewhere. At this point, we also see that the dimensions of W and b for each layer are specified by the dimensions of the inputs and the number of nodes in each layer. Let’s clean up the above diagram by not labeling every w and b value individually.
    1. While past work has characterized what kinds of functions ICL can learn (Garg et al., 2022; Laskin et al., 2022) and the distributional properties of pretraining that can elicit in-context learning (Xie et al., 2021; Chan et al., 2022), but how ICL learns these functions has remained unclear. What learning algorithms (if any) are implementable by deep network models? Which algorithms are actually discovered in the course of training? This paper takes first steps toward answering these questions, focusing on a widely used model architecture (the transformer) and an extremely well-understood class of learning problems (linear regression).
    1. It seems like the neuron basically adds the embedding of “ an” to the residual stream, which increases the output probability for “ an” since the unembedding step consists of taking the dot product of the final residual with each token2.

      This cleared the dust from my eyes in understanding what the MLP layer does

  2. Mar 2023
    1. Because then we have a world in which grown men, sipping tea, posit thought experiments about raping talking sex dolls, thinking that maybe you are one too.
    2. Others, like Dennett, the philosopher of mind, are even more blunt. We can’t live in a world with what he calls “counterfeit people.” “Counterfeit money has been seen as vandalism against society ever since money has existed,” he said. “Punishments included the death penalty and being drawn and quartered. Counterfeit people is at least as serious.”
  3. Feb 2023
    1. The simulator widget below contains the entire source code of the game. I’ll explain how it works in the following sections.
    1. The second purpose of skip connections is specific to transformers — preserving the original input sequence.
    2. Skip connections serve two purposes. The first is that they help keep the gradient smooth, which is a big help for backpropagation. Attention is a filter, which means that when it’s working correctly it will block most of what tries to pass through it.
    3. Once we have the result of our attention step, a vector that includes the most recent word and a small collection of the words that have preceded it, we need to translate that into features, each of which is a word pair. Attention masking gets us the raw material that we need, but it doesn’t build those word pair features. To do that, we can use a single layer fully connected neural network.

      Early transformer exploration focused on the attention layer/mechanism.The MLP that follows the attention layer is now being explored. ROME for example.

    1. If, on the other hand, I were to show you a brain scan taken before I believed it was going to rain, and after, there is no one in the world who could have the faintest clue what ideas these pictures were illustrating.

      They're working on it, for example, The neural architecture of language: Integrative modeling converges on predictive processing

    1. the Elhage et al.(2021) study showing an information-copying role for self-attention.

      It turns out Meng does refer to induction heads, just not by name.

  4. Jan 2023
    1. One of the main features of the high level architecture of a transformer is that each layer adds its results into what we call the “residual stream.”Constructing models with a residual stream traces back to early work by the Schmidhuber group, such as highway networks  and LSTMs, which have found significant modern success in the more recent residual network architecture . In transformers, the residual stream vectors are often called the “embedding.” We prefer the residual stream terminology, both because it emphasizes the residual nature (which we believe to be important) and also because we believe the residual stream often dedicates subspaces to tokens other than the present token, breaking the intuitions the embedding terminology suggests. The residual stream is simply the sum of the output of all the previous layers and the original embedding. We generally think of the residual stream as a communication channel, since it doesn't do any processing itself and all layers communicate through it.
    2. A transformer starts with a token embedding, followed by a series of “residual blocks”, and finally a token unembedding. Each residual block consists of an attention layer, followed by an MLP layer. Both the attention and MLP layers each “read” their input from the residual stream (by performing a linear projection), and then “write” their result to the residual stream by adding a linear projection back in. Each attention layer consists of multiple heads, which operate in parallel.
    1. You see the values of the self-attention weights are computed on the fly. They are data-dependent dynamic weights because they change dynamically in response to the data (fast weights).
    1. e twoareas in which the forward-forward algorithm may be superior to backpropagation are as a model oflearning in cortex and as a way of making use of very low-power analog hardware without resortingto reinforcement learning(Jabri and Flower, 1992).
  5. Dec 2022
    1. The attention distribution is usually generated with content-based attention. The attending RNN generates a query describing what it wants to focus on. Each item is dot-producted with the query to produce a score, describing how well it matches the query. The scores are fed into a softmax to create the attention distribution.

      This is the Key, Value, Query, yes?

    1. Our method is based on the hypothesis that the weights of a generator act as Optimal Linear Associative Memory (OLAM). OLAM is a classic single-layer neural data structure for memorizing associations that was described by Teuvo Kohonen and James A Anderson (independently) in the 1970s. In our case, we hypothesize that within a large modern multilayer convolutional network, the each individual layer plays the role of an OLAM that stores a set of rules that associates keys, which denote meaningful context, with values, which determine output.
  6. Oct 2022
  7. Sep 2022
    1. Consider a toy model where we train an embedding of five features of varying importanceWhere “importance” is a scalar multiplier on mean squared error loss. in two dimensions, add a ReLU afterwards for filtering, and vary the sparsity of the features.
    1. To see how this plays out, we can continue looking at matrix shapes. Tracing the matrix shape through the branches and weaves of the multihead attention blocks requires three more numbers. d_k: dimensions in the embedding space used for keys and queries. 64 in the paper. d_v: dimensions in the embedding space used for values. 64 in the paper. h: the number of heads. 8 in the paper.
    1. Now, the progression of NLP, as discussed, tells a story. We begin with tokens and then build representations of these tokens. We use these representations to find similarities between tokens and embed them in a high-dimensional space. The same embeddings are also passed into sequential models that can process sequential data. Those models are used to build context and, through an ingenious way, attend to parts of the input sentence that are useful to the output sentence in translation.
    2. Data, matrix multiplications, repeated and scaled with non-linear switches. Maybe that simplifies things a lot, but even today, most architectures boil down to these principles. Even the most complex systems, ideas, and papers can be boiled down to just that:
  8. Aug 2022
    1. Neural models more closely resemble movable type: they will change the way culture is transmitted in many social contexts.
  9. andrewbrown.substack.com andrewbrown.substack.com
    1. But the truths of religion appear in the lives of believers, not in their theologies,
  10. Jun 2022
    1. The dominant idea is one of attention, by which a representation at a position is computed as a weighted combination of representations from other positions. A common self-supervision objective in a transformer model is to mask out occasional words in a text. The model works out what word used to be there. It does this by calculating from each word position (including mask positions) vectors that represent a query, key, and value at that position. The query at a position is compared with the value at every position to calculate how much attention to pay to each position; based on this, a weighted average of the values at all positions is calculated. This operation is repeated many times at each level of the transformer neural net, and the resulting value is further manipulated through a fully connected neural net layer and through use of normalization layers and residual connections to produce a new vector for each word. This whole process is repeated many times, giving extra layers of depth to the transformer neural net. At the end, the representation above a mask position should capture the word that was there in the original text: for instance, committee as illustrated in Figure 1.
    1. The creator of GraphQL admits this. During his presentation on the library at a Facebook internal conference, an audience member asked him about the difference between GraphQL and SOAP. His response: SOAP requires XML. GraphQL defaults to JSON—though you can use XML.
    2. Conclusion There are decades of history and a broad cast of characters behind the web requests you know and love—as well as the ones that you might have never heard of. Information first traveled across the internet in 1969, followed by a lot of research in the ’70s, then private networks in the ’80s, then public networks in the ’90s. We got CORBA in 1991, followed by SOAP in 1999, followed by REST around 2003. GraphQL reimagined SOAP, but with JSON, around 2015. This all sounds like a history class fact sheet, but it’s valuable context for building our own web apps.
    1. This trick of using a one-hot vector to pull out a particular row of a matrix is at the core of how transformers work.

      Matrix multiplication as table lookup

  11. May 2022
    1. Given the complexities of the brain’s structure and the functions it performs, any one of these models is surely oversimplified and ultimately wrong—at best, an approximation of some aspects of what the brain does. However, some models are less wrong than others, and consistent trends in performance across models can reveal not just which model best fits the brain but also which properties of a model underlie its fit to the brain, thus yielding critical insights that transcend what any single model can tell us.
    1. According to a 2017 study, some 4.5 million American women have been threatened by a gun-wielding partner or former partner. Almost 1 million American women have survived after a gun was used by a partner against them.
    2. If there were any merit to the “defensive gun use” argument, you’d expect that one permissive nation to boast much greater safety.
    1. Such a highly non-linear problem would clearly benefitfrom the computational power of many layers. Unfortu-nately, back-propagation learning generally slows downby an order of magnitude every time a layer is added toa network.

      The problem in 1988

    1. The source sequence will be pass to the TransformerEncoder, which will produce a new representation of it. This new representation will then be passed to the TransformerDecoder, together with the target sequence so far (target words 0 to N). The TransformerDecoder will then seek to predict the next words in the target sequence (N+1 and beyond).
    1. When chatting with my father about the proton research he summed it up nicely, that two possible responses to hearing that how we measure something seems to change its nature, throwing the reliability of empirical testing into question, are: “Science has been disproved!” or “Great!  Another thing to figure out using the Scientific Method!” The latter reaction is everyday to those who are versed in and comfortable with the fact that science is not a set of doctrines but a process of discovery, hypothesis, disproof and replacement.  Yet the former reaction, “X is wrong therefore the system which yielded X is wrong!” is, in fact, the historical norm.
  12. Apr 2022
    1. Ourpre-trained network is nearly identical to the “AlexNet”architecture (Krizhevsky et al., 2012), but with local re-ponse normalization layers after pooling layers following(Jia et al., 2014). It was trained with the Caffe frameworkon the ImageNet 2012 dataset (Deng et al., 2009)
    1. Convolution Demo. Below is a running demo of a CONV layer. Since 3D volumes are hard to visualize, all the volumes (the input volume (in blue), the weight volumes (in red), the output volume (in green)) are visualized with each depth slice stacked in rows. The input volume is of size W1=5,H1=5,D1=3W1=5,H1=5,D1=3W_1 = 5, H_1 = 5, D_1 = 3, and the CONV layer parameters are K=2,F=3,S=2,P=1K=2,F=3,S=2,P=1K = 2, F = 3, S = 2, P = 1. That is, we have two filters of size 3×33×33 \times 3, and they are applied with a stride of 2. Therefore, the output volume size has spatial size (5 - 3 + 2)/2 + 1 = 3. Moreover, notice that a padding of P=1P=1P = 1 is applied to the input volume, making the outer border of the input volume zero. The visualization below iterates over the output activations (green), and shows that each element is computed by elementwise multiplying the highlighted input (blue) with the filter (red), summing it up, and then offsetting the result by the bias.

      Best explanation/inllustration of a convolution layer.and the ways the number relate.

    2. Example 1. For example, suppose that the input volume has size [32x32x3], (e.g. an RGB CIFAR-10 image). If the receptive field (or the filter size) is 5x5, then each neuron in the Conv Layer will have weights to a [5x5x3] region in the input volume, for a total of 5*5*3 = 75 weights (and +1 bias parameter). Notice that the extent of the connectivity along the depth axis must be 3, since this is the depth of the input volume. Example 2. Suppose an input volume had size [16x16x20]. Then using an example receptive field size of 3x3, every neuron in the Conv Layer would now have a total of 3*3*20 = 180 connections to the input volume. Notice that, again, the connectivity is local in 2D space (e.g. 3x3), but full along the input depth (20).

      These two examples are the first two layers of Andrej Karpathy's wonderful working ConvNetJS CIFAR-10 demo here

    1. input (32x32x3)max activation: 0.5, min: -0.5max gradient: 1.08696, min: -1.53051Activations:Activation Gradients:Weights:Weight Gradients:conv (32x32x16)filter size 5x5x3, stride 1max activation: 3.75919, min: -4.48241max gradient: 0.36571, min: -0.33032parameters: 16x5x5x3+16 = 1216

      The dimensions of these first two layers are explained here

    1. Here the lower level layers are frozen and are not trained, only the new classification head will update itself to learn from the features provided from the pre-trained chopped up model on the left.
    1. Starting from random noise, we optimize an image to activate a particular neuron (layer mixed4a, unit 11).

      And then we use that image as a kind of variable name to refer to the neuron in a way that more helpful than the the layer number and neuron index within the layer. This explanation is via one of Chris Olah's YouTube videos (https://www.youtube.com/watch?v=gXsKyZ_Y_i8)

    1. This just happened to me and it was because I was signed in to my work account at the same time.  I went to "sign out all" and signed in again and they reappeared

      When Android apps disappeared from my Chromebook, it was because I had added a managed account.

  13. Mar 2022
    1. A special quality of humans, not shared by evolution or, as yet, by machines, is our ability to recognize gaps in our understanding and to take joy in the process of filling them in. It is a beautiful thing to experience the mysterious, and powerful, too.
  14. Feb 2022
    1. Somewhat confusingly, and for historical reasons, such multiple layer networks are sometimes called multilayer perceptrons or MLPs, despite being made up of sigmoid neurons, not perceptrons. I'm not going to use the MLP terminology in this book, since I think it's confusing, but wanted to warn you of its existence.
  15. Jan 2022
    1. While heat pumps are the most cost effective way to use electricity to heat your home during the cooler months, leaving them running day and night is not economically efficient. According to Energywise, you should switch off your heat pump when you don’t need it. This is to avoid excessive energy waste.
    1. Treatment with single probiotic B. infantis didn't impact on abdominal pain, bloating/distention, or bowel habit satisfaction among IBS patients. However, patients who received composite probiotics containing B. infantis had significantly reduced abdominal pain
  16. Dec 2021
    1. To test whether these distributed representations of meaning are neurally plausible, a number of studies have attempted to learn a mapping between particular semantic dimensions and patterns of brain activation
    1. I grew up in a small town called Surry on the coast of down-east Maine. At Christmas, most everyone in our town bought their trees at Jordan's Tree Farm. $5 per tree, cut at your own risk. Thinking back, it seems funny to me now, since after all, this is rural Maine, the pine tree state. And you'd think everyone could cut their own trees on their own land. And it's not like the trees at the Jordan farm were so special. Pretty much everyone called them Charlie Brown trees. People came because of Robert Jordan. They were loyal to him, and they figured he could use the money.
    1. the only thing an artificial neuron can do: classify a data point into one of two kinds by examining input values with weights and bias.

      How does this relate to "weighted sum shows similarity between the weights and the inputs"?

    1. The transformer model introduces the idea of instead of adding another complex mechanism (attention) to an already complex Seq2Seq model; we can simplify the solution by forgetting about everything else and just focusing on attention.
    1. I’m particularly interested in two questions: First, just how weird is machine learning? Second, what sorts of choices do developers make as they shape a project?
  17. Nov 2021
    1. Now that we've made peace with the concepts of projections (matrix multiplications)

      Projections are matrix multiplications.Why didn't you sayso? spatial and channel projections in the gated gmlp

    2. Computers are especially good at matrix multiplications. There is an entire industry around building computer hardware specifically for fast matrix multiplications. Any computation that can be expressed as a matrix multiplication can be made shockingly efficient.
    3. The selective-second-order-with-skips model is a useful way to think about what transformers do, at least in the decoder side. It captures, to a first approximation, what generative language models like OpenAI's GPT-3 are doing.
    1. The following figure presents a simple functional diagram of the neural network we will use throughout the article. The neural network is a sequence of linear (both convolutional A convolution calculates weighted sums of regions in the input. In neural networks, the learnable weights in convolutional layers are referred to as the kernel. For example Image credit to https://towardsdatascience.com/gentle-dive-into-math-behind-convolutional-neural-networks-79a07dd44cf9. See also Convolution arithmetic. and fully-connected A fully-connected layer computes output neurons as weighted sum of input neurons. In matrix form, it is a matrix that linearly transforms the input vector into the output vector. ), max-pooling, and ReLU First introduced by Nair and Hinton, ReLU calculates f(x)=max(0,x)f(x)=max(0,x)f(x)=max(0,x) for each entry in a vector input. Graphically, it is a hinge at the origin: Image credit to https://pytorch.org/docs/stable/nn.html#relu layers, culminating in a softmax Softmax function calculates S(yi)=eyiΣj=1NeyjS(y_i)=\frac{e^{y_i}}{\Sigma_{j=1}^{N} e^{y_j}}S(yi​)=Σj=1N​eyj​eyi​​ for each entry (yiy_iyi​) in a vector input (yyy). For example, Image credit to https://ljvmiranda921.github.io/notebook/2017/08/13/softmax-and-the-negative-log-likelihood/ layer.

      This is a great visualization of MNIST hidden layers.

    1. The most beautiful and deepest experience a man can have is the sense of the mysterious.
    1. The Query word can be interpreted as the word for which we are calculating Attention. The Key and Value word is the word to which we are paying attention ie. how relevant is that word to the Query word.

      Finally

    1. Other work on interpreting transformer internals has focused mostly on what the attention is looking at. The logit lens focuses on what GPT "believes" after each step of processing, rather than how it updates that belief inside the step.
    1. The cube of activations that a neural network for computer vision develops at each hidden layer. Different slices of the cube allow us to target the activations of individual neurons, spatial positions, or channels.

      This is first explanation of

    1. The attention layer (W in the diagram) computes three vectors based on the input, termed key, query, and value.

      Could you be more specific?

    2. Attention is a means of selectively weighting different elements in input data, so that they will have an adjusted impact on the hidden states of downstream layers.
    1. These findings provide strong evidence for a classic hypothesis about the computations underlying human language understanding, that the brain’s language system is optimized for predictive processing in the service of meaning extraction
    1. On the geopolitical stage, it’s hard to argue with the claim that Twitter is a force of evil. But Twitter is also the infrastructural backbone of much of the digital humanities world.
    1. To review, the Forget gate decides what is relevant to keep from prior steps. The input gate decides what information is relevant to add from the current step. The output gate determines what the next hidden state should be.Code DemoFor those of you who understand better through seeing the code, here is an example using python pseudo code.
  18. Oct 2021
    1. This approach, visualizing high-dimensional representations using dimensionality reduction, is an extremely broadly applicable technique for inspecting models in deep learning.
    2. These layers warp and reshape the data to make it easier to classify.
    1. Even with this very primitive single neuron, you can achieve 90% accuracy when recognizing a handwritten text image1. To recognize all the digits from 0 to 9, you would need just ten neurons to recognize them with 92% accuracy.

      And here is a Google Colab notebook that demonstrates that

    1. Reports of death after COVID-19 vaccination are rare. More than 396 million doses of COVID-19 vaccines were administered in the United States from December 14, 2020, through October 4, 2021. During this time, VAERS received 8,390 reports of death (0.0021%) among people who received a COVID-19 vaccine. FDA requires healthcare providers to report any death after COVID-19 vaccination to VAERS, even if it’s unclear whether the vaccine was the cause. Reports of adverse events to VAERS following vaccination, including deaths, do not necessarily mean that a vaccine caused a health problem. A review of available clinical information, including death certificates, autopsy, and medical records, has not established a causal link to COVID-19 vaccines. However, recent reports indicate a plausible causal relationship between the J&J/Janssen COVID-19 Vaccine and TTS, a rare and serious adverse event—blood clots with low platelets—which has caused deaths pdf icon[1.4 MB, 40 pages].
    1. It is not only the essence of being human but also a vital property of life. Technological advances in communication shape society and lnake its members more interdependent

  19. Sep 2021
    1. The models are developed in Python [46], using the Keras [47] and Tensorflow [48] libraries. Detailson the code and dependencies to run the experiments are listed in a Readme file available togetherwith the code in the Supplemental Material.

      I have not found the code or Readme file

    2. These results nonetheless show that it could be feasible to develop recurrent neural network modelsable to infer input-output behaviours of real biological systems, enabling researchers to advance theirunderstanding of these systems even in the absence of detailed level of connectivity.

      Too strong a claim?

    3. We show that GRU models with a hidden layersize of 4 units are able to accurately reproduce with high accuracy the system’sresponse to very different stimuli.
    1. One popular theory among machine learning researchers is the manifold hypothesis: MNIST is a low dimensional manifold, sweeping and curving through its high-dimensional embedding space. Another hypothesis, more associated with topological data analysis, is that data like MNIST consists of blobs with tentacle-like protrusions sticking out into the surrounding space.
    1. This is what I call a leaky abstraction. TCP attempts to provide a complete abstraction of an underlying unreliable network, but sometimes, the network leaks through the abstraction and you feel the things that the abstraction can’t quite protect you from. This is but one example of what I’ve dubbed the Law of Leaky Abstractions:
    1. Humans perform a version of this task when interpretinghard-to-understand speech, such as an accent which is particularlyfast or slurred, or a sentence in a language we do not know verywell—we do not necessarily hear every single word that is said,but we pick up on salient key words and contextualize the rest tounderstand the sentence.

      Boy, don't they

    1. A neural network will predict your digit in the blue square above. Your image is 784 pixels (= 28 rows by 28 columns with black=1 and white=0). Those 784 features get fed into a 3 layer neural network; Input:784 - AvgPool:196 - Dense:100 - Softmax:10.
    1. If you have always wanted to know what it feels like to get stuck in a nonconsensual, one-way conversation with a libertarian high-school debate captain who’s more in love with his own brain than you will ever be with anyone or anything, Greenwald has just done you a great service. (I can already hear the debate captain shouting “point of personal privilege,” so I’ll try to steer clear of ad hominem from here on out.)
    1. Personalized ASR models. For each of the 432 participants with disordered speech, we create a personalized ASR model (SI-2) from their own recordings. Our fine-tuning procedure was optimized for our adaptation process, where we only have between ¼ and 2 h of data per speaker. We found that updating only the first five encoder layers (versus the complete model) worked best and successfully prevented overfitting [10]
    1. The researchers found that the model, when it is still confused by a given phoneme (that’s an individual speech sound like an “e” or “f”), has two kinds of errors. First, there’s the fact that it doesn’t recognize the phoneme for what was intended, and thus is not recognizing the word. And second, the model has to guess which phoneme the speaker did intend, and might choose the wrong one in cases where two or more words sound roughly similar.
    1. So whenever you hear of someone “training” a neural network, it just means finding the weights we use to calculate the prediction.
  20. Aug 2021
    1. So for each word, we create a Query vector, a Key vector, and a Value vector. These vectors are created by multiplying the embedding by three matrices that we trained during the training process.
    1. I'm going to try provide an English text example. The following is based solely on my intuitive understanding of the paper 'Attention is all you need'.

      This is also good

    2. For the word q that your eyes see in the given sentence, what is the most related word k in the sentence to understand what q is about?
    3. So basically: q = the vector representing a word K and V = your memory, thus all the words that have been generated before. Note that K and V can be the same (but don't have to). So what you do with attention is that you take your current query (word in most cases) and look in your memory for similar keys. To come up with a distribution of relevant words, the softmax function is then used.
    1. The Edgerton Essays are named for Norman Rockwell’s famous 1943 painting, “Freedom of Speech.” Rockwell depicted Jim Edgerton, a farmer in their small town, rising to speak and being respectfully listened to by his neighbors. That respectful, democratic spirit is too often missing today, and what we’re hoping to cultivate with this series.
    1. A neural network with a hidden layer has universality: given enough hidden units, it can approximate any function. This is a frequently quoted – and even more frequently, misunderstood and applied – theorem. It’s true, essentially, because the hidden layer can be used as a lookup table.
    2. Recursive Neural Networks
    3. t-SNE visualizations of word embeddings.
  21. Jul 2021
    1. In the language of Interpretable Machine Learning (IML) literature like Molnar et al.[20], input saliency is a method that explains individual predictions.
    1. Using multiple copies of a neuron in different places is the neural network equivalent of using functions. Because there is less to learn, the model learns more quickly and learns a better model. This technique – the technical name for it is ‘weight tying’ – is essential to the phenomenal results we’ve recently seen from deep learning.

    1. Vectors with a small Euclidean distance from one another are located in the same region of a vector space. Vectors with a high cosine similarity are located in the same general direction from the origin.
    1. If you're serious about neural networks, I have one recommendation. Try to rebuild this network from memory.
    2. Line 43: uses the "confidence weighted error" from l2 to establish an error for l1. To do this, it simply sends the error across the weights from l2 to l1. This gives what you could call a "contribution weighted error" because we learn how much each node value in l1 "contributed" to the error in l2. This step is called "backpropagating" and is the namesake of the algorithm

      Backpropagating

    3. If you're serious about neural networks, I have one recommendation. Try to rebuild this network from memory.
    1. In our research, i.e., the wormnet project, we try to build machine learning models motivated by the C. elegans nervous system. By doing so, we have to pay a cost, as we constrain ourselves to such models in contrast to standard artificial neural networks, whose modeling space is purely constraint by memory and compute limitations. However, there are potentially some advantages and benefits we gain. Our objective is to better understand what’s necessary for effective neural information processing to emerge.
    1. Recommendations DON'T use shifted PPMI with SVD. DON'T use SVD "correctly", i.e. without eigenvector weighting (performance drops 15 points compared to with eigenvalue weighting with (p = 0.5)). DO use PPMI and SVD with short contexts (window size of (2)). DO use many negative samples with SGNS. DO always use context distribution smoothing (raise unigram distribution to the power of (lpha = 0.75)) for all methods. DO use SGNS as a baseline (robust, fast and cheap to train). DO try adding context vectors in SGNS and GloVe.
  22. Jun 2021
    1. Here is an example run of the QnA model:

      This example doesn't work. The await gets an error. Since it's not inside the promise?

    1. One thing that should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great. The two methods that seem to scale arbitrarily in this way are search and learning

      This is a big lesson. As a field, we still have not thoroughly learned it, as we are continuing to make the same kind of mistakes. To see this, and to effectively resist it, we have to understand the appeal of these mistakes. We have to learn the bitter lesson that building in how we think we think does not work in the long run. The bitter lesson is based on the historical observations that 1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning. The eventual success is tinged with bitterness, and often incompletely digested, because it is success over a favored, human-centric approach.

    1. TensorFlow.js provides theLayers API,which mirrors the Keras API as closely as possible, in-cluding the serialization format.

      Surfing TensorFlow I was orbiting this conclusion. It's good to see it it stated clearly.

    1. The Hole Hawg is a drill made by the Milwaukee Tool Company.
    2. primal Jungian fugue
    3. They pay lip service to multiculturalism and diversity and non-judgmentalness, but they don't raise their own children that way.
    4. It comes through as the presumption that all authority figures--teachers, generals, cops, ministers, politicians--are hypocritical buffoons, and that hip jaded coolness is the only way to be.
  23. May 2021
    1. Note that variables cannot appear in the predicate position.
  24. Mar 2021
  25. Feb 2021
    1. There's this wonderful study done by Deborah Estrin at Cornell. If you plan and decide in advance what you’re going to eat and watch, the food you select and the video you watch will be different. Your video is likely to be slightly more intellectual and challenging, and your food is likely to be healthier for you. When you do it in advance it’s your planning self instead of your immediate-gratification self.
    1. There are two directions to look for: first, using the principle of independence between the sources and the knowledge management layer, and second, fine tuning the balance between automatic processing and manual curation.
    2. The "authority terms" comprising it are expected to be used outside the library catalog, as metadata in the sources, enabling links to the taxonomy.
    3. Other approaches have been created to manage information according to topics, such as the Darwin Information Typing Architecture (DITA), an XML architecture used in the industry for technical documentation.
    4. Although XML has become a lingua franca for publishing and data interchange, its usage has decreased among information technology professionals, who now tend to prefer JSON for data interchange, especially in situations where the data structure is straightforward.
    5. We started designing topic maps in an informal working group called Davenport, which turned out to also be at the origin of the SGML/XML-based "Docbook" document architecture.
  26. Mar 2018
    1. It would be fair to characterize Beaker as “a novel application of Bittorrent’s concepts to the Web platform.” If Beaker had been started in 2006, it would be using Bittorrent as its primary protocol. However, as of 2016, new variants have appeared with better properties.
  27. Feb 2018
  28. Jan 2018
    1. Regulatory agencies are our current political systems' tool of choice for preventing paperclip maximizers from running amok.
    2. Dude, you broke the future!
    3. (Of course, there were plenty of other things happening between the sixteenth and twenty-first centuries that changed the shape of the world we live in. I've skipped changes in agricultural productivity due to energy economics, which finally broke the Malthusian trap our predecessors lived in. This in turn broke the long term cap on economic growth of around 0.1% per year in the absence of famine, plagues, and wars depopulating territories and making way for colonial invaders. I've skipped the germ theory of diseases, and the development of trade empires in the age of sail and gunpowder that were made possible by advances in accurate time-measurement. I've skipped the rise and—hopefully—decline of the pernicious theory of scientific racism that underpinned western colonialism and the slave trade. I've skipped the rise of feminism, the ideological position that women are human beings rather than property, and the decline of patriarchy. I've skipped the whole of the Enlightenment and the age of revolutions! But this is a technocentric congress, so I want to frame this talk in terms of AI, which we all like to think we understand.)
    4. the development of Artificial Intelligence, which happened no earlier than 1553 and no later than 1844. I'm talking about the very old, very slow AIs we call corporations,
    5. I think transhumanism is a warmed-over Christian heresy.
  29. May 2017
    1. “I and other so-called ‘deniers’ are members of the 97 percent consensus, which refers to the following: Yes, the earth’s climate has been warming overall for more than a century. Yes, humans emit CO2, and CO2 has an overall warming effect on the climate,” Curry said. Where the consensus ends, Curry added, is “whether the dominant cause of the recent warming is humans versus natural causes, how the 21st century climate will evolve, and whether warming is dangerous.”
  30. Apr 2017
  31. Mar 2017
    1. That summer was the first time he rented an inexpensive cottage on Gotts, a remote island off the coast of Maine; it lacked running water and electricity but was covered in pine forests and romantic mists. There, he wrote Levin, he was “reading nothing more frivolous than Plotinus and Husserl,” and Harry was welcome to join him “if Wellfleet becomes too worldly.”

      Paul de Man is buried on Gotts

    1. Progressive values demand empathy for the poor and this often manifests as hatred for the rich.
    2. I’m realizing more and more how desperately this perspective is needed as I watch researchers and advocates, politicians and everyday people judge others from their vantage point without taking a moment to understand why a particular logic might unfold.
  32. Feb 2017
    1. The following is a statement of the laws of physics, not just my own personal opinion. "When power is Variable, Power controls airspeed." "When power is fixed, Pitch controls airspeed." In general, airplanes go where you point them, and go as fast as the power dictates. This is the easiest way to fly, and it works in all airplanes.
  33. Jan 2017
  34. Jul 2016
  35. May 2016
    1. lesh, the engineering manager, and Dubusker drew on all of McDonnell's experience with shingled-skin structures around jet afterburners for heat protection.
    2. Simulation tests indicate that manual control of the capsule attitude during retrograde firing will be a difficult task requiring much practice on the part of the pilot. By changing the command function from acceleration to rate, the task complexity will be greatly reduced and the developmental effort on display and controller characteristics can be reduced accordingly
    3. There is a natural reluctance to relinquish the mechanical linkage to the solenoid valves but the redundant fly-by-wire systems offer mechanical simplification with regard to plumbing and valving hydrogen peroxide
    1. Three retro rockets fire for 10 seconds each. They are started at 5 second intervals, firing overlaps for a total of 20 s. Delta V of 550 ft/s (168 m/s) is taken off forward velocity.
  36. Apr 2016
    1. n order to obtain an accurate estimate of true completion, and thus population, one must bias-correct the observed re-detection ratio to estimate the true completion as a function of size of asteroid. We do this with a computer model simulating actual surveys.
    1. Here’s the URL of annotations tagged wikipedia: https://hypothes.is/stream?q=tag:%27wikipedia%27 (Actually that doesn’t seem to work yet, but I’d love to see this become a next-gen delicious with all the taggy goodness.)

      I would love to see a worthy successor to delicious. Is hypothesis it?

    2. One thing I held on to during fedwiki was that it wasn’t intended to be wikipedia, and to me that meant it wasn’t intended to produce articles so much as to sustain and connect ideas in formation that might find their way into article-like things on other platforms.
    1. By valuing capital gains above all others, we end up extracting the value of our marketplaces and rendering them incapable of generating economic activity. As a Deloitte study showed, corporate profits over net worth have been decreasing for 75 years. Corporations are great at accumulating capital, but terrible at deploying it. They vacuum the money off the playing field altogether, impoverishing the markets and consumers–not to mention the employees–on whom they ultimately depend.
    1. "Using visible wavelengths of light, it is difficult to tell if an asteroid is big and dark, or bright and small, because both combinations reflect the same amount of light," said Carrie Nugent, a NEOWISE scientist at the Infrared Processing and Analysis Center at California Institute of Technology, in Pasadena. "But when you look at an asteroid in the infrared with NEOWISE, the amount of infrared light corresponds with how big the asteroid is, and with some thermal models on a computer, you can figure out how big the asteroids are."
  37. Mar 2016
    1. Since the mid 1960s and the explosion of electronics, telephony, and the computer chip, corporate profit over net worth has been declining. This doesn’t mean that corporations have stopped making money. Profits in many sectors are still going up. But the most apparently successful companies are also sitting on more cash — real and borrowed — than ever before. Corporations have been great at extracting money from all corners of the world, but they don’t really have great ways of spending or investing it. The cash does nothing but collect.
    1. I'm talking about optimizing the economy for the velocity of money rather than for the conversion of money into capital. It's going from a growth model to a flow model. Why are we, for instance, taxing capital gains at almost nothing but taxing dividends and earnings so high? That's a tax policy that is meant to favor the extraction of capital and to punish the exchange of things.
  38. Feb 2016
    1. He expects that the logging project near Quimby’s land will likely generate about $755,250 at the state’s average sale price, $50.35 per cord of wood. The land has about 1,500 harvestable acres that contain about 30 cords of wood per acre, or 45,000 cords, but only about a third of that will be cut because the land is environmentally sensitive, Denico said. The Bureau of Parks and Lands expects to generate about $6.6 million in revenue this year selling about 130,000 cords of wood from its lots, Denico said. Last year, the bureau generated about $7 million harvesting about 139,000 cords of wood. The Legislature allows the cutting of about 160,000 cords of wood on state land annually, although the LePage administration has sought to increase that amount.
    1. Inchoate

      Kind of a joke. I spent one year at Choate. There was good stuff: PSSC physics. There was bad stuff: bully. Choate, unlike Bowdoin, does not consider me to be an alumnus.

    2. A few nodes on paper to contemplate

      I update this from time and time and keep a printed copy in my back pocket. It's I bunch of stuff I like to remind myself of.

    3. Generation Chronology

      I propose this as a thing. The idea is to create a list of people who have some significance to me whose lives overlap in time.

    4. Synod of Whitby

      Important to me because it looks like an significant date in the transition of Irish Christianity from the monasticism celebrated in Kenneth Clark's Civilization. For me the Irish monasticism focuses on Skellig Michael, most recently connected to the far, far away galaxy and Luke Skywalker

      Synod of Whitby at Wikipedia

  39. Jan 2016
    1. confirmation bias
    2. P(B|E) = P(B) X P(E|B) / P(E), with P standing for probability, B for belief and E for evidence. P(B) is the probability that B is true, and P(E) is the probability that E is true. P(B|E) means the probability of B if E is true, and P(E|B) is the probability of E if B is true.
    3. The probability that a belief is true given new evidence equals the probability that the belief is true regardless of that evidence times the probability that the evidence is true given that the belief is true divided by the probability that the evidence is true regardless of whether the belief is true. Got that?
    4. Initial belief plus new evidence = new and improved belief.
    1. was a platinum fob chain simple and chaste in design, properly proclaiming its value by substance alone and not by meretricious ornamentation--as all good things should do.
    1. (the richer tourists at Disney World wear t-shirts printed with the names of famous designers, because designs themselves can be bootlegged easily and with impunity. The only way to make clothing that cannot be legally bootlegged is to print copyrighted and trademarked words on it; once you have taken that step, the clothing itself doesn't really matter, and so a t-shirt is as good as anything else. T-shirts with expensive words on them are now the insignia of the upper class. T-shirts with cheap words, or no words at all, are for the commoners).
    2. The word, in the end, is the only system of encoding

      Word not fungible

    3. Among Hollywood writers, Disney has the reputation

      Writers & Disney

    4. In the part of Disney World called the Animal Kingdom

      Maharajah Jungle Trek

    5. I was in Disney World recently,

      Disney mediated experience

    6. A few years ago I walked into a grocery store

      Dazzled by manufactured images

    7. But even from this remove it was possible to glean certain patterns, and one that recurred as regularly as an urban legend was the one about how someone would move into a commune populated by sandal-wearing, peace-sign flashing flower children, and eventually discover that, underneath this facade, the guys who ran it were actually control freaks; and that, as living in a commune, where much lip service was paid to ideals of peace, love and harmony, had deprived them of normal, socially approved outlets for their control-freakdom, it tended to come out in other, invariably more sinister, ways.
    1. Here’s what the Finns, who don’t begin formal reading instruction until around age 7, have to say about preparing preschoolers to read: “The basis for the beginnings of literacy is that children have heard and listened … They have spoken and been spoken to, people have discussed [things] with them … They have asked questions and received answers.”
  40. Dec 2015
    1. “Speakin’ o’ creeds,” and here old Mrs. Sargent paused in her work, “Elder Ransom from Acreville stopped with us last night, an’ he tells me they recite the Euthanasian Creed every few Sundays in the Episcopal Church.  I didn’t want him to know how ignorant I was, but I looked up the word in the dictionary.  It means easy death, and I can’t see any sense in that, though it’s a terrible long creed, the Elder says, an’ if it’s any longer ’n ourn, I should think anybody might easy die learnin’ it!” “I think the word is Athanasian,” ventured the minister’s wife.
    1. And instead of a nice dish of minnows—they had a roasted grasshopper with lady-bird sauce; which frogs consider a beautiful treat; but I think it must have been nasty!
    1. More venery. More love; more closeness; more sex and romance. Bring it back, no matter what, no matter how old we are. This fervent cry of ours has been certified by Simone de Beauvoir and Alice Munro and Laurence Olivier and any number of remarried or recoupled ancient classmates of ours. Laurence Olivier? I’m thinking of what he says somewhere in an interview: “Inside, we’re all seventeen, with red lips.”
    1. Part of Galileo’s genius was to transfer the spirit of the Italian Renaissance in the plastic arts to the mathematical and observational ones.
    1. from plantations. If that were to increase to 75 percent, the logged area of natural forests could drop in half.” Meanwhile the consumption of all wood has leveled off---for fuel, buildings, and, finally, paper. We are at peak timber.
  41. Sep 2015
    1. THE INTERFACE CULTURE

      "The Interface Culture" section of "In the begining was the command line" stands on it's own as an insightful essay on contemporary global culutre.

  42. Jul 2015
    1. Forexample, in a 2–MW wind turbine, the weight of the rotorand the tower is typically about 250 tons [10]. As reportedbelow, a kite generator of the same rated power can beobtained using a 500–m2kite and cables 1000–m long, witha total weight of about 2 tons only.

      Is this a reasonable claim?