 Jan 2023

ar5iv.labs.arxiv.org ar5iv.labs.arxiv.org

This input embedding is the initial value of the residual stream, which all attention layers and MLPs read from and write to.


www.cs.toronto.edu www.cs.toronto.edu

e twoareas in which the forwardforward algorithm may be superior to backpropagation are as a model oflearning in cortex and as a way of making use of very lowpower analog hardware without resortingto reinforcement learning(Jabri and Flower, 1992).
Tags
Annotators
URL

 Dec 2022

rewriting.csail.mit.edu rewriting.csail.mit.edu

Our method is based on the hypothesis that the weights of a generator act as Optimal Linear Associative Memory (OLAM). OLAM is a classic singlelayer neural data structure for memorizing associations that was described by Teuvo Kohonen and James A Anderson (independently) in the 1970s. In our case, we hypothesize that within a large modern multilayer convolutional network, the each individual layer plays the role of an OLAM that stores a set of rules that associates keys, which denote meaningful context, with values, which determine output.


www.zhihu.com www.zhihu.com

OCaml 语言能做些什么？
Tags
Annotators
URL


www.technologyreview.com www.technologyreview.com

AI training data is filled with racist stereotypes, pornography, and explicit images of rape, researchers Abeba Birhane, Vinay Uday Prabhu, and Emmanuel Kahembwe found after analyzing a data set similar to the one used to build Stable Diffusion.
That is horrifying. You'd think that authors would attempt to remove or filter this kind of material. There are, after all models out there that are trained to find it. It makes me wonder what awful stuff is in the GPT3 dataset too.


arxiv.org arxiv.org

We test this hypothesis by training a predicted computeoptimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and4× more more data. Chinchilla uniformly and significantly outperforms Gopher (280B), GPT3 (175B),Jurassic1 (178B), and MegatronTuring NLG (530B) on a large range of downstream evaluation tasks.This also means that Chinchilla uses substantially less compute for finetuning and inference, greatlyfacilitating downstream usage. As a highlight, Chinchilla reaches a stateoftheart average accuracy of67.5% on the MMLU benchmark, greater than a 7% improvement over Gopher
By using more data on a smaller language model the authors were able to achieve better performance than with the larger models  this reduces the cost of using the model for inference.
Tags
Annotators
URL

 Nov 2022

steadyhq.com steadyhq.com

KuratierungsFilter auf Empfängerseite gibt, aber dann wäre auch emailSpam als Problem gelöst und das sehe ich gerade noch nicht passieren.
gibt es projekte, die Modelle auf gesammelte spam mails trainieren?


community.interledger.org community.interledger.org

🌟 Highlight words as they are spoken (karaoke anybody?). 🌟 Navigate video by clicking on words. 🌟 Share snippets of text (with video attached!). 🌟 Repurpose by remixing using the text as a base and reference.
If I understand it correctly, with hyperaudio, one can also create transcription to somebody else's video or audio when embedded.
In that case, if you add to hyperaudio the annotation capablity of hypothes.is or docdrop, the vision outlined in the article on Global Knowledge Graph is already a reality.
Tags
 language
 remixing
 ML
 transcript
 creative
 hyperaudio
 lite
 annotation
 timing
 plugin
 wordpress
 monetization
 translation
 sharing
 translate
 speech to text
 learning
 mobile
 docdrop
 video
 speech
 repurposing
 interactive
 speech2text
 commons
 open source
 web monetization
 conference
 roam
 navigation
 captions
 knowledge
 open
 simultaneous
 audio
 graph
 global
Annotators
URL


www.exponentialview.co www.exponentialview.co

“The metaphor is that the machine understands what I’m saying and so I’m going to interpret the machine’s responses in that context.”
Interesting metaphor for why humans are happy to trust outputs from generative models


postgresml.org postgresml.org

Scaling PostgresML to 1 Million Requests per Second

 Sep 2022

transformercircuits.pub transformercircuits.pub

Consider a toy model where we train an embedding of five features of varying importanceWhere “importance” is a scalar multiplier on mean squared error loss. in two dimensions, add a ReLU afterwards for filtering, and vary the sparsity of the features.
Tags
Annotators
URL


moodle.lynchburg.edu moodle.lynchburg.edu

The present generation of Southerners are not responsible for the past
We can't judge or blame people based off of their ancestors' actions. In high school, I always hated that everyone knew my older siblings because it often felt like my future was already written for me even though I had not even experienced it myself yet.

Haytian revolt
We briefly touched on this in Traditions/Revolutions, and I know we will learn more about it later on in the course.

his educational programme was unnecessarily narrow.
When I was first annotating "The Education of the Negro," I also found Washington's idea of teaching industrial education singularly focused. However, towards the end of his article he made me come around to the idea because it seemed like a good way to instill a desire in students to work for themselves instead of someone else.

the Free Negroes from 1830 up to wartime hadstriven to build industrial schools, and the American Missionary Association had from the first taught various trades; and Price and others hadsought a way of honorable alliance with the best of the Southerners. ButMr. Washington first indissolubly linked these things; he put enthusiasm,unlimited energy, and perfect faith into his programme, and changed itfrom a bypath into a veritable Way of Life
ML: He was nor the first to come up with the idea obviously but he put a face on it. It seems like people myself included have a much easier time following something if there is a person in charge of it for them to follow.


moodle.lynchburg.edu moodle.lynchburg.edu

Our schools teach everybody a little of almosteverything, but, in my opinion, they teach very fewchildren just what they ought to know in order tomake their way successfully in life. They do not putinto their hands the tools they are best tted to use,and hence so many failures. Many a mother andsister have worked and slaved, living upon scantyfood, in order to give a son and brother a ’liberaleducation,’ and in doing this have built up a barrierbetween the boy and the work he was tted to do.Let me say to you that all honest work is honorablework. If the labor is manual, and seems common,you will have all the more chance to be thinking ofother things, or of work that is higher and bringsbetter pay, and to work out in your minds betterand higher duties and responsibilities foryourselves, and for thinking of ways by which youcan help others as well as yourselves, and bringthem up to your own higher level.
I still see this in our school systems today, especially in certain classes where you feel like you are never going to use anything that you have learned in the real world.

Our schools teach everybody a little of almosteverything, but, in my opinion, they teach very fewchildren just what they ought to know in order tomake their way successfully in life. They do not putinto their hands the tools they are best tted to use,and hence so many failures. Many a mother andsister have worked and slaved, living upon scantyfood, in order to give a son and brother a ’liberaleducation,’ and in doing this have built up a barrierbetween the boy and the work he was tted to do.Let me say to you that all honest work is honorablework. If the labor is manual, and seems common,you will have all the more chance to be thinking ofother things, or of work that is higher and bringsbetter pay, and to work out in your minds betterand higher duties and responsibilities foryourselves, and for thinking of ways by which youcan help others as well as yourselves, and bringthem up to your own higher level.
I still see this in our school systems today, especially in certain classes where you feel like you are never going to use anything that you have learned in the real world.

Our schools teach everybody a little of almosteverything, but, in my opinion, they teach very fewchildren just what they ought to know in order tomake their way successfully in life.
When I was in high school, my mom would always say that they don't teach us some of the most important life skills in class. She was always ranting about how we should have to take a finance class to prepare for adulthood.

“Our schools teach everybody a little of almosteverything, but, in my opinion, they teach very fewchildren just what they ought to know in order tomake their way successfully in life.
This is still accurate for schools today. For example, in middle school we had 8 classes a day for 45 minutes each for one semester. Even though we had class everyday it was far too little of time to actually learn a full subject. The teacher had to just give us a little bit of information on each topic we were supposed to cover.


moodle.lynchburg.edu moodle.lynchburg.edu

Uncle Bird had a small, rough farm, all woods and hills, miles from the big road; but he was fullof tales
My uncles are also full of tales that they like to share with everyone they have the chance to.

willow
I named my Jeep Willow.


pyimagesearch.com pyimagesearch.com

Now, the progression of NLP, as discussed, tells a story. We begin with tokens and then build representations of these tokens. We use these representations to find similarities between tokens and embed them in a highdimensional space. The same embeddings are also passed into sequential models that can process sequential data. Those models are used to build context and, through an ingenious way, attend to parts of the input sentence that are useful to the output sentence in translation.

Data, matrix multiplications, repeated and scaled with nonlinear switches. Maybe that simplifies things a lot, but even today, most architectures boil down to these principles. Even the most complex systems, ideas, and papers can be boiled down to just that:

 Aug 2022


Summarization of Methods for Smart Contract Vulnerabilities Detection
great reference table for SC vulenrabilities detection


towardsdatascience.com towardsdatascience.com

graphs
graph 深度学习动态

 Jul 2022

blogs.microsoft.com blogs.microsoft.com

Zcode models to improve common language understanding tasks such as name entity recognition, text summarization, custom text classification and key phrase extraction across its Azure AI services. But this is the first time a company has publicly demonstrated that it can use this new class of Mixture of Experts models to power machine translation products.
this model is what actually zcode is and what makes it special

have developed called Zcode, which offer the kind of performance and quality benefits that other largescale language models have but can be run much more efficiently.
can do the same but much faster

 Jun 2022

direct.mit.edu direct.mit.edu

The dominant idea is one of attention, by which a representation at a position is computed as a weighted combination of representations from other positions. A common selfsupervision objective in a transformer model is to mask out occasional words in a text. The model works out what word used to be there. It does this by calculating from each word position (including mask positions) vectors that represent a query, key, and value at that position. The query at a position is compared with the value at every position to calculate how much attention to pay to each position; based on this, a weighted average of the values at all positions is calculated. This operation is repeated many times at each level of the transformer neural net, and the resulting value is further manipulated through a fully connected neural net layer and through use of normalization layers and residual connections to produce a new vector for each word. This whole process is repeated many times, giving extra layers of depth to the transformer neural net. At the end, the representation above a mask position should capture the word that was there in the original text: for instance, committee as illustrated in Figure 1.


e2eml.school e2eml.school

This trick of using a onehot vector to pull out a particular row of a matrix is at the core of how transformers work.
Matrix multiplication as table lookup
Tags
Annotators
URL

 May 2022

www.pnas.org www.pnas.org

Given the complexities of the brain’s structure and the functions it performs, any one of these models is surely oversimplified and ultimately wrong—at best, an approximation of some aspects of what the brain does. However, some models are less wrong than others, and consistent trends in performance across models can reveal not just which model best fits the brain but also which properties of a model underlie its fit to the brain, thus yielding critical insights that transcend what any single model can tell us.
Tags
Annotators
URL


www.gwern.net www.gwern.net

Such a highly nonlinear problem would clearly benefitfrom the computational power of many layers. Unfortunately, backpropagation learning generally slows downby an order of magnitude every time a layer is added toa network.
The problem in 1988


colab.research.google.com colab.research.google.com

The source sequence will be pass to the TransformerEncoder, which will produce a new representation of it. This new representation will then be passed to the TransformerDecoder, together with the target sequence so far (target words 0 to N). The TransformerDecoder will then seek to predict the next words in the target sequence (N+1 and beyond).

 Apr 2022


Ourpretrained network is nearly identical to the “AlexNet”architecture (Krizhevsky et al., 2012), but with local reponse normalization layers after pooling layers following(Jia et al., 2014). It was trained with the Caffe frameworkon the ImageNet 2012 dataset (Deng et al., 2009)
Tags
Annotators
URL


cs231n.github.io cs231n.github.io

Example 1. For example, suppose that the input volume has size [32x32x3], (e.g. an RGB CIFAR10 image). If the receptive field (or the filter size) is 5x5, then each neuron in the Conv Layer will have weights to a [5x5x3] region in the input volume, for a total of 5*5*3 = 75 weights (and +1 bias parameter). Notice that the extent of the connectivity along the depth axis must be 3, since this is the depth of the input volume. Example 2. Suppose an input volume had size [16x16x20]. Then using an example receptive field size of 3x3, every neuron in the Conv Layer would now have a total of 3*3*20 = 180 connections to the input volume. Notice that, again, the connectivity is local in 2D space (e.g. 3x3), but full along the input depth (20).
These two examples are the first two layers of Andrej Karpathy's wonderful working ConvNetJS CIFAR10 demo here


cs.stanford.edu cs.stanford.edu

input (32x32x3)max activation: 0.5, min: 0.5max gradient: 1.08696, min: 1.53051Activations:Activation Gradients:Weights:Weight Gradients:conv (32x32x16)filter size 5x5x3, stride 1max activation: 3.75919, min: 4.48241max gradient: 0.36571, min: 0.33032parameters: 16x5x5x3+16 = 1216
The dimensions of these first two layers are explained here


codelabs.developers.google.com codelabs.developers.google.com

Here the lower level layers are frozen and are not trained, only the new classification head will update itself to learn from the features provided from the pretrained chopped up model on the left.


distill.pub distill.pub

Starting from random noise, we optimize an image to activate a particular neuron (layer mixed4a, unit 11).
And then we use that image as a kind of variable name to refer to the neuron in a way that more helpful than the the layer number and neuron index within the layer. This explanation is via one of Chris Olah's YouTube videos (https://www.youtube.com/watch?v=gXsKyZ_Y_i8)

 Mar 2022

quillette.com quillette.com

A special quality of humans, not shared by evolution or, as yet, by machines, is our ability to recognize gaps in our understanding and to take joy in the process of filling them in. It is a beautiful thing to experience the mysterious, and powerful, too.

 Feb 2022

www.sigsdatacom.de www.sigsdatacom.de

Verfahren des Relational Machine Learning, welche unter Ausnutzung der Graphstruktur in vielen Fällen Modelle besserer Qualität liefern.
Rleational Machine LearningAnsatz

In vielen Anwendungen ist es allerdings notwendig, Daten nicht nur in hoher Qualität und semantisch angereichert zur Verfügung zu stellen, sondern neues Wissen aus vorhandenen Informationen zu generieren. Hierfür nutzen wir Machine Learning.
Kombination mit MLAnästze zur Generierung von neuem Wissen


neuralnetworksanddeeplearning.com neuralnetworksanddeeplearning.com

Somewhat confusingly, and for historical reasons, such multiple layer networks are sometimes called multilayer perceptrons or MLPs, despite being made up of sigmoid neurons, not perceptrons. I'm not going to use the MLP terminology in this book, since I think it's confusing, but wanted to warn you of its existence.
Tags
Annotators
URL


docs.microsoft.com docs.microsoft.com

Model deployment in Azure ML

 Dec 2021

cloud.google.com cloud.google.com

the only thing an artificial neuron can do: classify a data point into one of two kinds by examining input values with weights and bias.
How does this relate to "weighted sum shows similarity between the weights and the inputs"?


towardsdatascience.com towardsdatascience.com

The transformer model introduces the idea of instead of adding another complex mechanism (attention) to an already complex Seq2Seq model; we can simplify the solution by forgetting about everything else and just focusing on attention.



I’m particularly interested in two questions: First, just how weird is machine learning? Second, what sorts of choices do developers make as they shape a project?

 Nov 2021

www.cell.com www.cell.com

ey use local computations to interpolate over taskrelevant manifolds in a highdimensional parameter space.
Tags
Annotators
URL


e2eml.school e2eml.school

Now that we've made peace with the concepts of projections (matrix multiplications)
Projections are matrix multiplications.Why didn't you sayso? spatial and channel projections in the gated gmlp

Computers are especially good at matrix multiplications. There is an entire industry around building computer hardware specifically for fast matrix multiplications. Any computation that can be expressed as a matrix multiplication can be made shockingly efficient.

The selectivesecondorderwithskips model is a useful way to think about what transformers do, at least in the decoder side. It captures, to a first approximation, what generative language models like OpenAI's GPT3 are doing.


www.tensorflow.org www.tensorflow.org

You'll use a (70%, 20%, 10%) split for the training, validation, and test sets. Note the data is not being randomly shuffled before splitting. This is for two reasons: It ensures that chopping the data into windows of consecutive samples is still possible. It ensures that the validation/test results are more realistic, being evaluated on the data collected after the model was trained.
Train, Validation, Test: 0.7, 0.2, 0.1
Tags
Annotators
URL


distill.pub distill.pub

The following figure presents a simple functional diagram of the neural network we will use throughout the article. The neural network is a sequence of linear (both convolutional A convolution calculates weighted sums of regions in the input. In neural networks, the learnable weights in convolutional layers are referred to as the kernel. For example Image credit to https://towardsdatascience.com/gentlediveintomathbehindconvolutionalneuralnetworks79a07dd44cf9. See also Convolution arithmetic. and fullyconnected A fullyconnected layer computes output neurons as weighted sum of input neurons. In matrix form, it is a matrix that linearly transforms the input vector into the output vector. ), maxpooling, and ReLU First introduced by Nair and Hinton, ReLU calculates f(x)=max(0,x)f(x)=max(0,x)f(x)=max(0,x) for each entry in a vector input. Graphically, it is a hinge at the origin: Image credit to https://pytorch.org/docs/stable/nn.html#relu layers, culminating in a softmax Softmax function calculates S(yi)=eyiΣj=1NeyjS(y_i)=\frac{e^{y_i}}{\Sigma_{j=1}^{N} e^{y_j}}S(yi)=Σj=1Neyjeyi for each entry (yiy_iyi) in a vector input (yyy). For example, Image credit to https://ljvmiranda921.github.io/notebook/2017/08/13/softmaxandthenegativeloglikelihood/ layer.
This is a great visualization of MNIST hidden layers.
Tags
Annotators
URL


towardsdatascience.com towardsdatascience.com

The Query word can be interpreted as the word for which we are calculating Attention. The Key and Value word is the word to which we are paying attention ie. how relevant is that word to the Query word.
Finally


www.lesswrong.com www.lesswrong.com

Other work on interpreting transformer internals has focused mostly on what the attention is looking at. The logit lens focuses on what GPT "believes" after each step of processing, rather than how it updates that belief inside the step.


distill.pub distill.pub

The cube of activations that a neural network for computer vision develops at each hidden layer. Different slices of the cube allow us to target the activations of individual neurons, spatial positions, or channels.
This is first explanation of
Tags
Annotators
URL


towardsdatascience.com towardsdatascience.com

The attention layer (W in the diagram) computes three vectors based on the input, termed key, query, and value.
Could you be more specific?

Attention is a means of selectively weighting different elements in input data, so that they will have an adjusted impact on the hidden states of downstream layers.


www.pnas.org www.pnas.org

These findings provide strong evidence for a classic hypothesis about the computations underlying human language understanding, that the brain’s language system is optimized for predictive processing in the service of meaning extraction
Tags
Annotators
URL


towardsdatascience.com towardsdatascience.com

To review, the Forget gate decides what is relevant to keep from prior steps. The input gate decides what information is relevant to add from the current step. The output gate determines what the next hidden state should be.Code DemoFor those of you who understand better through seeing the code, here is an example using python pseudo code.

 Oct 2021

colah.github.io colah.github.io

This approach, visualizing highdimensional representations using dimensionality reduction, is an extremely broadly applicable technique for inspecting models in deep learning.

These layers warp and reshape the data to make it easier to classify.


selectedfirms.co selectedfirms.co

machine learning libraries in Python
Find out the best python libraries for ML in 2021.


cloud.google.com cloud.google.com

Even with this very primitive single neuron, you can achieve 90% accuracy when recognizing a handwritten text image1. To recognize all the digits from 0 to 9, you would need just ten neurons to recognize them with 92% accuracy.
And here is a Google Colab notebook that demonstrates that

 Sep 2021


Humans perform a version of this task when interpretinghardtounderstand speech, such as an accent which is particularlyfast or slurred, or a sentence in a language we do not know verywell—we do not necessarily hear every single word that is said,but we pick up on salient key words and contextualize the rest tounderstand the sentence.
Boy, don't they
Tags
Annotators
URL


www.ccom.ucsd.edu www.ccom.ucsd.edu

A neural network will predict your digit in the blue square above. Your image is 784 pixels (= 28 rows by 28 columns with black=1 and white=0). Those 784 features get fed into a 3 layer neural network; Input:784  AvgPool:196  Dense:100  Softmax:10.
Tags
Annotators
URL


www.iscaspeech.org www.iscaspeech.org

Personalized ASR models. For each of the 432 participants with disordered speech, we create a personalized ASR model (SI2) from their own recordings. Our finetuning procedure was optimized for our adaptation process, where we only have between ¼ and 2 h of data per speaker. We found that updating only the first five encoder layers (versus the complete model) worked best and successfully prevented overfitting [10]


jalammar.github.io jalammar.github.io

So whenever you hear of someone “training” a neural network, it just means finding the weights we use to calculate the prediction.

 Aug 2021

stats.stackexchange.com stats.stackexchange.com

I'm going to try provide an English text example. The following is based solely on my intuitive understanding of the paper 'Attention is all you need'.
This is also good

For the word q that your eyes see in the given sentence, what is the most related word k in the sentence to understand what q is about?

So basically: q = the vector representing a word K and V = your memory, thus all the words that have been generated before. Note that K and V can be the same (but don't have to). So what you do with attention is that you take your current query (word in most cases) and look in your memory for similar keys. To come up with a distribution of relevant words, the softmax function is then used.


ericssonlearning.percipio.com ericssonlearning.percipio.com

Here is a list of some open data available online. You can find a more complete list and details of the open data available online in Appendix B.
DataHub (http://datahub.io/dataset)
World Health Organization (http://www.who.int/research/en/)
European Union Open Data Portal (http://opendata.europa.eu/en/data/)
Amazon Web Service public datasets (http://aws.amazon.com/datasets)
Facebook Graph (http://developers.facebook.com/docs/graphapi)
Healthdata.gov (http://www.healthdata.gov)
Google Trends (http://www.google.com/trends/explore)
Google Finance (https://www.google.com/finance)
Google Books Ngrams (http://storage.googleapis.com/books/ngrams/books/datasetsv2.html)
Machine Learning Repository (http://archive.ics.uci.edu/ml/)
As an idea of open data sources available online, you can look at the LOD cloud diagram (http://lodcloud.net ), which displays the connections of the data link among several open data sources currently available on the network (see Figure 13).


colah.github.io colah.github.io

A neural network with a hidden layer has universality: given enough hidden units, it can approximate any function. This is a frequently quoted – and even more frequently, misunderstood and applied – theorem. It’s true, essentially, because the hidden layer can be used as a lookup table.

Recursive Neural Networks


arxiv.org arxiv.org
Tags
Annotators
URL


mccormickml.com mccormickml.com

The secondtolast layer is what Han settled on as a reasonable sweetspot.
Pretty arbitrary choice


arxiv.org arxiv.org

We show that BigBird is a universal approximator of sequence functions and is Turing complete,
Tags
Annotators
URL

 Jul 2021

www.codemotion.com www.codemotion.com

hyperparameters, i.e., parameters external to the model, such as the learning rate, the batch size, the number of epochs.


jalammar.github.io jalammar.github.io

In the language of Interpretable Machine Learning (IML) literature like Molnar et al.[20], input saliency is a method that explains individual predictions.
Tags
Annotators
URL


colah.github.io colah.github.io

Using multiple copies of a neuron in different places is the neural network equivalent of using functions. Because there is less to learn, the model learns more quickly and learns a better model. This technique – the technical name for it is ‘weight tying’ – is essential to the phenomenal results we’ve recently seen from deep learning.
Tags
Annotators
URL


www.baeldung.com www.baeldung.com

Vectors with a small Euclidean distance from one another are located in the same region of a vector space. Vectors with a high cosine similarity are located in the same general direction from the origin.


iamtrask.github.io iamtrask.github.io

If you're serious about neural networks, I have one recommendation. Try to rebuild this network from memory.

If you're serious about neural networks, I have one recommendation. Try to rebuild this network from memory.


mlech26l.github.io mlech26l.github.io

In our research, i.e., the wormnet project, we try to build machine learning models motivated by the C. elegans nervous system. By doing so, we have to pay a cost, as we constrain ourselves to such models in contrast to standard artificial neural networks, whose modeling space is purely constraint by memory and compute limitations. However, there are potentially some advantages and benefits we gain. Our objective is to better understand what’s necessary for effective neural information processing to emerge.
Tags
Annotators
URL


aylien.com aylien.com

Recommendations DON'T use shifted PPMI with SVD. DON'T use SVD "correctly", i.e. without eigenvector weighting (performance drops 15 points compared to with eigenvalue weighting with (p = 0.5)). DO use PPMI and SVD with short contexts (window size of (2)). DO use many negative samples with SGNS. DO always use context distribution smoothing (raise unigram distribution to the power of (lpha = 0.75)) for all methods. DO use SGNS as a baseline (robust, fast and cheap to train). DO try adding context vectors in SGNS and GloVe.

 Jun 2021

towardsdatascience.com towardsdatascience.com

2D Vectors in space. Image by Author
A good image for cosine similarity.


www.incompleteideas.net www.incompleteideas.net

One thing that should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great. The two methods that seem to scale arbitrarily in this way are search and learning
This is a big lesson. As a field, we still have not thoroughly learned it, as we are continuing to make the same kind of mistakes. To see this, and to effectively resist it, we have to understand the appeal of these mistakes. We have to learn the bitter lesson that building in how we think we think does not work in the long run. The bitter lesson is based on the historical observations that 1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning. The eventual success is tinged with bitterness, and often incompletely digested, because it is success over a favored, humancentric approach.


cloud.google.com cloud.google.com

"dividing ndimensional space with a hyperplane."

This dataset can not be classified by a single neuron, as the two groups of data points can't be divided by a single line.

 Apr 2021


Machine learning app development has been gaining traction among companies from all over the world. When dealing with this part of machine learning application development, you need to remember that machine learning can recognize only the patterns it has seen before. Therefore, the data is crucial for your objectives. If you’ve ever wondered how to build a machine learning app, this article will answer your question.


towardsdatascience.com towardsdatascience.com

Machine learning is an extension of linear regression in a few ways. Firstly is that modern ML
Machine learning is an extension to linear model which deals with much more complicated situation where we take few different inputs and get outputs.


www.infoq.com www.infoq.com

survival prediction of colorectal cancer is formulated as a multiclass classification problem

 Nov 2020

blog.csdn.net blog.csdn.net

可以认为 π k \pi_k πk就是每个分量 N ( x ∣ μ k , Σ k ) \mathcal{N}(\boldsymbol{x}\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) N(x∣μk,Σk)的权重。
有的书称为责任
Tags
Annotators
URL

 Oct 2020


LEGO
作者将深度学习比作乐高



Data Augmentation
常用来增加数据量

 May 2020

www.javatpoint.com www.javatpoint.com

Machine learning has a limited scope

AI is a bigger concept to create intelligent machines that can simulate human thinking capability and behavior, whereas, machine learning is an application or subset of AI that allows machines to learn from data without being programmed explicitly


expertsystem.com expertsystem.com

Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed
Tags
Annotators
URL

 Apr 2020

keras.io keras.io

Keras is a highlevel neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. It was developed with a focus on enabling fast experimentation. Being able to go from idea to result with the least possible delay is key to doing good research. Use Keras if you need a deep learning library that: Allows for easy and fast prototyping (through user friendliness, modularity, and extensibility). Supports both convolutional networks and recurrent networks, as well as combinations of the two. Runs seamlessly on CPU and GPU. Read the documentation at Keras.io. Keras is compatible with: Python 2.73.6.

 Jan 2020

pubs.aeaweb.org pubs.aeaweb.org

Suppose the algorithm chooses a tree that splits on education but not on age. Conditional on this tree, the estimated coefficients are consistent. But that does not imply that treatment effects do not also vary by age, as education may well covary with age; on other draws of the data, in fact, the same procedure could have chosen a tree that split on age instead
a caveat

hese heterogenous treatment effects can be used to assign treatments; Misra and Dubé (2016) illustrate this on the problem of price targeting, applying Bayesian regularized methods to a largescale experiment where prices were randomly assigned
todo  look into the implication for treatment assignment with heterogeneity

Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, and Newey (2016) take care of highdimensional controls in treatment effect estimation by solving two simultaneous prediction problems, one in the outcome and one in the treatment equation.
this seems similar to my idea of regularizing on only a subset of the variables

These same techniques applied here result in splitsample instrumental variables (Angrist and Krueger 1995) and “jackknife” instrumental variables
some classical solutions to IV bias are akin to ML solutions

Understood this way, the finitesample biases in instrumental variables are a consequence of overfitting.
traditional 'finite sample bias of IV' is really overfitting

Even when we are interested in a parameter β ˆ, the tool we use to recover that parameter may contain (often implicitly) a prediction component. Take the case of linear instrumental variables understood as a twostage procedure: first regress x = γ′z + δ on the instrument z, then regress y = β′x + ε on the fitted values x ˆ. The first stage is typically handled as an estimation step. But this is effectively a prediction task: only the predictions x ˆ enter the second stage; the coefficients in the first stage are merely a means to these fitted values.
first stage of IV  handled as an estimation problem, but really it's a prediction problem!

Prediction in the Service of Estimation
This is especially relevant to economists across the board, even the ML skeptics

New Data
The first application: constructing variables and meaning from highdimensional data, especially outcome variables
 satellite images (of energy use, lights etc) > economic activity
 cell phone data, Google street view to measure wealth
 extract similarity of firms from 10k reports
 even traditional data .. matching individuals in historical censuses

Zhao and Yu (2006) who establish asymptotic modelselection consistency for the LASSO. Besides assuming that the true model is “sparse”—only a few variables are relevant—they also require the “irrepresentable condition” between observables: loosely put, none of the irrelevant covariates can be even moderately related to the set of relevant ones.
Basically unrealistic for microeconomic applications imho

First, it encourages the choice of less complex, but wrong models. Even if the best model uses interactions of number of bathrooms with number of rooms, regularization may lead to a choice of a simpler (but worse) model that uses only number of fireplaces. Second, it can bring with it a cousin of omitted variable bias, where we are typically concerned with correlations between observed variables and unobserved ones. Here, when regularization excludes some variables, even a correlation between observed variables and other observed (but excluded) ones can create bias in the estimated coefficients.
Is this equally a problem for procedures that do not assum sparsity, such as the Ridge model?

97the variables are correlated with each other (say the number of rooms of a house and its squarefootage), then such variables are substitutes in predicting house prices. Similar predictions can be produced using very different variables. Which variables are actually chosen depends on the specific finite sample.
Lassochosen variables are unstable because of what we usually call 'multicollinearity.'<br> This presents a problem for making inferences from estimated coefficients.

Through its regularizer, LASSO produces a sparse prediction function, so that many coefficients are zero and are “not used”—in this example, we find that more than half the variables are unused in each run
This is true but they fail to mention that LASSO also shrinks the coefficients on variables that it keeps towards zero (relative to OLS). I think this is commonly misunderstood (from people I've spoken with).

One obvious problem that arises in making such inferences is the lack of standard errors on the coefficients. Even when machinelearning predictors produce familiar output like linear functions, forming these standard errors can be more complicated than seems at first glance as they would have to account for the model selection itself. In fact, Leeb and Pötscher (2006, 2008) develop conditions under which it is impossible to obtain (uniformly) consistent estimates of the distribution of model parameters after datadriven selection.
This is a very serious limitation for Economics academic work.

First, econometrics can guide design choices, such as the number of folds or the function class.
How would Econometrics guide us in this?

These choices about how to represent the features will interact with the regularizer and function class: A linear model can reproduce the log base area per room from log base area and log room number easily, while a regression tree would require many splits to do so.
The choice of 'how to represent the features' is consequential ... it's not just 'throw it all in' (kitchen sink approach)

Ta b l e 2Some Machine Learning Algorithms
This is a very helpful table!

Picking the prediction function then involves two steps: The first step is, conditional on a level of complexity, to pick the best insample lossminimizing function.8 The second step is to estimate the optimal level of complexity using empirical tuning (as we saw in crossvalidating the depth of the tree).
ML explained while standing on one leg.

egularization combines with the observability of prediction quality to allow us to fit flexible functional forms and still find generalizable structure.
But we can't really make statistical inferences about the structure, can we?

This procedure works because prediction quality is observable: both predictions y ˆ and outcomes y are observed. Contrast this with parameter estimation, where typically we must rely on assumptions about the datagenerating process to ensure consistency.
I'm not clear what the implication they are making here is. Does it in some sense 'not work' with respect to parameter estimation?

In empirical tuning, we create an outofsample experiment inside the original sample.
remember that tuning is done within the training sample

Performance of Different Algorithms in Predicting House Values
Any reason they didn't try a Ridge or an Elastic net model here? My instinct is that these will beat LASSO for most Economic applications.

We consider 10,000 randomly selected owneroccupied units from the 2011 metropolitan sample of the American Housing Survey. In addition to the values of each unit, we also include 150 variables that contain information about the unit and its location, such as the number of rooms, the base area, and the census region within the United States. To compare different prediction techniques, we evaluate how well each approach predicts (log) unit value on a separate holdout set of 41,808 units from the same sample. All details on the sample and our empirical exercise can be found in an online appendix available with this paper athttp://ejep.org
Seems a useful example for trying/testing/benchmarking. But the link didn't work for me. Can anyone find it? Is it interactive? (This is why I think papers should be html and not pdfs...)

Making sense of complex data such as images and text often involves a prediction preprocessing step.
In using 'new kinds of data' in Economics we often need to do a 'classification step' first

The fundamental insight behind these breakthroughs is as much statistical as computational. Machine intelligence became possible once researchers stopped approaching intelligence tasks procedurally and began tackling them empirically.
I hadn't thought about how this unites the 'statistics to learn stuff' part of ML and the 'build a tool to do a task' part. Wellphrased.

In another category of applications, the key object of interest is actually a parameter β, but the inference procedures (often implicitly) contain a prediction task. For example, the first stage of a linear instrumental variables regression is effectively prediction. The same is true when estimating heterogeneous treatment effects, testing for effects on multiple outcomes in experiments, and flexibly controlling for observed confounders.
This is most relevant tool for me. Before I learned about ML I often thought about using 'stepwise selection' for such tasks... to find the best set of 'control variables' etc. But without regularisation this seemed problematic.

Machine Learning: An Applied Econometric Approach
Shall we use Hypothesis to have a discussion ?

 Dec 2019

www.ourcommunity.com.au www.ourcommunity.com.au
 Aug 2019

towardsdatascience.com towardsdatascience.com

Machine learning is an approach to making many similar decisions that involves algorithmically finding patterns in your data and using these to react correctly to brand new data

 Jul 2019

www.ohdsieurope.org www.ohdsieurope.org

We translate all patient measurements into statisticsthat are predictive of unsuccesfull discharge
Egy analitikai pipeline, kb amit nekünk is össze kéne hozni a végére.

 Feb 2019

stats.stackexchange.com stats.stackexchange.com

One benefit of SGD is that it's computationally a whole lot faster. Large datasets often can't be held in RAM, which makes vectorization much less efficient. Rather, each sample or batch of samples must be loaded, worked with, the results stored, and so on. Minibatch SGD, on the other hand, is usually intentionally made small enough to be computationally tractable. Usually, this computational advantage is leveraged by performing many more iterations of SGD, making many more steps than conventional batch gradient descent. This usually results in a model that is very close to that which would be found via batch gradient descent, or better.
Good explanation for why SGD is computationally better. I was confused about the benefits of repeated performing minibatch GD, and why it might be better than batch GD. But I guess the advantage comes from being able to get better performance by vecotrizing computation.


neuralnetworksanddeeplearning.com neuralnetworksanddeeplearning.com

And so it makes most sense to regard epoch 280 as the point beyond which overfitting is dominating learning in our neural network.
I do not get this. Epoch 15 indicates that we are already overfitting to the training data set, on? Assuming both training and test set come from the same population that we are trying to learn from.

If we see that the accuracy on the test data is no longer improving, then we should stop training
This contradicts the earlier statement about epoch 280 being the point where there is overtraining.

It might be that accuracy on the test data and the training data both stop improving at the same time
Can this happen? Can the accuracy on the training data set ever increase with the training epoch?

What is the limiting value for the output activations aLj
When c is large, small differences in z_j^L are magnified and the function jumps between 0 and 1, depending on the sign of the differences. On the other hand, when c is very small, all activation values will be close to 1/N; where N is the number of neurons in layer L.


towardsdatascience.com towardsdatascience.com

Top Sources For Machine Learning Datasets

 Jan 2019

www.sciencedirect.com www.sciencedirect.com

By utilizing the Deeplearning4j library1 for model representation, learning and prediction, KNIME builds upon a well performing open source solution with a thriving community.

It is especially thanks to the work of Yann LeCun and Yoshua Bengio (LeCun et al., 2015) that the application of deep neural networks has boomed in recent years. The technique, which utilizes neural networks with many layers and enhanced backpropagation algorithms for learning, was made possible through both new research and the ever increasing performance of computer chips.

One of KNIME's strengths is its multitude of nodes for data analysis and machine learning. While its base configuration already offers a variety of algorithms for this task, the plugin system is the factor that enables thirdparty developers to easily integrate their tools and make them compatible with the output of each other.

 Dec 2018

jalammar.github.io jalammar.github.io
Tags
Annotators
URL


users.umiacs.umd.edu users.umiacs.umd.eduCMSC 7261
Tags
Annotators
URL

 Nov 2018

artemisml.readthedocs.io artemisml.readthedocs.io

github.com github.com


192.168.199.102:5000 192.168.199.102:5000

三，方法3：MDS
Multidimensional scaling (MDS) and Principla Coordinate Analysis(PCoA) are very similar to PCA, except that instead of converting correlations into a 2D graph, they convert distance among the samples into a 2D graph.
So, in order to do MDS or PCoA, we have to calculate the distance between Cell1 and Cell2, and distance between Cell1 and Cell3...
 1 2
 1 3
 1 4
 2 3
 2 4
 3 4
One very common way to calculate distance between two things is to calculate the Euclidian distance.
And once we calculated the distance between every pair of cells, MDS and PCoA would reduce them to a 2D graph.
The bad news
is that if we used the Euclidean Distance, the graph would be identical to a PCA graph!!In other words, clustering based on minimizing the linear distances is the same with maximzing the linear correlations.
我想这里也就是为什么，李宏毅老师在 tSNE 课程一开始时说，其他非监督降维算法都只是专注于【如何让·簇内距小·】，而 tSNE 还考虑了【如何让·簇间距大·】
也就是说，PCA 的本质（或者叫另一种解释）也只是【找到一种转换函数，他能让原空间中距离近的两点，转换后距离更近】，他压根就没有考虑【簇内or簇外】而是“通杀”所有点。
The good news
is that there are tons of other ways to measure distance!!!For example, another way to measure distances between cells is to calcualte between cells is to calculate the average of the absolute value of the log fold changes among genes.
Finally， we get a plot different from the PCA plot
A biologist might choose to use log fold change to calculate distance because they are frequently interested in log fold changes among genes...
But there are lots of distance to choose from...
 Manhattan Distance
 Hamming Distance
 Great Circle Distance
In summary:
 PCA creates plots based on correlations among samples;
 MDS and PCoA create plots based on distances among samples

 Oct 2018

192.168.199.102:5000 192.168.199.102:5000

Tdistribution Stochastic Neighbor Embedding(tSNE)
之前介绍的所有方法都存在相同的弊病：
similar data are close, but different data may collapse，亦即，相似（label）的点靠的确实很近，但不相似(label)的点也有可能靠的很近。
tSNE 的原理
\(x \rightarrow z\)
tSNE 一样是降维，从 x 向量降维到 z. 但 tSNE 有一步很独特的标准化步骤：
一， tSNE 第一步：similarity normalization
这一步假设我们已经知道 similarity 的公式，关于 similarity 的公式在【第四步】单独讨论，因为实在神妙。
这一步是对任意两个点之间的相似度进行标准化，目的是尽量让所有的相似度的度量都处在 [0,1] 之间。你可以把他看做是对相似度进行标准化，也可以看做是为求解KL散度做准备  求条件概率分布。
compute similarity between all pairs of x: \(S(x^i, x^j)\)
我们这里使用 Similarity(A,B) 来近似 P(A and B), 使用 \(\sum_{A\neq B}S(A,B)\) 来近似 P(B)
\(P(AB) = \frac{P(A\cap B)}{P(B)} = \frac{P(A\cap B)}{\sum_{all\ I\ \neq B}P(I\cap B)}\)
\(P(x^jx^i)=\frac{S(x^i, x^j)}{\sum_{k\neq i}S(x^i, x^k)}\)
假设我们已经找到了一个 low dimension zspace。我们也就可以计算转换后样本的相似度，进一步计算 \(z^i\) \(z^j\) 的条件概率。
compute similarity between all pairs of z: \(S'(z^i, z^j)\)
\(P(z^jz^i)=\frac{S(z^i, z^j)}{\sum_{k\neq i}S(z^i, z^k)}\)
Find a set of z making the two distributions as close as possible:
\(L = \sum_{i}KL(P(\star  x^i)Q(\star  z^i))\)
二， tSNE 第二部：find z
我们要找到一组转换后的“样本”， 使得转换前后的两组样本集（通过KLdivergence测量）的分布越接近越好：
衡量两个分布的相似度：使用 KL 散度(也叫 Infomation Gain)。KL 散度越小，表示两个概率分布越接近。
\(L = \sum_{i}KL(P(\star  x^i)  Q(\star  z^i))\)
find zi to minimize the L.
这个应该是很好做的，因为只要我们能找到 similarity 的计算公式，我们就能把 KL divergence 转换成关于 zi 的相关公式，然后使用梯度下降法GD最小化这个式子即可。
三，tSNE 的弊端
 需要计算所有两两pair的相似度
 新点加入，需要重新计算他与所有点之间的相似度
 由于步骤2导致的后续所有的条件概率\(P\ and\ Q\) 都需要重新计算
因为 tSNE 要求我们计算数据集的两两点之间的相似度，所以这是一个非常高计算量的算法。同时新数据点的加入会影响整个算法的过程，他会重新计算一遍整个过程，这个是十分不友好的，所以 tSNE 一般不用于训练过程，仅仅用在可视化中，即便在可视化中也不会仅仅使用 tSNE，依旧是因为他的超高计算量。
在用 tSNE 进行可视化的时候，一般先使用 PCA 把几千维度的数据点降维到几十维度，然后再利用 tSNE 对几十维度的数据进行降维，比如降到2维之后，再plot到平面上。
四，tSNE 的 similarity 公式
之前说过如果一种 similarity 公式：计算两点(xi, xj)之间的 2norm distance（欧氏距离）：
\(S(x^i, x^j)=exp(x^i  x^j_2)\)
一般用在 graph 模型中计算 similarity。好处是他可以保证非常相近的点才会让这个 similarity 公式有值，因为 exponential 会使得该公式的结果随着两点距离变大呈指数级下降。
在 tSNE 之前有一个算法叫做 SNE 在 zspace 和 xspace 都使用这个相似度公式。
similarity of xspace: \(S(x^i, x^j)=exp(x^i  x^j_2)\) similarity of zspace: \(S'(z^i, z^j)=exp(z^i  z^j_2)\)
tSNE 神妙的地方就在于他在 zspace 上采用另一个公式作为 similarity 公式, 这个公式是 tdistribution 的一种（t 分布有参数可以调，可以调出很多不同的分布）：
\(S(x^i, x^j)=exp(x^i  x^j_2)\) \(S'(z^i, z^j)=\frac{1}{1+z^i  z^j_2}\)
可以通过函数图像来理解为什么需要进行这种修正，以及这种修正为什么能保证xspace原来近的点, 在 zspace 依旧近，原来 xspace 稍远的点，在 zspace 会拉的非常远：
也就是说，原来 xspace 上的点如果存在一些 gap（similarity 较小），这些 gap 就会在映射到 zspace 后被强化，变的更大更大。

Unsupervised Learning: Neighbor Embedding
著名的 tSNE 算法（'NE'  Neighbor Embedding）
manifold Learning
manifold 与 欧氏距离失效
什么是 manifold，manifold 其实就是一个 2D 平面被卷曲起来成为一个3D物体，其最大的特点是3D空间中的两点之间Euclidean distance并不能衡量两者在(卷曲前)2D空间中的'远近'，尤其是两者距离较大的时候，欧式几何不再适用  3D远距离情况下欧式几何失效问题，在3D空间中欧式几何只能用在距离较近的时候。
manifold learning 就是针对3D下欧式几何失效问题要做的事情就是把卷曲的平面摊平，这样可以重新使用欧式几何求解问题(毕竟我们的很多算法都是基于 Euclidean distance)。这种摊平的过程也是一种降维过程。
manifold learning algo1: LLE
又是一种“你的圈子决定你是谁”的算法
第一步, 计算 w
针对每个数据集中的点，【选取】他的K（超参数，类似KNN中的K）个邻居，定义名词该\(x^i\)点与其邻居\(x^j\)之间的【关系】为：\(w_{ij}\), \(w_{ij}\) represents the relation between \(x^i\) and \(x^j\)
\(w_{ij}\) 就是我们要寻找的目标，我们希望借由 \(w_{ij}\) 使得 \(x^i\) 可以被K个邻居通过\(w_{ij}\)的加权和来近似，使用 Euclidean distance 衡量近似程度:
given \(x_i, x_j\),, find a set of \(w_{ij}\) minimizing
\(w_{ij} = argmin_{w_{ij},i\in [1,N],j\in [1,K]}\sum_ix^i  \sum_jw_{ij}x^j_2\)
第二步, 计算 z 做降维，keep \(w_{ij}\) unchanged, 找到 \(z_{i}\) and \(z_{j}\)将 \(x^i, x^j\) 降维成\(z^i, z^j\), 原则是保持 \(w_{ij}\) 不变，因为我们要做的是 dimension reduction, 所以新产生的 \(z_i, z_j\) 应该比 \(x_i, x_j\) 的维度要低：
given \(w_{ij}\), find a set of \(z_i\) minimizing
\(z_{i} = argmin_{z_{i},i\in [1,N],j\in [1,K]}\sum_iz^i  \sum_jw_{ij}z^j_2\)
LLE 的特点是：它属于 transductive learning 类似 KNN 是没有一个具体的函数（例如: \(f(x)=z\)）用来做降维的.
LLE 的一个好处是：看算法【第二步】，及时我们不知道 \(x_i\) 是什么，但只要知道点和点之间的关系【\(w_{ij}\)】我们依然可以使用 LLE 来找到 \(z_i\) 因为 \(x_i\) 起到的作用仅仅是找到 \(w_{ij}\)
LLE 的累赘：必须对 K（邻居数量）谨慎选择，必须刚刚好才能得到较好的结果。
K 太小，整体 w （模型参数）的个数较少，能力不足，结果不好
K 太大，离 \(x_i\) 较远距离的点（xspace 就是卷曲的 2D 平面）也被考虑到，之前分析过 manifold 的特点就是距离太大的点 Euclidean distance 失效问题。而我们的公式计算 w 的时候使用的就是 Euclidean distance，所以效果也不好。
这也就是为什么 K 在 LLE 中非常关键的原因。
manifold learning algo1: Laplacian Eigenmaps
Graphbased approach, to solve manifold
算数据集中点的两两之间的相似度，如果超过某个阈值就连接起来，如此构造一个 graph。得到 graph 之后，【两点之间的距离】就可以被【连线的长度】替代，换言之 laplacian eigenmaps 并不是计算两点之间的直线距离（euclidean distance）而是计算两点之间的曲线距离:
回忆我们之前学习的 semisupervised learning 中关于 graphbased 方法的描述：如果 x1 和 x2 在一个 highdensity region 中相近，那么两者的标签（分类）相同，我们使用的公式是：
\(L=\sum_{x^r}C(y^r, \hat{y}^r)\) + \lambda S
\(S=\frac{1}{2}\sum_{i,j}w_{i,j}(y^i  y^j)^2=y^TLy\)
\(L = D  W\)
\(w_{i,j} = similarity between i and j if connected, else 0\)
 \(x^r\)： 带标数据
 \(S\)： 图(从整个数据集绘出)的平滑度
 \(w\)：两点之间的相似度，也就是graph的边的值
 \(y^i\)：预测标签
 \(\hat{y}^r\)：真实标签
 \(L\)： graph 的 laplacian
同样的方法可以用在 unsupervised learning 中, 如果 xi 与 xj 的 similarity(\(w_{i,j}\)) 值很大，降维之后（曲面摊平之后）zi 和 zj 的距离(euclidean distance)就很近:
\(S=\frac{1}{2}\sum_{i,j}w_{i,j}(z^i  z^j)^2\)
但是仅仅最小化这个 S 会导致他的最小值就是 0，所以要给 z 一些限制  虽然我们是把高维的扭曲平面进行摊平，但我们不希望摊平（降维）之后他仍然可以继续'摊'(曲面 >摊平,依然是曲面 > 继续摊), 也就是说我们这次摊平的结果应该是【最平的】，也就是说：
if the dim of z is M, \(Span{z^1, z^2, ..., z^N} = R^M\)
【给出结论】可以证明的是，这个 z 是 Laplacian (\(L\)) 的比较小的 eigenvalues 的 eigenvectors。所以整个算法才叫做 Laplacian eigenmaps, 因为他找到的 z 就是 laplacian matrix 的最小 eigenvalue 的 eigenvector.
Spectral clustering: clustering on z
结合刚才的 laplacian eigenmaps, 如果对 laplacian eigenmaps 找出的 z 做 clustering(eg, Kmeans) 这个算法就是 spectral clustering.
spectral clustering = laplacian eigenmaps reduction + clustering
Tdistributed Stochastic Neighbor Embedding(tSNE)

Unsupervised Learning: Word Embedding
why Word Embedding ?
Word Embedding 是 Diemension Reduction 一个非常好，非常广为人知的应用。
1ofN Encoding 及其弊端
apple = [1 0 0 0 0]
bag = [0 1 0 0 0]
cat = [0 0 1 0 0]
dog = [
