 May 2017

www.inf.ed.ac.uk www.inf.ed.ac.uk

From Section 2.1.2, and (1.45), we know thatE(x>x) =Pdi=1

(x>wi)(w>ix)
These are scalars, that's why they can be swapped

Figure 1.8
Several sets of (x, y) points, with the correlation coefficient of x and y for each set. Note that the correlation reflects the nonlinearity and direction of a linear relationship (top row), but not the slope of that relationship (middle), nor many aspects of nonlinear relationships (bottom). N.B.: the figure in the center has a slope of 0 but in that case the correlation coefficient is undefined because the variance of Y is zero.


d1b10bmlvqabco.cloudfront.net d1b10bmlvqabco.cloudfront.net

the data
Not the data them selves. The variance of the data

One problem in which the Metric MDS could be used, would be to represent cities in atwodimensional space, simulating a map. This is a very intuitive example that makesuse of the basic Euclidean distance. If we had a matrix with distances between manycities, we could use a Metric MDS to represent them on a plane. If everything worksas expected, the distribution of the cities would be the same as the one found in a map.Another example would be to represent the digits from the previous example. Thevalues of the 784dimensional vectors representing each digit carry information ofthe (gray) color intensity of the image; hence, a Metric MDS would be suitable torepresent the digits in a twodimensional space, hoping to get different cluster foreach of the digits.
Are these correct ??

 Nov 2016

www.inf.ed.ac.uk www.inf.ed.ac.uk

Dropout training

 Oct 2016

www.learn.ed.ac.uk www.learn.ed.ac.uk

3.
It just happens that the mean and variance of the data are the exactly same values with the mean and variance of the gaussian. And this values maximize likelihood. You could take the derivative of the probability and we would find that it is maximizing using these exact mean and variances


neuralnetworksanddeeplearning.com neuralnetworksanddeeplearning.com

Sigmoid neurons simulating perceptrons, part I
output={<br> 0 if w⋅x+b≤0<br> 1 if w⋅x+b>0<br> }
So three cases wx + b is zero, negative or positive if zero then c zero = 0 and is classified as zero so nothing changes
input negative c negative = positive negative = negative which is classified as zero so nothing changes
input positive c positive = positive positive = positive which is classified as one so again nothing changes

If you don't find this obvious, you should stop and prove to yourself that this is equivalent.
2 2 bias 3 out
 0 0 3 > True
 0 1 1 > True
 1 0 1 > True
 1 1 1 > False
x1 x2 x1*x2(4) bias 3 out
 0 0 0 3 > True
 0 1 0 3 > True
 1 0 0 3 > True
 1 1 1 1 > False
Proven :)

In this way a perceptron in the second layer can make a decision at a more complex and more abstract level than perceptrons in the first layer
University Professor claims that there are published academic papers which disprove that this is always true. Not always the layers are more abstract. Sometimes it might be vice versa. Sorry no link is provided.


neuralnetworksanddeeplearning.com neuralnetworksanddeeplearning.com

The first assumption we need is that the cost function can be written as an average
How could have a cost / error function that cannot be written as an average?


www.learn.ed.ac.uk www.learn.ed.ac.ukdt.pdf1

Sub‐tree replacement pruning (WF 6.1
Say you have a tree with 5 nodes, n1, n2 etc. First you prune n1, and check with your test data if that increased the accuracy. So that was a tree with n2, n3, n4, and n5. Then you try it with only n2 cut. Check if it increases the accuracy. Do that for all the trees with just one node cut. Whichever results in the largest increases in accuracy you prune. Then you start the process again. So say pruning n5 resulted in the biggest increased in accuracy, you start with the tree including n1, n2, n3, and n4. n5 is forever gone. Then you try all the possible trees with just one of the remaining nodes pruned. You keep doing this until cutting nodes no longer increases accuracy.
