5 Matching Annotations
1. May 2019
2. towardsdatascience.com towardsdatascience.com
1. we represent that to the model as that feature taking on it’s expected value over the whole dataset

Kind of similar to occlusion if you consider simplistic view in terms of pixels as opposed to super-pixels

2. In the case where the model is linear, or where the features are truly independent, this problem is a trivial one: no matter the values of other features, or the sequence in which features are added to the model, the contribution of a given feature is the same.

I guess this is in agreement with the equation in the paper. So basically inp*grad (from the paper)

3. finding each

This also takes care of finding marginal contribution as compareed to when all the players are absent (set is empty)(subtraction over expectation of dataset as in (https://christophm.github.io/interpretable-ml-book/shapley.html))

#### URL

3. Apr 2018
4. neuralnetworksanddeeplearning.com neuralnetworksanddeeplearning.com
1. Where does the "softmax" name come from

This one's quite interesting. The output of the maximum function would look something like [0, 0,...,1, 0..., 0] (1 for the maximum value). That's why the name softmax when c = 1.

Another interesting article explaining why to use softmax over simple normalization.

#### URL

5. neuralnetworksanddeeplearning.com neuralnetworksanddeeplearning.com
1. There are other insights along these lines which can be obtained from (BP1)δLj=∂C∂aLjσ′(zLj)δjL=∂C∂ajLσ′(zjL)\begin{eqnarray} \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}$('#margin_406519642632_reveal').click(function() {$('#margin_406519642632').toggle('slow', function() {});});-(BP4)∂C∂wljk=al−1kδlj∂C∂wjkl=akl−1δjl\begin{eqnarray} \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j \nonumber\end{eqnarray}$('#margin_780300064432_reveal').click(function() {$('#margin_780300064432').toggle('slow', function() {});});. Let's start by looking at the output layer. Consider the term σ′(zLj)σ′(zjL)\sigma'(z^L_j) in (BP1)δLj=∂C∂aLjσ′(zLj)δjL=∂C∂ajLσ′(zjL)\begin{eqnarray} \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}$('#margin_935860455964_reveal').click(function() {$('#margin_935860455964').toggle('slow', function() {});});. Recall from the graph of the sigmoid function in the last chapter that the σσ\sigma function becomes very flat when σ(zLj)σ(zjL)\sigma(z^L_j) is approximately 000 or 111. When this occurs we will have σ′(zLj)≈0σ′(zjL)≈0\sigma'(z^L_j) \approx 0. And so the lesson is that a weight in the final layer will learn slowly if the output neuron is either low activation (≈0≈0\approx 0) or high activation (≈1≈1\approx 1). In this case it's common to say the output neuron has saturated and, as a result, the weight has stopped learning (or is learning slowly). Similar remarks hold also for the biases of output neuron.

True for sigmoid layer