41 Matching Annotations
1. Jul 2021
2. christophm.github.io christophm.github.io
1. EDx(z|A)[1f(x)=f(z)]≥τ,A(x)=1

Not clear to me what this means.

#### URL

3. christophm.github.io christophm.github.io
1. The x-axis shows the feature effect: The weight times the actual feature value.

I do not understand this.

#### URL

4. christophm.github.io christophm.github.io
1. You can subtract the lower-order effects in a partial dependence plot to get the pure main or second-order effects

How do we do this?

2. Well, that sounds stupid. Derivation and integration usually cancel each other out, like first subtracting, then adding the same number. Why does it make sense here? The derivative (or interval difference) isolates the effect of the feature of interest and blocks the effect of correlated features.

Say what? How does it remove the correlation? It remove the offset, but correlation?

3. ALE plots are a faster and unbiased alternative to partial dependence plots (PDPs).

Why are the PDPs biased?

#### URL

5. christophm.github.io christophm.github.io
1. For each of the categories, we get a PDP estimate by forcing all data instances to have the same category. For example, if we look at the bike rental dataset and are interested in the partial dependence plot for the season, we get 4 numbers, one for each season. To compute the value for "summer", we replace the season of all data instances with "summer" and average the predictions.

Why would be change the season for all? This does not make sense. We simply have to take the average of all instances corresponding to a particular season.

Update: I got it now. You do replace every instance by that value and simply run all modified instances through the ML model and average across its output.

2. An assumption of the PDP is that the features in C are not correlated with the features in S. If this assumption is violated, the averages calculated for the partial dependence plot will include data points that are very unlikely or even impossible (see disadvantages).

#### URL

6. christophm.github.io christophm.github.io
1. f(x)=^f(xS,xC)=g(xS)+h(xC)

How do we know we can express it like this?

#### URL

7. nba.uth.tmc.edu nba.uth.tmc.edu
1. innervation ratio

How is this a ratio?

#### URL

8. christophm.github.io christophm.github.io
1. A tree with a depth of three requires a maximum of three features and split points to create the explanation for the prediction of an individual instance.

This means that predicting the value for any instance only requires a maximum of three features. Even though the overall tree itself can can use up to 7 features.

2. ∑j=1feat.contrib(j,x)

How do we get the feature contribution?

3. Feature importance

#### URL

9. christophm.github.io christophm.github.io
1. How will this help with comparison? Are we assuming the other model uses categorization?

#### URL

10. christophm.github.io christophm.github.io
1. Logistic regression can suffer from complete separation. If there is a feature that would perfectly separate the two classes, the logistic regression model can no longer be trained. This is because the weight for that feature would not converge, because the optimal weight would be infinite. This is really a bit unfortunate, because such a feature is really useful. But you do not need machine learning if you have a simple rule that separates both classes. The problem of complete separation can be solved by introducing penalization of the weights or defining a prior probability distribution of weights.

Cannot understand this.

2. But usually you do not deal with the odds and interpret the weights only as the odds ratios. Because for actually calculating the odds you would need to set a value for each feature, which only makes sense if you want to look at one specific instance of your dataset.

#### URL

11. christophm.github.io christophm.github.io
1. days_since_2011 4.9 0.2 28.5

It looks like there was a steady increase in the number of bikes rented every single day.

2. R2=1−(1−R2)n−1n−p−1

I do not follow this. Is this always guaranteed to be within 0 and 1?

#### URL

12. christophm.github.io christophm.github.io
1. 2.4 Evaluation of Interpretability

I do not follow this section.

2. Application level evaluation (real task)

I do not understand this. Where is the interoperability here? The software could person as well as the radiologist, but we might still have no idea how it is doing it.

#### URL

13. May 2021
14. www.gammon.com.au www.gammon.com.au
1. This article discusses interrupts on the Arduino Uno (Atmega328) and similar processors, using the Arduino IDE. The concepts however are very general. The code examples provided should compile on the Arduino IDE (Integrated Development Environment).

This is such a great resource!

#### URL

15. Apr 2021
16. qz.com qz.com
1. It’s very hard to get consciousness out of non-consciousness

If particles can come in and go out of existence from nothing, why can't consciousness?

#### URL

17. Mar 2021
18. www.scholarpedia.org www.scholarpedia.org
1. low firing rates,

What is low firing rate?

2. fenvi=fmin+⎛⎝1−UthMUifmax−fmin⎞⎠∗U

I do not follow this. This means the neuron is always firing at f_min? How foes f_0.5 come into the picture?

3. model is an assembly of phenomenological models,

Not sure how this avoids the issues of over-fitting.

4. All of the muscle fibers in all of the motor units of a given muscle tend to move together, experiencing the same sarcomere lengths and velocities

How do we know this? What about motor units that are not activated? What about motor units that are activated with different time delays and different rates?

#### URL

19. Jul 2020
20. en.wikipedia.org en.wikipedia.org
1. effects of crossover and dropout

I understand dropout, but how crossover?

2. Randomized clinical trials analyzed by the intention-to-treat (ITT) approach provide unbiased comparisons among the treatment groups

What is the proof for this? Is there a statistical proof?

#### URL

21. Jan 2020
22. rdgao.github.io rdgao.github.io
1. Additionally, one of my all time favorite papers (Fröhlich & McCormick, 2010) showed that an applied external field can entrain spiking in ferret cortical slice, parametrically to the oscillatory field frequency.

What are the sources of LFP? If its only the current induced by EPSP and IPSP, then it is not clear if the entertainment through external field makes a strong argument. If LFPs can be modified by other factors like it is pointed out (Ca++ currents and other glial cells currents), then it is possible that these are not epiphenomenal. But we still need to show that they actually play a causal role in the observed system output.

2. But spikes do not compute! The cells “compute”, dendrites “compute”, the axon hillock “computes”. In that sense, spikes are epiphenomenal: they are the secondary consequences of dendritic computation, of which you can fully infer by knowing the incoming synaptic inputs and biophysical properties of the neuron.

This is correct. But in that case, LFPs and any other oscillations that we are recording are also epiphenomenal.

#### URL

23. Dec 2019
1. estimators is the prior covariance ΣΣφφ

How do we know this covariance?

2. If the sensor noise has an independent identical distribution (IID) across channels, the covariance of the sensor noise in the referenced data will be Σεεrεεr=σ2TrTTr

I do not understand this.

#### URL

25. Mar 2019
26. docs.microsoft.com docs.microsoft.com
1. When the number of references drops to zero, the object deletes itself. The last part is worth repeating: The object deletes itself

How does that work? How does an object delete itself?

#### URL

27. docs.microsoft.com docs.microsoft.com
1. CALLBACK is the calling convention for the functio

What is a calling convention?

#### URL

28. Feb 2019
29. colah.github.io colah.github.io
1. Calculus on Computational Graphs: Backpropagation

Good article on computational graphs and their role in back propagation.

#### URL

30. stats.stackexchange.com stats.stackexchange.com
1. One benefit of SGD is that it's computationally a whole lot faster. Large datasets often can't be held in RAM, which makes vectorization much less efficient. Rather, each sample or batch of samples must be loaded, worked with, the results stored, and so on. Minibatch SGD, on the other hand, is usually intentionally made small enough to be computationally tractable. Usually, this computational advantage is leveraged by performing many more iterations of SGD, making many more steps than conventional batch gradient descent. This usually results in a model that is very close to that which would be found via batch gradient descent, or better.

Good explanation for why SGD is computationally better. I was confused about the benefits of repeated performing mini-batch GD, and why it might be better than batch GD. But I guess the advantage comes from being able to get better performance by vecotrizing computation.

#### URL

31. neuralnetworksanddeeplearning.com neuralnetworksanddeeplearning.com
1. And so it makes most sense to regard epoch 280 as the point beyond which overfitting is dominating learning in our neural network.

I do not get this. Epoch 15 indicates that we are already over-fitting to the training data set, on? Assuming both training and test set come from the same population that we are trying to learn from.

2. If we see that the accuracy on the test data is no longer improving, then we should stop training

This contradicts the earlier statement about epoch 280 being the point where there is over-training.

3. It might be that accuracy on the test data and the training data both stop improving at the same time

Can this happen? Can the accuracy on the training data set ever increase with the training epoch?

4. test data

Shouldn't this be "training data"?

5. What is the limiting value for the output activations aLj

When c is large, small differences in z_j^L are magnified and the function jumps between 0 and 1, depending on the sign of the differences. On the other hand, when c is very small, all activation values will be close to 1/N; where N is the number of neurons in layer L.

6. zLj=lnaLj+C

How can the constant C be independent of j? It will have a e^{z_j^L} term in it. This is not correct.