6 Matching Annotations
  1. Oct 2020
    1. The number of hidden neurons should be between the size of the input layer and the size of the output layer. The number of hidden neurons should be 2/3 the size of the input layer, plus the size of the output layer. The number of hidden neurons should be less than twice the size of the input layer.

      3 rules of thumb while choosing the number of hidden layers and neurons

  2. Feb 2019
    1. One benefit of SGD is that it's computationally a whole lot faster. Large datasets often can't be held in RAM, which makes vectorization much less efficient. Rather, each sample or batch of samples must be loaded, worked with, the results stored, and so on. Minibatch SGD, on the other hand, is usually intentionally made small enough to be computationally tractable. Usually, this computational advantage is leveraged by performing many more iterations of SGD, making many more steps than conventional batch gradient descent. This usually results in a model that is very close to that which would be found via batch gradient descent, or better.

      Good explanation for why SGD is computationally better. I was confused about the benefits of repeated performing mini-batch GD, and why it might be better than batch GD. But I guess the advantage comes from being able to get better performance by vecotrizing computation.

    1. And so it makes most sense to regard epoch 280 as the point beyond which overfitting is dominating learning in our neural network.

      I do not get this. Epoch 15 indicates that we are already over-fitting to the training data set, on? Assuming both training and test set come from the same population that we are trying to learn from.

    2. If we see that the accuracy on the test data is no longer improving, then we should stop training

      This contradicts the earlier statement about epoch 280 being the point where there is over-training.

    3. It might be that accuracy on the test data and the training data both stop improving at the same time

      Can this happen? Can the accuracy on the training data set ever increase with the training epoch?

    4. What is the limiting value for the output activations aLj

      When c is large, small differences in z_j^L are magnified and the function jumps between 0 and 1, depending on the sign of the differences. On the other hand, when c is very small, all activation values will be close to 1/N; where N is the number of neurons in layer L.