start with a very low learning rate to avoid jumping to an undesirable minimum, and then increase once we're no longer at risk of getting stuck ther
?
start with a very low learning rate to avoid jumping to an undesirable minimum, and then increase once we're no longer at risk of getting stuck ther
?
many recent papers use vanilla SGD without momentum and a simple learning rate annealing schedule.
So does annealing help with Adam and other adaptive learning techniques?
erforms redundant computations for large datasets, as it recomputes gradients for similar examples before each parameter update.
Coudl not follow the redundancy claim here
Batch gradient descent is guaranteed to converge to the global minimum for convex error surfaces and to a local minimum for non-convex surfaces.
How?
Due to parameter sharing, you can easily run a pretrained network on images of different spatial size. This is clearly evident in the case of Conv/Pool layers because their forward function is independent of the input volume spatial size (as long as the strides “fit”). In case of FC layers, this still holds true because FC layers can be converted to a Convolutional Layer: For example, in an AlexNet, the final pooling volume before the first FC layer is of size [6x6x512]. Therefore, the FC layer looking at this volume is equivalent to having a Convolutional Layer that has receptive field size 6x6, and is applied with padding of 0.
Not sure how FCS can be made size agnostic?
Tmux notes:
Steps:
Key points:
Workflow:
More data:
Data set breakdown:
Progress plateaus above Human level performance:
What to do when bias is high:
How do you define human level performance: Example: Error of a panel of experts
Size of data: