15 Matching Annotations
  1. Jul 2016
    1. With blending, instead of creating out-of-fold predictions for the train set, you create a small holdout set of say 10% of the train set. The stacker model then trains on this holdout set only.

      whaat?

  2. May 2016
    1. most solutions did not provide a way to rerun pipelines with different inputs, mechanisms to explicitly capture outputs and/or side effects, visualization of outputs, and conditional steps for tasks like parameter sweeps.

      not clear read again?

    1. In fact, we'll allow each neuron in this layer to learn from all 20×5×520×5×520 \times 5 \times 5 input neurons in its local receptive field.

      better language can be used

    2. We can think of max-pooling as a way for the network to ask whether a given feature is found anywhere in a region of the image. It then throws away the exact positional information.

      didn't get it?...anyone?

    3. This means that all the neurons in the first hidden layer detect exactly the same feature* *I haven't precisely defined the notion of a feature. Informally, think of the feature detected by a hidden neuron as the kind of input pattern that will cause the neuron to activate: it might be an edge in the image, for instance, or maybe some other type of shape. , just at different locations in the input image.

      because of the ame weights and biases!

    1. A related heuristic explanation for dropout is given in one of the earliest papers to use the technique* *ImageNet Classification with Deep Convolutional Neural Networks, by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton (2012).: "This technique reduces complex co-adaptations of neurons, since a neuron cannot rely on the presence of particular other neurons. It is, therefore, forced to learn more robust features that are useful in conjunction with many different random subsets of the other neurons." In other words, if we think of our network as a model which is making predictions, then we can think of dropout as a way of making sure that the model is robust to the loss of any individual piece of evidence. In this, it's somewhat similar to L1 and L2 regularization, which tend to reduce weights, and thus make the network more robust to losing any individual connection in the network.

      didn't get it do again

    2. The net result is that L1 regularization tends to concentrate the weight of the network in a relatively small number of high-importance connections, while the other weights are driven toward zero.

      didnt get it why?

    3. Now, there are ten points in the graph above, which means we can find a unique 999th-order polynomial y=a0x9+a1x8+…+a9y=a0x9+a1x8+…+a9y = a_0 x^9 + a_1 x^8 + \ldots + a_9 which fits the data exactly.

      why does 10 points imply 9 variables!!??

  3. Apr 2016
    1. To continue the recurrence and to chain the gradient, the add gate takes that gradient and multiplies it to all of the local gradients for its inputs (making the gradient on both x and y 1 * -4 = -4). Notice that this has the desired effect: If x,y were to decrease (responding to their negative gradient) then the add gate's output would decrease, which in turn makes the multiply gate's output increase.

      not clear

    1. You may have noticed that evaluating the numerical gradient has complexity linear in the number of parameters. In our example we had 30730 parameters in total and therefore had to perform 30,731 evaluations of the loss function to evaluate the gradient and to perform only a single parameter update

      Didn't understand, can someone help?!!