7 Matching Annotations
  1. Feb 2017
    1. start with a very low learning rate to avoid jumping to an undesirable minimum, and then increase once we're no longer at risk of getting stuck ther


    1. many recent papers use vanilla SGD without momentum and a simple learning rate annealing schedule.

      So does annealing help with Adam and other adaptive learning techniques?

    2. erforms redundant computations for large datasets, as it recomputes gradients for similar examples before each parameter update.

      Coudl not follow the redundancy claim here

    3. Batch gradient descent is guaranteed to converge to the global minimum for convex error surfaces and to a local minimum for non-convex surfaces.


  2. Jan 2017
    1. Due to parameter sharing, you can easily run a pretrained network on images of different spatial size. This is clearly evident in the case of Conv/Pool layers because their forward function is independent of the input volume spatial size (as long as the strides “fit”). In case of FC layers, this still holds true because FC layers can be converted to a Convolutional Layer: For example, in an AlexNet, the final pooling volume before the first FC layer is of size [6x6x512]. Therefore, the FC layer looking at this volume is equivalent to having a Convolutional Layer that has receptive field size 6x6, and is applied with padding of 0.

      Not sure how FCS can be made size agnostic?

    1. Tmux notes:

      • Terminal Multiplexer
      • Multiple panes
      • Attach and detach
      • Share sessions


      • Install tmux on the server
      • start tmux ; tmux
      • Run tmux commands => ctrl +b -> <command>
        • c => new window
        • , => rename window
        • n,p => next previous window
        • w => list windows
        • % => split vertical
        • " => horizantal
        • o => switch between panes
        • space => rearrange panes (gives different layouts)
  3. Dec 2016
    1. Key points:

      1. Scale of data is especially good for large NNs
      2. Having a combination of HPC and AI skills is important to have optimal impact (handle scale challenges and bigger/complex NN)
      3. Most of the value right now comes from CNNS, FCs, RNNS. Unsupervised, GANs and others might be future but they are research topics right now.
      4. E2E DL might be relevant for some cases in future like speech -> transcript, Image -> captioning, text -> image
      5. Self driving cars might also move to E2E, but none of us have enough data image -> steer


      1. Bias = Training error - Human error. Try Bigger model, run longer, New model architecture
      2. Variance = Dev error - Train error. Try More data, Regularization, New model architecture.
      3. Conflict between bias and variance is weaker in DL. We can have bigger model with more data.

      More data:

      1. Data synthesis/augmentation is becoming useful and popular: OCR (superpose alphabets on various images), Speech (Superpose various background noises), NLP(?) But does have drawbacks, if it is not representative
      2. Unified data warehouse helps leverage data usage across company

      Data set breakdown:

      1. Dev and test should come from same distribution. As we spend a lot of time optimizing for Dev accuracy.

      Progress plateaus above Human level performance:

      • But there is theoretical optimal error rate (Bayes rate)

      What to do when bias is high:

      • Look at examples of the ones machine got it wrong
      • Get labels from humans?
      • Error analysis: Segment training - identify segments where training error is higher than human.
      • Estimate bias/variance effect?

      How do you define human level performance: Example: Error of a panel of experts

      Size of data:

      1. How do you define a NN as small vs medium vs large?
      2. Is the reason large NN can leverage bigger data is because it would not cause overfitting unlike on smaller NNs?