31 Matching Annotations

Feb 2021
jobs.intuit.com jobs.intuit.com

Software Developer Intern, TurboTax.ca – Full Stack (Canada - Mississauga, Summer 2021)

1
1. QuantumWasp 21 Feb 2021
  
  in Public
  
  the
  
  this is a note
Visit annotations in context

Annotators

QuantumWasp

URL

jobs.intuit.com/job/mississauga/software-developer-intern-turbotax-ca-full-stack-canada-mississauga-summer-2021/27595/18509155
Jan 2021
dzone.com dzone.com

REST API: Path vs. Request Body Parameters - DZone Integration

1
1. QuantumWasp 14 Jan 2021
  
  in Public
  
  Path parameters:
  
  Identify a resource uniquely
  
  Request body:
  
  Send and receive data via the REST API, often specifically to upload data.
  
  Needs to adhere to REST API principles to be RESTful. Therefore for POST/PUT, needs to send the whole resource in the body.
  
  Query:
  
  Mainly used for filtering. If you send a query asking for a large number of resources, use query parameters to filter the set of resources that you want. Example: You send a request to api/pictures/category/cat. You will now get all pictures with cats which could be millions. You can put in the query more specific parameters to clarify your request, such as: api/pictures/category/cat?color=black&breed=korat. Now you will get the subset of pictures of cats which havethe color black and are korat cats.
  
  summary API
Visit annotations in context

Tags

summary

API

Annotators

QuantumWasp

URL

dzone.com/articles/rest-api-path-vs-request-body-parameters
realpython.com realpython.com

Async IO in Python: A Complete Walkthrough – Real Python

1
1. QuantumWasp 14 Jan 2021
  
  in Public
  
  Threading is a concurrent execution model whereby multiple threads take turns executing tasks. One process can contain multiple threads.
  
  Note: threads take turns executing tasks. They're not actually running in parallel, just switching between each other very fast within a CPU core.
Visit annotations in context

Annotators

QuantumWasp

URL

realpython.com/async-io-python/
realpython.com realpython.com

An Intro to Threading in Python – Real Python

1
1. QuantumWasp 14 Jan 2021
  
  in Public
  
  Tasks that spend much of their time waiting for external events are generally good candidates for threading. Problems that require heavy CPU computation and spend little time waiting for external events might not run faster at all.
  
  In Python use threads for I/O, not for heavy CPU computations
Visit annotations in context

Annotators

QuantumWasp

URL

realpython.com/intro-to-python-threading/
blog.minitab.com blog.minitab.com

Three Common P-Value Mistakes You'll Never Have to Make

1
1. QuantumWasp 06 Jan 2021
 
 in Public
 
 Instead, it tells you the odds of seeing it.
 
 P-value says the probability of seeing something assuming null hypothesis is true
Visit annotations in context

Annotators

QuantumWasp

URL

blog.minitab.com/blog/understanding-statistics/three-common-p-value-mistakes-youll-never-have-to-make
www.investopedia.com www.investopedia.com

P-test Definition

3
1. QuantumWasp 06 Jan 2021
 
 in Public
 
 A common and simplistic type of statistical testing is a z-test, which tests the statistical significance of a sample mean to the hypothesized population mean but requires that the standard deviation of the population be known, which is often not possible. The t-test is a more realistic type of test in that it requires only the standard deviation of the sample as opposed to the population's standard deviation.
 
 To perform the actual test, we do a z-test or a t-test. For t-tests you need to know the population standard deviation, which is often not possible. T-test only requires standard deviation of a sample, which is more realistic.
 
 important
2. QuantumWasp 06 Jan 2021
 
 in Public
 
 The smaller the p-value, the stronger the evidence that the null hypothesis should be rejected and that the alternate hypothesis might be more credible.
 
 P value is defined as: probability of observing this effect, if the null hypothesis is true(i.e the commonly accepted claim about a population). 
3. QuantumWasp 06 Jan 2021
 
 in Public
 
 if the P-test fails to reject the null hypothesis then the test is deemed to be inconclusive and is in no way meant to be an affirmation of the null hypothesis.
 
 Why?
 
 important
Visit annotations in context

Tags

important

Annotators

QuantumWasp

URL

investopedia.com/terms/p/p-test.asp
Dec 2020
medium.com medium.com

Covariance, Correlation, R Squared

1
1. QuantumWasp 17 Dec 2020
  
  in Public
  
  This explains how much X varies from its mean when Y varies from its own mean.
  
  Covariance: How much does X vary from its mean when Y varies from its mean.
  
  Notice that x_i and y_i can be +ve or -ve, therefore their product will also be +ve(x_i deviates in +ve direction away from mean and y_i also deviates in +ve direction away from mean or they both deviate negatively away from the mean) or -ve(x_i deviates +vely and y_i deviates -vely from the mean and viceversa). We then sum all of the products up. A +ve covariance will indicate that on average the values had a +ve linear relationship, -ve indicates a -ve linear relationship, and 0 indicates that all products cancelled out which means they were equally +ve and -ve relationships, and therefore no overall relatonship.
  
  important
Visit annotations in context

Tags

important

Annotators

QuantumWasp

URL

medium.com/swlh/covariance-correlation-r-sqaured-5cbefc5cbe1c
www.cs.bham.ac.uk www.cs.bham.ac.uk

PII: S0893-6080(99)00073-8

2
1. QuantumWasp 16 Dec 2020
  
  in Public
  
  Ensemble diversity regularizer
2. QuantumWasp 16 Dec 2020
  
  in Public
  
  FÖnÜà1MXMià1FiÖn
  
  average prediction
Visit annotations in context

Tags

regularizer

diversity

Ensemble

Annotators

QuantumWasp

URL

cs.bham.ac.uk/~pxt/NC/ncl.pdf
neuralnetworksanddeeplearning.com neuralnetworksanddeeplearning.com

Neural Networks and Deep Learning

1
1. QuantumWasp 15 Dec 2020
  
  in Public
  
  So saying "learning is slow" is really the same as saying that those partial derivatives are small.
  
  given the same learning rate between 2 examples, as learning rate also affects the learning.
  
  $$w = w - grad*lr$$
Visit annotations in context

Annotators

QuantumWasp

URL

neuralnetworksanddeeplearning.com/chap3.html
towardsdatascience.com towardsdatascience.com

Monte Carlo Dropout

1
1. QuantumWasp 15 Dec 2020
  
  in Public
  
  Monte Carlo Dropout boils down to training a neural network with the regular dropout and keeping it switched on at inference time. This way, we can generate multiple different predictions for each instance.
  
  .
  
  #summary
Visit annotations in context

Tags

#summary

Annotators

QuantumWasp

URL

towardsdatascience.com/monte-carlo-dropout-7fd52f8b6571
towardsdatascience.com towardsdatascience.com

Understanding RNN and LSTM

2
1. QuantumWasp 12 Dec 2020
  
  in Public
  
  Long Short-Term Memory (LSTM) networks are a modified version of recurrent neural networks, which makes it easier to remember past data in memory. The vanishing gradient problem of RNN is resolved here.
  
  LSTM's resolve vanishing gradient problem of RNN, can be seen as evolution of RNN(i.e better)
  
  Well suited on time series data.
2. QuantumWasp 12 Dec 2020
  
  in Public
  
  RNN is recurrent in nature as it performs the same function for every input of data while the output of the current input depends on the past one computation.
  
  performs the same operation on every input of data,except it also previous outputs(called context) as another input.
  
  Description:
  
  First.it takes the X(0) from the sequence of input and then it outputs h(0) which together with X(1) is the input for the next step. So, the h(0) and X(1) is the input for the next step. Similarly, h(1) from the next is the input with X(2) for the next step and so on
Visit annotations in context

Annotators

QuantumWasp

URL

towardsdatascience.com/understanding-rnn-and-lstm-f7cdf6dfc14e
pytorch.org pytorch.org

PyTorch

1
1. QuantumWasp 10 Dec 2020
  
  in Public
  
  SWA uses a modified learning rate schedule so that SGD continues to explore the set of high-performing networks instead of simply converging to a single solution. For example, we can use the standard decaying learning rate strategy for the first 75% of training time, and then set the learning rate to a reasonably high constant value for the remaining 25% of the time (see the Figure 2 below). The second ingredient is to average the weights of the networks traversed by SGD. For example, we can maintain a running average of the weights obtained in the end of every epoch within the last 25% of training time (
  
  we train enough to get to a good area in the loss function.
  
  We have a high learning rate(but not too high) so we can explore our surroundings and stumble upon nearby high performing minima. We periodically save the weights(every x epochs)
  
  We average the weights . As a result, the averaged weights will be centered around the loss. See left picture below
  
  summary
Visit annotations in context

Tags

summary

Annotators

QuantumWasp

URL

pytorch.org/blog/stochastic-weight-averaging-in-pytorch/
Nov 2020
machinelearningmastery.com machinelearningmastery.com

How to Control the Stability of Training Neural Networks With the Batch Size

1
1. QuantumWasp 29 Nov 2020
  
  in Public
  
  Smaller batch sizes are used for two main reasons: Smaller batch sizes are noisy, offering a regularizing effect and lower generalization error. Smaller batch sizes make it easier to fit one batch worth of training data in memory (i.e. when using a GPU). A third reason is that the batch size is often set at something small, such as 32 examples, and is not tuned by the practitioner. Small batch sizes such as 32 do work well generally. … [batch size] is typically chosen between 1 and a few hundreds, e.g. [batch size] = 32 is a good default value — Practical recommendations for gradient-based training of deep architectures, 2012.
  
  Training with a small batch size has a regularizing effect, and like most regularizers, can lead to very good generalization. In general, batch size 1 has the best generatlization:
  
  Small batches can oﬀer a regularizing eﬀect (Wilson and Martinez, 2003), perhaps due to the noise they add to the learning process. Generalization error is often best for a batch size of 1. Training with such a small batch size might require a small learning rate to maintain stability because of the high variance in the estimate of the gradient. The total runtime can be very high as a result of the need to make more steps, both because of the reduced learning rate and because it takes more steps to observe the entire training set. (Deep Learning Book, p276)
Visit annotations in context

Annotators

QuantumWasp

URL

machinelearningmastery.com/how-to-control-the-speed-and-stability-of-training-neural-networks-with-gradient-descent-batch-size/
arxiv.org arxiv.org

2002.06715.pdf

1
1. QuantumWasp 22 Nov 2020
 
 in Public
 
 The random noise from sampling mini-batches of data inSGD-like algorithms and random initialization of the deep neural networks, combined with the factthat there is a wide variety of local minima solutions in high dimensional optimization problem (Geet al., 2015; Kawaguchi, 2016; Wen et al., 2019), results in the following observation: deep neuralnetworks trained with different random seeds can converge to very different local minima althoughthey share similar error rates.
 
 Random initialization
 
 Noise from sampling mini-batches
 
 causes neural networks with the same architecture to converge to different local minima, but very similar error rates..
Visit annotations in context

Annotators

QuantumWasp

URL

arxiv.org/pdf/2002.06715v2.pdf
towardsdatascience.com towardsdatascience.com

PyTorch Autograd – Towards Data Science

1
1. QuantumWasp 21 Nov 2020
  
  in Public
  
  On calling backward(), gradients are populated only for the nodes which have both requires_grad and is_leaf True. Gradients are of the output node from which .backward() is called, w.r.t other leaf nodes.
  
  All layers declared inside of a neural network's __init__ method or as part of nn.Sequential automatically have their parameters set up with requires_grad=True and is_leaf=True. Autograd will therefore automatically store their gradients during a backward() call
Visit annotations in context

Annotators

QuantumWasp

URL

towardsdatascience.com/pytorch-autograd-understanding-the-heart-of-pytorchs-magic-2686cd94ec95
medium.com medium.com

Effect of batch size on training dynamics

1
1. QuantumWasp 20 Nov 2020
  
  in Public
  
  However, it is well known that too large of a batch size will lead to poor generalization (although currently it’s not known why this is so).
  
  the assumption in "Train longer, generalize better" is that it is due to making fewer updates: "we conducted experiments to show empirically that the "generalization gap" stems from the relatively small number of updates rather than the batch size, and can be completely eliminated by adapting" the training regime they use.
Visit annotations in context

Annotators

QuantumWasp

URL

medium.com/mini-distill/effect-of-batch-size-on-training-dynamics-21c14f7a716e
web.hypothes.is web.hypothes.is

Annotating the law | Hypothes.is

1
1. QuantumWasp 19 Nov 2020
  
  in Public
  
  or equations (in LaTeX format)
  
  Wrap your equation between 2$ on each side. $$Example$$.
Visit annotations in context

Annotators

QuantumWasp

URL

web.hypothes.is/
towardsdatascience.com towardsdatascience.com

10 New Things I Learnt from fast.ai v3 – Towards Data Science

1
1. QuantumWasp 19 Nov 2020
  
  in Public
  
  Loss functions usually have bumpy and flat areas (if you visualise them in 2D or 3D diagrams). Have a look at Fig. 3.2. If you end up in a bumpy area, that solution will tend not to generalise very well. This is because you found a solution that is good in one place, but it’s not very good in other place. But if you found a solution in a flat area, you probably will generalise well. And that’s because you found a solution that is not only good at one spot, but around it as well.
  
  Another key point is that test distribution and train distribution are not always identical, therefore if you're in a flat area and test distribution shifts, you will still be in the flat area of the loss function but more around the edges. If you're in a sharp minima and distribution shifts, you're kicked out of it and end up somewhere higher up on the loss surface.
Visit annotations in context

Annotators

QuantumWasp

URL

towardsdatascience.com/10-new-things-i-learnt-from-fast-ai-v3-4d79c1f07e33
medium.com medium.com

How we beat the FastAI leaderboard score by +19.77%…a

2
1. QuantumWasp 19 Nov 2020
  
  in Public
  
  As mentioned earlier, I tested a lot of activation functions this year before Mish, and in most cases while things looked awesome in the paper, they would fall down as soon as I put them to use on more realistic datasets like ImageNette/Woof.Many of the papers show results using only MNIST or CIFAR-10, which really has minimal proof of how they will truly fare in my experience.
  
  You should start with CIFAR-10 and MNIST only to get some initial results, but to see if those ideas hold up more broadly, test them on more realistic datasets like ImageWoof, ImageNet.
  
  Tip
2. QuantumWasp 19 Nov 2020
  
  in Public
  
  RAdam achieves this automatically by adding in a rectifier that dynamically tamps down the adaptive learning rate until the variance stabilizes.
  
  RAdam reduces the variance of the adaptive learning rate early on Source
Visit annotations in context

Tags

Tip

Annotators

QuantumWasp

URL

medium.com/@lessw/how-we-beat-the-fastai-leaderboard-score-by-19-77-a-cbb2338fab5c
machinelearningmastery.com machinelearningmastery.com

How to Use Weight Decay to Reduce Overfitting of Neural Network in Keras

5
1. QuantumWasp 19 Nov 2020
  
  in Public
  
  It is a good practice to first grid search through some orders of magnitude between 0.0 and 0.1, then once a level is found, to grid search on that level.
  
  Their range is [0.0, 0.1]. They then do a couple of orders of magnitude through the range: [1e-1, 1e-2, 1e-3, 1e-4, 1e-5, 1e-6].
  
  Logic for this is is once you find the best magnitude for your architecture/data distribution combination, you can then tune further by selecting a weight decay along that specific magnitude, this time by grid searching through the much smaller range.
2. QuantumWasp 19 Nov 2020
  
  in Public
  
  Once you can confirm that weight regularization may improve your overfit model, you can test different values of the regularization parameter.
  
  Before using a hyperparmeter, first test that it will add value to your model. Only after that decide on value for it. If you don't want to use the default values, which you probably shouldn't given that hyperparameter values are very dependent on model architecture and dataset, then do something like grid search or randomized search through a model to find the best hyperparameter
3. QuantumWasp 19 Nov 2020
  
  in Public
  
  An overfit model should show accuracy increasing on both train and test and at some point accuracy drops on the test dataset but continues to rise on the training dataset.
  
  can also use loss: If training loss decreases while validation loss increases -> overfitting.
4. QuantumWasp 19 Nov 2020
  
  in Public
  
  A weight regularizer can be added to each layer when the layer is defined in a Keras model.
  
  i.e you can also set other hyperparameters(weight decay, learning rate, momentum, etc) per layer instead of whole network
5. QuantumWasp 19 Nov 2020
  
  in Public
  
  add a penalty for weight size to the loss function.
  
  l2 regularization can be done either by adding a penalty to the loss function and also directly to the weights through weight decay
Visit annotations in context

Annotators

QuantumWasp

URL

machinelearningmastery.com/how-to-reduce-overfitting-in-deep-learning-with-weight-regularization/
pytorch.org pytorch.org

PyTorch

1
1. QuantumWasp 18 Nov 2020
  
  in Public
  
  SWA can be used with any learning rate schedule that encourages exploration of the flat region of solutions. For example, you can use cyclical learning rates in the last 25% of the training time instead of a constant value, and average the weights of the networks corresponding to the lowest values of the learning rate within each cycle (see Figure 3).
  
  This is very similar to what a snapshot ensemble does, except that a snapshot ensemble doesn't average out the weights at the end. Instead, it uses each network it saved as part of an ensemble during inference.
Visit annotations in context

Annotators

QuantumWasp

URL

pytorch.org/blog/stochastic-weight-averaging-in-pytorch/
distill.pub distill.pub

Why Momentum Really Works

1
1. QuantumWasp 17 Nov 2020
  
  in Public
  
  The problem could be the optimizer’s old nemesis, pathological curvature. Pathological curvature is, simply put, regions of dt-math[block] { display: block; } fff which aren’t scaled properly. The landscapes are often described as valleys, trenches, canals and ravines. The iterates either jump between valleys, or approach the optimum in small, timid steps. Progress along certain directions grind to a halt. In these unfortunate regions, gradient descent fumbles.
  
  Pathological curvature is a big problem for gradient descent and makes it significantly slow down: for a ravine, instead of going straight down through it, it oscillates sideways a lot since that's the direction where the gradient is larger, as it is immediately steeper:
Visit annotations in context

Annotators

QuantumWasp

URL

distill.pub/2017/momentum

QuantumWasp

Annotations: 31

Joined: October 29, 2020

Annotators

URL

Tags

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Annotators

URL

Tags

Annotators

URL

Annotators

URL

Tags

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Tags

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL