- Feb 2017
Possibly confusing naming conventions
Possibly?! Most definitely! I kept trying to figure out the softmax loss function but there is no 'loss' calculations taking place when softmax is performed!!
All it does is provide a convenient form factor of our estimated probabilities (the result of hypothesis function f, which for this course seems to be Wx + b).
Pretty much after we run these function results through softmax, the whole thing is rebranded as q in the actual Cross-Entropy loss function.
Cross Entropy loss function doesn't even seem like it makes use of all the values calculated in a softmax equation. You convert your class predictions to normalized probabilities, but your cross entropy loss function only cares about the probabilities assigned to the correct class. And we only care about minimizing the loss function, which can only be done by increasing the correct class probabilties (due to the negative sign in the cross entropy formula)
So to wrap my head around the cross-entropy formula:
It's intention is to sum all the log estimated probabilities of the correct class only, then make this value negative, and really the more negative the better.
As Andrej pointed out with the Maximum Likelihood Estrimation, thinking of minimizing the (negative) result is equivalent to maximizing the absolute value of the result. Since this function only considers the values assigned to the correct class (due to the multiplication of p, the correct class probabilities (100% or 0%), the more mass on the correct estimated probability, the better the result of the cross entropy function will be.
To this effect, we disregard any estimated probability values assigned to the not-correct classes, and only worry about the probability assigned to the correct class.
SVM only cares that the difference is at least 10
The margin seems to be manually set by the creator in the loss function. In the sample code, the margin is 1-- so the incorrect class has to be scored lower than the correct class by 1.
How is this margin determined? It seems like one would have to know the magnitude of the scores beforehand.
Diving deeper, is the scoring magnitude always the same if the parameters are normalized by their average and scaled to be between 0 and 1? (or -1 and -1... not sure of the correct scaling implementation)
Coming back to the topic -- is this 'minimum margin' or delta a tune-able parameter?
What effects do we see on the model by adjusting this parameter?
What are best and worst case scenarios of playing with this parameter?