- Oct 2019
-
www.cs.cornell.edu www.cs.cornell.edu
-
H={x|wTx+b=0}
what does this mean ?
H = {all points x such that wT.x + b = 0} that is the equation of the hyperplace.
-
SVM
An amazing and simple video on SVM: https://www.youtube.com/watch?v=1NxnPkZM9bc
-
The only difference is that we have the hinge-loss instead of the logistic loss.
What is hinge loss and logistic loss ??
-
If the data is low dimensional it is often the case that there is no separating hyperplane between the two classes.
Why ??
-
The slack variable ξiξi\xi_i allows the input xixi\mathbf{x}_i to be closer to the hyperplane
How ?
-
-
www.cs.cornell.edu www.cs.cornell.edu
-
Mγ≤w⃗ ⋅w⃗ ∗
w.w new = w.w after M updates w.w old = w.w before M updates w.w new = w.w old + M.Gamma M.Gamma < = w.w* new
-
- Sep 2019
-
www.cs.cornell.edu www.cs.cornell.edu
-
y2=1
as y E {-1, +1}
-
w⃗ ∗w→∗\vec{w}^* lies on the unit sphere
What does this mean ??
-
w⃗ ⋅xi→
If one were to take the dot product of a unit vector A and a second vector B of any non-zero length, the result is the length of vector B projected in the direction of vector A
-
Quiz#1: Can you draw a visualization of a Perceptron update? Quiz#2: How often can a Perceptron misclassify a point x⃗ x→\vec{x} repeatedly?
-
-
www.cs.cornell.edu www.cs.cornell.edu
-
DDD (sequence of heads and tails)
D is the sequence i.e. y Theta is P(H)
-
E
-
θθ\theta as a random variable
Let P(H) be variable. P(D) is a constant as it has already occured.
-
derivative and equating it to zero
At maxima and minima, the derivatives are always zero
-
We can now solve for θθ\theta by taking the derivative and equating it to zero.
why ?
-
Posterior Predictive Distribution
Doubtful about this. Refer video: https://www.youtube.com/watch?v=R9NQY2Hyl14
-
Now, we can use the Beta distribution to model P(θ)P(θ)P(\theta): P(θ)=θα−1(1−θ)β−1B(α,β)
Important shit! https://www.youtube.com/watch?v=v1uUgTcInQk
-
HH\mathcal{H}
H is the hypothetical class (i.e., the set of all possible classifiers h(⋅))
-
MLE Principle:
This is very important
-
X
What does P(X,Y) mean ?
-
-
www.cs.cornell.edu www.cs.cornell.edu
-
=argminw1n∑i=1n(x⊤iw−yi)2+λ||w||22λ=σ2nτ2
This means we minimize the loss and the magnitude or w. so some of the weights for the noisy (high variance) features in x become zero. https://towardsdatascience.com/regularization-in-machine-learning-76441ddcf99a
-
P(w)
P(w) - w is considered to be a random variable varying over Gaussian distribution.
-
12πσ2
This is the formula for gaussian distribution
-
⊤
w = matrix of weights [w1, w2, w3, w4, w5] w^t = transpose of w transpose of w * x should be a scalar . cross-product
-
argminw1n
Where did the n come from in denominator
-
Linear Regression
Need to revise this again. A lot of doubts.
-
-
www.cs.cornell.edu www.cs.cornell.edu
-
This gives you a good estimate of the validation error (even with standard deviation)
why ??
-
-
www.cs.cornell.edu www.cs.cornell.edu
-
Regression Trees
I don't get these.
-
O(nlogn)
How?
-
Decision trees are myopic
Doubtful.
-
Quiz: Why don't we stop if no split can improve impurity? Example: XOR
I don't get this :(
-
−∑kpklog(pk)
This is the value of Entropy
-
KLKLKL-Divergence
What is KL-Divergence?
-
-
www.cs.cornell.edu www.cs.cornell.edu
-
Rescue to the curse:
Dimensionality reduction may have better data
-
ϵNN
Doubt ful about this. Shouldn't it be P(y|xt)P(y|xnn) + P(y|xt)P(y|xnn) ??
-
How does kkk affect the classifier? What happens if k=nk=nk=n? What if k=1k=1k =1?
As per my project,the accuracy changes with K. As k -> n, the accuracy drops down. (Refer project 1 report)
-
−1
This means when Y is not 1 or 0. This seems like a typing mistake here.
-
-
www.cs.cornell.edu www.cs.cornell.edu
-
Generalization: ϵ=E(x,y)∼P[ℓ(x,y|h∗(⋅))],
What is this ? Doubt. What is E here ?
-
i.i.d.i.i.d.i.i.d..
independent and identically distributed data points
-
C=RC=R\mathcal{C}=\mathbb{R}.
What is R here ? Is the data set and label set have same space?
-