52 Matching Annotations
1. Jun 2022
2. people.eecs.berkeley.edu people.eecs.berkeley.edu
1. fs?zbmn ̄»z'sfm?eqxe ́zXw]qkrme « f quz'afcK|~zbjf»kvykreeghukrfOqeu_ ° z b|rà_'

epsilon greedy

the crux

3. _shz'm]w±KÅQfO ̄à ̧³o*Õz'm ̄à

so rewards, and reward functions can be defined over S x A x S

#### URL

3. May 2022
4. d3c33hcgiwev3.cloudfront.net d3c33hcgiwev3.cloudfront.net
1. Will they make exactly the same action selections and weightupdates?

no, on Q-learning the greedy action in the bellman equation is taken before the update, but the next step's action is generated from the updated Q.

whereas in SARSA with a greedy policy, the same greedy action is used in the update equation, and it is also used taken to generate the next state

2. Q-learning considered an o↵-policy control method?

because the policy for which Q estimates is the one that is greedy w.r.t Q, but the policy generating the samples can be anything

3. ⇢ t (R t+1 + Gt+1:h )

see 5.9, this is per-decision importance sampling

#### URL

5. Nov 2021
6. ocw.mit.edu ocw.mit.edu
1. Information Criterion

usually a function of the likelihood function plus some penalty for model complexity

2. sum of t independent innovations

3. 1 + θ21 + θ22 + ···+ θ2q

due to having uncorrelated innovations

4. ex number λ: |λ|> 1,(1 − 1λL)−1

applying the inverse, renders the process an infinite order MA process

5. z |≤1}, i.e., |λj |> 1

geometric series

6. Li (ηt )

function on eta_t

7. ηt−

how does this affect the process over time?

e.g. changing interest rates on the long term economy

8. p/n →0

more data than parameters

9. St

S_t is a weighted average of white noise

#### URL

7. Jul 2021
8. d3c33hcgiwev3.cloudfront.net d3c33hcgiwev3.cloudfront.net
1. ✓✓+↵✓Irln⇡(A|S,✓)

actor-critic with state value baseline update, with discounting!

del ln(A|S, theta) is actually CrossEntropy

#### URL

9. May 2021
10. jmlr.org jmlr.org
1. (δt+θ>tφt−θ>t−1φt)

modified TD error in terms of regular TD error

2. αφtδ′t

A_{t+1}^t = I

3. At−10θ0+αt−1∑i=0At−1i+1φiGλ|ti

we've achieved something special here - the recursive definition of theta_{t+1} on theta_t

4. = (I−αφtφ>t)θt+αt−1∑i=0Ati+1φi(γλ)t−iδ′t+αφt(Rt+1+γθt>φt+1)

this is already computationally quite nice, but these jokers want to incorporate the last term into the modified TD error

5. Gλ|t+1

lambda return = lambda weighted sum of all n-step returns up to time t+1

6. γλet−1+φt−αγλ(e>t−1φt)φt

the dutch trace update rule

#### URL

11. d3c33hcgiwev3.cloudfront.net d3c33hcgiwev3.cloudfront.net
1. G1:3

weighted sum of all available n-step lambda returns from t=1

2. hanging the values of those past states for when theyoccur again in the future

assuming the special linear case, with a discrete state space and where the feature vector is a one-hot encoding, the TD-error multipled by the eligbility vector is the effect of the current TD-error on each state, with the effect amplified (or de-amplified) by each states' recency of occurence.

3. assign it backward to each prior state according to how much that state contributedto the current eligibility trace at that time.

the current TD error contributes less and less to those states that occur furhter back in time (using the linear case helps, where the eligibility trace is a sum of past, fading state input vectors).

4. Tt1Gt

sum of all remaining n-step returns

5. Qt+n1(St,At)

should this be inside the bracket??

6. How about the change in left-side outcome from 0 to1 madein the larger walk? Do you think that made any di↵erence in the best value ofn

yes, because rewards can propagate in from both sides now, making the optimal n shorter?

7. Gt:t+n.=Rt+1+Rt+2+···+n1Rt+n+nQt+n1(St+n,At+n)

G(t:t+n) needs all rewards from time t+1

8. Qt+n(St,At).=Qt+n1(St,At)+↵[Gt:t+nQt+n1(St,At)]

at the current time step t+n, we need the states and actions from time = t (i.e. n steps back)

9. +n1

use the most up-to-date version

10. xpectation of being in a state depends only on thepolicy and the MDP transition probabilities

hmm sure starting states can have an impact?

#### URL

12. Jan 2021
13. arxiv.org arxiv.org
1. [21, 39] directlyuse conventional CNN or deep belief networks (DBN)

#### URL

14. d3c33hcgiwev3.cloudfront.net d3c33hcgiwev3.cloudfront.net
1. f⌧+n<T,then:GG+nV(S⌧+n

V(terminal state) = 0

#### URL

15. Dec 2020
16. d3c33hcgiwev3.cloudfront.net d3c33hcgiwev3.cloudfront.net
1. ↵Gtr⇡(At|St,✓t)⇡(At|St,✓t)

notice that the multiplier of the gradient here: G_t / pi(a|s) is positive, meaning we are always going in the same direction as the gradient. using a baseline G_t - v(S_t) allows us to revers this direction if G_t is lower than the baseline

2. Actor–Critic with Eligibility Traces (continuing), for estimating⇡✓⇡⇡⇤

actor critic algorithm one step TD:

3. (13.16)

very similar to box 199 but without h(s)

4. ww+↵wrˆv(S,w)

TD(0) update

5. Gt:t+1ˆv(St,w)

same as REINFORCE MC baseline, but with the sampled G replaced with a bootstrapped G

6. That is,wis a single component,w.

constant baseline?

7. If there is discounting (<1) itshould be treated as a form of termination, which can be done simply by includinga factor ofin the second term of

termination because discounting by gamma is equivalent to a non-disounted case, but with termination probability gamma

#### URL

17. Oct 2020
18. Local file Local file
1. Chenget al.[92] design a multi-channel parts-aggregated deep convolutional network byintegrating the local body part features and the global full-body features in a triplet training framework

TODO: read this and find out what the philosophy behind parts-based model is??

what is this?

3. Generation/Augmentation

4. Using theannotated source data in the training process of the targetdomain is beneficial for cross-dataset learning

What? Clarify

5. Dy-namic graph matching (DGM)

super interesting, but hardly applicable. do rad though!

6. Sample Rate Learning

what

7. Singular VectorDecomposition (SVDNet)

seems interesting, "iteratively integrate the orthogonality constraint in CNN training"

8. Omni-Scale Network (OSNet)

read paper again to see if any good ideas for architecture

9. bottleneck laye

Bottleneck layers do a 1x1 convolution to reduce the dimensionality, before a 3x3 convolution, to save computation

https://medium.com/@erikgaas/resnet-torchvision-bottlenecks-and-layers-not-as-they-seem-145620f93096

10. Global Feature Representation Learning

someething that came up whilst looking through papers in attention: https://arxiv.org/pdf/1709.01507.pdf squeeze-and-excitation

11. [68]

Parts-based paper, interesting approach