Hypothesis

75 Matching Annotations

Oct 2016
www.umiacs.umd.edu www.umiacs.umd.edu

Policy Gradient Methods for Off-policy Control

5
1. hal3 11 Oct 2016
  
  in Public
  
  Mean Squared Projected Bellman Error
  
  This requires some thought to deconstruct. See page 2 (bottom right) of http://www.machinelearning.org/archive/icml2009/papers/546.pdf for an explanation.
  
  TQ is Bellman applied to Q, so Q-TQ would be something like: how far is Q from satisfying Bellman (which we know it must)?
  
  This would be fine for tabular Q, but under function approximation such a Q function might not be realizable: there might not be a setting of the thetas that yields this. So the idea is to project:
  
  \Pi(v) = argmin_theta || v - V_theta ||^_D
  
  It turns out that under linear function approximation, this has a closed form, which is exactly the \Pi shown in the next paragraph.
2. hal3 11 Oct 2016
  
  in Public
  
  We have tested our method on the "star" Baird counter example [1]
  
  See page 9 of http://www.leemon.com/papers/1995b.pdf
3. hal3 11 Oct 2016
  
  in Public
  
  ormjjvjj2D=v>Dv
  
  Interpretation: if we look at ||Q - Q'||_D, this is measuring how far Q is from Q', but where (s,a) pairs are weighted by their importance according to D. In particular, I think it's the case that ||Q-Q'||^2D = E{(s,a) ~ D}[ (Q(s,a) - Q'(s,a))^2 ].
4. hal3 11 Oct 2016
  
  in Public
  
  limit distribution
  
  Presumably this is conditioned on the policy pi?
5. hal3 11 Oct 2016
  
  in Public
  
  Watkins and Dayan's Q-learning algorithm is an o-policy algorithm as thepolicy that is searched and evaluated is strictly greedy with respect to the current action values, butfor control the agent uses a"-greedy policy. This facilitates exploration as the agent is allowed tomake a random move with"probability to obtain representative samples and facilitate the searchfor a policy that generates high rewards.
  
  This is a nice summary.
Visit annotations in context

Annotators

hal3

URL

umiacs.umd.edu/~hal/courses/2016F_RL/Lehnert15.pdf
www.umiacs.umd.edu www.umiacs.umd.edu

doi:10.1016/j.neunet.2008.02.003

4
1. hal3 04 Oct 2016
  
  in Public
  
  4.3. Natural actor-critic algorithms
  
  we will spend some time in the future talking specifically about actor-critic methods, so it's okay to skip this for now.
2. hal3 04 Oct 2016
  
  in Public
  
  3.3. Compatible function approximation
  
  ok to skip
3. hal3 04 Oct 2016
  
  in Public
  
  Example 1.When using a REINFORCE estimator with a baselineb=0 in a scenario where there is only a single reward of alwaysthe same magnitude, e.g.,r(x,u)=c∈Rfor allx,u, then thevariance of the gradient estimate will grow at least cubically withthe length of the planning horizonHasVar{gRF}=H2c2HXk=0Var{∇θlogπθ(uk|xk)},(22)if Var{∇θlogπθ(uk|xk)}>0 for allk. Furthermore, it will alsogrow quadratically with the magnitude of the rewardc. The meangradient remains unaltered by the reward magnitudecin thisexample as the reward is constant.
  
  the variance issue is the key problem with policy gradient approaches.
4. hal3 04 Oct 2016
  
  in Public
  
  or when acommon history of random numbers (Glynn,1987;Kleinman,Spall,&Naiman,1999) is being used (the later trick is known asthe PEGASUS method in reinforcement learning,
  
  this is a useful trick :)
Visit annotations in context

Annotators

hal3

URL

umiacs.umd.edu/~hal/courses/2016F_RL/Peters08.pdf
Sep 2016
arxiv.org arxiv.org

1606.02960.pdf

22
1. hal3 22 Sep 2016
  
  in Public
  
  Kte= 10
  
  unfortunate that this goes down. maybe because K_tr=6?
2. hal3 22 Sep 2016
  
  in Public
  
  moderate
  
  s/moderate/more-or-less-nil/
3. hal3 22 Sep 2016
  
  in Public
  
  5.11/79.32
  
  Why does this hurt so much? Is it because K_tr = 6? If K_tr = 1, would it recover?
4. hal3 22 Sep 2016
  
  in Public
  
  Weuse simple 0/1 costs in defining thefunction
  
  We've found it helpful to have Delta = 0 when correct, 1 when action is correct but tag is wrong, and 2 when action is wrong. I wonder if this would help here. (See, eg, http://arxiv.org/abs/1406.1837)
5. hal3 22 Sep 2016
  
  in Public
  
  We use simple 0/1 costs in defining thefunction
  
  I assume this means "did I predict the t-th word correctly or not" which is kind of a weird loss function, especailly if we're evaluating with Bleu.
6. hal3 22 Sep 2016
  
  in Public
  
  capa-bilities of text-generation systems
  
  See also Soricut & Marcu, 2005, "Towards developing generation algorithms for text-to-text applications" https://www.isi.edu/~marcu/papers/idl-nglm-acl05.pdf
7. hal3 22 Sep 2016
  
  in Public
  
  Accordingly, we also found it use-ful to use a “curriculum beam” strategy in training
  
  Also disappointing :(
8. hal3 22 Sep 2016
  
  in Public
  
  We therefore foundit necessary to pre-train the model using a standard,word-level cross-entropy loss as described in Sec-tion 3
  
  This is disappointing and I'm very curious why this happens. Maybe because of the delayed LaSO update? It seems like that shouldn't be sufficient to cause this. Maybe it's just too much exploration? I wonder why this doesn't seem to be an issue for linear models.
9. hal3 22 Sep 2016
  
  in Public
  
  When the gold sequence falls off the beamatt= 4, search resumes withS5= succ(y1:4), andso all subsequent predicted sequences havey1:4as aprefix and are thus functions ofh4.
  
  Key idea for getting from T^2 to T.
10. hal3 22 Sep 2016
  
  in Public
  
  succ(y1:t1)
  
  There's sort of a weird thing going on with time-stamps here, where you assume some sort of 1-1 mapping between predicted words and gold words.
11. hal3 22 Sep 2016
  
  in Public
  
  If there was a margin violation,Stis con-structed as theKbest sequences assuming the goldhistoryy1:t1through time-stept1
  
  Note: this potentially re-introduces overexposure.
12. hal3 22 Sep 2016
  
  in Public
  
  BLEU scor
  
  Note: I don't think this is going to allow the learning algorithm to detect compounding error, unless there's something very clever going on that I haven't seen yet.
13. hal3 22 Sep 2016
  
  in Public
  
  Above, the(^y(K)1:t)term denotes a mistake-specificcost-function, which allows us to scale the loss de-pending on the severity of erroneously predicting^y(K)
  
  This is great! The inability of LaSO to deal with costs was one of the reasons I kind of abandoned it.
14. hal3 22 Sep 2016
  
  in Public
  
  We now define a loss function that gives loss eachtime the score of the gold prefixy1:tdoes not exceedthat of^y(K)1:tby a margin:
  
  This is a cool way of getting a middle ground between LaSO and structured SVMs that I haven't seen before!
15. hal3 22 Sep 2016
  
  in Public
  
  Ideally we would trainby comparing the gold sequence to the highest-scoring complete sequence
  
  This is essentially what a structured SVM would do, modulo slack rescaling (aka loss-augmented decoding).
16. hal3 22 Sep 2016
  
  in Public
  
  In practice,a simple beam search procedure that exploresKprospective histories at each time-step has provento be an effective decoding approach
  
  My understanding, at least in MT-land, is that one can get away with shockingly small beams: like 5-10 elements!
17. hal3 22 Sep 2016
  
  in Public
  
  rhtL rhtL+BRNN(wt+1;ht;rht+1L)
  
  I find this notation really confusing. I know what they're doing because I know how backprop through time works, but this seems to be overloading \nabla_{h_t}L in a way that I have a hard time parsing.
18. hal3 22 Sep 2016
  
  in Public
  
  In this work we are agnostic to the form ofthe encoding model, and simply assume an abstractsource representationx.
  
  Revisit this later: one presumes that x should be backpropable, and there'll be some form of attention.
19. hal3 22 Sep 2016
  
  in Public
  
  When it comesto training RNNs, SEARN/DAgger has been appliedunder the name “scheduled sampling”
  
  LOL
20. hal3 22 Sep 2016
  
  in Public
  
  We run experiments on three very dif-ferent problems: word ordering, syntactic parsing,and machine translation
  
  arguably word reordering and machine translation aren't "very different"
21. hal3 22 Sep 2016
  
  in Public
  
  Loss-Evaluation Mismatch: training uses aword-level loss, while at test-time we targetimproving sequence-level evaluation metrics,such as BLEU
  
  I think there's another issue too: compounding error.
22. hal3 22 Sep 2016
  
  in Public
  
  In this work, we introduce a modeland training scheme, based on the work ofDaum ́e III and Marcu (2005),
  
  Naturally I appreciate the credit, but I encourage everyone to also/alternatively cite Xu and Fern, ICML 2007, "On Learning Linear Ranking Functions for Beam Search", which corrects some errors in DM05. See http://web.engr.oregonstate.edu/~afern/papers/beam-icml07.pdf
Visit annotations in context

Annotators

hal3

URL

arxiv.org/pdf/1606.02960
arxiv.org arxiv.org

Liang15.pdf

2
1. hal3 22 Sep 2016
  
  in Public
  
  We call the resulting feature set Basic Pairwise RelativeOsets in Space (B-PROS) because it captures pairwiserel-ativedistances between objects in a single screen. Morespecically, a B-PROS feature checks if there is a pair ofBasic features with colorsk1;k22f1;:::;128gseparated byan oset (i;j), where13i13 and15j15.If so,k1;k2;(i;j)is set to 1, meaning that a pixel of colork1is contained within some block (c;r) and a pixel of colork2is contained within the block (c+i;r+j). Intuitively,B-PROS features encode information like \there is a yel-low pixel three tiles below a red pixel". The computationalcomplexity of generating B-PROS features is similar to thatof BASS, though ultimately fewer features are generated.Note that, as described, B-PROS contains redundant fea-tures (e.g.1;2;(4;0)and2;1;(4;0)), but it is straightforwardto eliminate them. The complete feature set is composed ofthe Basic features and the B-PROS features. After redun-dancy is eliminated, the B-PROS feature set has 6;885;440features in total (28;672 + ((31271282128)=2 + 128))
  
  Slightly more complicated features. I might not have come up with these.
2. hal3 22 Sep 2016
  
  in Public
  
  Basic Features:To obtain Basic features we rst dividethe Atari 2600 screen into 1614 tiles of size 1015 pixels.For every tile (c;r) and colork, wherec2 f1;:::;16g,r2f1;:::;14g, andk2f1;:::;128g, we check whether colorkis present within the tile (c;r), generating the binary featurec;r;k. Intuitively, Basic features encode information like\there is a red pixel within this region". There are 1614128 = 28;672 Basic features in total.It can be computationally demanding to work directlywith 160210 pixel screens with a 128 color palette. Tomake the screen more sparse, following previous work (e.g.[2, 3]) we subtract the background from the screen at eachframe before processing it. The background is precomputedusing 18;000 samples obtained from random trajectories
  
  These are pretty straightforward features.
Visit annotations in context

Annotators

hal3

URL

arxiv.org/pdf/1512.01563
arxiv.org arxiv.org

1312.5602v1.pdf

17
1. hal3 22 Sep 2016
  
  in Public
  
  even Atari 2600 game
  
  for a more extensive evaluation, see the nature paper here: http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html
2. hal3 22 Sep 2016
  
  in Public
  
  We present the first deep learning model to successfully learn control policies di-rectly from high-dimensional sensory input using reinforcement learning
  
  This claim appears to be false. See Schmidhuber's comments here: https://plus.google.com/100849856540000067209/posts/eLQf4KC97Bs?pid=6119853279954822354&oid=100849856540000067209
3. hal3 22 Sep 2016
  
  in Public
  
  The gamesQ*bert, Seaquest, Space Invaders, on which we are far from human performance, are more chal-lenging because they require the network to find a strategy that extends over long time scales.
  
  Key point.
4. hal3 22 Sep 2016
  
  in Public
  
  raw RGB screenshots
  
  greyscale?
5. hal3 22 Sep 2016
  
  in Public
  
  Another,more stable, metric is the policy’s estimated action-value functionQ, which provides an estimate ofhow much discounted reward the agent can obtain by following its policy from any given state. Wecollect a fixed set of states by running a random policy before training starts and track the averageof the maximum2predictedQfor these states. The two rightmost plots in figure 2 show that averagepredictedQincreases much more smoothly than the average total reward obtained by the agent andplotting the same metrics on the other five games produces similarly smooth curves
  
  I totally don't get this.
6. hal3 22 Sep 2016
  
  in Public
  
  Since the scale of scores varies greatly from game to game, wefixed all positive rewards to be1and all negative rewards to be1, leaving0rewards unchanged.Clipping the rewards in this manner limits the scale of the error derivatives and makes it easier touse the same learning rate across multiple games. At the same time, it could affect the performanceof our agent since it cannot differentiate between rewards of different magnitude.
  
  This seems pretty dubious to me.
7. hal3 22 Sep 2016
  
  in Public
  
  The number of valid actions variedbetween4and18on the games we considered
  
  This is game-specific information; it's not clear why you can't just learn which actions do anything for a given game.
8. hal3 22 Sep 2016
  
  in Public
  
  the functionfrom algorithm 1 applies thispreprocessing to the last4frames of a history and stacks them
  
  So, features are: last four frames, in greyscale, shrunk and clipped, stacked together. Four frames at 60hz is about 67ms, which is basically enough to get instanteous motion, but not much more. note that they're NOT encoding past actions (or rewards) explicitly; though this is probably implicit in the four most recent frames.
9. hal3 22 Sep 2016
  
  in Public
  
  stbed that presents agents with a high dimen-sional visual input (210160RGB video at 60Hz)
  
  note: they're not going to actually use all this information
10. hal3 22 Sep 2016
  
  in Public
  
  Store transition(t;at;rt;t+1)inD
  
  For experience replay
  
  note: this is slightly different from the ER I talked about last time, in that there's no matching of states
11. hal3 22 Sep 2016
  
  in Public
  
  Note that when learning by experience replay, it is necessary to learn off-policy(because our current parameters are different to those used to generate the sample), which motivatesthe choice of Q-learning.
  
  good point!
12. hal3 22 Sep 2016
  
  in Public
  
  random minibatch
  
  note: does not necessarily include s_t ... my understanding is that this is slightly unusual and many people always ensure that s_t is among the minibatch
13. hal3 22 Sep 2016
  
  in Public
  
  With probabilityselect a random actionatotherwise selectat= maxaQ((st);a;)
  
  Epsilon-greedy exploration
14. hal3 22 Sep 2016
  
  in Public
  
  deepQ-learning
  
  unnecessary branding :P
15. hal3 22 Sep 2016
  
  in Public
  
  Since using histories ofarbitrary length as inputs to a neural network can be difficult, our Q-function instead works on fixedlength representation of histories produced by a function.
  
  Standard deep learning sell: "no feature engineering." Truth: "still some feature engineering" :)
16. hal3 22 Sep 2016
  
  in Public
  
  Note that this algorithm ismodel-free: it solves the reinforcement learning task directly using sam-ples from the emulatorE, without explicitly constructing an estimate ofE. It is alsooff-policy: itlearns about the greedy strategya= maxaQ(s;a;), while following a behaviour distribution thatensures adequate exploration of the state space. In practice, the behaviour distribution is often se-lected by an-greedy strategy that follows the greedy strategy with probability1and selects arandom action with probability.
  
  Hopefully this is all clear, but this is a nice concise definition of model-free and off-policy.
17. hal3 22 Sep 2016
  
  in Public
  
  Q(s0;a0;i1)
  
  in this case, they're treating Q(s') as a constant as far as the gradient is concerned
Visit annotations in context

Annotators

hal3

URL

arxiv.org/pdf/1312.5602v1
www.umiacs.umd.edu www.umiacs.umd.edu

Untitled document

11
1. hal3 19 Sep 2016
  
  in Public
  
  The Cartpole Regulator Benchmark
  
  Video here: https://www.youtube.com/watch?v=J7E6_my3CHk
2. hal3 19 Sep 2016
  
  in Public
  
  nother heuristic that we found helpful, is to add ’artificial’ training pat-terns from the goal region, which have a known target value of 0. This technique’clamps’ the neural value function to zero in the goal region, and we thereforecall it thehint-to-goal-heuristic
  
  Works only if you know what "goals" look like ahead of time.
3. hal3 19 Sep 2016
  
  in Public
  
  The Mountain Car Benchmark
  
  Might be useful to look at https://www.youtube.com/watch?v=0_xeAD6ZgVU for a sense of how you have to solve this problem.
4. hal3 19 Sep 2016
  
  in Public
  
  Since we are interested in methods that can directly learn on real systems, ourpreferred measure of learning effort is the number of cycles needed to achieve acertain performance
  
  This is a good evaluation, but tricky because of randomness.
5. hal3 19 Sep 2016
  
  in Public
  
  They can be seen as a special form of the ’experience replay’technique [Lin92], where value iteration is performed on all transition experiencesseen so far.
  
  See section 2.3 of https://webdocs.cs.ualberta.ca/~sutton/lin-92.pdf for more information on experience replay. This will come up again, so I want to spend a bit of time discussing it.
  
  An experience is a (s,a,r,s') tuple. The idea is that you store in some separate memory a set of past experiences. When you make a normal Q update for a "new experience" you also re-update yourself for past experiences. Importantly these past experiences might have slightly different targets, because the Q for their s' might have changed.
  
  It's often the case that only past experiences that agree with the current policy are replayed. (Side note: could probably get around this restriction with some form of importance sampling.)
6. hal3 19 Sep 2016
  
  in Public
  
  Sample Setting of Costs
  
  This section appears to be a complicated way of saying "do the normal thing."
7. hal3 19 Sep 2016
  
  in Public
  
  ul
  
  I'm pretty sure "u" is supposed to be "a"
8. hal3 19 Sep 2016
  
  in Public
  
  c(sl,ul,sl)
  
  Key question: what are the costs?
9. hal3 19 Sep 2016
  
  in Public
  
  The consideration of the entire training information instead of on-line sam-ples, has an important further consequence: It allows the application of advancedsupervised learning methods, that converge faster and more reliably than onlinegradient descent methods. Here we use Rprop [RB93], a supervised learningmethod for batch learning, which is known to be very fast and very insensitivewith respect to the choice of its learning parameters
  
  This is less true now than it was 10 years ago. We have very good stochastic learning algorithms, like adagrad, adadelta, adam, rmsprop, etc. I don't think this is a big deal anymore.
10. hal3 19 Sep 2016
  
  in Public
  
  For example, a squared-error measure like the following can be used:error=(Q(s, a)−(c(s, a)+γminbQ(s,b)))2.Atthispoint,commongradientdescent techniques (like the ’backpropagation’ learning rule) can be applied toadjust the weights of a neural network in order to minimize the error
  
  This is exactly the update scheme we talked about last time. You have the current predicted value Q(s,a), which is approximated by a neural network, and the desired term, c(s,a)+gamma min_b Q(s',b), which is the target. We minimize squared error between the prediction and the target. NOTE: my understanding is that when gradients are taken on this, they are ONLY taken with respect to the prediction Q(s,a) and NOT with respect to min_b Q(s',b). This is kind of weird, now that I think about it, but I'm pretty sure that's how this is done.
11. hal3 19 Sep 2016
  
  in Public
  
  Manyof these problems arise, since the representation mechanism in a multi-layer per-ceptron is not local, but global: A weight change induced by an update in acertain part of the state space might influence the values in arbitrary other re-gions - and therefore destroy the effort done so far in other regions
  
  This is essentially the key issue, both in theory and in practice (but in particular in theory) for function approximation in RL.
Visit annotations in context

Annotators

hal3

URL

umiacs.umd.edu/~hal/courses/2016F_RL/Riedmiller05.pdf
Jul 2016
alexander-schwing.de alexander-schwing.de

Learning Deep Structured Models

14
1. hal3 11 Jul 2016
  
  in Public
  
  e replaced the linear pair-wise function in Eq. (6) by a one-layer MLP, while keepingthe other settings identical
  
  take left and right characters, take difference, use that as input to MLP
2. hal3 11 Jul 2016
  
  in Public
  
  Counting numberscrare employed toweight the marginal entropies.
  
  counting numbers are used to approximate the entropy. loopy bp has counting nubmers that approximate tree-structured case, but you can do more.
3. hal3 10 Jul 2016
  
  in Public
  
  1st order MarkovTwo Laye
  
  too many #s, but there seems to be an effect that depth can obviate for structure. for instance, in the H1=512 setting, we have: d2o1 ~= d1o2. but d2o0 is never really competitive, so some structure is a clear win. but does the NN get to look at neighboring images? i don't think so. that seems like an important comparison.
4. hal3 10 Jul 2016
  
  in Public
  
  Joint training helps
  
  presentational comment: I love how these results are presented :) although it would be nice if they were quantified in the paragraph so I didn't have to flip back and forth to the table, and significance tests would be nice so we really could conclude "X helps". maybe even do an anova or something ;). but i really like that the main hypotheses are being teased out!
5. hal3 10 Jul 2016
  
  in Public
  
  by a one-layer MLP, while keepingthe other settings identica
  
  not clear to me what this means. input is onehot for left- and right-labels, then one hidden layer, then score?
6. hal3 10 Jul 2016
  
  in Public
  
  define all pairwise interactions via
  
  no features on edges (just bias)
7. hal3 10 Jul 2016
  
  in Public
  
  However, assuming allcounting numberscrto be positive, we can derive an al-gorithm that is able to interleave minimization w.r.t.wandmaximization of the beliefsb.
  
  as far as I can tell, this is nearly identical to Meshi et al., ICML 2010, "Learning efficiently with approximate inference via dual losses". the basic idea is to construct the dual of the inner optimization, turning the problem into a min/min problem, and then doing block coordinate descent over the resulting primal ws and dual(b)s.
8. hal3 10 Jul 2016
  
  in Public
  
  we require a sub-gradient w.r.t.w
  
  not sure we really care about subgradients here, especially if F is non-convex, but we still need gradients in the cases where gradients exist, which is what is handled in the remainder.
9. hal3 10 Jul 2016
  
  in Public
  
  The approximated non-linear structured prediction task
  
  This derivation is essentially a standard derivation for CRFs, regardless of whether the score function is linear or non-linear. The bs are local beliefs, f are the region-based scores (i.e., clique scores), and the entropy is approximated with counts c. The polytope over which the bs is optimized is given below and to the right.
10. hal3 10 Jul 2016
  
  in Public
  
  since amargin term is easily incorporated
  
  huh? Is this just saying "we could alternatively use structured hinge loss if we wanted?"
11. hal3 10 Jul 2016
  
  in Public
  
  parameter0is used to adjust the uniformityof the distribution
  
  I don't understand the reason for doing annealing.
12. hal3 07 Jul 2016
  
  in Public
  
  Weassume the space of valid configurations to be a productspace,i.e.,Y=QNi=1Yi, and the domain of each indi-vidual variableyito be discrete,i.e.,Yi=f1;:::;jYijg
  
  this is a pretty limiting assumption, but makes it fit in the MRF setting
13. hal3 07 Jul 2016
  
  in Public
  
  iece-wise training is, however, suboptimal asthe deep features are learned while ignoring the dependen-cies between the variables of interest
  
  motivation for structure+deep
14. hal3 07 Jul 2016
  
  in Public
  
  Towards this goal, we propose atraining algorithm that is able to learn structuredmodels jointly with deep features that form theMRF potentials. Our approach is efficient as itblends learning and inference
  
  main goal
Visit annotations in context

Annotators

hal3

URL

alexander-schwing.de/papers/ChenSchwingEtAl_ICML2015.pdf

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL