75 Matching Annotations
  1. Oct 2016
    1. Mean Squared Projected Bellman Error

      This requires some thought to deconstruct. See page 2 (bottom right) of http://www.machinelearning.org/archive/icml2009/papers/546.pdf for an explanation.

      TQ is Bellman applied to Q, so Q-TQ would be something like: how far is Q from satisfying Bellman (which we know it must)?

      This would be fine for tabular Q, but under function approximation such a Q function might not be realizable: there might not be a setting of the thetas that yields this. So the idea is to project:

      \Pi(v) = argmin_theta || v - V_theta ||^_D

      It turns out that under linear function approximation, this has a closed form, which is exactly the \Pi shown in the next paragraph.

    2. We have tested our method on the "star" Baird counter example [1]
    3. ormjjvjj2D=v>Dv

      Interpretation: if we look at ||Q - Q'||_D, this is measuring how far Q is from Q', but where (s,a) pairs are weighted by their importance according to D. In particular, I think it's the case that ||Q-Q'||^2D = E{(s,a) ~ D}[ (Q(s,a) - Q'(s,a))^2 ].

    4. limit distribution

      Presumably this is conditioned on the policy pi?

    5. Watkins and Dayan's Q-learning algorithm is an o -policy algorithm as thepolicy that is searched and evaluated is strictly greedy with respect to the current action values, butfor control the agent uses a"-greedy policy. This facilitates exploration as the agent is allowed tomake a random move with"probability to obtain representative samples and facilitate the searchfor a policy that generates high rewards.

      This is a nice summary.

    1. 4.3. Natural actor-critic algorithms

      we will spend some time in the future talking specifically about actor-critic methods, so it's okay to skip this for now.

    2. 3.3. Compatible function approximation

      ok to skip

    3. Example 1.When using a REINFORCE estimator with a baselineb=0 in a scenario where there is only a single reward of alwaysthe same magnitude, e.g.,r(x,u)=c∈Rfor allx,u, then thevariance of the gradient estimate will grow at least cubically withthe length of the planning horizonHasVar{gRF}=H2c2HXk=0Var{∇θlogπθ(uk|xk)},(22)if Var{∇θlogπθ(uk|xk)}>0 for allk. Furthermore, it will alsogrow quadratically with the magnitude of the rewardc. The meangradient remains unaltered by the reward magnitudecin thisexample as the reward is constant.

      the variance issue is the key problem with policy gradient approaches.

    4. or when acommon history of random numbers (Glynn,1987;Kleinman,Spall,&Naiman,1999) is being used (the later trick is known asthe PEGASUS method in reinforcement learning,

      this is a useful trick :)

  2. Sep 2016
    1. Kte= 10

      unfortunate that this goes down. maybe because K_tr=6?

    2. moderate

      s/moderate/more-or-less-nil/

    3. 5.11/79.32

      Why does this hurt so much? Is it because K_tr = 6? If K_tr = 1, would it recover?

    4. Weuse simple 0/1 costs in defining thefunction

      We've found it helpful to have Delta = 0 when correct, 1 when action is correct but tag is wrong, and 2 when action is wrong. I wonder if this would help here. (See, eg, http://arxiv.org/abs/1406.1837)

    5. We use simple 0/1 costs in defining thefunction

      I assume this means "did I predict the t-th word correctly or not" which is kind of a weird loss function, especailly if we're evaluating with Bleu.

    6. capa-bilities of text-generation systems

      See also Soricut & Marcu, 2005, "Towards developing generation algorithms for text-to-text applications" https://www.isi.edu/~marcu/papers/idl-nglm-acl05.pdf

    7. Accordingly, we also found it use-ful to use a “curriculum beam” strategy in training

      Also disappointing :(

    8. We therefore foundit necessary to pre-train the model using a standard,word-level cross-entropy loss as described in Sec-tion 3

      This is disappointing and I'm very curious why this happens. Maybe because of the delayed LaSO update? It seems like that shouldn't be sufficient to cause this. Maybe it's just too much exploration? I wonder why this doesn't seem to be an issue for linear models.

    9. When the gold sequence falls off the beamatt= 4, search resumes withS5= succ(y1:4), andso all subsequent predicted sequences havey1:4as aprefix and are thus functions ofh4.

      Key idea for getting from T^2 to T.

    10. succ(y1:t1)

      There's sort of a weird thing going on with time-stamps here, where you assume some sort of 1-1 mapping between predicted words and gold words.

    11. If there was a margin violation,Stis con-structed as theKbest sequences assuming the goldhistoryy1:t1through time-stept1

      Note: this potentially re-introduces overexposure.

    12. BLEU scor

      Note: I don't think this is going to allow the learning algorithm to detect compounding error, unless there's something very clever going on that I haven't seen yet.

    13. Above, the(^y(K)1:t)term denotes a mistake-specificcost-function, which allows us to scale the loss de-pending on the severity of erroneously predicting^y(K)

      This is great! The inability of LaSO to deal with costs was one of the reasons I kind of abandoned it.

    14. We now define a loss function that gives loss eachtime the score of the gold prefixy1:tdoes not exceedthat of^y(K)1:tby a margin:

      This is a cool way of getting a middle ground between LaSO and structured SVMs that I haven't seen before!

    15. Ideally we would trainby comparing the gold sequence to the highest-scoring complete sequence

      This is essentially what a structured SVM would do, modulo slack rescaling (aka loss-augmented decoding).

    16. In practice,a simple beam search procedure that exploresKprospective histories at each time-step has provento be an effective decoding approach

      My understanding, at least in MT-land, is that one can get away with shockingly small beams: like 5-10 elements!

    17. rhtL rhtL+BRNN(wt+1;ht;rht+1L)

      I find this notation really confusing. I know what they're doing because I know how backprop through time works, but this seems to be overloading \nabla_{h_t}L in a way that I have a hard time parsing.

    18. In this work we are agnostic to the form ofthe encoding model, and simply assume an abstractsource representationx.

      Revisit this later: one presumes that x should be backpropable, and there'll be some form of attention.

    19. When it comesto training RNNs, SEARN/DAgger has been appliedunder the name “scheduled sampling”

      LOL

    20. We run experiments on three very dif-ferent problems: word ordering, syntactic parsing,and machine translation

      arguably word reordering and machine translation aren't "very different"

    21. Loss-Evaluation Mismatch: training uses aword-level loss, while at test-time we targetimproving sequence-level evaluation metrics,such as BLEU

      I think there's another issue too: compounding error.

    22. In this work, we introduce a modeland training scheme, based on the work ofDaum ́e III and Marcu (2005),

      Naturally I appreciate the credit, but I encourage everyone to also/alternatively cite Xu and Fern, ICML 2007, "On Learning Linear Ranking Functions for Beam Search", which corrects some errors in DM05. See http://web.engr.oregonstate.edu/~afern/papers/beam-icml07.pdf

    1. We call the resulting feature set Basic Pairwise RelativeO sets in Space (B-PROS) because it captures pairwiserel-ativedistances between objects in a single screen. Morespeci cally, a B-PROS feature checks if there is a pair ofBasic features with colorsk1;k22f1;:::;128gseparated byan o set (i;j), where13i13 and15j15.If so,k1;k2;(i;j)is set to 1, meaning that a pixel of colork1is contained within some block (c;r) and a pixel of colork2is contained within the block (c+i;r+j). Intuitively,B-PROS features encode information like \there is a yel-low pixel three tiles below a red pixel". The computationalcomplexity of generating B-PROS features is similar to thatof BASS, though ultimately fewer features are generated.Note that, as described, B-PROS contains redundant fea-tures (e.g.1;2;(4;0)and2;1;(4;0)), but it is straightforwardto eliminate them. The complete feature set is composed ofthe Basic features and the B-PROS features. After redun-dancy is eliminated, the B-PROS feature set has 6;885;440features in total (28;672 + ((31271282128)=2 + 128))

      Slightly more complicated features. I might not have come up with these.

    2. Basic Features:To obtain Basic features we rst dividethe Atari 2600 screen into 1614 tiles of size 1015 pixels.For every tile (c;r) and colork, wherec2 f1;:::;16g,r2f1;:::;14g, andk2f1;:::;128g, we check whether colorkis present within the tile (c;r), generating the binary featurec;r;k. Intuitively, Basic features encode information like\there is a red pixel within this region". There are 1614128 = 28;672 Basic features in total.It can be computationally demanding to work directlywith 160210 pixel screens with a 128 color palette. Tomake the screen more sparse, following previous work (e.g.[2, 3]) we subtract the background from the screen at eachframe before processing it. The background is precomputedusing 18;000 samples obtained from random trajectories

      These are pretty straightforward features.

    1. even Atari 2600 game

      for a more extensive evaluation, see the nature paper here: http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html

    2. We present the first deep learning model to successfully learn control policies di-rectly from high-dimensional sensory input using reinforcement learning
    3. The gamesQ*bert, Seaquest, Space Invaders, on which we are far from human performance, are more chal-lenging because they require the network to find a strategy that extends over long time scales.

      Key point.

    4. raw RGB screenshots

      greyscale?

    5. Another,more stable, metric is the policy’s estimated action-value functionQ, which provides an estimate ofhow much discounted reward the agent can obtain by following its policy from any given state. Wecollect a fixed set of states by running a random policy before training starts and track the averageof the maximum2predictedQfor these states. The two rightmost plots in figure 2 show that averagepredictedQincreases much more smoothly than the average total reward obtained by the agent andplotting the same metrics on the other five games produces similarly smooth curves

      I totally don't get this.

    6. Since the scale of scores varies greatly from game to game, wefixed all positive rewards to be1and all negative rewards to be1, leaving0rewards unchanged.Clipping the rewards in this manner limits the scale of the error derivatives and makes it easier touse the same learning rate across multiple games. At the same time, it could affect the performanceof our agent since it cannot differentiate between rewards of different magnitude.

      This seems pretty dubious to me.

    7. The number of valid actions variedbetween4and18on the games we considered

      This is game-specific information; it's not clear why you can't just learn which actions do anything for a given game.

    8. the functionfrom algorithm 1 applies thispreprocessing to the last4frames of a history and stacks them

      So, features are: last four frames, in greyscale, shrunk and clipped, stacked together. Four frames at 60hz is about 67ms, which is basically enough to get instanteous motion, but not much more. note that they're NOT encoding past actions (or rewards) explicitly; though this is probably implicit in the four most recent frames.

    9. stbed that presents agents with a high dimen-sional visual input (210160RGB video at 60Hz)

      note: they're not going to actually use all this information

    10. Store transition(t;at;rt;t+1)inD

      For experience replay

      note: this is slightly different from the ER I talked about last time, in that there's no matching of states

    11. Note that when learning by experience replay, it is necessary to learn off-policy(because our current parameters are different to those used to generate the sample), which motivatesthe choice of Q-learning.

      good point!

    12. random minibatch

      note: does not necessarily include s_t ... my understanding is that this is slightly unusual and many people always ensure that s_t is among the minibatch

    13. With probabilityselect a random actionatotherwise selectat= maxaQ((st);a;)

      Epsilon-greedy exploration

    14. deepQ-learning

      unnecessary branding :P

    15. Since using histories ofarbitrary length as inputs to a neural network can be difficult, our Q-function instead works on fixedlength representation of histories produced by a function.

      Standard deep learning sell: "no feature engineering." Truth: "still some feature engineering" :)

    16. Note that this algorithm ismodel-free: it solves the reinforcement learning task directly using sam-ples from the emulatorE, without explicitly constructing an estimate ofE. It is alsooff-policy: itlearns about the greedy strategya= maxaQ(s;a;), while following a behaviour distribution thatensures adequate exploration of the state space. In practice, the behaviour distribution is often se-lected by an-greedy strategy that follows the greedy strategy with probability1and selects arandom action with probability.

      Hopefully this is all clear, but this is a nice concise definition of model-free and off-policy.

    17. Q(s0;a0;i1)

      in this case, they're treating Q(s') as a constant as far as the gradient is concerned

    1. The Cartpole Regulator Benchmark
    2. nother heuristic that we found helpful, is to add ’artificial’ training pat-terns from the goal region, which have a known target value of 0. This technique’clamps’ the neural value function to zero in the goal region, and we thereforecall it thehint-to-goal-heuristic

      Works only if you know what "goals" look like ahead of time.

    3. The Mountain Car Benchmark

      Might be useful to look at https://www.youtube.com/watch?v=0_xeAD6ZgVU for a sense of how you have to solve this problem.

    4. Since we are interested in methods that can directly learn on real systems, ourpreferred measure of learning effort is the number of cycles needed to achieve acertain performance

      This is a good evaluation, but tricky because of randomness.

    5. They can be seen as a special form of the ’experience replay’technique [Lin92], where value iteration is performed on all transition experiencesseen so far.

      See section 2.3 of https://webdocs.cs.ualberta.ca/~sutton/lin-92.pdf for more information on experience replay. This will come up again, so I want to spend a bit of time discussing it.

      An experience is a (s,a,r,s') tuple. The idea is that you store in some separate memory a set of past experiences. When you make a normal Q update for a "new experience" you also re-update yourself for past experiences. Importantly these past experiences might have slightly different targets, because the Q for their s' might have changed.

      It's often the case that only past experiences that agree with the current policy are replayed. (Side note: could probably get around this restriction with some form of importance sampling.)

    6. Sample Setting of Costs

      This section appears to be a complicated way of saying "do the normal thing."

    7. ul

      I'm pretty sure "u" is supposed to be "a"

    8. c(sl,ul,sl)

      Key question: what are the costs?

    9. The consideration of the entire training information instead of on-line sam-ples, has an important further consequence: It allows the application of advancedsupervised learning methods, that converge faster and more reliably than onlinegradient descent methods. Here we use Rprop [RB93], a supervised learningmethod for batch learning, which is known to be very fast and very insensitivewith respect to the choice of its learning parameters

      This is less true now than it was 10 years ago. We have very good stochastic learning algorithms, like adagrad, adadelta, adam, rmsprop, etc. I don't think this is a big deal anymore.

    10. For example, a squared-error measure like the following can be used:error=(Q(s, a)−(c(s, a)+γminbQ(s,b)))2.Atthispoint,commongradientdescent techniques (like the ’backpropagation’ learning rule) can be applied toadjust the weights of a neural network in order to minimize the error

      This is exactly the update scheme we talked about last time. You have the current predicted value Q(s,a), which is approximated by a neural network, and the desired term, c(s,a)+gamma min_b Q(s',b), which is the target. We minimize squared error between the prediction and the target. NOTE: my understanding is that when gradients are taken on this, they are ONLY taken with respect to the prediction Q(s,a) and NOT with respect to min_b Q(s',b). This is kind of weird, now that I think about it, but I'm pretty sure that's how this is done.

    11. Manyof these problems arise, since the representation mechanism in a multi-layer per-ceptron is not local, but global: A weight change induced by an update in acertain part of the state space might influence the values in arbitrary other re-gions - and therefore destroy the effort done so far in other regions

      This is essentially the key issue, both in theory and in practice (but in particular in theory) for function approximation in RL.

  3. Jul 2016
    1. e replaced the linear pair-wise function in Eq. (6) by a one-layer MLP, while keepingthe other settings identical

      take left and right characters, take difference, use that as input to MLP

    2. Counting numberscrare employed toweight the marginal entropies.

      counting numbers are used to approximate the entropy. loopy bp has counting nubmers that approximate tree-structured case, but you can do more.

    3. 1st order MarkovTwo Laye

      too many #s, but there seems to be an effect that depth can obviate for structure. for instance, in the H1=512 setting, we have: d2o1 ~= d1o2. but d2o0 is never really competitive, so some structure is a clear win. but does the NN get to look at neighboring images? i don't think so. that seems like an important comparison.

    4. Joint training helps

      presentational comment: I love how these results are presented :) although it would be nice if they were quantified in the paragraph so I didn't have to flip back and forth to the table, and significance tests would be nice so we really could conclude "X helps". maybe even do an anova or something ;). but i really like that the main hypotheses are being teased out!

    5. by a one-layer MLP, while keepingthe other settings identica

      not clear to me what this means. input is onehot for left- and right-labels, then one hidden layer, then score?

    6. define all pairwise interactions via

      no features on edges (just bias)

    7. However, assuming allcounting numberscrto be positive, we can derive an al-gorithm that is able to interleave minimization w.r.t.wandmaximization of the beliefsb.

      as far as I can tell, this is nearly identical to Meshi et al., ICML 2010, "Learning efficiently with approximate inference via dual losses". the basic idea is to construct the dual of the inner optimization, turning the problem into a min/min problem, and then doing block coordinate descent over the resulting primal ws and dual(b)s.

    8. we require a sub-gradient w.r.t.w

      not sure we really care about subgradients here, especially if F is non-convex, but we still need gradients in the cases where gradients exist, which is what is handled in the remainder.

    9. The approximated non-linear structured prediction task

      This derivation is essentially a standard derivation for CRFs, regardless of whether the score function is linear or non-linear. The bs are local beliefs, f are the region-based scores (i.e., clique scores), and the entropy is approximated with counts c. The polytope over which the bs is optimized is given below and to the right.

    10. since amargin term is easily incorporated

      huh? Is this just saying "we could alternatively use structured hinge loss if we wanted?"

    11. parameter0is used to adjust the uniformityof the distribution

      I don't understand the reason for doing annealing.

    12. Weassume the space of valid configurations to be a productspace,i.e.,Y=QNi=1Yi, and the domain of each indi-vidual variableyito be discrete,i.e.,Yi=f1;:::;jYijg

      this is a pretty limiting assumption, but makes it fit in the MRF setting

    13. iece-wise training is, however, suboptimal asthe deep features are learned while ignoring the dependen-cies between the variables of interest

      motivation for structure+deep

    14. Towards this goal, we propose atraining algorithm that is able to learn structuredmodels jointly with deep features that form theMRF potentials. Our approach is efficient as itblends learning and inference

      main goal