- Oct 2016
-
www.umiacs.umd.edu www.umiacs.umd.edu
-
Mean Squared Projected Bellman Error
This requires some thought to deconstruct. See page 2 (bottom right) of http://www.machinelearning.org/archive/icml2009/papers/546.pdf for an explanation.
TQ is Bellman applied to Q, so Q-TQ would be something like: how far is Q from satisfying Bellman (which we know it must)?
This would be fine for tabular Q, but under function approximation such a Q function might not be realizable: there might not be a setting of the thetas that yields this. So the idea is to project:
\Pi(v) = argmin_theta || v - V_theta ||^_D
It turns out that under linear function approximation, this has a closed form, which is exactly the \Pi shown in the next paragraph.
-
We have tested our method on the "star" Baird counter example [1]
See page 9 of http://www.leemon.com/papers/1995b.pdf
-
ormjjvjj2D=v>Dv
Interpretation: if we look at ||Q - Q'||_D, this is measuring how far Q is from Q', but where (s,a) pairs are weighted by their importance according to D. In particular, I think it's the case that ||Q-Q'||^2D = E{(s,a) ~ D}[ (Q(s,a) - Q'(s,a))^2 ].
-
limit distribution
Presumably this is conditioned on the policy pi?
-
Watkins and Dayan's Q-learning algorithm is an o-policy algorithm as thepolicy that is searched and evaluated is strictly greedy with respect to the current action values, butfor control the agent uses a"-greedy policy. This facilitates exploration as the agent is allowed tomake a random move with"probability to obtain representative samples and facilitate the searchfor a policy that generates high rewards.
This is a nice summary.
-
-
www.umiacs.umd.edu www.umiacs.umd.edu
-
4.3. Natural actor-critic algorithms
we will spend some time in the future talking specifically about actor-critic methods, so it's okay to skip this for now.
-
3.3. Compatible function approximation
ok to skip
-
Example 1.When using a REINFORCE estimator with a baselineb=0 in a scenario where there is only a single reward of alwaysthe same magnitude, e.g.,r(x,u)=c∈Rfor allx,u, then thevariance of the gradient estimate will grow at least cubically withthe length of the planning horizonHasVar{gRF}=H2c2HXk=0Var{∇θlogπθ(uk|xk)},(22)if Var{∇θlogπθ(uk|xk)}>0 for allk. Furthermore, it will alsogrow quadratically with the magnitude of the rewardc. The meangradient remains unaltered by the reward magnitudecin thisexample as the reward is constant.
the variance issue is the key problem with policy gradient approaches.
-
or when acommon history of random numbers (Glynn,1987;Kleinman,Spall,&Naiman,1999) is being used (the later trick is known asthe PEGASUS method in reinforcement learning,
this is a useful trick :)
-
- Sep 2016
-
arxiv.org arxiv.org
-
Kte= 10
unfortunate that this goes down. maybe because K_tr=6?
-
moderate
s/moderate/more-or-less-nil/
-
5.11/79.32
Why does this hurt so much? Is it because K_tr = 6? If K_tr = 1, would it recover?
-
Weuse simple 0/1 costs in defining thefunction
We've found it helpful to have Delta = 0 when correct, 1 when action is correct but tag is wrong, and 2 when action is wrong. I wonder if this would help here. (See, eg, http://arxiv.org/abs/1406.1837)
-
We use simple 0/1 costs in defining thefunction
I assume this means "did I predict the t-th word correctly or not" which is kind of a weird loss function, especailly if we're evaluating with Bleu.
-
capa-bilities of text-generation systems
See also Soricut & Marcu, 2005, "Towards developing generation algorithms for text-to-text applications" https://www.isi.edu/~marcu/papers/idl-nglm-acl05.pdf
-
Accordingly, we also found it use-ful to use a “curriculum beam” strategy in training
Also disappointing :(
-
We therefore foundit necessary to pre-train the model using a standard,word-level cross-entropy loss as described in Sec-tion 3
This is disappointing and I'm very curious why this happens. Maybe because of the delayed LaSO update? It seems like that shouldn't be sufficient to cause this. Maybe it's just too much exploration? I wonder why this doesn't seem to be an issue for linear models.
-
When the gold sequence falls off the beamatt= 4, search resumes withS5= succ(y1:4), andso all subsequent predicted sequences havey1:4as aprefix and are thus functions ofh4.
Key idea for getting from T^2 to T.
-
succ(y1:t 1)
There's sort of a weird thing going on with time-stamps here, where you assume some sort of 1-1 mapping between predicted words and gold words.
-
If there was a margin violation,Stis con-structed as theKbest sequences assuming the goldhistoryy1:t 1through time-stept 1
Note: this potentially re-introduces overexposure.
-
BLEU scor
Note: I don't think this is going to allow the learning algorithm to detect compounding error, unless there's something very clever going on that I haven't seen yet.
-
Above, the(^y(K)1:t)term denotes a mistake-specificcost-function, which allows us to scale the loss de-pending on the severity of erroneously predicting^y(K)
This is great! The inability of LaSO to deal with costs was one of the reasons I kind of abandoned it.
-
We now define a loss function that gives loss eachtime the score of the gold prefixy1:tdoes not exceedthat of^y(K)1:tby a margin:
This is a cool way of getting a middle ground between LaSO and structured SVMs that I haven't seen before!
-
Ideally we would trainby comparing the gold sequence to the highest-scoring complete sequence
This is essentially what a structured SVM would do, modulo slack rescaling (aka loss-augmented decoding).
-
In practice,a simple beam search procedure that exploresKprospective histories at each time-step has provento be an effective decoding approach
My understanding, at least in MT-land, is that one can get away with shockingly small beams: like 5-10 elements!
-
rhtL rhtL+BRNN(wt+1;ht;rht+1L)
I find this notation really confusing. I know what they're doing because I know how backprop through time works, but this seems to be overloading \nabla_{h_t}L in a way that I have a hard time parsing.
-
In this work we are agnostic to the form ofthe encoding model, and simply assume an abstractsource representationx.
Revisit this later: one presumes that x should be backpropable, and there'll be some form of attention.
-
When it comesto training RNNs, SEARN/DAgger has been appliedunder the name “scheduled sampling”
LOL
-
We run experiments on three very dif-ferent problems: word ordering, syntactic parsing,and machine translation
arguably word reordering and machine translation aren't "very different"
-
Loss-Evaluation Mismatch: training uses aword-level loss, while at test-time we targetimproving sequence-level evaluation metrics,such as BLEU
I think there's another issue too: compounding error.
-
In this work, we introduce a modeland training scheme, based on the work ofDaum ́e III and Marcu (2005),
Naturally I appreciate the credit, but I encourage everyone to also/alternatively cite Xu and Fern, ICML 2007, "On Learning Linear Ranking Functions for Beam Search", which corrects some errors in DM05. See http://web.engr.oregonstate.edu/~afern/papers/beam-icml07.pdf
-
-
arxiv.org arxiv.org
-
We call the resulting feature set Basic Pairwise RelativeOsets in Space (B-PROS) because it captures pairwiserel-ativedistances between objects in a single screen. Morespecically, a B-PROS feature checks if there is a pair ofBasic features with colorsk1;k22f1;:::;128gseparated byan oset (i;j), where 13i13 and 15j15.If so,k1;k2;(i;j)is set to 1, meaning that a pixel of colork1is contained within some block (c;r) and a pixel of colork2is contained within the block (c+i;r+j). Intuitively,B-PROS features encode information like \there is a yel-low pixel three tiles below a red pixel". The computationalcomplexity of generating B-PROS features is similar to thatof BASS, though ultimately fewer features are generated.Note that, as described, B-PROS contains redundant fea-tures (e.g.1;2;(4;0)and2;1;( 4;0)), but it is straightforwardto eliminate them. The complete feature set is composed ofthe Basic features and the B-PROS features. After redun-dancy is eliminated, the B-PROS feature set has 6;885;440features in total (28;672 + ((31271282 128)=2 + 128))
Slightly more complicated features. I might not have come up with these.
-
Basic Features:To obtain Basic features we rst dividethe Atari 2600 screen into 1614 tiles of size 1015 pixels.For every tile (c;r) and colork, wherec2 f1;:::;16g,r2f1;:::;14g, andk2f1;:::;128g, we check whether colorkis present within the tile (c;r), generating the binary featurec;r;k. Intuitively, Basic features encode information like\there is a red pixel within this region". There are 1614128 = 28;672 Basic features in total.It can be computationally demanding to work directlywith 160210 pixel screens with a 128 color palette. Tomake the screen more sparse, following previous work (e.g.[2, 3]) we subtract the background from the screen at eachframe before processing it. The background is precomputedusing 18;000 samples obtained from random trajectories
These are pretty straightforward features.
-
-
www.semanticscholar.org www.semanticscholar.org
-
even Atari 2600 game
for a more extensive evaluation, see the nature paper here: http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html
-
We present the first deep learning model to successfully learn control policies di-rectly from high-dimensional sensory input using reinforcement learning
This claim appears to be false. See Schmidhuber's comments here: https://plus.google.com/100849856540000067209/posts/eLQf4KC97Bs?pid=6119853279954822354&oid=100849856540000067209
-
The gamesQ*bert, Seaquest, Space Invaders, on which we are far from human performance, are more chal-lenging because they require the network to find a strategy that extends over long time scales.
Key point.
-
raw RGB screenshots
greyscale?
-
Another,more stable, metric is the policy’s estimated action-value functionQ, which provides an estimate ofhow much discounted reward the agent can obtain by following its policy from any given state. Wecollect a fixed set of states by running a random policy before training starts and track the averageof the maximum2predictedQfor these states. The two rightmost plots in figure 2 show that averagepredictedQincreases much more smoothly than the average total reward obtained by the agent andplotting the same metrics on the other five games produces similarly smooth curves
I totally don't get this.
-
Since the scale of scores varies greatly from game to game, wefixed all positive rewards to be1and all negative rewards to be 1, leaving0rewards unchanged.Clipping the rewards in this manner limits the scale of the error derivatives and makes it easier touse the same learning rate across multiple games. At the same time, it could affect the performanceof our agent since it cannot differentiate between rewards of different magnitude.
This seems pretty dubious to me.
-
The number of valid actions variedbetween4and18on the games we considered
This is game-specific information; it's not clear why you can't just learn which actions do anything for a given game.
-
the functionfrom algorithm 1 applies thispreprocessing to the last4frames of a history and stacks them
So, features are: last four frames, in greyscale, shrunk and clipped, stacked together. Four frames at 60hz is about 67ms, which is basically enough to get instanteous motion, but not much more. note that they're NOT encoding past actions (or rewards) explicitly; though this is probably implicit in the four most recent frames.
-
stbed that presents agents with a high dimen-sional visual input (210160RGB video at 60Hz)
note: they're not going to actually use all this information
-
Store transition(t;at;rt;t+1)inD
For experience replay
note: this is slightly different from the ER I talked about last time, in that there's no matching of states
-
Note that when learning by experience replay, it is necessary to learn off-policy(because our current parameters are different to those used to generate the sample), which motivatesthe choice of Q-learning.
good point!
-
random minibatch
note: does not necessarily include s_t ... my understanding is that this is slightly unusual and many people always ensure that s_t is among the minibatch
-
With probabilityselect a random actionatotherwise selectat= maxaQ((st);a;)
Epsilon-greedy exploration
-
deepQ-learning
unnecessary branding :P
-
Since using histories ofarbitrary length as inputs to a neural network can be difficult, our Q-function instead works on fixedlength representation of histories produced by a function.
Standard deep learning sell: "no feature engineering." Truth: "still some feature engineering" :)
-
Note that this algorithm ismodel-free: it solves the reinforcement learning task directly using sam-ples from the emulatorE, without explicitly constructing an estimate ofE. It is alsooff-policy: itlearns about the greedy strategya= maxaQ(s;a;), while following a behaviour distribution thatensures adequate exploration of the state space. In practice, the behaviour distribution is often se-lected by an-greedy strategy that follows the greedy strategy with probability1 and selects arandom action with probability.
Hopefully this is all clear, but this is a nice concise definition of model-free and off-policy.
-
Q(s0;a0;i 1)
in this case, they're treating Q(s') as a constant as far as the gradient is concerned
-
-
www.umiacs.umd.edu www.umiacs.umd.edu
-
The Cartpole Regulator Benchmark
Video here: https://www.youtube.com/watch?v=J7E6_my3CHk
-
nother heuristic that we found helpful, is to add ’artificial’ training pat-terns from the goal region, which have a known target value of 0. This technique’clamps’ the neural value function to zero in the goal region, and we thereforecall it thehint-to-goal-heuristic
Works only if you know what "goals" look like ahead of time.
-
The Mountain Car Benchmark
Might be useful to look at https://www.youtube.com/watch?v=0_xeAD6ZgVU for a sense of how you have to solve this problem.
-
Since we are interested in methods that can directly learn on real systems, ourpreferred measure of learning effort is the number of cycles needed to achieve acertain performance
This is a good evaluation, but tricky because of randomness.
-
They can be seen as a special form of the ’experience replay’technique [Lin92], where value iteration is performed on all transition experiencesseen so far.
See section 2.3 of https://webdocs.cs.ualberta.ca/~sutton/lin-92.pdf for more information on experience replay. This will come up again, so I want to spend a bit of time discussing it.
An experience is a (s,a,r,s') tuple. The idea is that you store in some separate memory a set of past experiences. When you make a normal Q update for a "new experience" you also re-update yourself for past experiences. Importantly these past experiences might have slightly different targets, because the Q for their s' might have changed.
It's often the case that only past experiences that agree with the current policy are replayed. (Side note: could probably get around this restriction with some form of importance sampling.)
-
Sample Setting of Costs
This section appears to be a complicated way of saying "do the normal thing."
-
ul
I'm pretty sure "u" is supposed to be "a"
-
c(sl,ul,sl)
Key question: what are the costs?
-
The consideration of the entire training information instead of on-line sam-ples, has an important further consequence: It allows the application of advancedsupervised learning methods, that converge faster and more reliably than onlinegradient descent methods. Here we use Rprop [RB93], a supervised learningmethod for batch learning, which is known to be very fast and very insensitivewith respect to the choice of its learning parameters
This is less true now than it was 10 years ago. We have very good stochastic learning algorithms, like adagrad, adadelta, adam, rmsprop, etc. I don't think this is a big deal anymore.
-
For example, a squared-error measure like the following can be used:error=(Q(s, a)−(c(s, a)+γminbQ(s,b)))2.Atthispoint,commongradientdescent techniques (like the ’backpropagation’ learning rule) can be applied toadjust the weights of a neural network in order to minimize the error
This is exactly the update scheme we talked about last time. You have the current predicted value Q(s,a), which is approximated by a neural network, and the desired term, c(s,a)+gamma min_b Q(s',b), which is the target. We minimize squared error between the prediction and the target. NOTE: my understanding is that when gradients are taken on this, they are ONLY taken with respect to the prediction Q(s,a) and NOT with respect to min_b Q(s',b). This is kind of weird, now that I think about it, but I'm pretty sure that's how this is done.
-
Manyof these problems arise, since the representation mechanism in a multi-layer per-ceptron is not local, but global: A weight change induced by an update in acertain part of the state space might influence the values in arbitrary other re-gions - and therefore destroy the effort done so far in other regions
This is essentially the key issue, both in theory and in practice (but in particular in theory) for function approximation in RL.
-
- Jul 2016
-
alexander-schwing.de alexander-schwing.de
-
e replaced the linear pair-wise function in Eq. (6) by a one-layer MLP, while keepingthe other settings identical
take left and right characters, take difference, use that as input to MLP
-
Counting numberscrare employed toweight the marginal entropies.
counting numbers are used to approximate the entropy. loopy bp has counting nubmers that approximate tree-structured case, but you can do more.
-
1st order MarkovTwo Laye
too many #s, but there seems to be an effect that depth can obviate for structure. for instance, in the H1=512 setting, we have: d2o1 ~= d1o2. but d2o0 is never really competitive, so some structure is a clear win. but does the NN get to look at neighboring images? i don't think so. that seems like an important comparison.
-
Joint training helps
presentational comment: I love how these results are presented :) although it would be nice if they were quantified in the paragraph so I didn't have to flip back and forth to the table, and significance tests would be nice so we really could conclude "X helps". maybe even do an anova or something ;). but i really like that the main hypotheses are being teased out!
-
by a one-layer MLP, while keepingthe other settings identica
not clear to me what this means. input is onehot for left- and right-labels, then one hidden layer, then score?
-
define all pairwise interactions via
no features on edges (just bias)
-
However, assuming allcounting numberscrto be positive, we can derive an al-gorithm that is able to interleave minimization w.r.t.wandmaximization of the beliefsb.
as far as I can tell, this is nearly identical to Meshi et al., ICML 2010, "Learning efficiently with approximate inference via dual losses". the basic idea is to construct the dual of the inner optimization, turning the problem into a min/min problem, and then doing block coordinate descent over the resulting primal ws and dual(b)s.
-
we require a sub-gradient w.r.t.w
not sure we really care about subgradients here, especially if F is non-convex, but we still need gradients in the cases where gradients exist, which is what is handled in the remainder.
-
The approximated non-linear structured prediction task
This derivation is essentially a standard derivation for CRFs, regardless of whether the score function is linear or non-linear. The bs are local beliefs, f are the region-based scores (i.e., clique scores), and the entropy is approximated with counts c. The polytope over which the bs is optimized is given below and to the right.
-
since amargin term is easily incorporated
huh? Is this just saying "we could alternatively use structured hinge loss if we wanted?"
-
parameter0is used to adjust the uniformityof the distribution
I don't understand the reason for doing annealing.
-
Weassume the space of valid configurations to be a productspace,i.e.,Y=QNi=1Yi, and the domain of each indi-vidual variableyito be discrete,i.e.,Yi=f1;:::;jYijg
this is a pretty limiting assumption, but makes it fit in the MRF setting
-
iece-wise training is, however, suboptimal asthe deep features are learned while ignoring the dependen-cies between the variables of interest
motivation for structure+deep
-
Towards this goal, we propose atraining algorithm that is able to learn structuredmodels jointly with deep features that form theMRF potentials. Our approach is efficient as itblends learning and inference
main goal
-