584 Matching Annotations
  1. Oct 2020
    1. may reflect disparate processes in unique cognitivecircumstances, these aforementioned processes have all been spe-cifically associated with ACC function during attention orientationand/or action selection

      This paper does not discredit the idea that different types of N2 and the reward positivity are epistemically different processes

    Annotators

    1. These differences cannot beattributed to greater engagement of the NE system in the ActiveExperiment (rather than greater DA system engagement) becausegreater NE release would have produced alargernegativity toinfrequent reward feedback

      This is how DA and NE are dissociated in terms of their predictions!

    2. in the Active Experiment, the rawN2 to infrequent reward feedback was significantly smaller thanthe raw N2 to infrequent reward feedback in both the Passive andModerate Experiments, suggesting that greater DA system engage-ment resulted in a larger DA-associated positivity that attenuatedthe raw N2.

      So here DA is independent of NE

    3. interaction of reward condition andfrequency condition such that the effect of reward was largerin the infrequent condition than in the frequent condition

      This is interesting! NE modulated DA so that it obscures N2 even more?

    Annotators

  2. Sep 2020
    1. identical task stimuli (colored faces) presented with identicaltask designs

      It is compelling that task set can yield such a difference -> it is definitely somewhat goal-related

    2. variablescalp distribution dependent on relative engagement of the dif-ferent cortical areas giving rise to the ERP

      So then neither N2 nor P3 have a very easy to localize or 'centered' scalp distribution -> but the N2 does have a localizable distribution (acc?)

    Annotators

    1. By taking into account P3b versusP3a effects and latency information, it may be possible to con-sider surprise in the context of other mental states contributing togoal-oriented behavior

      If ACC is very goal-heavy in its processing signature this could be something to look into!

    2. cannotdetermine whether the P300 modulation was purely due to thesurprise conveyed by the visual stimuli, or whether it was relatedto the response selection on each trial

      P300 in passive viewing, or related to the giving of response (motor preparatory process could very well play a role! would not happen in the case of passive viewing)

    3. P300 reflects the arrival of a phasic norepinephrine (NE) signal incortical areas, which serves to increase signal transmission in thecortex

      This could be interesting, especially if it is an LC-NE system link! Pupillometry will add some info in that case, and read the papers about acc-Lc-Ne

    4. compared our model to an alternative mea-sure of surprise based on the Kullback–Leibler divergence.

      KL-divergence is what is used in the O'Reilly study right? So there are some differences here. There, there is a clear behaviourally relevant model update (which might be reflected in some other process, although perhaps not even EEG measurable, since fERN seems more concerned with expectations?) Expectations of errors can of course be part of a task-model...

    5. identity of trials on which participants responded er-roneously, trials that were rejected during the preprocessing of the EEGdata, and a constant term.

      nuisance modeled at the same time!

    6. regressor models variance related to stimulus probabilitywithin a block and does not take into account any learning

      Simply assume structure is known and use that to predict p300

    7. quantified in terms of how much they change posteriorbeliefs.

      So if prediction error is used to update the model, this is also intuitive -> in fact, in an RL setting, the notion of 'surprise' as prediction error, and 'model update', are identical.

    8. P300 has commonly been linked to therevision of a participant’s expectation about the current task con-text

      Context -> do they mean structure or task-set (for reward outcome?)

    Annotators

    1. Thepresent findings suggest that the variance in fERN amplitudeacross conditions results more from the effect of unpredictedpositive feedback than from unpredicted negative feedback

      So negative prediction error does not yield the same attentuating impact on N200 measurement as positive prediction error does for enhancing it?

    2. essentialdifference between conditions is associated with neural activityon correct trials rather than neural activity on error trials

      This is an important distinction to make - N200 as component shows mostly responsivity to infrequency. Original definition of fERN as a difference wave with correct trials was interpreted as different from N200, assumed because of corrective processes taking place after incorrect feedback. However, this study picks N200 vs fRxx apart and shows, difference in difference wave vs N200 is caused by more negativity on infrequent correct feedback trials

    3. error–oddball differ-ence should be smaller than the correct–oddball difference

      Makes sense - If something else is happening on the correct feedbacks that makes them look N200 absent

    4. in-frequent targets in an oddball task and infrequent error feedbackin a time estimation task both elicit a frontal or frontal-centrallydistributed, negative-going component that reaches maximumamplitude at approximately 280–310 ms.

      Suggestive of coming from an identical source with identical function!

    5. first by examining the ERPs directly, and second by per-forming a spatial principal components analysis (PCA)

      This is to get the scalp distribution - That allows to compare possible sources for the N200 and fERN

    6. difference between the ERP oncorrect trials and the N200 should be larger than the differencebetween the fERN and the N200

      If fERN == N200, subtractive difference between them should be very small. The difference with the correct trial ERP (fCRP) would be larger!

    7. it is equally possible that thedifference between the ERPs on correct and incorrect trials arisesfrom a process associated with correct trials rather than witherror trials

      So that would mean the fERN is NOT some component that is elicited by errors specifically - actually it is elicited by the correct feedbacks?? => completely invalidates the idea of behavioural adjustment through ACC inhibitory release?

    8. fERN is elicited by unexpected negativefeedback stimuli, but not by unexpected positive feedback stimuli

      So that is a clear distinction with the N200 - it is modulated by correct v incorrectness? It was also modulated by expectation of the feedback...

    Annotators

    1. resulted from an increase in the amplitude of thefERN, rather than from overlap with a different ERP compo-nent (such as the P300).

      Scalp distribution used to identify source (p300 vs ERN)

    2. Participants made more errors in the hardcondition (76%) than in the easy condition (23%)

      This is of course what is kind of ensured by that moving window approach! smart

    3. correct ERP in the hard conditionfrom the error ERP in the easy condition

      In both cases this is the unexpected condition -> unexpectedness gets removed but what is left?

    4. correct feedback in the hard con-dition

      The ERN would be bigger for correct responses? -> That's funny, opposite of what the name indicates. But it is because it is modulated by what is expected => EXPECTATION error

    Annotators

    1. SR represents the association between the stimulusandfood,andisalsoabletoupdatetherewardfunctionofthefood as a result of devaluation

      But the transitions are initially learned policy-dependent, which means inflexible! This requires SR-dyna style updated, or formed SR through undirected exploration?

    2. A reduced acquisition of conditionedresponding to C and D, compared to F, which was trainedin compound with a novel stimulus

      A was already directly associated with X -> AD and AC to X showed 'blocking' compared to EF which was novel preceded

    3. shifts in value (amount of reward) and identity(reward flavour).

      So it seems the common idea here is: Change in the reward stimulus, not the state-state (prior to reward) transitions. You can change both reward value and identity and see if it has a modulatory effect using this model!

    4. although the RPE correlate has famously been evident insingle units, representation of these more complex or subtleprediction errors may be an ensemble property.

      Perhaps some pattern analysis with fMRI would be able to say things about this...

    5. First, it naturally captures SPEs, as we will illus-trate shortly. Second, it also captures RPEs if reward is oneof the features.

      It can incorporate both SPE of SR and the RPE into one error signal?? -> would allow for cool modeling and dissociation tricks!

    6. expected TD error isthen proportional to the superposition of feature-specific TDerrors,PjdMt(j).

      This is a strong assumption but it is functional - Maybe we should incorporate such an encoding in our model/paper too?

    7. Althoughprediction errors are useful for updating estimates of thereward and transition functions used in model-based algor-ithms, these do not require a TD error.

      TD error is not useful for model-based updates, because these updates can be local and complete, at least as long as there are only local/short-term dependencies (e.g. Markov property in graph)

    8. building on the pioneeringwork of Suri [46], we argue that dopamine transients pre-viously understood to signal RPEs may instead constitutethe SPE signal used to update the SR

      Also a somewhat important reference!

    9. sensitive to movement-related variables such as action initiation and termination

      What do these have to do with value? Modulation based on course of action could be a signal that options are present...! => option-specific value-function

    10. some dopamine neuronsrespond to aversive stimuli

      The exact opposite of what you would expect if response indicates appetite - Seems more like an (unsigned?) prediction error then

    11. natomically segregated projection frommidbrain to striatum

      So DA could have the state-state learning signal, but then it would be segragated from the value-projections which run into PFC / ACC?

    Annotators

    1. Finally, to test if the differences in these measures are sufficiently robust to allowcategorisation, we trained a support vector machine (SVM) classifier to accurately predict trajectories as either model-free, model-based or SR agents (Fig 6A-B). When the decoder was given data from the biological behaviour, the SVM classifier most frequently predicted those trajectories to be an SR agent

      This is a very cool idea! Train classifier on generated RL agent data, and let it then classify a batch of real data -> to what is it most similar?

    2. attributes a role for dopamine in forming sensory predictionerrors (Gardner et al., 2018) - similar to what has been observed experimentally(Menegas et al., 2017; Sharpe et al., 2017; Takahashi et al., 2017).

      This seems like some stuff to check out!

    Annotators

    1. This abs(RPE) signal correlated with anumber of brain regions including an IPS locus anterior to wherewe found SPE correlates at p < 0.001 uncorrected (Figure S3).However, a direct comparison between SPE and abs(RPE) re-vealed a region of both posterior IPS that was significantly betterexplained by the SPE than by the abs(RPE) signal at p < 0.05corrected, as well as a region of latPFC that showed a differenceat p < 0.001 uncorrected

      So there is some degree of overlap: could it be dodgy?

    2. the degree to which pIPS encodes an SPE representationNeuronReward and State Prediction Errors in HumansNeuron66, 585–595, May 27, 2010ª2010 Elsevier Inc.589

      Behavioural relevance is established!

    3. partici-pants had indeed acquired knowledge about the particularsequence of state transitions during the first session: 99.6% ofpermutation samples provided a poorer explanation of choicesthan the original (p = 0.004).

      So earlier latent learning CAN cause for state transition model to emerge -> very useful, do not need to instruct

    4. If participants’ beliefsabout the transition probabilities were updated by error-drivenmodel-based learning (with a fixed learning rate, as assumedin FORWARD), this may have left a bias toward the most recentlyexperienced transitions

      A 'lingering effect' of recent learning, just like the successive contradictory feedback in Holroyd and Coles 2002

    5. volunteers were firstexposed to just the state space in the absence of any rewards,much as in a latent learning design. This provides a pure assess-ment of an SPE

      Is this similar to what Danesh did?

    1. “replay”trajectories that the animal has taken in the past, as well as to follow novel trajectories that the animal has neverbefore taken

      consolidation and integration?

    2. Theta sequences typically occur when an animal is moving

      So if theta synchronizes with the ACC, it can keep updating its environmental model according to task progression?

    3. monkeys use a joystick to control a digital predator pursuing digital prey

      This is in a sense also a foraging task because it is food-seeking? But i dont see effects like depletion and opporunity costs

    4. dlPFC activity correlates with a “state prediction error”

      but the unsigned version - so it is not goal-specific (and could it even be used to update SR in the right direction?)

    5. silencing OFC activity on a particular trialselectively impairs the influence of this expected outcome signal on learning but does not impair choosing.

      So the OFC really does seem to carry a learning signal as well!

    6. “inference”: the ability to combine separately-learned associationsin order to form new associations between items that may never have been encountered together before​

      This is basically association but for unseen combinations - shows a world model is abstracted from earlier knowledge

    7. Multi-stepplanning can be thought of as a process that uses this map to guide an extended sequence of behavior towards adistant goal

      The cognitive map is related to the HPC, and this is why it keeps popping up in ACC research as well -> it is very tightly linked to decision making and model building of also distal relationships!

    Annotators

    1. Notably, these signals were signed based on the subject’sgoal, consistent with a mechanism for determining how much toupdate beliefs and in which direction: toward confirmation (pos-itive) or reconsideration of one’s rewarding goal (negative)

      This seems like something very important to how ACC does model updating as well!

    2. extracted the feedback-locked BOLD response in left lOFC (at 6 s post-feedback onset)at trialtand regressed this against the (signed) change to hippo-campal CSS (i.e., the change in the difference between LC andHC presentations in [ipsilateral] left hippocampus from the pre-ceding blockt0.5 to the subsequent blockt+ 0.5)

      Cool parametric analysis - How did (signed) prediction error influence the difference between LC and HC representations? (do they move further apart or closer together)

    3. the unsignedDKLterm, corresponding to the magni-tude of the belief update, independent of its direction, instead re-cruited a dorsal frontoparietal network, consistent with previousfindings related to unsigned state prediction errors during latentlearning

      So signing the Dkl term is important for identifying the ACC!? This means we have to include some sort of goal perhaps?

    4. stimulus-outcome update effects in lateral OFC/ventro-lateral prefrontal cortex (VLPFC) and also a distributed networkincluding anterior cingulate cortex, inferior temporal cortex, andposterior cingulate cortex

      Relevant: ACC is included in this! Hurray the experiment might be saved :D

    5. reduction in the BOLD response for HC items when comparedto LC items

      Because of accurate prediction -> previous activation? Or because in HC the next stimulus was already 'explained away' in a predictive processing sense!

    6. and the other inferred based on thesubject’s knowledge of the inverse relationship between stimuliand outcomes dictated by the task structure

      There is counterfactual updating possible in this task! Assumes internal model is accurate (but very likely)

    7. In each CSS block, each stimulus-outcome transition was pre-sented once

      So a suppressed representation is yielded for the outcome variable, both for the common and the rare transition

    8. firstselect the more desired gift card goal based on the current po-tential payouts and then reverse-infer the stimulus they believedwould most likely lead to that desired outcome

      So the reward is given, pps have to 'calculate' the optimal path -> separate reward function (given, unlearned) applied over learned model (OR SR!)

    9. little is known about how these different signalsare used in the brain

      There are many types of learning signals - prediction errors for all kinds of domains of cognition, not just reward or sensory. However, only for striatal dopamine (RPE) there is some knowledge of how it affects behaviour / other neural computation

    Annotators

    1. facilitate theupdating of internal models from which future action is gener-ated (55)

      Then could it also perform the function of clustering task sets based on outcomes? We have seen based on contextual cues it doesn't - that seems like a HPC thing

    2. The currentfindings suggest a specific computational functionfor the ACC: It is involved in updating internal models to fa-cilitate future information processing.

      A one-off update to guide future information processing -> it could happen through short term plasticity mechanisms?

    3. “reset”signal for internal models

      So following this theory, the model updating should actually have a positive effect on the pupil diameter because increase of LC-NE activity?

    4. Participants’internal models of the targets’distribution couldbe expected to differ from the true (generative) distribution

      Exactly: So how do you determine the model used by participants?

    Annotators

    1. mPFC may track changes in activity patterns in regions with community-based representational similarity, providing a signal that could underlie parsing decisions.

      mPFC uses specifically the predictive structure to identify boundaries?

    2. three-layer neural network model (Fig. 6a). The network took input representing the current stimulus and was trained to predict which stimulus would occur next.

      predictive RNN -> should lead to successor rep?

    3. within the set of Hamiltonian paths, the probability of transitioning from one cluster boundary node (one of the pale nodes in Fig. 1a) to the adjacent one, if not yet visited, is always exactly

      otherwise the hamiltonian walk cannot be made

    4. set of possible successor items on each step depends only on the current item, this uniformity in transition prob-abilities holds whether one takes into account only the most recent item or the n most recent items

      True uniformity! not some secret non-markov transition probability influence

    5. items will fall close together in representational space when they are preceded and followed by similar distributions of items in familiar sequence

      i.e. have similar successor representations.

    Annotators

    1. prediction errors in the orbitofrontal cortex dur-ing active learning predict later changes in hippocampal representations of the stored model(Boorman et al., 2016)

      So it is OFC and not ACC?

    2. Indeed, we note that neural signals can be recorded in frontal and parietal cortices, reflectingthe ‘state-prediction errors’ that ensue when predicted state relationships are breached duringbehavioural control (Gla ̈scher et al., 2010)

      This seems very important

    3. We hypothesised that implicit knowledge about the graph structurewould influence response times, such that subjects would respond faster if a preceding object in thetest sequence was closer on the graph structure underlying the train sequence. Indeed, we foundthat log-transformed response times were longer the further away the preceding object was on thegraph (Figure 5C, D)

      Cool behavioural validation!

    4. communicabilitysignificantly distorts the graph structure by shortening links that form part of many paths around thegraph structure and lengthening links that would be less frequently visited by a random navigator

      So it is like a random exploration SR representation, but no need to fit gamma

    5. it was the symmetrised version alonethat predicted the fMRI suppression effect

      It is not the actual number of experienced transitions between graph nodes -> this is a good validation for the representation of an abstract relational structure, and that it's not some sort of experiential or temporal effect

    6. can be read out directly from functional magnetic resonance imaging(fMRI) data in the entorhinal cortex

      Readable coding from entorhinal cortex -> how to compare with goal as shown by Yoo et al (the neural basis of predictive pursuit)? Should be the same, but entorhinal vs dACC is different structure...?

    Annotators

    1. rela-tive balance between ‘state prediction errors’ and ‘reward prediction errors’ may be used for arbitration in an MB–MF hybrid learner

      Talk about state prediction errors in MF-MB literature?

    2. we chose to linearly combine the ratings from MB and SR algorithms

      So SR sort of becomes the replacement for MF -> explains why after proper learning / caching, increased alternative task demands don't have detrimental effect on MBness?

    3. response times were slower in the transition revaluation condition compared with both the reward revalu-ation condition (t57= 2.08, P< 0.05) and the control condition (t57= 4.04, P< 0.00

      This could be a signature some model-based planning is grinding the participants mental gears

    4. indicate their preference

      So they had some mild agency / incentive to pay attention to the transitional structure! Maybe we can have specific states in some community (B) pay out a small reward, and later validate this with choices from community A -> to which would you transition? -> choose bottleneck to community B

      Preference check is attentional / structure learning check

    5. Experiment 1 used a passive learning task, which permitted the simplest possible test of the the-ory, removing the need to model action selection.

      Passive learning task of the SR

    6. Specifically, they learn and store a one-step internal representation or model of the short-term environmental dynamics: specifically, a state transition function T and a reward function R

      So learning T is substantially different from learning M. Learning T might not be more difficult, but using it for planning will take more time than using M. This is the benefit of SR -> computational time at decision time

    1. posterior dorsomedial striatum (Oh et al., 5312014; Hintiryan et al., 2016), a region necessary for learning and expression of goal directed action 532as assessed by outcome devaluation

      So learning signals could also come from here?

    2. Likewise, neuroimaging in a saccade task in which subjects constructed and updated a model of the 522location of target appearance observed ACC activation when subjects updated an internal model of 523where saccade targets were likely to appear, (O’Reilly et al., 2013)

      This is the update v no update oddball task

    3. , neuroimaging in the Daw two-step task has identified representation of model-based value 520in the BOLD signal in anterior- and mid-cingulate regions

      This is where ACC relevance in two-step task comes in explicitly!

    4. In 465humans, extensive training renders apparently model-based behaviour resistant to a cognitive load 466manipulation (Economides et al., 2015) which normally disrupts model-based control (Otto et al., 4672013), suggesting that it is possible to develop automatized strategies which closely resemble 468planning.

      This seems like a possible SR influence as well!

    5. ACC inhibition on model-based control because the effects would not survive multiple comparison 428correction for the large number of model parameters.

      So do not conclusively cite this study on anything specific

    6. xed and are known to be so by 302the human subjects

      So with fixed transition probabilities people will learn the interaction effect in the logistic regression. However, when participants also have to learn the transitional structure, the logistic regression will show effects separate for transitions and rewards, and no interaction!

    7. The absence of transition-outcome interaction has been used in the 298context of the traditional Daw two-step task (Daw 2011) to suggest that behaviour is model-free. 299However, we have previously shown (Akam et al. 2015) that this depends on the subjects not 300learning the transition probabilities from the transitions they experience.

      So this might be a crucial theoretical problem for typical 2-step tasks?

    8. Subjects learned the task in 3 weeks

      This is where the difference between human subjects and animals comes into play: Humans can be instructed to ínstantly grasp the task. Will that be a major issue in performing a similar setup with humans?

    9. Reversals in which first-step action (high or low) had higher reward 216probability, could therefore occur either due to the reward probabilities of the second-step states 217reversing, or due to the transition probabilities linking the first-step actions to the second-step states 218reversing.

      That does seem like a convincing dissociation that the brain might want to keep separate!

    10. except on transitions to neutral 214blocks, 50% of which were accompanied by a change in the transition probabilities

      So sometimes here we have a double update required?

    11. At block transitions, either 213the reward probabilities or the transition probabilities changed

      so you have a dissociation between updates on poke transitions and reward outcomes from these pokes, thats clever

    12. introduction of reversals in the 189transition probabilities mapping the first-step actions to the second-step states. This step was taken 190to preclude subjects developing habitual strategies consisting of mappings from second-step states 191in which rewards had recently been obtained to specific actions at the first step (e.g. rewards in 192state X  chose action x, where action x is that which commonly leads to state X). Such strategies 193can, in principle, generate behaviour that looks very similar to model-based control despite not using 194a forward model which predicts the future state given chosen action (

      Is this the Dezfouli and Balleine style decisions they are referring to?

    13. in each second-step state there was a single action rather than a 181choice between two actions available, reducing the number of reward probabilities the subject must 182track from four to two

      This is how the 'burden' on participants is relieved - they can now focus more on learning the state transitions. But doesn't this identify the reached second stage with guaranteed reward, hence allowing direct encoding of reward to be a confound?

    14. model-based mechanisms 171which learn action-state transition probabilities and use these to guide choice.

      slightly different representation of the SR (in the plos paper it would be the H matrix of SR-Dyna?)

    15. optogenetic silencing of ACC neurons on individual trials reduced the influence of the 110experienced state transition on subsequent choice without affecting the influence of the trial 111outcome

      HEAVILY IMPLIES ACC AS LEARNING STATE-TO-STATE TRANSITIONS (signalling the error for SR updates?)

    16. developing a new version in which both the reward probabilities in the leaf states of the decision 103tree and the action-state transition probabilities change over time.

      This seems like a sensible approach - now also transitions have to be learned. My understanding was no other experimentalists really did this because it would make the setup too involved and difficult for participants to grasp.

    17. Firstly, the ACC provides a massive input to posterior dorsomedial 54striatum (Oh et al., 2014; Hintiryan et al., 2016), a region critical for model-based control as assessed 55through outcome-devaluation

      Outcome devaluation is also present in SR however!

    1. failurestoflexiblyupdatedeci-sionpoliciesthatarecausedbycachingofeitherthesuccessorrepresentation(asinSR-TDorSR-Dynawithinsufficientreplay)ora decisionpolicy(asinSR-MB)shouldbeaccompaniedbyneuralmarkersofnon-updatedfuturestateoccupancypredictions

      Good basis for some kind of experiment?

    2. unlikethehippocampus,partsofthePFCappeartobeinvolvedinactionrepresentationinadditiontostaterepresentation

      This could be relevant for the SR-Dyna model where there is H(a,s) matrix

    3. valueweightswouldbeleanedbyneuronsconnectingthehippocampustoventralstriatum,inthesameTDmannerdiscussedinthispaper

      Or perhaps a detectable error with ERP produced in ACC?

    4. [74]demonstratedthata sophisticatedrepresentationthatincludesrewardhistorycanpro-ducemodel-basedlikebehaviorinthetwo-steprewardrevaluationtask

      Could we be able to decode first-stage action at second stage decision time from ACC potentially? Maybe when trained with an RNN?

    5. SR-Dynacansupportrapidactionselectionbyinspectingitslookuptable

      It basically does tree-search beforehand, and caches the values it finds from the tree search, so they can be applied directly at task-time

    6. ThissimulationdemonstratesthatSR-Dynacanthusproducebehavioridenticalto“full”model-basedvalueiterationinthistask

      And it is still computed using DA plausible TD techniques. Only the representational space gets more and more complex, and the Replay mechanic gets added.

    7. SR-TDandSR-MBarethus“on-policy”methods–theirestimatesofVπcanbecomparedtotheestimatesofa traditionalmodel-basedapproach

      So they do not generalize to other policies necessarily! Though M learned under a uniform, fully random policy should consititute the accurate transition model?

    8. SR-MBlearnsa one-steptransitionmodel,Tπandusesit,atdecisiontime,toderivea solutiontoEq9

      With eq.9. being the solution for M. So in essence it is a 'double' SR? It holds a model for the long run expectancies in M, but which is composed from the one-step expectancies learned in T

    9. Crucially,despitethefunctionalsimi-laritybetweenthisruleandtheTDupdateprescribedtodopamine,wedonotsuggestthatdopaminecarriesthisseconderrorsignal

      So no error detectable with ERP technique will probably guide us in this direction of SR-TD? Depends on content of paper in ref 55.

    10. BecauseMπreflectslong-runcumulativestateoccupancies,ratherthantheindividualone-steptransitiondistribution,P(s’|s,a),SR-TDcannotadjustitsvaluationstolocalchangesinthetransitionswithoutfirstupdat-ingMπatdifferentlocations.

      A 'smarter' algorithm could evaluate M completely if it learns a local change, which could even be learned through TD (only relevant vector s -> s').

    11. SR-TDcan,withoutfurtherlearning,producea newpolicyreflectingtheshortestpathtotherewardedlocation

      And then, using learned M(pi) from random policy and rewarded state by direct placement, it can compute shortest path from any random initialization!

    12. firstexploresthegrid-worldran-domly,duringwhichit learnsthesuccessormatrixMπcorrespondingtoa randompolicy

      It can learn M(pi) according to a random policy without any reward entering the system ever!

    13. gMpðs0;:Þ

      So it makes M(s,:) a little bit more similar to M(s',:)? But it will become an aggregate of all possible s' since you could transition to any (with a certain probability) -> weighted exactly by transition structure! Weighting happens slowly over time hence source of inflexibility to transitioning.

    14. Thethreemodels

      So actually, for the learning of M, three different models are proposed in this paper! There is not simply one learning algorithm involved in this SR thing.

    15. approximationwillbecorrectwhentheweightw(s0) foreachsuccessorstatecorrespondstoitsone-stepreward,averagedoveractionsins

      Why average over the actions and not pick the max?

    16. Doya[51]introduceda circuitbywhichprojectionsviathecerebellumper-formonestepofforwardstateprediction,whichactivatesa dopaminergicpredictionerrorfortheanticipatedstate

      State prediction error coming from cerebellum?

    17. thusper-hapsinvolvinganalogous(striatal)computationsoperatingoverdistinct(cortical)inputrepre-sentations[47].

      Would make sense, TD errors and other learning signals could be used by many systems for many different kind of learning or updating

    18. Typically,model-basedmethodsareoff-policy(sincehavinglearnedaone-stepmodelit is possibletouseEq2 todirectlycomputetheoptimalpolicy);whereasdif-ferentTDlearningvariantscanbeeitheron-oroff-policy.

      Model-based is considered off-policy, because we have a pretty complete representation of the environment, which is not biased by taken actions. If the relevant mappings of actions make a difference for the estimation, it would be considered on-policy.

    19. butalsointhemoreflexi-blechoiceadjustmentsthatseemtoreflectmodel-basedlearning

      E.g. error updating from counterfactuals, which suggests model-based inference of reward that was not obtained!

    Annotators

    1. n previousversions, subjects at each stage chose between two symbols insteadof two fixed actions and the symbols moved from side to side ateach trial ensuring there was no consistent mapping between thebutton presses and the symbols

      This could be a major difference for the paradigm! Now the actions are mapped directly and not through an environmental link

    2. decisions that are insensitive to (i) the valuesof the outcomes [8] and (ii) the contingency between specificactions and their outcomes

      Clear statement that HRL sequences can become insensitive to the transition structure of the problem at hand

    3. a first stage action is the best action

      Here they do analyze the difference in expected reward from different first stage actions. Danesh paper dismissed this as intractable for participants

    4. with a small probability (1:7), the rewardingprobability of each key changed randomly to either the high or lowprobability.

      So not the slow drifting as in Daw and in Danesh paper

    Annotators