Great to finally see a working Firefox extension. Now waiting for seamless integration between Memex and Hypothesis.
- Sep 2020
- Sep 2019
-
arxiv.org arxiv.org
-
stateembedding in which each dimension is the value function foran auxiliary reward function.
What does this mean?
-
- Jun 2019
-
deepmindsafetyresearch.medium.com deepmindsafetyresearch.medium.com
-
For example, for the commonly-used reversibility criterion, the baseline is the starting state of the environment, and the deviation measure is unreachability of the starting state baseline.
My hunch is that this is terribly computationally inefficient. Not sure though.
-
A complementary approach is to define a general concept of side effects that would apply across different environments. This could be combined with human-in-the-loop approaches like reward modeling, and would improve our understanding of the side effects problem, which contributes to our broader effort to understand agent incentives.
CIRL is also another human-in-the-loop approach.
-
he design specification (which only rewards the agent for getting to point B) is different from the ideal specification (which specifies the designer’s preferences over everything in the environment, including the vase)
Ideal specification is the unknown utility function, and the design specification is the approximation of that using the reward function.
-
Since the agent didn’t need to break the vase to get to point B, breaking the vase is a side effect: a disruption of the agent’s environment that is unnecessary for achieving its objective
Is this a complete definition of side effects?
-
-
arxiv.org arxiv.org
-
knows the reward function
But wasn't the whole point of IRL that the human doesn't know the reward function?
-
-
arxiv.org arxiv.org
-
ntentions c
Isn't that basically 'rewards'?
-
- Apr 2019
-
deepmindsafetyresearch.medium.com deepmindsafetyresearch.medium.com
-
Many of the AI safety problems that have been discussed in the literature are fundamentally incentive problems
Hmmmmmmm.
-
In contrast, since there is no observation incentive for estimated walking distance, there is no point changing the formula that calculates it.
I feel like some weird wireheading can happen here.
-
feed this correct answer into a device that compares the QA-system’s answer against the correct answer
What makes us strongly believe that in reality this machine is decoupled from the QA-machine?
-
can be inferred directly from the influence diagram
What about creating these influences diagrams at the first place?
-
The key notation in influence diagrams is that the agent controls decision nodes to optimize utility nodes, while also interacting with chance nodes.
Nice and clever!
-
The objectives are what the agent is ultimately optimizing, such as a loss or a reward function. In turn, objectives induce incentives for events that would contribute to optimizing the objective. For example, the reward function in the ATARI game Pong results in an incentive to move the paddle towards the ball.
So, incentives are just low-level building blocks for objective?
-
- Mar 2019
-
www.tomeveritt.se www.tomeveritt.se
-
Executive Summary
This paper classifies misalignment under three different possible reward function categories:
- Preprogrammed (standard most familiar setup)
- Human as External Reward Function
- Interactively Learning (frameworks like CIRL)
It provides multiple examples and creates a taxonomy.
It also provides few methods to tackle such misalignment.
- Preprogrammed (standard most familiar setup)
-
History-based versions avoid restrictive Markov a
If everything is Markovanizable, what is the problem?
-
CoastRunnersfound a way to get more reward by going in a small circle and collecting pointsfrom the same targets, while crashing into other boats and objects
Is this goal misspecification or wireheading?
-
he name “wireheading” comes from experiments on rats where an electrodeis inserted into the brain’s pleasure center to directly increase “reward” (Olds andMilner, 1954). Similar effects have also been observed in humans treated for mentalillness with electrodes in the brain (Portenoy et al., 1986; Vaughanbell, 2008).Hedonic drugs can also be seen as directly increasing the pleasure/reward humansexperience.
There is some dispute, IIRC. Some people say that wireheading doesn't necessarily provide reward, but it just makes sure that the person stays wireheaded.
-
hus,each example gives rise to an almost complete misalignment between the agent and itsdesigner, indicating that no misalignment source is significantly more benign than theothers.
Does this mean there is no half-wireheading?
-
-
arxiv.org arxiv.org
-
Bayesian networks
Bayesian networks are DAGs whose nodes represent variables in the Bayesian sense: they may be observable quantities, latent variables, unknown parameters or hypotheses. Edges represent conditional dependencies; nodes that are not connected (no path connects one node to another) represent variables that are conditionally independent of each other. Each node is associated with a probability function that takes, as input, a particular set of values for the node's parent variables, and gives (as output) the probability (or probability distribution, if applicable) of the variable represented by the node.
-
For example, if the agent’s actionamodified its reward sensor toalways report 1, then this would be represented by the a self-delusion functiond1(ˇr)≡1 that always returns observed reward 1 for any inner reward ˇr
So this setting basically decouples the reward evaluation from the agent. But in case of the subsytem alignment the issue still persists, IMO.
-
elf-delusion function
How is this different from Corrupt Reward MDP setting for RL?
-
principal with utility functionu
Is this a human?
-
For the chess-playing ex-ample, we could design an agent with utility 1 for winning board states, andutility−1 for losing board states.
What? What if the agent just deludes itself to believe any state of the board as the winning state? How is it different to RL agent then?
-
a rewardsignal that is provided by an external evaluator
In the ultimate scenario, this agent has to come up with a reward function on its own? This is off-topic, but what if all human values are optimally satisfied and there is nothing more that the humanity want?
-
-
www.hutter1.net www.hutter1.net
-
AIXI is a formulation of an uncomputable universal artificial intelligence.
This document defines the primary framework of reinforcement learning in POMDP, formulates AIxi - which uses a convex linear combination of environments from a Model Class and then also updates the belief about the environment using Bayes Rule.
AIXI does all that and also uses Solomonoff Prior to employ Occam's Razor and arrives at an agent which is mathematically stronger.
There are computable approximations of AIXI, though.
-
-
model class M
How strong is this assumption?
-
MCTS)
Same thing that AlphaGo uses.
-
integer-valued rewards
Does anything change if rewards are real-valued?
-
probability measures
Function which returns 0 for empty set and 1 for the whole set, plus obeys the countable additivity property (this function applied to union of disjoint sets should be same as sum of ...)
-
Kleene star
Empty set + all the things you can make by concatenating elements of your base set.
Example of Kleene star applied to set of strings:
{"ab","c"}* = { ε, "ab", "c", "abab", "abc", "cab", "cc", "ababab", "ababc", "abcab", "abcc", "cabab", "cabc", "ccab", "ccc", ...} -
We identify an agent with its policy
In case of Occam's Razor Insufficiency paper, they equate a human with a tuple of reward and planner.
-
-
www.cs.uic.edu www.cs.uic.edu
-
We employ the principle of maximum entropy, which re-solves this ambiguity by choosing the distribution that doesnot exhibit any additional preferences beyond matching fea-ture expectations (Equation 1). The resulting distributionover paths for deterministic MDPs is parameterized by re-ward weightsθ(Equation 2).
Do you think this means this results into the linearity of rewards? I think they are just talking about, already under that assumption, you can write the path's likelihood as a function of the parameters.
-