We (the alignment community) think we want corrigibility with respect to some wide set of goals S, but we actually want non-obstruction with respect to S
Prediction
We (the alignment community) think we want corrigibility with respect to some wide set of goals S, but we actually want non-obstruction with respect to S
Prediction
Trying to implement corrigibility is probably a good instrumental strategy for us to induce non-obstruction in an AI we designed.
Prediction
Can we get negative results, like "without such-and-such assumption on πAI, the environment, or pol, non-obstruction is impossible for most goals."
prediction
Can we prove that some kind of corrigibility or other nice property falls out of non-obstruction across many possible environments?
prediction
Given an AI policy, could we prove a high probability of non-obstruction, given conservative assumptions about how smart pol is?
prediction
Main idea: we want good things to happen; there may be more ways to do this than previously considered.
Prediction
However, even if we could maximally impact-align the agent with any objective, we couldn't just align it our objective
Random copy edit, but I think this is missing a 'with' and should say: "However, even if we could maximally impact-align the agent with any objective, we couldn't just align it with our objective"
Main idea: We only care about how the agent affects our abilities to pursue different goals (our AU landscape) in the two-player game, and not how that happens. AI alignment subproblems (such as corrigibility, intent alignment, low impact, and mild optimization) are all instrumental avenues for making AIs which affect this AU landscape in specific desirable ways.
Prediction
Main idea: By considering how the AI affects your attainable utility (AU) landscape, you can quantify how helpful and flexible an AI is.
Another claim that could be predicted on!
Main claim: corrigibility’s benefits can be mathematically represented as a counterfactual form of alignment.
This could be a prediction for people to express credence: "Corrigibility's benefit can be mathematically represented as a counterfactual form of alignment"
Various biosecurity risks
P(AGI comes before biosecurity risks) P(AGI comes before stable dystopias), etc
there was just no way we were going to have AGI in the next 50 years
P (AGI in the next 50 years)
Various biosecurity risksStable dystopias, nuclear war or major power war, whole brain emulationClimate changeSuper volcanos, asteroids
Various biosecurity risks are most likely to break the claim
So the reason that we don't have AGI is not that we could make AGI as powerful as the brain, and we just don't because it's too expensive.
Will we get AGI cheaply enough
Crux 1: AGI would be a big deal if it showed up here
Prediction: AGI would be a big deal if it showed up here