Hypothesis

15 Matching Annotations

Nov 2020
www.lesswrong.com www.lesswrong.com

Non-Obstruction: A Simple Concept Motivating Corrigibility - LessWrong

10
1. amanda_ngo 10 Nov 2020
  
  in Public
  
  We (the alignment community) think we want corrigibility with respect to some wide set of goals S, but we actually want non-obstruction with respect to S
  
  Prediction
2. amanda_ngo 10 Nov 2020
  
  in Public
  
  Trying to implement corrigibility is probably a good instrumental strategy for us to induce non-obstruction in an AI we designed.
  
  Prediction
3. amanda_ngo 10 Nov 2020
  
  in Public
  
  Can we get negative results, like "without such-and-such assumption on πAI, the environment, or pol, non-obstruction is impossible for most goals."
  
  prediction
4. amanda_ngo 10 Nov 2020
  
  in Public
  
  Can we prove that some kind of corrigibility or other nice property falls out of non-obstruction across many possible environments?
  
  prediction
5. amanda_ngo 10 Nov 2020
  
  in Public
  
  Given an AI policy, could we prove a high probability of non-obstruction, given conservative assumptions about how smart pol is?
  
  prediction
6. amanda_ngo 10 Nov 2020
  
  in Public
  
  Main idea: we want good things to happen; there may be more ways to do this than previously considered.
  
  Prediction
7. amanda_ngo 10 Nov 2020
  
  in Public
  
  However, even if we could maximally impact-align the agent with any objective, we couldn't just align it our objective
  
  Random copy edit, but I think this is missing a 'with' and should say: "However, even if we could maximally impact-align the agent with any objective, we couldn't just align it with our objective"
8. amanda_ngo 10 Nov 2020
  
  in Public
  
  Main idea: We only care about how the agent affects our abilities to pursue different goals (our AU landscape) in the two-player game, and not how that happens. AI alignment subproblems (such as corrigibility, intent alignment, low impact, and mild optimization) are all instrumental avenues for making AIs which affect this AU landscape in specific desirable ways.
  
  Prediction
9. amanda_ngo 10 Nov 2020
  
  in Public
  
  Main idea: By considering how the AI affects your attainable utility (AU) landscape, you can quantify how helpful and flexible an AI is.
  
  Another claim that could be predicted on!
10. amanda_ngo 10 Nov 2020
  
  in Public
  
  Main claim: corrigibility’s benefits can be mathematically represented as a counterfactual form of alignment.
  
  This could be a prediction for people to express credence: "Corrigibility's benefit can be mathematically represented as a counterfactual form of alignment"
Visit annotations in context

Annotators

amanda_ngo

URL

lesswrong.com/posts/Xts5wm3akbemk4pDa/non-obstruction-a-simple-concept-motivating-corrigibility
forum.effectivealtruism.org forum.effectivealtruism.org

My personal cruxes for working on AI safety - EA Forum

5
1. amanda_ngo 04 Nov 2020
  
  in Public
  
  Various biosecurity risks
  
  P(AGI comes before biosecurity risks) P(AGI comes before stable dystopias), etc
2. amanda_ngo 04 Nov 2020
  
  in Public
  
  there was just no way we were going to have AGI in the next 50 years
  
  P (AGI in the next 50 years)
3. amanda_ngo 04 Nov 2020
  
  in Public
  
  Various biosecurity risksStable dystopias, nuclear war or major power war, whole brain emulationClimate changeSuper volcanos, asteroids
  
  Various biosecurity risks are most likely to break the claim
4. amanda_ngo 04 Nov 2020
  
  in Public
  
  So the reason that we don't have AGI is not that we could make AGI as powerful as the brain, and we just don't because it's too expensive.
  
  Will we get AGI cheaply enough
5. amanda_ngo 04 Nov 2020
  
  in Public
  
  Crux 1: AGI would be a big deal if it showed up here
  
  Prediction: AGI would be a big deal if it showed up here
Visit annotations in context

Annotators

amanda_ngo

URL

forum.effectivealtruism.org/posts/Ayu5im98u8FeMWoBZ/my-personal-cruxes-for-working-on-ai-safety

Annotators

URL

Annotators

URL