15 Matching Annotations
  1. Nov 2020
    1. We (the alignment community) think we want corrigibility with respect to some wide set of goals S, but we actually want non-obstruction with respect to S

      Prediction

    2. Can we get negative results, like "without such-and-such assumption on πAI, the environment, or pol, non-obstruction is impossible for most goals."

      prediction

    3. However, even if we could maximally impact-align the agent with any objective, we couldn't just align it our objective

      Random copy edit, but I think this is missing a 'with' and should say: "However, even if we could maximally impact-align the agent with any objective, we couldn't just align it with our objective"

    4. Main idea: We only care about how the agent affects our abilities to pursue different goals (our AU landscape) in the two-player game, and not how that happens. AI alignment subproblems (such as corrigibility, intent alignment, low impact, and mild optimization) are all instrumental avenues for making AIs which affect this AU landscape in specific desirable ways.

      Prediction

    5. Main idea: By considering how the AI affects your attainable utility (AU) landscape, you can quantify how helpful and flexible an AI is.

      Another claim that could be predicted on!

    6. Main claim: corrigibility’s benefits can be mathematically represented as a counterfactual form of alignment.

      This could be a prediction for people to express credence: "Corrigibility's benefit can be mathematically represented as a counterfactual form of alignment"

    1. Various biosecurity risksStable dystopias, nuclear war or major power war, whole brain emulationClimate changeSuper volcanos, asteroids

      Various biosecurity risks are most likely to break the claim

    2. So the reason that we don't have AGI is not that we could make AGI as powerful as the brain, and we just don't because it's too expensive.

      Will we get AGI cheaply enough