$1.5 million to change someone's mind W39
Watch and listen to this week's update on YouTube and podcast.
Change FTX’s mind for $1.5 million, get a new perspective in interpretability, engage with the speed prior, and join our hackathon.
—
The FTX Future Fund announces a 1.5 million dollar prize to change their mind about the risks of artificial intelligence. Until now, they’ve donated upwards of 31 million dollars to this cause and changing their mind might completely change where this money goes. If you change their probability estimates drastically for how dangerous AGI is and when it will arrive, you’ll be eligible for a prize.
An early submission is this post advocating that scary AI will arrive soon. It covers how we expect early general intelligence to be more scary than later AI since we won’t have as much time to prepare and that there are just several variables in understanding how early the onset of scary AI that we cannot be sure about. A very good submission!
Conjecture releases great research in interpretability, the Polytopes lens on feature space. They argue that we should not understand features as directions but as geometric structures in feature space due to the nonlinear activation functions and polysemanticity.
Despite the fact that one neuron can encode multiple features, they can identify “monosemantic polytopes” which just means that if we don’t study features as directions but as geometric shapes, we can identify where distinct input types are interpreted better. This challenges the Circuits interpretability we have talked about before with an experiment where they scale activations and see a difference in what the network understands. What this implies is that we cannot use linear directions as features (figure).
Anthropic works a lot in Circuits interpretability and they have simultaneously published some amazing work on understanding feature superposition which is just the understanding of features spread across many neurons, for example one neuron responding to both cars and dogs. This enables the network to understand more things but makes it harder for us to understand it, unfortunately.
Their work shows an array of interesting experiments, studying when feature superposition occurs, such as in this figure where the yellow indicates higher superpositioning (figure) and feature geometry graphs that show how superpositions are possible through encoding information in the most distinct directions as possible (figure). There are more experiments and I recommend you read the paper if you want to know more.
Perez and McKenzie release the winners of the first round of the inverse scaling prize. This challenge attempts to find tasks where larger language models perform worse than smaller models, which is exceedingly important to know where much larger models might hit roadblocks in their compatibility with human values.
The winners show that 1) larger models are worse at understanding negation, 2) more often repeat what they have seen in their training set, 3) are worse at redefining definitions, and 4) are worse at understanding future risky behaviour.
Evan Hubinger has released his Summer experiments that build upon his work on so-called “Speed priors”. We expect future dangerous AI to deceive humans and so we need a way to punish algorithms that deceive. One way is to find a “regularizer”, or a penalty to networks, that is biased towards non-deceptive models. The speed prior attempts to do this by selecting for the model that is fastest at a task since we assume deception requires extra steps compared to just doing the task at hand.
His new work presents attempts to use speed priors on multiple levels, solving inner misalignment as well. As we explained in the second Safe AI Progress Report, inner misalignment is when a model appears to do the right thing but is deceptive or has its own goals underneath the first layer. To alleviate this problem, we want the speed prior to work on both levels. Most of the approaches he presents do not have a lot of promise but warrant future research.
Leo Gao describes how reinforcement learning policies cannot care about the reward in an embedded setting but that they are still capable of wireheading. This extends to the fact that there is no special mechanism in humans that makes us care about things in the world. Leo’s writing is a response to a text from Alex Turner with the weaker claim that reinforcement learning agents probably won’t optimize for reward.
—
In smaller news, Holden Karnofsky analyses how the deployment of AI is incredibly important and questions the view among theorists that we just need to solve the technical problems of alignment and won’t have to worry too much about how the world deploys these models.
Akash and Thomas describe new alignment researchers’ 7 mistakes and how they often end up stuck in “upskilling” and won’t question authority figures.
But some places where new researchers might arise with better fundamentals is the language model hackathon that we are running over the weekend, starting today! You’re very welcome to join the hackathon during the weekend and you can win up to a 1000 dollars. Join us to figure out if we can get novel research results in a weekend.
Another event happening is the AI Safety conference by ALTER in Israel to put more focus on AI safety in the country. Our very own Fazl Barez will be speaking at this event.
And as always, if you want to learn more, go to Apart Research dot com, and if you want to find projects to work on, go to AI Safety Ideas dot com.
This has been the Safe AI Progress Report and we look forward to seeing you next week!
Links
Future Fund world view competition: https://ftxfuturefund.org/
Polytopes lens: https://www.alignmentforum.org/posts/eDicGjD9yte6FLSie/interpreting-neural-networks-through-the-polytope-lens
Inverse scaling prize round 1: https://www.alignmentforum.org/posts/iznohbCPFkeB9kAJL/inverse-scaling-prize-round-1-winners
Speed prior and forwarding speed priors: https://www.alignmentforum.org/posts/bzkCWEHG2tprB3eq2/attempts-at-forwarding-speed-priors
Deconfusing wireheading: https://www.alignmentforum.org/posts/jP9cKxqwqk2qQ6HiM/towards-deconfusing-wireheading-and-reward-maximization
Nearcasting AGI: https://www.alignmentforum.org/posts/vZzg8NS7wBtqcwhoJ/nearcast-based-deployment-problem-analysis
7 traps new alignment researchers drop into: https://www.lesswrong.com/posts/h5CGM5qwivGk2f5T9
Language model hackathon: https://itch.io/jam/llm-hackathon
AI Safety Israel conference: https://aisic2022.net.technion.ac.il/
Apart Research: https://apartresearch.com
AI Safety Ideas: https://aisi.ai