Watch this week's update on YouTube or listen to it on Spotify.
This week, we look at broken scaling laws, surgical fine-tuning, interpretability in the wild, and threat models of AI.
Today is November 4th and this is the ML & AI safety update!
Broken scaling laws & surgical fine-tuning
A range of interesting papers have been making the rounds the previous weeks and we selected a few of the most interesting ones.
Scaling laws are important to infer how future AI systems will behave. Existing scaling laws are often fitted linearly or monotonically. Caballero, Krueger and others introduce “broken scaling laws” after critiquing how the normal scaling laws research that do not reflect empirical facts of model training. Their new scaling laws function can show “breaks” which correspond to the sudden non-monotonic shifts in ability we see from neural networks. Their function extrapolates significantly better than the other three function forms.
Robustness of computer vision is important for a range of tasks. A team from Stanford show that fine-tuning single layers works better than fine-tuning the whole neural network in specific adversarial benchmarks. For example surgically fine-tuning early layers gives better performance for input-level shifts such as corruption attacks while late layer fine-tuning induces robustness for output-level shifts.
Debate & interpretability
Parrish, Bowman and others show that debate does not help humans answer hard reading comprehension questions. They show the participants arguments for and against an incorrect and a correct answer to a hard reading comprehension question but find that humans do not benefit from this.
“When Drake and Yoojin went to the store, Yoojin gave a drink to…” A transformer can easily predict that the next word in this sentence is Drake but how does it do it? Redwood Research identifies a circuit of conceptual understanding in the Transformer heads.
We see that the neural heads have specific functions in understanding: Some identify duplicate words, some inhibit specific words, and the three late-stage classes of heads negatively and positively move the word “Drake” into the predicted position. This task is called indirect object identification and is clearly an interesting test case for circuits interpretability.
Threat models in ML safety
The DeepMind safety team created a taxonomy of how the current risks look from artificial intelligence. Their consensus development model is scaled up version of our current models which they don’t think need much innovation to become artificial general intelligence, an AI that is better than humans at most relevant tasks.
The risks that arise from such a model are goal misgeneralization where the models fail to generalize their training to real world scenarios and power-seeking as a result of such misalignment. We don’t expect to catch this due to deception and the most important people in society won’t understand the risks. John Wentworth notes that this multi-stage story is not even necessary since current systems already train to deceive humans.
Michael Cohen shows that existential catastrophe from AI is above 35%. He takes an optimistic perspective on success scenarios such as well-enforced laws that stop dangerous versions of AI, an entity stops it in some way, no one develops advanced AI, or advanced AI is developed in a safe way that violates a series of assumptions Cohen makes (which he doubts). These assumptions focus on the ability of the AI to make hypotheses, follow plans in uncertainty, and use these plans in a way that progresses some proxy reward.
Additionally, he does not put confidence in current AI safety research paradigms and even writes up an “anti review”, where he argues against each contemporary research agenda.
In other news
In other news, Scott Garrabrant discusses so-called “frames” which he describes as creating an agentic first-person perspective on all (third-person) possible worlds, such as uncertainty, choices, and plausible worlds. He claims this is in contrast to the embedded agents view and traditional RL with its environment / agent boundary separation.
Michaeud, Liu, and Tegmark show scaling laws of different function approximators and provide a taxonomy for precision machine learning.
Michael Nielsen and Kanjun Qiu release their book “Vision for Metascience” and describe the funders of research as a detector and discriminator in an imaginative research generation process.
The Future of Life Institute has started a new podcast and the latest episode with Ajeya Cotra covers how AI might cause catastrophe.
Opportunities
This week, we have a few very interesting opening available:
Redwood Research is inviting 30-50 researchers to join them in Berkeley for a very interesting mechanistic interpretability research programme.
Anthropic is looking for operations managers, recruiters, researchers, engineers, and product managers.
Additionally, you can check out some of the new features on AI Safety Ideas and join the interpretability hackathon from anywhere in the world next weekend.
This has been the ML & AI safety update, see you next week!