Watch and listen to this week's update on YouTube or podcast.
The crypto giant FTX crashes, introducing massive uncertainty in the funding space for AI safety, humans cooperate better with lying AI, and interpretability is promising but also not.
This and other news from the AI Safety world will be addressed today.
It is the 11th of November and welcome to the ML & AI safety update!
FTX drops
Since this is a major story, let's dive into what actually happened with the FTX Foundation.
When Sam Bankman-Fried, the CEO of FTX, announced The Future Fund in late February 2022 with the aim to improve humanity's long-term prospects, it seemed like yet another great initiative in support of the AI Safety community and its ability to operate outside the incentive system of for-profits.
Three days ago, Sam Bankman-Fried tweeted about their liquidity issues as a crypto exchange, marking the start of a series of revelations about FTX, how they have mishandled users’ money, moved funds to their own accounts, and violated their own terms of service. The Department of Justice has initiated an investigation into FTX and their crypto hedge fund, Alameda Research.
Additionally, the recent crash of the Meta stock has seen the second big funder of AI safety research, Open Philanthropy, lose a lot of its money so the future of AI safety looks interesting, to say the least.
Human-AI cooperation
We follow up this serious news with research from a team at Stanford. They show that human-AI cooperation is better when the AI is calibrated on the relationship with the human instead of accuracy.
The authors use AI to give decision-making advice to the participants and find that AI modulated to fit the human-AI interaction gives better performance overall compared to a maximally accurate AI system for the human-AI collaborative system.
This introduces interesting considerations for how AI actually interacts with humans in relation to several ways we might safeguard future AI.
U-shaped inverse scaling
And just as we thought we found some sort of linearity in inverse scaling laws, Google shows that they can become U-shaped. The only thing you need to do is just to scale your models up to extreme sizes. If this is true, it may disprove inverse scaling laws and Google even goes to the degree of stating: "This suggests that the term inverse scaling task is under-specified - a given task may be inverse scaling for one prompt but positive or U-shaped scaling for a different prompt".
However, not all are satisfied with their methods. Ethan Perez' calls the team out for deviating their inverse scaling law tests from the ones they describe as replicating in the paper.
Interpretability in the wild
A wonderful piece of contemporary interpretability work in the wild has been conducted by Redwood Research: Using GPT-2 Small, they investigate “indirect object identification” end-to-end in terms of the internal parts of the circuit in a Transformer, even evaluating the reliability of the model.
What is so ingenious about interpretability work is not only that it really takes the task of interpretability research seriously, but that it also shows how much valuable information proper interpretability research can find.
The team manages to identify 26 attention heads grouped in 7 categories, that comprise the indirect object classification-circuit. Along the way, the team also identified interesting structures from the model internals, for example that the attention heads communicated by using pointers to share a piece of information, instead of copying it.
We really recommend that you check out this interpretability research paper!
Other news
In other news, Eric Drexler and Yudkowsky discuss superintelligence on the alignment forum: Because how many superintelligent AIs are actually the best case scenario when they start interacting with each other?
Also, the Janus team from Conjecture have found the outputs of OpenAI’s human fine-tuned models to have very confident outputs in quite specific situations, having clear preferences for specific numbers, answers, and the like.
MadHatter doubts some of the mesa-optimiser thought scenarios proposed by the researchers in the field and calls to consider far more empirical research on mesa-optimisers.
David Krueger doubts the true value of interpretability and reverse engineering, suggesting that we should get our engineering right instead of 'reversing' that engineering with interpretability.
Nate Soares doubts cognitive interpretability approaches, because we're not building minds but rather training minds, and we have very little grasp of their internal thinking. He doubts our ability to predict if an AGI system will have positive outcomes for humanity
And finally, Apart Research has released a website for interpretability research. We definitely recommend you go check them out and also consider if you should participate in the coming interpretability hackathon this very weekend. Check the links below for more info.
Opportunities
This week, we have a few very interesting opening available:
CHAI is offering an AI Research Internship under one of their mentors
Today is the day the interpretability hackathon starts, open to all
AI impacts is looking for a senior Research Analyst
This has been the ML & AI safety update. We look forward to seeing you next week.