Watch and listen to this week's update on YouTube or podcast.
Considerations on the funding situation for AI Safety, exciting projects from Apart's interpretability hackathon, Meta AI-math transformer interpretability and considerations on what to spend time on in AI Safety.
Today is the 18th of November and welcome to the ML & AI safety update!
Thoughts on FTX and AI safety
Last week, we reported, like all others, of the FTX crash and now being in the aftermath of the shock, it seems appropriate to dive a little into what it means for the AI safety community.
The New York Times published an article about the general impact on EA funding and accurately says that it is a just cause for turbulence in such a young movement and has commentary from the Center on Nonprofits and Philanthropy that it is too easy for billionaires to gain legitimation “as long as the money is flowing”, a risk that happened in this case.
The research community is generally appalled at what FTX has done. The main FTX fund for AI safety research, Future Fund, saw its whole team resign over the deception they were exposed to. Will McAskill and Evan Hubinger in clear terms state that this fraud is completely unacceptable with what effective altruism stands for. Meanwhile, Eliezer Yudkowsky and a lawyer makes sure that the community knows that it’s not to blame for this situation and the legal status of FTX’s donations.
When it comes to funding for AI safety research, one of the two biggest funders has now stopped and the other funder OpenPhil is taking a month’s break to evaluate this turbulence. Nonlinear has set up an emergency fund for smaller grants below $10,000 dollars to compensate pressed organizations in this funding stop.
Holden Karnofsky from OpenPhil recommends organizations to:
Put commitments on hold and wait until there is more clarity of the actual impact
Identify gaps, assess by urgency/importance
Reprioritize and balance portfolios
Interpretability Alignment Jam
The second Alignment Jam about interpretability research finished this weekend with a total 147 participants and 25 submissions of valuable interpretability research.
The first prize was awarded to Alex Foote for his research and algorithm that finds minimally activating examples for neurons in language models using word replacement and sentence pruning. This automatically creates positive and negative examples for what specific neurons activate to and is a highly interpretable method.
The second prize was awarded to three researchers from Stanford who found that when Transformer heads are deactivated in different ways, other Transformer heads take over their task even though they did not show activation normally. This has been shown before but the team found that even the backup heads have backup heads and that all these backup heads are robust to the method of deactivation (or ablation) used on the main heads.
The third prize was awarded to Team Nero for finding flaws in the way the ROME and MEMIT papers replace factual associations. They show that factual association replacements also affect any sentence related to the words in the factual association, indicating that it is not constrained to factual associations.
The fourth place team introduced a way to interpret reinforcement learning agents’ strategies on mathematically solved games. They use the match four game and find that the way the agent sees the board corresponds to how humans generally model the board.
The hackathon sparked a lot of interesting research, which we definitely recommend you check out.
Also, remember to stay tuned for our coming hackathon in December!
Meta AI math Transformer-interpretability
Francois Charton from Meta AI has investigated the failure cases and out-of-distribution behavior on transformers trained on inverse matrices and decomposition of eigenvalues.
Despite research that mathematical language models fail to understand math, he finds that they have a correct understanding of the mathematical problems but that it’s the nature of these problems that affect how correct it is. He shows that the training data generators do not simulate the correct results to learn from, leading to generalization failures for the math models.
It remains like it has always been: The computers only do what we ask them to; the main failure is our expectations and aims.
Thoughts on buying time
Akash, Olivia Jimenez and Thomas Larsen has posted a long list of interventions that could 'buy us time'. In their opinion, they believe the AIS-community should invest more in buying time than technical research because the median researcher's time is far more well spent with consideration for the general risk than really technical alignment.
Their new intervention proposal lists among others demonstrating alignment failure, 1-1 conversations with ML researchers and defining concepts in AI safety better. We have heard these claims before and they also seem to get a bit of pushback from Jan Kulveit and habryka.
Other news
Martin Soto criticizes Vanessa Kosoy's PreDCA-protocol of interpretability for involving betting everything on a specific mathematical formalization of some instructions, which might be problematic
Pablo Vallalobos and others have estimated when training data will be exhausted based on current trends. They predict that we will have exhausted the stock of low quality language data by 2030 to 2050, high-quality language data before 2026, and vision data by 2030 to 2060
Instrumental convergence is proposed to be the argument for why general intelligence is possible
Jessica Mary proposes that model-agnostic interpretability might not be that bad after all though the commenters indicate the opposite.
Opportunities:
This week, we have a few very interesting openings available:
AI impacts is still looking for a senior Research Analyst
Anthropic is still looking for a senior software engineer
Center of AI Safety is looking for a chief of staff
David Krueger’s lab is looking for collaborators
This has been the ML & AI safety update. We look forward to seeing you next week!