In this week's newsletter, we explore the topic of modern large models’ alignment and examine criticisms of extreme AI risk arguments. Of course, don't miss out on the opportunities we've included at the end!
Watch this week's MLAISU on YouTube or listen to it on Spotify.
Understanding large models
An important task for our work in making future machine learning systems safe is to understand how we can measure, monitor and understand these large models’ safety.
This past week has a couple of interesting examples of work that helps us in this direction besides last week’s wonderful inverse scaling examples.
A paper explores the perspective that large language models (LLMs) are implicitly topic models. They find a method to increase performance by 12.5% compared to a random prompt by thinking about the hidden concepts that LLMs learn.
Adam Scherlis expands on what inner misalignment looks like with the Simulator perspective of LLMs. Inner misalignment is when our system seems to be doing the right thing but is doing a malicious computation in the background. The Simulator perspective sees LLMs as simulating different scenarios and characters as you write with it. Scherlis discusses the ways these models have a different kind of inner misalignment.
Another paper investigates 491 different computer vision algorithms and finds that being aligned with human representation is predictive of higher robustness to malicious attacks and that they generalize better.
These are but a few good examples of work that investigates how we can scale our alignment understanding to larger systems. You can join us next weekend for the ScaleOversight hackathon to contribute to this growing field and meet amazing people who share the passion for ML safety around the world!
Hardcore AGI doom
We also shift our focus slightly from the technical aspects of AI alignment research to a thought-provoking article by Nuño Sempere. The piece addresses the alarmist views regarding the imminent dangers of artificial general intelligence (AGI).
Sempere critiques the notion of a severe short-term risk from AGI, such as an 80% chance of human extinction by 2070, stating that these claims are based on flawed reasoning and imperfect concepts. He also highlights the lack of proper presentation of the cumulative evidence against such extreme risks."
On the topic, in this week’s ML Street Talk podcast, renowned philosopher Luciano Floridi made an appearance. Floridi recently published an article expressing his distrust of both those who believe in a rapid intelligence explosion and those who dismiss the risks of AI. He stresses the importance of preserving human dignity and argues that the concept of AI having agency (“able to think”) is not actually relevant to the conversation about risk.
Of course, there are still many risks from AI, especially in the longer term. We recommend that you read Eliezer Yudkowsky’s list of ways AGI can go wrong. Here, he mentions that we need 100% safe solutions, we cannot “just train AI on good actions” and that current efforts are not attacking the right problems.
Other research
In other research news…
Neel Nanda has released his quickstart guide for mechanistic interpretability that he wrote for our latest hackathon.
Google released a highly capable music-generating language model.
New work investigates the relationship between actually generalizing properly and the famous double descent phenomenon.
Opportunities
In the opportunities area, we have…
Senior roles open at Ought who create amazing language model-driven research software for e.g. ML safety researchers.
A communications role at the Fund for Alignment Research.
You can refer a cool friend to the Redwood Research summer internship for a bounty of $2,000.
Or you can apply for it yourself!
And of course, you can join our hackathon.
Thank you for joining us in this week’s ML and AI safety update!