OpenAI, Shard Theory, and Left Turns W36
You can watch and listen to this week's update on YouTube and podcast, respectively.
OpenAI is generally seen as pursuing risky endeavors with AI since their strategy is to develop safe develop artificial general intelligence. They receive quite a lot of criticism for this position so to answer some of these, OpenAI has come out with several posts explaining their position on what the word “safe” means in safe AGI.
Jacob Hilton directly addresses that they are indeed working on scalable solutions to safety and that both the leadership and teams at OpenAI are aware of existential risks from
AI while OpenAI changes their front page to include a strategy towards safe machine learning.
Jan Leike and the safety team describe how they want to use better human feedback data, use AI to help humans evaluate AI, and use AI to help research into safe machine learning.
These are all prevalent ideas in the safety field:
In human feedback, models receive evaluation from humans on its outputs to change its response. Successful examples of this have for example become better at explaining concepts compared to their predecessors.
Using AI to help humans evaluate outputs is related to an idea called Iterated Distillation and Amplification where a human evaluates an AI that should then be safe, then that AI helps the human evaluate the next generation of AI and so on
Using AI to help our safety research is what several projects work on, for example Elicit’s research assistant and Eleuther’s AI safety paper network analyzer.
However, not everyone is happy with these approaches to ML safety. John Wentworth describes how iterative design towards safe AGI can fail in two major ways:
If the AI suddenly becomes much better and the first developers need to get it right
and if the model behaves in a way to deceive its operators
He especially criticizes using human feedback since he claims that this directly trains the AI to become deceptive. An example is when a robot is trained with human feedback to grasp a ball but cheats the human by floating in front of the ball on the screen with a grasping motion. This is wild!
At the same time, language model researchers agree that NLP might lead us to AGI and we should prioritize safety in machine learning. 36% even agree that machine learning systems might cause nuclear war-level catastrophe in the next hundred years. It is great news that researchers are thinking more about safety while developing systems that might be revolutionary. An example of this is the programming assistant Github CoPilot that just continues to improve and might one day be able to program a replacement for itself!
Diving into some new perspectives on safety, Janus and Conjecture release the Simulator perspective of language models. The basic idea is that models like GPT-3 don’t act like people, they act like simulators of people and scenarios. This brings together many previous ideas and gives us signs that language models can simulate most other types of AI in one way or another.
Outside of language, Quintin Pope and Alex Turner summarize shard theory, their approach towards understanding human values. Some of the idea is based on predictive inference in neuroscience and assumes that human values are learned like much else: Different contexts bring different action plans to mind. They want to use these contextual neural “Shards” to understand where and how values relate in deep learning models.
On the smaller side, Richard Ngo of OpenAI releases a list of things he’d like people to work on while Thomas and Eli release a list of things people are already working on.
The Center for AI Safety announces a philosophy fellowship and release their machine learning safety course material for free! This is on top of their existing ML safety competitions for machine learning engineers to work on safety.
If you’re interested in learning more about AI safety, go to Apart Research dot com and if you want to work on open problems, join AI safety ideas dot com.
This has been the Safe AI Progress Report, remember to subscribe, and we hope to see you for the next one!