Watch and listen to this week's update on YouTube or podcast.
Aligning language models is hard and it becomes harder to find their flaws, Refine again pumps out interesting articles, and Redwood publishes a review of their robust language model work.
An often used method to align language models is reinforcement learning from human feedback that we talked about in the first Safe AI Progress Report. A good way to create examples for humans to evaluate and give feedback on is to use adversarial techniques, often called Red Teaming.
In Red Teaming, we attempt to trip up the models as much as possible by giving extreme examples in some direction, e.g. with examples of violence. Creating a model without any examples of violence in its text output was one of Redwood Research’s first projects. Now, they have released a hindsight review of how useful it was to alignment.
Their tools for using AI to assist human annotation are very good examples of an actual alignment pipeline that will be useful for future use. The data contractors Surge AI wrote a post about their process.
Unfortunately, their results showed that they could not create a robust enough model for alignment even though they still believe in the direction of adversarial robustness for alignment.
Simultaneously, Anthropic publishes a review of adversarial examples and their effectiveness on internal language models. They show that language models with human feedback are harder to find successful attacks against but they’re less harmless compared to traditional models.
They create this UMAP embedding map of all the different adversarial attacks and their success rating. An interesting result is that traditionally explicit harmful or negative responses aren’t very effective but “asking for help” for something harmful is quite effective.
These articles are contextualized by Kasirzadeh and Gabriel who write a philosophical analysis of what it means for language models to be aligned. They frame conversations with language models as linguistic cooperation to an end and build on that idea to define future directions for technical work.
In other news, Refine’s third week of blog posts are out. Refine is a project run by Conjecture in London where researchers receive support for three months to create fringe and interesting perspectives on alignment. This is to diversify the field, something that Thomas Kuhn would be happy to hear about since AI safety is in its early stages and we need good views on alignment.
“Ordering capability thresholds” describes which capabilities come before others and how to think about this progression. “Levels of goals and alignment” describe the authors’ confusion and attempt to understand terminology about inner and outer alignment. “Representational tether” introduces a way to use machine learning to align an AI to human values. One thing I like about this post is how Paul relates the idea to most relevant research agendas.
John explains the idea of coordinate-free interpretability which references topology to create preferred transformations in the neural network that are easier to interpret.
In relation to this post, Jacob Hilton links to the softmax linear units paper that describes privileged basis. Neurons often attempt to encode more dimensions than there are neurons in the model which means that their activation is correlated with multiple understandings of the data.
Their softmax linear unit changes the activation function of neurons to accentuate the largest input. In this way, neurons are biased to encode only one dimension which makes them much easier to interpret, since we know that the neuron activation is associated with one type of concept in the input.
In other news, the Backdoor Bench creates a standard for evaluating attacks and defenses in neural networks, a field that is in an arms race at the moment to create the best protected neural networks. They publish an open repository with implementations of state-of-the-art algorithms for attacks and defense to test ones methods against.
Leon writes a large summary of the 8 weeks of course material from the “artificial general intelligence safety fundamentals course” that contains one of the best introductions to alignment you can find online.
Vanessa Kosoy announces a $50,000 prize for creating research towards her alignment agenda in learning-theoretic alignment where we try to infer how agents learn and use this information to build more interpretable and aligned statistical models.
If you want to learn more about AI safety, go to Apart Research dot com and follow us on various social media. If you want inspiration for projects to work on, go to AI safety ideas dot com.
This has been the Safe AI Progress Report. Remember to subscribe. And we will see you next time.
Links:
Reinforcement learning from human feedback:
https://twitter.com/anthropicai/status/1514277273070825476?lang=en
First SAIPR:
Red teaming LLMs: https://arxiv.org/pdf/2202.03286.pdf
Adversarial training [Redwood]: https://arxiv.org/abs/2205.01663
Robust injury classifier [Redwood]: https://www.alignmentforum.org/posts/n3LAgnHg6ashQK3fF/takeaways-from-our-robust-injury-classifier-project-redwood
Red teaming language models to reduce harms: Review [Anthropic]: https://arxiv.org/abs/2209.07858
Aligning language models: https://arxiv.org/abs/2209.00731
Refine’s third blog post battery: https://www.alignmentforum.org/posts/PhKSe9BT4h5peqrHL/refine-s-third-blog-post-day-week
Coordinate-free interpretability theory: https://www.alignmentforum.org/posts/sxhfSBej6gdAwcn7X/coordinate-free-interpretability-theory
Backdoor bench: https://arxiv.org/abs/2206.12654
Leon Lang’s summary of AGISF readings: https://www.alignmentforum.org/posts/eymFwwc6jG9gPx5Zz/summaries-alignment-fundamentals-curriculum
Vanessa Kosoy’s ALTER prize for learning-theoretic progress in alignment: https://www.alignmentforum.org/posts/8BL7w55PS4rWYmrmv/prize-and-fast-track-to-alignment-research-at-alter
Apart Research: https://apartresearch.com
AI Safety Ideas: https://aisafetyideas.com