Deep ML Safety Research W41

Oct 14, 2022

Deep ML Safety Research W41

Watch and listen to this week's update on YouTube and podcast.

This week, we share amazing ML safety papers, describe dynamics of the field of AI safety, and share opportunities for how you can work with ML safety as well.

— Law defines human values

A new, long paper describes the many ways law has solved problems of technically defining human values in ways such as imperfect generalized value specification (law) and human oversight (judges). It describes value alignment as a theoretical ethical problem (something Joscha Bach is also pushing for) and as a multi-agent coordination and cooperation dilemma. Worth a read! (article)

— Out-of-distribution alignment

The alignment problem can be redefined as an out-of-distribution robustness problem. If the training data does not contain all examples of how a human value is carried out in the world, then how does the AI generalize beyond this training set? We usually try to solve this by transforming the training data such as mirroring and rotating images so we get more examples and testing our models on data it hasn’t seen. More advanced methods use neural networks to generate new data, so-called generative adversarial networks (or GANs). A new paper trains these GANs to create a more reliable representation of what we consider out-of-distribution instead of just testing on other datasets (article).

— Reward hacking defined

Rewards for machine learning models are defined according to a true goal we have in mind, for example creating a sustainable business that we can profit off of. However, we don’t have a good metric to track for this sustainable business, so we define the reward as the amount of money it earns for us. When we define an imperfect reward, the AI might end up doing what is called “reward hacking”. A new paper defines reward hacking as any behavior on our imperfect reward leading to reduced performance on our true goal. A reward is defined as unhackable if increasing the reward does not lead to reduced performance on the true goal in any situation (article).

Relatedly, DeepMind describes goal misgeneralization. This is what happens in the edge cases of reward hacking, when a reward is correctly defined but the behavior doesn’t work when it is deployed. An example is when an agent receives reward for walking to locations in a specific order and is lead by a teacher that does it correctly in training but during deployment, the teacher walks in the reverse direction. This shows that despite having a robust reward, the model student learns a wrong behavior (post).

— Inductive biases in learning algorithms

Quintin Pope summarizes 16 papers on inductive bias in learning algorithms which just means how learning algorithms are biased towards specific behaviors. Notable research includes using the neural tangent kernel to visualize learned behavior in different network architectures, analyzing stochastic gradient descent’s discrete inductive biases using straightforward methods, and showing that stochastic gradient descent is biased towards selecting non-deep neural networks (article).

Larsen and Gillen summarize the mentioned neural tangent kernel research in a recent post where they also share a paper on Gaussian processes with in-depth instructions and interactive demos of what they are. Generally, kernels in machine learning help us redefine the input data into something that our models can understand (link).

— Warning shots

Warning shots are a series of examples that indicate that we should begin taking a risk seriously, such as when an AI is able to replace junior programmers or when it is responsible for more than 10% of the world’s GDP. Some argue that such warning shots will be enough to push governments to action however, Soares argues that covid-19 was a test case for such a process and describes how we cannot rely on governments for the safety of future AI. This puts even more focus on technical alignment research as the best path (post).

— State of AI safety

However, new estimates show that only around 300 researchers work full-time in the AI safety field. This is less than half a percent of just the job posts on LinkedIn that list machine learning as a requirement which was 98,000 at the last count. Increasing the field is very important to ensure the safety of future AI systems and (figure) Marcus summarizes his experiences speaking with over 100 machine learning academics about safety. Luckily, people generally seem more and more open to security risks of AI and researchers are interested in the technical discussions of the field (link).

— Announcements

Dan Hendrycks has released the latest monthly ML Safety Newsletter that we recommend you read if you’re interested in learning more.

The Center for AI Safety has released a $500,000 dollar call for ideas to create benchmarks in AI safety (front page) and Redwood Research calls for people to find emergent heuristics in a small GPT-2 model (article).

Our next hackathon in ML safety is about interpretability and you’re very welcome to register your interest already now in the description. If you wish to run a local event with support from us, click on the link in the description (itch page).

The AGI Safety Fundamentals releases their second course on AI safety, the Alignment 201 curriculum. Sign up for their 10-week spare-time interactive course in the description! (front page)

Visit our page at Apart Research dot com and follow along here for the next update.

This has been the Safe AI Progress Report, see you next week!

Apart Research

Deep ML Safety Research W41

Deep ML Safety Research W41

Discussion about this post