Interpretability state-of-the-art W37

Sep 15, 2022

Watch and listen to this week's update on YouTube or podcast.

Interpretability can be called “the neuroscience of AI”. We look into the brain of AI to understand why and how they give certain outputs. AI safety often focuses on the Circuits paradigm. However, a new survey of 300 interpretability papers show 20 other paradigms within the field with similarly promising results.

A few examples pointed out by the authors are 1) the activation atlas method, 2) updating training data for behaviourally accurate representations, 3) adversarial methods, and 4) manual fine-tuning of weights.

The activation atlas method resembles Circuits interpretability research and uses a semantic map of neural activations to represent each layer through the neural network. For this specific image of a fireboat, we can then analyze its related activations backtracking through the layers. In this case, a fireboat is related to windows, crane-like objects, geysers, and water.

Updating the training data to counteract biases allows us to for example update images to have more accentuation on shapes instead of textures and solve ResNets’ natural bias towards overfitting to textures, something humans would not do. This enables the network to perform more like a human, which we are interested in for AI safety since establishing similar frames of reference can help with value alignment between AI and humans.

Another example of using adversarial examples is that they can help us understand mistakes and biases in models and ensure safer future systems while more direct intervention includes understanding factual associations in neural networks which gives us a much better chance at fixing and identifying inconsistent and possibly dangerous behaviours.

However, even with our ability to do interpretability, we still face a high risk. The forecasting group Samotsvety added their estimates of AI risk probabilities and they show an order of magnitude higher estimate for risk compared to previous estimates from Metaculus.

Samotsvety generally has a great track record and their piece complements the existing literature on AGI timelines, a good example being the “AGI timelines from biological anchors” report from Ajeya Cotra, which Anson Ho wrote a summary for that we’ll link in the description.

So what can we actually do about these risks? Evan Hubinger proposes a clear win proposal for safe AI coordination. His idea is to ask Deepmind, OpenAI, and Anthropic to commit to actively monitor and look for evidence of deceptive alignment in their models which can help us identify and catch errors earlier.

This deceptive alignment is a problem where systems have different behaviour in deployment from the rewarded behaviour in training. An example is evolution rewarding humans for rearing children but now we have created a lot of other ways to get enjoyment out of the world.

This can be a big win for coordinating safety in AI development.

In other news, Quintin has started a weekly alignment research paper summary series,

John thinks most people start out in alignment with bad ideas but gets a bit of pushback from Evan,

Beth Barnes starts a capabilities and alignment tracking project at the Alignment Research Center,

language models replicate cognitive biases from humans,

and maybe academia is actually really good to work on AI safety despite the focus in the for-profit AI scene in San Francisco.

If you want to learn more about AI safety, go to Apart Research dot com, and if you’d like to work on research, go to AI safety ideas dot com.

This has been the Safe AI Progress Report, remember to subscribe, and we hope to see you for the next one!

Links:

Circuits: https://distill.pub/2020/circuits/zoom-in/

Interpretability survey: https://arxiv.org/abs/2207.13243, see Twitter summary:

Stephen Casper @StephenLCasper

1/ New paper – Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks. We surveyed over 300 papers on interpretable deep learning. arxiv.org/abs/2207.13243 @ansonwhho @TilmanRa @dhadfieldmenell

, and PDF: https://arxiv.org/pdf/2207.13243.pdf

Activation atlas: https://distill.pub/2019/activation-atlas/
Changing training data: https://arxiv.org/pdf/1811.12231.pdf
Editing factual associations in GPT: https://arxiv.org/pdf/2202.05262.pdf
Natural language descriptions of deep visual features: https://arxiv.org/pdf/2201.11114.pdf
Robust feature-level adversaries are interpretability tools: https://arxiv.org/pdf/2110.03605.pdf

Samotsvety’s AI risk forecast: https://forum.effectivealtruism.org/posts/EG9xDM8YRz4JN4wMN/samotsvety-s-ai-risk-forecasts

Date of AGI:

(June) Forecasting TAI with biological anchors summary: https://www.lesswrong.com/s/B9Qc8ifidAtDpsuu8/p/wgio8E758y9XWsi8j

Monitoring for deceptive alignment: https://www.alignmentforum.org/posts/Km9sHjHTsBdbgwKyi/monitoring-for-deceptive-alignment

Deceptive alignment: https://www.alignmentforum.org/posts/zthDPAjh9w6Ytbeks/deceptive-alignment

Quintin’s alignment paper lineup: https://www.lesswrong.com/posts/7cHgjJR2H5e4w4rxT/quintin-s-alignment-papers-roundup-week-1

Most people start with the same few bad ideas: https://www.lesswrong.com/posts/Afdohjyt6gESu4ANf/most-people-start-with-the-same-few-bad-ideas

Beth Barnes starting a risks and development evaluations group at ARC: https://www.alignmentforum.org/posts/svhQMdsefdYFDq5YM/evaluations-project-arc-is-hiring-a-researcher-and-a-webdev-1

Cognitive biases in LLMs: https://arxiv.org/pdf/2206.14576.pdf

Academia vs. industry: https://www.alignmentforum.org/posts/HXxHcRCxR4oHrAsEr/an-update-on-academia-vs-industry-one-year-into-my-faculty

Apart Research

Interpretability state-of-the-art W37

Links:

Discussion about this post