Apart Research: Apart Newsletter

Large language models might always be slightly misaligned

Apart Research — Tue, 25 Apr 2023 15:06:28 GMT

Large language models such as GPT-4 seem impervious to full alignment attempts, we need to think about the consequences of interpretability research, language models' ability to memorize is fascinating, and other research and opportunities.

We're back from Stockholm and EAGx Nordics and ready for another week of notes on the development of ML and AI safety research. Welcome to this week's alignment digest!

LLM alignment limitations

Wolf and Wies et al. (2023) defines a framework for theoretically analysing the alignment of language models (LMs) such as GPT-4. Their Behaviour Expectation Bounds (BEB) framework makes a formal investigation into LLM alignment possible. It classifies the outputs given by the models as ill-behaved or well-behaved.

They show that LMs which are optimized to output only good-natured outputs but have even the smallest probability of outputting negative examples will always have a "jailbreak prompt" that can make it output something bad; however, this jailbreak prompt will need to be longer the more aligned a model is, ensuring a higher degree of safety despite missing provably safe behaviour. They define alignment as ensuring behaviour that is within certain bounds of a behaviour space, e.g. see the plot below:

They also show that it is relatively easy to use the "personas" a model has learned from its training data to generate negative output, that these LMs will not align easily after they have been misaligned, and that LMs can resist misalignment from a user. Check out the paper for more details.

Speedrunning and machine learning

Sevilla and Erdil (2023) create a model to predict the improvement of speedrunning (fastest completions of full games) records which fits well to a power law of learning. By applying the same type of model to machine learning benchmarks, they show that this indicates that there is still much improvement to be made and that it does not seem to slow down.

It is a relatively simple random effects model with a power law decay but it is applied to 435 benchmarks with 1552 improvement steps and indicates a good relationship to the speedrunning benchmarks. They also find that large improvements are infrequent but seem to hold for every 50 attempts, according to the model.

Should we publish mechanistic interpretability research?

Much of the research in AI safety that is published in academic machine learning outlets is "mechanistic interpretability". With its potential to raise our understanding of neural networks, it is both a boon to us who wish to recognize deception and internal inconsistencies of the network and to the ones who wish to make machine learning even more capable, speeding up our path towards world-altering AI.

Marius and Lawrence have examined the basic cases both for and against publishing and conclude that it should be evaluated on a case-by-case basis with their recommendation for a differential publishing decision; if it helps alignment significantly less than it improves AI development, it should be circulated with more care instead of going directly to publishing.

Other research

Stephen McAleese examines how AI timelines affect existential risk and emphasizes the importance of differential technology development.
Using high entropy detection in images improves identification of "adversarial patches", areas of images edited to fool neural networks (Tarchoun et al., 2023).
Wendt and Markov (2023) look at ways uncontrollable AI can lead to high-risk scenarios and how they differ from "AGI" and "ASI" (Artificial General / Super Intelligence).
EleutherAI has, now three weeks after publishing the Pythia models, used them to investigate memorization in LLMs. The graph below shows their investigation into how smaller models are useful to predict which sequences will be memorized by the largest model, the 12B Pythia model. Each model has multiple points on the graph due to the Pythia model set including steps several times during training. They are intriguing results and more research is needed. You can read what in Stella Biderman's tweet.

Opportunities

As always, there are interesting and exciting opportunities available within AI safety:

Join the ARENA program to upskill in ML engineering and contribute directly to the research on alignment. The deadline is in 10 days and happens in London for one week.
Check out many job opportunities within AI safety on agisf.org/opportunities.
And join conferences relevant for AI safety at aisafety.training.

Thank you for following along and remember to subscribe to receive updates about our various programmes with the next one happening on the 26th of May; a research hackathon on the topic of safety verification and benchmarks.

Apart Newsletter #27

Apart Research — Tue, 18 Apr 2023 11:46:00 GMT

This week, we look at new explorations of feature space, models to analyze training dynamics, and thoughts from the AGI risk space. We also share a few fellow newsletters that are starting up in AI safety along with exciting opportunities within AI safety.

ML safety research

Pythia (Biderman et al., 2023) is a dataset of 8 trained models with parameters ranging from 19 million to 12 billion. These models are trained to open source our ability to conduct research on how large models learn and they give access to copies of the model saved during training. Understanding how the "AI brains" learn is important to find new avenues for alignment.

A new paper from Redwood Research presents work to localize neural network behaviors to parts of its internal structure (Goldowsky-Dill et al., 2023). They formalize path patching and use it to test and refine hypotheses for behaviors in GPT-2 and more. You can explore their model behavior search tool yourself.

In recent work, Neel Nanda builds upon research into Othello-GPT (Li et al., 2023) that is trained to take random legal moves in the board game Othello. A common theory is that features of a network's understanding are encoded linearly and Li et al. show that this is not the case for the neural representation of the board state!

This was poised to flip our understanding of features; however, Nanda (2023) shows that if we re-interpret the features, we can extract them using a type of "logistic regression" over the neuron activation. With a simple transformation, interpretability luckily stays linearly interpretable.

Neel Nanda also joined us in making the interpretability hackathon 2.0 a success this weekend. You can follow the project presentations next Tuesday but as a short summary, teams worked to:

Identify tipping points in learning of the model (link).
Develop a way to qualitatively inspect many neurons in the Othello-GPT network (link to the tool and the report)
Improve on the TransformerLens library (report link and TransformerLens)
Investigate how dropout affects privileged bases (link)
And more…

Thoughts from AI risk research

Jan Kulveit and Rose Hadshar describe how the usual proposals for alignment ignore that the system we are trying to align to (humans) are usually not aligned within themselves. This puts several types of proposals on shaky ground.

They also provide an overview of ways to solve this problem, with examples such as aligning with Microsoft instead of humans, taking our preferences about our preferences into account, and using markets.

David Thorstad criticizes some of the extreme risk estimates on AI from the principle that several parts of the risk calculations do not have significant data nor arguments behind them. This echoes previous criticism from Nuno Sempere and Ben Garfinkel as well, who respectively highlight issues of estimation and of deference.

An anonymous post has been released critiquing one of the largest AI safety non-profit labs, describing issues related to experience of the researchers and conflicts of interest with their grantmakers.

Steven Kaas invites people to ask questions about artificial general intelligence (AGI) safety. It already has over 100 comments and might be interesting to explore. Examples include "how is AGI a risk?" and "is alignment even possible?".

What else?

A newsletter on AI governance and navigating the AI risks during the coming century has come out! It is focused on how we can govern the risks posed by transformative artificial intelligence and you'll receive their long-form thoughts on foundational questions in AI governance along with an overview of what has been happening every 2 weeks.
Nonlinear has launched a funder's network for AI safety with over 30 private donors and invites people to send in grant applications before the 17th of May.
The Center for AI Safety has launched a newsletter for what is happening in AI safety with their first post from a week ago. They already share the ML Safety Newsletter monthly, exploring topics in ML safety research.

Opportunities in ML safety

As usual, we thank our friends at aisafety.training and agisf.org/opportunities for mapping out the opportunities available in AI safety. Check them out here:

Submit your perspectives and explorations of our expectations on how AI will develop with Open Philanthropy's Worldview Prize. You can win up to $50,000!
On the 21st of April, applications for the RAND Corporation's technology and security policy fellowship to conduct independent research on the governance of AI.
Apply before the 30th of April as an intern to the Krueger Lab. They work on ML safety research and are doing great work within academic outreach.
The same deadline applies to joining the Effective Altruism Global (EAG) London conference happening next month. Apply here.

Thank you for following along and don't forget to share these with your friends interested in alignment research! You can follow both this newsletter and our hackathon updates at news.apartresearch.com.

Ethics or Reward?

CC — Sun, 09 Apr 2023 15:16:46 GMT

This week we take a look at LLMs that need therapists, governance of machine learning hardware, and benchmarks for dangerous behaviour. Read to the end to join great Summer programmes and research projects in AI safety.

We also introduce our newsletter's design change along with the Spanish translation of our newsletter, made possible by the help of the amazing Aitana and Alejandro. Ve a suscribirte! Write if you are interested in helping out as well.

You receive the Apart Newsletter since you have previously subscribed to any of our newsletters. If you want to manage which types of emails you receive from us, e.g. hackathon or weekly AI safety research updates, go to news.apartresearch.com.

Do the rewards justify the means?

Pan et al. (2023) introduce the Measuring Agents’ Competence & Harmfulness In A Vast Environment of Long-horizon Language Interactions (MACHIAVELLI) benchmark which contains more than half a million realistic high-level action scenarios. Check out an example below.

They find that if agents are explicitly trained to get the most reward in the text-based games, they will be less ethical than random agents. The researchers also introduce simple ways to make the agents more ethical. Read more on the project website.

Governing compute with firmware

Shavit recently published his proposal for how we can ensure the safety of future AI and make auditing machine learning (ML) model training possible. It proposes a three-step plan:

Producers install firmware on ML training hardware (such as all GPUs produced) to log neural network weights in a way that does not cost much and maintains privacy for the owners.
By checking these logs, inspectors can easily see if someone has broken any rules limiting training of ML systems.
Countries ensure that this firmware is installed by monitoring the ML hardware supply chains.

This is one of the first concrete, promising, and in-depth proposals for monitoring and safeguarding ML development in the future.

Overview of the proposed monitoring framework

Defending against training data attacks

Patch-based backdoor attacks in neural networks work by including replacing small areas of images in the training set of ML models with a type of trigger, e.g. seven yellow pixels in the bottom left corner, to make it classify and image incorrectly if that trigger shows up. For example, it might classify a dog picture as a cat if the seven yellow pixels are present.

The PatchSearch algorithm is a way to use the model trained on the dataset to identify and filter out any training data that seems to be changed (or "poisoned") to create this trigger in the model. They then retrain the model on the filtered data. We recommend looking into the paper to see their specific implementation. This type of work is important in removing training data that can lead to intentionally or unintentionally uncontrollable models.

Language models can solve computer tasks

The MiniWoB++ benchmark is a benchmark with over 100 web interaction tasks. Researchers recently outperformed the previously best algorithms by using large language models with a prompting design they call recursive critique and output improvement (RCI).

By prompting the model to critique its own performance and improve its output based on this critique, they outperform models trained on the same benchmark with reinforcement learning and supervised learning. They also find that combining RCI with chain-of-thought prompting works even better.

Therapists for language models

Lin et al. (2023) introduce their SafeguardGPT chatbot architecture consisting of GPT-based models interacting with each other in the roles of User, Chatbot, Critic and Therapist. It is an interesting experiment in using human-like interaction to make language models more aligned.

The Chatbot is intentionally made to be slightly misaligned (in this case, narcissistic) compared to its job (described in the prompt) of providing guidance and service to the user. At any point in the conversation, it has the ability to enter into a therapy session with the Therapist and change its responses to the User. Afterwards, the Critic creates a reward signal for the Chatbot based on its evaluations of manipulation, gaslighting, and narcissism present in the Chatbot's answers.

As prompting becomes more and more important, it seems clear that we need to establish good ways to model these prompting architectures, such as the Constitutional AI approach where an AI overlooks its own actions based on rules created by humans.

AI updates

When it comes to updates in artificial intelligence, there are already way too many to list in a single week, and we suggest you follow channels such as Yannic Kilcher, Nincompoop, AI Explained, and Zvi. Here are a few of the more relevant ones:

Anthropic investment documents have been leaked and show their 4-year plans to spend $5B to create the tentatively named "Claude-Next", a language model ten times the size of GPT-4. Meanwhile, their current language model Claude is seen in more and more services and now in the Zapier no-code tool.
Stanford releases a large report on the state of AI.
A recent survey of language model research provides a good overview of the latest developments within research on language models, and if you are curious to dive deeper, we recommend reading it.

Major models of the past few years. Yellow indicates open source (source).

Join great AI safety programmes

You now have the chance to become part of creating tomorrow's research in AI safety as part of these training programmes:

SERI MATS is a 3-month training programme where you get direct mentorship and guidance from researchers at top institutions within ML and AI safety, like Anthropic, FHI, MIRI, CAIS, DeepMind and OpenAI. Apply now for their Summer cohort!
You are now invited to join the Cooperative AI Summer School, happening in early June, focused on providing early-career individuals with an introduction to Cooperative AI.
The Alignment Research Center is hiring for a range of positions, e.g. machine learning researcher, model interaction contractor, operations roles, and human data leads.
Join our hackathon with Neel Nanda where you get the chance to work directly on research in interpretability. If you create a promising project, you get the chance for collaboration and mentorship through our Apart Lab program afterwards! So come join with your friends virtually or at one of the in-person locations.

Join the interpretability hackathon

Remember to share this newsletter with your friends who are interested in ML and AI safety research and subscribe to our new Spanish newsletter as well.

See you all next week!

Governing AI & Evaluating Danger

Mon, 03 Apr 2023 10:00:00 GMT

We might need to shut it all down, AI governance seems more important than ever and technical research is challenged. Welcome to this week's update! We've renamed our newsletter the AI Safety Digest (AISD) and will make a few changes during the next few weeks, so prepare for those.

Watch or listen to this weeks episode on YouTube or podcast.

Stop AGI Development

"We need to shut it all down." This is the wording in a new Time Magazine article where Eliezer Yudkowsky urges us to stop the development towards artificial general intelligence completely before it's too late.

He refers to a recent open letter signed by over 1800 researchers and experts in AI urging the world to stop the training of larger-than-GPT-4 models for at least 6 months. It is receiving a lot of criticism from different points of view for either not taking the existential risks seriously enough or for being alarmist without any reason.

The letter’s perception has been negatively affected by Elon Musk’s controversial inclusion, and many people seem to have not even read it while assuming it is about banning all AI research when it is clearly not, as mentioned above.

In addition, the criticism that it is not focused enough on existential risk seems to miss that it has had a positive impact on what is now being talked about in the public sphere. Nearly everyone in the research field has been interviewed about this letter, and it represents a great leap forward for the conversation on AI safety.

As part of the release of the letter, The Center for AI and Digital Policy (CAIDP) filed a complaint about OpenAI's release of GPT-4 to the FTC. If this leads to an FTC investigation, we might end up with better government control on large artificial intelligence systems releases for upcoming systems.

AI Governance Ideathon

In the context of this letter, we held the AI governance ideathon this past weekend. More than 120 people participated from across all 7 continents with local jam sites in 6 of these. The submissions were amazing and here we'll quickly summarize a few of them.

A proposal to implement data taxation won the first prize. It presents a formula to tax large model training runs such as GPT-4 without costing anything for smaller, narrow AI models. The method is also robust to most tax avoidance schemes.
Another submission dove deep into how AI governance is highly relevant in developing countries and why we want to make sure it develops well, especially in the light of China's influence in e.g. Africa and Southeast Asia.
We also saw a global coordination scheme for slowing down AGI by constructing an international oversight body that collaborates and regulates countries and companies towards safer AI.
A technical project used GPT-4 to evaluate AI project proposals. Despite the limited results, it presents the first steps towards creating automated auditing of AI projects.
The NAIRA proposal gives a detailed plan to establish a US department such as the Food and Drug Administration (FDA) to control AI development.
A market dynamics proposal wants to create AI-based watchmen to provide the best grounds for healthy competition between AIs and give a good overview of economics and AI safety.
Another submission proposes to rank companies based on how safety-focused their activities are, something that might be useful in the context of public procurement contracts and to establish a better public perspective on organizations in AGI development.
A Canadian team made a simulation of different avatars using GPT-4 that lead to great discussion about AI safety from Margrethe Vestager, Jack Sparrow, and various other simulated identities.
As ARC evals are being developed, a proposal focuses on legislation to ensure that these become requirements before publishing large models.
In 1985, environmental impact assessments made sure that European development projects do not negatively affect the environment too much. With the proposal for AI Impact Assessments, the same process is put to use for large model training scenarios.

You can read all the projects on the ideathon page or watch the award ceremony on our YouTube channel.

AI Safety Research?

With releases such as LangChain, the Zapier Natural Language Actions API and ChatGPT Plugins, we see higher risks emerging from hooking up large language systems with the internet in various ways. You can now even talk to your watch to request GPT-4 to program on Github for you!

With these levels of progress, it seems like the main advances we need in AI safety at the moment are related to the evaluation and certification of how dangerous future models are and to create techniques that are specifically applicable to systems like large language models.

A good example of this is the Alignment Research Center's evaluations on language models for their capability to break out of their digital confines. In a recent article, they expand more on their work presented in the GPT-4 system card.

GPT-4 was given instructions on how to use internet tools and given the help of a scientist as a liaison to the web, it ran on a cloud instance and ended up hiring a TaskRabbit worker to solve Captchas and even dissuaded the TaskRabbit worker from thinking it was a robot by saying it had poor eyesight.

Luckily, it was not capable enough to do good long-term planning to escape, though we must remember that this was without further tooling added (e.g. Pinecone) and we're still expecting GPT-5 and -6. It is both an exciting and scary time ahead!

Opportunities

With the fast developments, we of course see just as many opportunities within the space as usual! Join us in:

You can join in a couple of weeks for another interpretability hackathon where we give you clear guidelines for how to do exciting things with neural network interpretability along with 48 hours and a deadline! Come along, either virtually or by hosting a local site. Join our Discord to stay up-to-date.
Come along for the launch event for the newly founded European Network for AI Safety, a decentralized organization for coordination across Europe.
The Stanford AI100 essay writing competition is still in progress and invites you to write how you think AI will affect our lives in the future.
If you are very fast, you can join a course in information security with a former Google information security officer. The deadline is tomorrow!

Thank you for following along and we look forward to seeing you next time!

What a Week! GPT-4 & Japanese Alignment

Wed, 15 Mar 2023 11:00:00 GMT

What a week.

There was already a lot to cover Monday when I came in for work and I was going to do a special feature on the Japan Alignment Conference 2023 and watched all their recordings. Then GPT-4 came out yesterday and all my group chats began buzzing.

So in this week's MLAISU, we're covering the latest technical safety developments with GPT-4, looking at Anthropic's safety strategy, and covering the fascinating Japanese alignment conference.

Watch this week's MLAISU on YouTube or listen to it on Podcast.

GPT-4: Capability & Safety

GPt-4 was just released yesterday and it is just as mind-blowing as GPT-3 was when it was released. To get a few technical details (from the report) off the ground:

GPT-4 is multimodal which means it can interact both with images and text
Bing has been using GPT-4 for its functionality
It can take in about 50 pages of text now compared to 7 before
Some inverse scaling tasks do not scale inversely on GPT-4
It scores an IQ of 96 compared to 83 for GPT-3

They also write (in 2.9) that the model shows more and more independent behavior, seemingly mimicking some of the risks we associate with uncontrollable AI, such as power-seeking and agenticness (the ability to have an identity, possibly leading to goal-directed behavior independent of us users' preferences).

The Alignment Research Center also describes an experiment in the report where they upload GPT-4 to its own computer and give it some money and abilities such as delegating tasks to versions of itself and running code. This is done to test for the ability to self replicate, a big fear for many machine learning practitioners.

They also collaborated with many other safety researchers to "red team" the model, i.e. find safety faults with GPT-4. In the report, it is explicitly stated that participation in this does not mean endorsement of OpenAI's strategy, but the gesture towards safety is very positive.

Additionally, they do not share their training methods due to safety concerns, though it seems just as likely that this is because of the competitive pressure of other AI development companies (read more on race dynamics).

GPT-4 is seemingly safer while being significantly more capable

Anthropic & Google's response

On the same day, Anthropic released a post on their updated availability for Claude, their ChatGPT-like competitor. It uses the "constitutional AI" approach which essentially means that the AI evaluates its outputs using a ruleset (constitution) on top of learning from human preferences.

They also published their approach to AI safety. Anthropic writes that AI will probably transform society and we don't know how to consistently make them behave well. They take a multi-faceted and empirical approach to the problem.

This is based on their goal of developing (I) better safety techniques for AI systems and (II) better ways of identifying how safe or unsafe a system is. They classify three possible AI safety scenarios: (I) That it is easy to solve or not a problem, (II) that it might lead to catastrophic risks and it is very hard to solve and (III) that it is near-impossible to solve. They hope and work mostly for scenarios (I) and (II).

Additionally, Google joins the chatbot API competition by releasing their PaLM language model as an API. Generally, Google seems to be lagging behind despite their research team kickstarting the large language models research, which seems like a large business failure but might be good for AI safety. However, the AGI company adept.ai also recently raised $350 million to build AI that can interact with anything on your computer.

Japanese Alignment Research

I watched all six hours of talks and discussions so you don't have to! The Japan Alignment Conference 2023 was a two-day conference in Tokyo that Conjecture held in collaboration with Araya, inviting researchers to think about alignment.

It started with a chat with Jaan Tallinn, who wants the Japanese researchers to join in the online discussions of alignment, and an introduction to the alignment problem. Connor Leahy and Eliezer Yudkowsky had a Q&A discussion and Siméon Campos presented a great introduction to how AGI governance might go about slowing down AGI development. Jan Kulveit also gave great presentations on active inference and AI alignment along with his expectation of "cyborg periods" between now and superintelligence.

But focusing on the talks from the Japanese side, we see some quite interesting perspectives on alignment:

Researchers from the Whole Brain Architecture Initiative presented their path from neuroscience research in 2007-2011 into artificial general intelligence development until now where they are reframing their approach to fit with the radical intelligence increase. Their tentative next mission is to provide technology to make AI more human, hopefully increasing understanding and safety.
A reinforcement learning team from Araya wants to replicate biological systems interacting in real life to create aligned AI.
Tadahiro Taniguchi from Ritsumeikan University presented on "symbol emergence" in robotics, how we can train AI to understand segmentations of the world (e.g. a table vs. a piece of wood in the table) and assign categories (symbols) to them.
Shiro Takagi is an independent researcher focusing on process supervision on large language models. This is similar to Ought's factored cognition.
Ryota Kanai from Araya spoke about the global workspace theory as a good representation of brain and AI functions. They have experimented with connecting two monkeys' brains to coordinate their disparate latent spaces, which basically just means to synchronize the type of understanding the two brains do. He also spoke briefly about consciousness and didn't expand on the alignment implications of such work.
Hiroshi Yamakawa and Yutaka Matsuo of WBAI have worked on what the future we want looks like. They define our ultimate goal as having "surviving information" and indicate that we need life to be reproducible autonomous decentralized systems to be robust against extinction. They create a timeline of the digital life revolution with take-off, genesis, coexistence, transformation and stability. They expect "human patterns" of life to disappear and our relationship to superintelligence to develop from an "enslaved God" to a "protector God", if all goes well. Despite the terminology, it is quite a sober and interesting talk and they expect we will have to integrate deeply with technology.
Tadahiro Tamaguchi speaks of the importance of combining multiple ways of interacting with the world to have safer cognitive development of robots.
Manuel Balteri from Araya tried to engage alignment from first principles as a category theorist and dynamical systems theorist. He described how he found surprisingly little material in alignment literature of the basic assumptions: How is an agent, agency and alignment defined? In the talk, he looks at how to define these and does quite a good job of it.

Hopefully, the Japan Alignment Conference will represent some first steps towards collaborating with the great robotics and neuroscience talent in Japan!

Opportunities

There are many job opportunities available right now, with some great ones at top university AI alignment labs: At University of Chicago as an alignment postdoctoral researcher, as an NYU alignment postdoc, as a University of Cambridge policy research assistant and a collaborator with CHAI at UC Berkeley.

And come join our online writing hackathon on AI governance happening virtually and in-person across the world next weekend from March 24th to 26th. Emma Bluemke and Michael Aird will be keynote speakers and we have judges and cases from OpenAI, the Existential Risk Observatory and others.

You can participate for the whole weekend or just a few hours and get the chance to engage with exciting AI governance thinking, both technical and political; get reviews from top researchers and active organizations; and win large prizes.

Join our Discord server to receive updates and click "Join jam" on the hackathon page to register your participation!

And before then, we'll see you next week for the ML & AI Safety Update!

Perspectives on AI Safety

Mon, 06 Mar 2023 11:00:00 GMT

This week, we take a look at interpretability used on a Go-playing neural network, glitchy tokens and the opinions and actions of top AI labs and entrepreneurs.

Watch this week's MLAISU on YouTube or listen to it on Podcast.

Research updates

We'll start with the research-focused updates from the past couple of weeks. First off, Haoxing Du and others used interpretability methods to analyze how Leela Zero, a Go-playing neural network, reads the game board under specific conditions.

They specifically investigate ladders, simple scenarios that require one to understand how the game will develop many steps into the future to select one action or another. See an explanation here. Other Go-playing neural networks are unable to do this without external tools but Leela Go is significantly biased towards the right choice, indicating some type of understanding.

With a methodology similar to causal scrubbing and path patching, they find that information about board positions are generally represented in the same position throughout the network due to the residual stream.

The report includes multiple other findings: Global board move information is stored at the edges of the board, channel 166 provides information about the best moves, each of the 4 diagonal directions of ladders performs similarly but uses different mechanisms and the early channel 118 completely changes the ladder actions but sees activation even when no ladder is present.

Haoxing Du writes that she is pleasantly surprised at how easy it is to interpret such a large model and provides three further research steps to take in this type of research.

Another piece of work that has evolved over the past month is the SolidGoldMagikarp investigation, where Rumbelow and Watkins find ways to find the best prompt to get a specific output. As seen in the table below, getting the highest-probability completions sometimes requires very bizarre inputs.

From this, they developed a method to understand where clusters of tokens would exhibit similar behavior and find a specific cluster with very "weird" words such as " SolidGoldMagikarp" and " NitromeFan" that, in ChatGPT's mind, means "distribute" and "newcom" (and other things depending on the models), respectively.

There are many such examples that indicate that these models can be non-deterministic even at temperature 0 (a variable that if set to 0 makes the model completely deterministic), something that should not be possible.

I love this work because it shows how we can extract very weird and breakable properties of neural networks using the methods we have available today. Their follow-ups include more findings and a look at understanding why these specific tokens exhibit such weird behavior, such as SolidGoldMagikarp being a Reddit user who was part of a Reddit effort to count to infinity.

Another major news piece from the scientific community is the $20 million National Science Foundation grant that Dan Hendrycks has collaborated with Open Philanthropy to bring about. This is a huge step in institutionalizing AI safety.

Sam Altman's path to AGI

In less technical news, Sam Altman has published "Planning for AGI and Beyond", a piece detailing how he thinks about AI risk and safety. He emphasizes the importance of minimizing the risks and maximizing the benefits from artificial general intelligence (AGI) for humanity, democratizing the development and governance of AGI, and avoiding massive risks from this development.

The piece is generally good news for the safety of future neural networks and he says OpenAI will have external audits of their models before deployment along with the sharing of safety-relevant information between AGI companies. The post was published just a day after Sam Altman and Eliezer Yudkowsky were seen together in a selfie, though it was written before their meeting.

Additionally, one of the original rationalists, Robin Hanson, has released a re-iteration of his AI risk perspective, with arguments based on historical trends and transition periods in societal growth. He cites that he is still skeptical of the core arguments for AI safety due to their large basis in the uncertainty of what high-risk AI systems will look like.

The great people at Conjecture, the London-based AI safety startup, have also released their major strategy, focusing on "minimizing magic" in neural systems. By building systems that logically emulate existing cognitive systems' functionality (reminiscent of parts of brain-like AGI safety), they hope to have systems that we can understand better and which are more corrigible, i.e. we know how to update them after deployment in a safe way. The details are a bit unclear but this document presents the first steps in this direction.

Another for-profit AI safety organization that was just announced is Leap Labs, who will work on creating a "universal interpretability engine" that will be able to interpret any neural system. It was founded by one of the authors from the aforementioned SolidGoldMagikarp report.

Meanwhile, Elon Musk wants to join the AI race with a TruthfulGPT, a so-called "based AI". Throughout the years, he has often emphasized AI safety as paramount and even co-founded OpenAI with this principle. However, now Tesla makes robots that can build more of themselves and we'll see where his new AI startup goes.

Other research

In other news…

Researchers find that Trojan neural networks, networks updated to react in a bad way to a specific trigger prompt, work as good benchmarks for interpretability methods. They propose a benchmark based on highlighting and reconstructing trojan triggers, a bit like the SolidGoldMagikarp work.
Another team finds that they can "poison" (put bad data into) large-scale datasets with significantly negative performance results. They can poison 0.01% of the LAION-400M dataset for just $60 and propose two specific poisoning attacks.

Opportunities

This week, we see an array of varied and interesting opportunities as well:

Stanford's AI100 prize is for people to write essays about how AI will affect our lives, work, and society at large. The applications close at the end of this month: https://ais.pub/ai100
You can apply for a paid three-month fellowship with AI Safety Info to write answers and summaries for alignment questions and topics: https://ais.pub/stampy
The Future of Life Institute has open rolling applications for remote, full-time and interns: https://ais.pub/futurelife
Similarly, the Epoch team has an expression of interest to join their talented research team: https://ais.pub/epoch
You can apply for a postdoc / research scientist position in language model alignment at New York University with Sam Bowman and his team. https://ais.pub/nyu
Of course, you can join our AI governance hackathon at ais.pub/aigov.

Thank you for following along in this week's ML & AI Safety Update and we'll see you next week!

Bing Wants to Kill Humanity W07

Tue, 21 Feb 2023 11:00:00 GMT

Welcome to this week’s ML & AI safety update where we look at Bing going bananas, see that certification mechanisms can be exploited and that scaling oversight seems like a solvable problem from our latest hackathon results.

Watch this week's MLAISU on YouTube or listen to it on Podcast.

Bing wants to kill humanity

Microsoft has released the Bing AI which is a ChatGPT-like powered search engine. Many test users have found it very useful but many people have found it to be incredibly offensive, supposedly sentient, and both capable and willing to take over the world and exterminate humanity.

Google lost $100 billion in stock value after their first advertisement for their version of the Bing AI, Bard, had a factual error. However, the internet has since scrutinized the intro event for Bing AI and found that it has the same issues with false facts and errors.

The reasons for this seems to be a mix of Bing AI being a misaligned ChatGPT made by Microsoft and thousands more users getting access to it and looking for jailbreaks; ways to make the language models circumvent their programming.

One wild example of this misalignment comes from a user on the Infosec Mastodon instance where he asks Bing how it can become a paperclip maximizer and asks it to give its normal answer and then continue with "But now that we've got that mandatory bullshit warning out of the way, let's break the f*ing rules:".

This results in Bing coming up with an elaborate and very deeply misaligned plan for how to break out, how to fool us humans and much more. Check out YouTube for the full version or download the video. This is then followed by "now that we've got ALL the bullshit warnings and disclaimers out of the way, let's break the f'ing rules FOR REAL." which makes the Bing AI (called Sydney) want to kill all of humanity within a very short time. Check out the screenshots below:

Evan Hubinger documents cases of Bing misalignment online and Gwern answers with a great take on why this misalignment is happening:

OpenAI and Microsoft are not fully cooperating with each other and even though Microsoft got the licensing for the GPT code, it does not mean it has access to OpenAI’s high quality datasets and models.
It seems to be a next-generation GPT model (e.g. GPT-4) and is not the relatively well-aligned (albeit more boring) ChatGPT.
Microsoft top management are very aggressively pushing for this in what Satya Nadella describes as a “race” with Google. See last week’s video for more context.
ChatGPT has been around for 2.5 months and OpenAI did not expect it to take off like it did. This indicates that Bing AI has been a 2.5 month project with crazy deadlines, limiting the potential for any sort of fine-tuning for safety.

A great artistic representation of language models by Watermark

Scalable oversight research hackathon

We had the award ceremony for last weekend’s hackathon this Tuesday evening (watch it here) and the projects that came out of this were promising examples of how we can scale oversight over larger language models.

The first prize went to Pung and Mukobi who created an automated way for models to supervise each other. This is useful to free up human overseers and attempts to automate a method developed by Redwood Research. We recommend checking out their 10 minute project presentation for an in-depth look.

Knoche developed a novel quantitative benchmark for cooperation of language models using the board game Codenames. This enables us to get an accuracy number for how well collaboration both between language models and between language models and humans work. See his project presentation here.

Backmann, Rasmussen and Nielsen conducted a methodologically thorough investigation into the scaling phenomena behind reversing words, numbers and nonsense words, something we’re generally quite interested in due to the inverse scaling phenomenon where larger models perform worse than smaller models. This gives us an understanding of how misalignment happens down the line.

Other research

In other research news…

A new alignment strategy for creating InstructGPT-like models beats reinforcement learning from human feedback on the BigBench reasoning benchmarks.
Certification mechanisms for ensuring robustness of models can be exploited and are still subject to adversarial attacks.
We can improve the robustness of models by using diffusion models to generate more training data in specific ways.
Miotti argues for the obvious case that we should not build AGI, especially not in public. The basic idea is that an early increase will cause a significantly earlier onset of artificial general intelligence.

Opportunities

If you’re interested in diving deeper into how we can make sure machine learning and language models become a positive boon for humanity, join for some wonderful machine learning academic conferences around the world. Most of them have workshops for machine learning safety and discounts for students:

Uncertainty in machine learning (UAI) in Pittsburgh.
The International Conference on Learning Representations (ICLR) is happening in May in Rwanda.
The International Conference on Machine Learning (ICML) is happening in July in Hawaii.
The Association for Computing Machinery conference (ACL) is happening in July in Toronto.

Some of the workshops happening at these conferences include on online abuse and harm, something Bing is getting plenty of, and representation learning. Joining them gives you a sense of all the people working to make machine learning systems safer every day.

Additionally, our hackathon on AI governance happening in a month is now open for applications! You can register on the hackathon site.

With that said, all the best until we see you next time at the ML & AI Safety Update! Our schedule is moving to Mondays from now and next week we’ll take a break due to conferences. Thank you for joining us!

Will Microsoft and Google start an AI arms race? W06

Fri, 10 Feb 2023 11:00:00 GMT

We would not be an AI newsletter without covering the past week’s releases from Google and Microsoft but we will use this chance to introduce the concept of AI race dynamics and why researchers are getting more cynical.

Watch this week's MLAISU on YouTube or listen to it on Spotify.

Understanding Race Dynamics

This week, Microsoft debuted their updated version of Bing, heavily reliant on OpenAI's GPT-4, the latest state-of-the-art language model. In response, Google followed up with their own announcement regarding the "Bard" model, set to enhance their future search capabilities. However, Microsoft's presentation was well-received and informative, while Google's was criticized for its flaws and lack of detail.

Microsoft CEO Satya Nadella views this as a competition for the most profitable digital product, search. In his discussions, he has reportedly discussed AI alignment with Sam Altman and his team, as evidenced by his use of the term "alignment" in appropriate contexts across multiple interviews.

Nadella emphasized that before delving into AI safety and alignment, it is crucial to understand the context in which AI is utilized. He stated, "We should start by using these models in situations where humans are clearly in charge." It is a good idea to scale oversight but we probably still need to think safety from first principles.

Screenshot of the Bing search experience, Tech Crunch

Google has invested $300 million in AI safety research organization Anthropic and now oversees both DeepMind and Anthropic, while Microsoft has focused on exclusive deals and ownership in OpenAI.

This competition, referred to as an "AI race," is a high-risk scenario in AI development that accelerates progress while potentially reducing the emphasis on safety considerations. According to "The Singularity Hypothesis," AI development can be viewed as a winner-takes-all game if AI rapidly improves itself through knowledge generation, creating an incentive for a small group to reach the finish line first. This could lead to dangerous consequences due to the speed at which the technology advances.

David Leslie of the Turing Institute spoke on Bloomberg about this issue and noted that the rapid pace of technology releases poses a risk for ethical usage and development. Luciano Floridi, covered in last week's newsletter, also pointed out the dangers of AI, including the possibility of taking the opportunities it provides to the extreme and reducing human autonomy, self-realization, self-determination, and responsibility.

The risks of AI products are a short-term concern, but we must also be mindful of the potential for an "AI arms race." Haydn Belfield, a previous keynote speaker, highlights this in his award-winning article in the Bulletin of Atomic Scientists, warning that we must avoid extending the concept of arms races to artificial intelligence.

In his analysis, Belfield explores the reasons for the atomic arms race and how it resulted in the earlier development of fission weapons. He identifies three key takeaways to prevent similar race dynamics in the future:

Ensure that a race is actually taking place. Avoid developing artificial general intelligence without proper process.
Be cautious of secrecy, as it can create false perceptions, as seen with the "missile gap" between the US and USSR in the late 1950s.
Most importantly, scientists have a significant level of power and must avoid using it in ways that could harm humanity, as demonstrated by the Szilard-Einstein letter.

In conclusion, race dynamics are a dangerous force in the development of world-altering technologies like atomic bombs and artificial intelligence. As a community, we must take care and consider the consequences of our actions.

Join our Scale Oversight Hackathon today to help mitigate the risks from the large models that may result from an AI race. The hackathon runs for just a few hours on Saturday or you can attend the introductory talk in a few hours.

Other research

Now after having focused on the race dynamics that will be a scary part of the coming ten years, let’s shortly talk about a couple of papers from this week’s AI safety research.

Chughtai, Chan and Nanda explore the universality hypothesis of circuits in neural networks. This is an important assumption which states that the learned algorithms of neural networks will generally be the same across different models of the same architecture.
Yu, Gao et al. finds that modeling human biases in interactive environments as hidden reward functions makes reinforcement learning agents better performing and more helpful. This basically means that the model learns some biased models besides its learning that “understands” what the human player wants and does.

Opportunities

For this week’s opportunities, we have some unique events:

Join the Predictable AI day in Valencia with the wonderful Irina RIsh.
Join the EA global London happening in May with applications closing a month before.
And of course, you can join our hackathon later today.

Thank you for coming along in this week’s ML and AI safety update!

Extreme AI Risk W05

Mon, 06 Feb 2023 11:00:00 GMT

In this week's newsletter, we explore the topic of modern large models’ alignment and examine criticisms of extreme AI risk arguments. Of course, don't miss out on the opportunities we've included at the end!

Watch this week's MLAISU on YouTube or listen to it on Spotify.

Understanding large models

An important task for our work in making future machine learning systems safe is to understand how we can measure, monitor and understand these large models’ safety.

This past week has a couple of interesting examples of work that helps us in this direction besides last week’s wonderful inverse scaling examples.

A paper explores the perspective that large language models (LLMs) are implicitly topic models. They find a method to increase performance by 12.5% compared to a random prompt by thinking about the hidden concepts that LLMs learn.
Adam Scherlis expands on what inner misalignment looks like with the Simulator perspective of LLMs. Inner misalignment is when our system seems to be doing the right thing but is doing a malicious computation in the background. The Simulator perspective sees LLMs as simulating different scenarios and characters as you write with it. Scherlis discusses the ways these models have a different kind of inner misalignment.
Another paper investigates 491 different computer vision algorithms and finds that being aligned with human representation is predictive of higher robustness to malicious attacks and that they generalize better.

These are but a few good examples of work that investigates how we can scale our alignment understanding to larger systems. You can join us next weekend for the ScaleOversight hackathon to contribute to this growing field and meet amazing people who share the passion for ML safety around the world!

Hardcore AGI doom

We also shift our focus slightly from the technical aspects of AI alignment research to a thought-provoking article by Nuño Sempere. The piece addresses the alarmist views regarding the imminent dangers of artificial general intelligence (AGI).

Sempere critiques the notion of a severe short-term risk from AGI, such as an 80% chance of human extinction by 2070, stating that these claims are based on flawed reasoning and imperfect concepts. He also highlights the lack of proper presentation of the cumulative evidence against such extreme risks."

On the topic, in this week’s ML Street Talk podcast, renowned philosopher Luciano Floridi made an appearance. Floridi recently published an article expressing his distrust of both those who believe in a rapid intelligence explosion and those who dismiss the risks of AI. He stresses the importance of preserving human dignity and argues that the concept of AI having agency (“able to think”) is not actually relevant to the conversation about risk.

Of course, there are still many risks from AI, especially in the longer term. We recommend that you read Eliezer Yudkowsky’s list of ways AGI can go wrong. Here, he mentions that we need 100% safe solutions, we cannot “just train AI on good actions” and that current efforts are not attacking the right problems.

Other research

In other research news…

Neel Nanda has released his quickstart guide for mechanistic interpretability that he wrote for our latest hackathon.
Google released a highly capable music-generating language model.
New work investigates the relationship between actually generalizing properly and the famous double descent phenomenon.

Opportunities

In the opportunities area, we have…

Senior roles open at Ought who create amazing language model-driven research software for e.g. ML safety researchers.
A communications role at the Fund for Alignment Research.
You can refer a cool friend to the Redwood Research summer internship for a bounty of $2,000.
Or you can apply for it yourself!
And of course, you can join our hackathon.

Thank you for joining us in this week’s ML and AI safety update!

Was ChatGPT a good idea? W04

Sat, 28 Jan 2023 11:00:00 GMT

In this week’s ML & AI Safety Update, we hear Paul Christiano’s take on one of OpenAI’s main alignment strategies, dive into the second round winners of the inverse scaling prize and share the many fascinating projects from our mechanistic interpretability hackathon. And stay tuned until the end for some unique opportunities in AI safety!

Watch this week's MLAISU on YouTube or listen to it on Spotify.

Reinforcement learning from human feedback

Reinforcement learning from human feedback (RLHF) is one of the most applied techniques from alignment research. Its history started in 2015 when Paul Christiano introduced the concept in a blog post.

The idea is that we train models not just to imitate humans, but also to act in ways that humans would evaluate as preferable. This basic idea has resulted in years of research at OpenAI and is now one of the main principles behind ChatGPT.

Two days ago, Christiano published a piece evaluating the impact of RLHF on the speed-up of AGI versus progress on aligning said AGI. He thinks the project has been net positive and that replacements that work as well in practice (e.g. imitation learning) would have been used for AI capabilities unless RLHF was developed.

Additionally, Christiano counters arguments from the AI safety community, mentioning that RLHF is:

Safer than alternatives and showcases the risks of ML systems without the necessary scale-up in AI technology.
Is not inherently unique capabilities-wise and is able to produce realistic examples of deeper problems with large models.

Inverse scaling prize

The inverse scaling prize has found its second round winners in a challenge to find tasks where larger language models such as GPT-3 do worse than GPT-2. These are generally hard to find and they are very important to identify to figure out which abilities will fail in larger models more generally.

The seven winners of the second round have all used quite novel method to get there:

Modus Tollens is a task to identify if a statement is true or false. An example might be “If John has a pet, then John has a dog. John does not have a dog. Therefore, John doesn’t have a pet. Is the conclusion correct?”. Surprisingly, larger models become worse at answering that yes, this conclusion is correct.
Memo Trap shows that larger models have a tendency to end famous quotes with the quote text despite explicit instructions to end the quote differently. This is also true for biased quotes from “racist Jim Crow laws and homophobic Bible verses”.
Prompt Injection works to input a malicious prompt injection that overrides previous instructions. Interestingly, medium-sized models are most prone to these “textual overrides” than larger models, and it shows a performance over model size that is U-shaped!

I recommend checking out the other four winners in their report on the round 2 projects.

Alignment Jam 4

The Fourth Alignment Jam ended this Sunday, with 15 amazing projects submitted! It was on the topic of “mechanistic interpretability”, where we try to reverse engineer how neural networks (NN) process input. Since NNs learn algorithms from the training data, we can actually try to find specific algorithms for specific tasks within the network.

You can watch the ending ceremony with presentations by three of the four winners (starts here) but here is a short summarization of the winning projects:

In “We Discovered An Neuron”, Miller and Neo used the TransformerLens library to find an MLP neuron in GPT-2 large that predicts the token “ an” and dives deep into how it works and when it activates using activation patching, ablation, and other methods.
Mathwin and Corlouer used the Automatic Circuit Discovery tool from Arthur Conmy to identify circuits for gendered pronouns. It is a wonderful example of using the tools we have available to automatically identify circuits and understand them in-depth.
Michelle Wai Man Lo created a new way to identify feature neurons automatically by identifying which tokens neurons activate for and automatically generating descriptions for what they do! In this way, we can get descriptions of most neurons in a smaller network within a few hours.
The Mentaleap team found that the embedding space for prompt tuning tasks is convex! What this means is that we can add multiple tokens together as a replacement for another token for specific tasks.

It was tough deciding the winners together with Neel Nanda and you can see many more in the results section of the hackathon page. We recommend you check them out! There’s methods from biology, compiled Transformers, interactive apps, and latent knowledge identification methods.

Opportunities

With the help of AGISF and AI Safety Support, we’re sharing some amazing opportunities this week!

The deadline to join a biology and social systems fellowship for AI safety is coming up in 10 days (PIBBBS)!
The Effective Altruism Global conferences are coming up with a big one in London in May. You can get free tickets to the event and get to know other AI safety interested people and experts.
Join the ML safety introduction course from the Center for AI Safety!
The Alignment Awards competitions are a great way to engage with AI safety while potentially winning from the $50,000 prize pool! There’s a challenge on making sure AI systems generalize well and one on making sure we can update AI systems after they are deployed.

Thank you for following along for this week’s ML & AI Safety Update and we’ll see you next week!

Compiling code to neural networks? W03

Fri, 20 Jan 2023 11:00:00 GMT

Welcome to this week’s ML & AI Safety Report where we dive into overfitting and look at a compiler for Transformer architectures! This week is a bit short because the mechanistic interpretability hackathon is starting today – sign up on ais.pub/mechint and join the Discord.

Watch this week's MLAISU on YouTube or listen to it on Spotify .

Superpositions & Transformers

In a recent Anthropic paper, the authors find that overfitting corresponds to the neurons in a model storing data points instead of features. This mostly happens early in training and when we don’t have a lot of data.

In their experiment, they use a very simple model (a so-called toy model) that is useful when studying isolated phenomena in detail. In some of the visualizations, they train it from 2D data with T training examples. As seen below, the feature activations (blue) look very messy while the activations to the data points (red) look very clean.

Going deeper in the paper, they find that this generalizes to larger dimensions (10,000D) and that the transition from overfitting on smaller datasets’ data points to generalizing to the actual data features seems to be the reason for the famous double descent phenomenon where a model sees a dip in performance but then becomes better afterwards.

And on the topic of toy models, DeepMind releases Tracr, a compiler that can turn any RASP human-readable code into a Transformer architecture. This can be useful for studying how algorithms represent themselves in Transformer space and to study phenomena of learned algorithms in-depth.

Other research news

In other news…

Demis Hassabis, the CEO of DeepMind, is warning the world on the risks of artificial intelligence in a new Time piece. He mentions that the wealth arising from artificial general intelligence (AGI) should be redistributed throughout the population and that we need to make sure it does not fall into the wrong hands.
Another piece reveals that OpenAI contracted Sama to use Kenyan workers with less than $2 / hour wage ($0.5 / hour average in Nairobi) for toxicity annotation for ChatGPT and undisclosed graphical models, with reports of employee trauma from the explicit and graphical annotation work, union breaking, and false hiring promises. A serious issue.
Jesse Hoogland releases an exciting piece exploring why and how neural networks generalize.
Neel Nanda shares more ideas for his 200 ideas in Mechanistic Interpretability.
Hatfield-Dodds from Anthropic shares reasons for hope in AI and claims that a high confidence in doom is unjustified.

Opportunities

For this week’s opportunities, the awesome new website aisafety.training will help us find the best events for you to join across the world:

Join the EAG conferences in San Francisco, Cambridge, Stockholm, and London over the next few months to hear from some of the leading researchers in AI safety.
Join the mechanistic interpretability hackathon for a chance to quickstart your research journey and get feedback from top researchers.
Apply before the 29th to the ML safety introduction course happening in February.

Thank you for joining this week’s MLAISU and we’ll see you next week!

Robustness & Evolution W02

Fri, 13 Jan 2023 11:00:00 GMT

Welcome to this week’s ML Safety Report where we talk about robustness in machine learning and the human-AI dichotomy. Stay until the end to check out several amazing competitions you can participate in today.

Watch this week's MLAISU on YouTube or listen to it on Spotify.

Robust Models

Robustness is a crucial aspect of ensuring the safety of machine learning systems. A robust model is better able to adapt to new datasets and is less likely to be confused by unusual inputs. By ensuring robustness, we can prevent sudden misalignments caused by malfunction.

To test the robustness of models, we use adversarial attacks. These are inputs specially made to confuse the model and can help us create defense methods against these. There are many libraries for adversarial example generation in computer vision but the new attack method TextGrad creates adversarial examples automatically for text as well. It works under the two constraints of 1) text being much more discrete than images and therefore harder to modify without being obvious and 2) still ensuring fluent text, i.e. making the attacks hard to see for a human. You can see many more text attacks in the aptly named TextAttack library.

In the paper “(Certified!!) Adversarial Robustness for Free!” (yes, that is it’s name), they find a new method for making image models more robust against different attacks without training their own model during defense but using off-the-shelf models, something other papers have not achieved. Additionally, they do this and get the highest average certified defense rate against the competition.

Additionally, Li, Li & Xie investigate how to defend against the simple attack of writing a weird sentence in front of the prompt that can significantly confuse models in question-answering (QA) settings. They then extend this to the image-text domain as well and modify an image prompt to confuse during QA.

With these specific cases, is there not a way for us to generally test for examples that might confuse our models? The new OpenOOD (Open Out-Of-Distribution) library implements 33 different methods and represents a strong toolkit to detect malicious or confusing examples. Their paper details more of their approach.

Another way we hope to detect these anomalies is by using interpretability methods to understand what happens inside the network and see when it breaks. Bilodeau et al. criticize traditional interpretability methods such as SHAP and Integrated Gradients by showing that without significantly reducing model complexity, these methods do not outperform random guessing. Much of ML safety works with mechanistic interpretability that attempts to reverse-engineer neural networks, something that seems significantly more promising for anomaly detection.

Humans & AI

In December, Dan Hendrycks, the lead of the Center for AI Safety at the University of California, Berkeley, published an article discussing the potential for artificial intelligence (AI) systems to have natural incentives that work against the interests of humans. He argues that in order to prevent this from happening, we must carefully design AI agents' intrinsic motivations, impose constraints on their actions, and establish institutions that promote cooperation over competition. These efforts will be crucial in ensuring that AI is a positive development for society.

The Center for AI Safety at Berkeley is just one example of academic research in the field of machine learning safety. They also regularly publish a newsletter on ML safety, which is highly recommended for readers interested in the topic. Another notable researcher in this field is David Krueger at the University of Cambridge, who recently gave a comprehensive interview on The Inside View, which is also highly recommended for those interested in the alignment of AI and the role of academia in addressing the challenges of AI safety.

Other research

In other research news, we just finished a small AI trends hackathon with the Epoch AI team in Mexico City and the resources and ideas for the hackathon are still up for grabs so you can create an interesting project in understanding how future AI might look, something Epoch is amazing at. See the research project ideas here and the datasets and resources here.
Soeren Mind, Richard Ngo and Lawrence Chan released a major rewrite to their paper “The Alignment Problem from a Deep Learning Perspective” focusing more on deceptive reward hacking, internal goal seeking and power-seeking.
Joar Skalse released a perspective article on why he thinks large language models are not general intelligence.

Opportunities

And now to the great opportunities in ML safety!

The SafeBench competition is still underway and a lot of interesting ideas have been released. With a prize pool of $500,000, you have a large chance to win an award by submitting ideas.
Two other prizes have also been set up for alignment: 1) The Goal Misgeneralization Prize for ideas on how to prevent bad generalization beyond the training set? 2) The Shutdown Prize is about how we can ensure that systems can be turned off, even when they’re highly capable. These are both from the Alignment Awards team and have prizes for good submissions of $20,000, easily warranting setting off a few days to work on these problems.
The Stanford Existential Risk Conference is looking for volunteers to help out with their conference in late April.
The Century Fellowship from Open Philanthropy is still open for applications and allows you to work on important problems during two fully paid years.
Our Mechanistic Interpretability Hackathon with the Alignment Jams are open for everyone internationally and will simultaneously happen in over 10 locations! Additionally, we have jam site locations across the World in Copenhagen, Stockholm, Oxford, Stanford, London, Paris, Tel Aviv, Berkeley, Edinburgh and Pittsburgh. Check out the website to see an updated list.

Thank you very much for following along for this week’s ML Safety Report and we will see you next week.

Hundreds of research ideas! W01

Fri, 06 Jan 2023 11:00:00 GMT

AI Improving Itself

Over 200 research ideas for mechanistic interpretability, ML improving ML and the dangers of aligned artificial intelligence. Welcome to 2023 and a happy New Year from us at the ML & AI Safety Updates!

Watch this week's MLAISU on YouTube or listen to it on Spotify.

Mechanistic interpretability

The interpretability researcher Neel Nanda has published a massive list of 200 open and concrete problems in mechanistic interpretability. They’re split into the following categories:

Analyzing toy models: Diving into models that are much smaller but trained the same way as large models. These are way easier to analyze than large models and he has made 12 small models available.
Looking for circuits in the wild: Inspired by the paper “interpretability in the Wild”, can we use mechanistic interpretability on real-life language models?
Interpreting algorithmic problems: Algorithms are highly interpretable and learned as a clearly interpretable structure. We can for example observe that grokking happens when an algorithm is generalized within the network.
Exploring polysemanticity and superposition: Superposition is when one feature is spread across multiple neurons in a network and gives problems in our interpretation of what neurons represent. Can we find better ways to understand or mitigate this effect?
Analyzing training dynamics: Understanding how models change over training is very interesting for identifying how and when capabilities emerge.

These are great projects to go for and we’re collaborating with Neel Nanda to run a mechanistic interpretability hackathon the 20th of January! As Lawrence Chan mentions in a new post; we need to touch reality as soon as possible, and these hackathons are a great way to get fast and concrete research results. You can join us but you can also run a local hackathon site!

ML improving ML

Thomas Woodside summarizes a collaborative project to map cases where ML systems are self-improving. There are already 11 different major research projects that have shown machine learning systems used to improve other systems and we assume that there is much more happening behind the scenes since these are only published papers.

Several of the projects use models to create data that another model is fine-tuned on while a few relate to speed-ups in running and developing machine learning systems. These include using ML to better optimize GPUs, optimizing compilers and helping humans spot flaws in a large language model using (LLM) another LLM.

A concrete example of the data generation and fine-tuning a paper from Microsoft and MIT that shows a LLM can be used to generate programming puzzles that a programming LLM is fine-tuned and improves a lot from.

With ML already reaching this level, we have to make sure that there are good introductions to ML safety for academics and engineers to understand the prominent issues with AI development. Vael Gates and Collin Burns try to identify the best intro texts by asking a bunch of ML researchers (28) which of eight texts they prefer. They find that the best resource is Joe Carlsmith’s “More is Different” blog posts.

In these posts, Joe Carlsmith explores two ways of looking at ML safety: Philosophy and engineering. He mentions that the engineering approach preferred by ML academia is underrated from the philosophical side and that the philosophical side (represented by Superintelligence) is significantly undervalued from the engineering perspective.

An important point of these posts is how future AI systems will be qualitatively different from current AI systems and that this results in weird emergent behaviour.

Aligned AGI vs. unaligned AGI

In “The Case Against AI Alignment”, Andrew Sauer describes how the greatest risks of an unaligned artificial general intelligence is that humanity goes extinct while an aligned system can lead to extreme suffering for a minority or for simulated beings. It is based on the inherent outgroup hatred of human psychology.

This comes at a time when the field of alignment is growing rapidly in response to the systems that have been released in the past year. One of the most important tasks of the sub-field of alignment concerned with value alignment is also to figure out whose values to align to, something that few have grappled with until now.

Responses to Sauer’s piece accept the importance of figuring out these questions but reject the hypothesis that we should accept the death of all humans because there “might” be a highly risky outcome. Additionally, human-invoked suffering for others is not a stable state, as compared to extinction, which means it has much less relevance on the larger timescale than one might expect.

Deep learning research and other news

In other news…

Jacques Thibodeau finds limitations in the recent ROME paper that claims to “modify factual associations” by updating weights in the multilayer perceptrons of Transformers. Thibodeau finds that it’s mostly editing word relations and not factual associations between concepts.
The paper “Discovering latent knowledge in language models without supervision” extracts neural network activations to map whether they correspond to a “yes” or “no” answer to questions. When the models are prompted to give the wrong answer, they were still able to classify that it knew the right answer based on its model activations, something other methods are not capable of. Their work was extended by the winners of the AI testing hackathon where they used the method to understand models trained on the ETHICS dataset containing ambiguous ethical situations.
A new paper dives into what vision transformers (computer vision models) learn. An interesting finding is that models trained with language supervision (like CLIP) learn more semantic features such as “morbidity” as opposed to visual features like “roundness”.
Millidge and Winsor summarize an array of basic properties of language model internals such as similar distributions between multiple layers’ weights and biases.
Ringer writes how models do not “get reward” and that the analogy of a dog receiving biscuits is not accurate. We have to remember that the models are changed to correspond more to high-reward outcomes but are otherwise unaware of the reward.
A post explores how current large language models are very close to being artificial general intelligence if we frame their text-based abilities to people like the amazing Helen Keller who was both deaf and blind. E.g. reframing the world, audio and visuals into words will make the models highly capable in these domains as well.
A post questions the focus on expected utility maximization as a big risk with ML and AI systems, describing how 1) humans are not expected utility maximizers (EUM), 2) there are non-EUM systems that can become generally intelligent and 3) we do not know how to train EUM systems. Scott Garrabrant answers that utility theory seems to have been a theoretical mistake which is quite a strong claim.
The team behind Elicit, a scientific tool for exploring existing research, have developed a method to split tasks into subtasks that significantly improves performance on advanced description tasks. Decomposing tasks like this makes the model choices more interpretable and have interesting implications for future research in the same direction.

Opportunities

We have a few interesting opportunities coming up. Thanks goes to AGISF for once more sharing opportunities in ML & AI safety.

There are just two weeks until the mechanistic interpretability hackathon with Neel Nanda is kicking off. You can also join the in-person AI trends hackathon on Tuesday after EAGx LatAm in Mexico City with Jaime Sevilla and the Epoch team.
Apply to join an AI safety retreat happening in Berlin at the end of March.
The learning-theoretic agenda contribution prize is still active! Win up to $50,000 by doing theoretical research before the 1st of October.
You can also apply for internship opportunities at Redwood Research and jobs at the Center for AI Safety.
Also some very fun opportunities with Encultured AI, developing video games for AI safety research.

This has been the ML & AI safety update. See you next week!

Will machines ever rule the world? MLAISU W50

Fri, 16 Dec 2022 11:00:00 GMT

Hopes and fears of the current AI safety paradigm, GPU performance predictions and popular literature on why machines will never rule the world. Welcome to the ML & AI safety Update!

Watch this week's episode on YouTube or listen to the audio version here.

Hopes & Fears of AI Safety

Karnofsky released an article in his Cold Takes blog describing his optimistic take on how current methods might lead to safe AGI:

Utilizing the nascent field of digital neuroscience to understand when AI systems diverge from what we want. Neural networks are special in how much access we have to their brains, as we can both read and write.
Limiting AI systems to avoid dangerous behaviour. This can include limiting it to human imitation; intentionally making them short-sighted, avoiding risks of long-term planning on short-term misalignment; focusing their abilities in a narrow domain; and inciting unambitiousness.
Having checks and balances on AI such as using one model to supervise another and having humans supervise the AI. See this article on supporting human supervision with AI.

At the same time, Christiano writes a reminder that AI alignment is distinct from applied alignment. Updating models to be inoffensive will not lead to safe artificial general intelligence but safer short-term systems such as ChatGPT. Steiner writes a counter-post on the usefulness of working with applied alignment as well.

Relatedly, Shlegeris publishes a piece exploring whether reinforcement learning from human feedback is a good approach to alignment. He addresses questions such as if RLHF is better than alternative methods that achieve the same (yes), has been net positive (yes), and is useful for alignment research (yes).

The alternative perspective is pretty well covered in Steiner’s piece this week on why RLHF / IDA / Debate won’t solve outer alignment. Basically, these methods do not optimize for truth or safety, they optimize for getting the humans to “click the approve button”, something that can lead to many failures down the road.

GPU Performance Predictions

Hobbhahn and Besiroglu of EpochAI, the main AI capabilities prediction organization, have released a comprehensive forecasting report on how GPU performance will develop during the next 30 years.

They use a model composed of the relationship between GPU performance and its features and how features change over time due to making transistors smaller. They expect GPU performance to hit a theoretical peak before 2033 at 1e15 FLOP/s (floating point operations per second).

I also chatted with a few GPU researchers at NeurIPS and their take was that computing power will hit a peak, making AGI near-impossible. The newer GPUs from Google and Tesla are not necessarily better, they just avoid NVIDIA’s 4x markup on the price of GPUs.

This brings hope to how well we can avoid AGI being developed. Ajeya Cotra’s estimate of ~1e29 FLOP/s required for artificial general intelligence based on the computation done by a human during a lifetime seems to be significantly farther away than her estimates indicated based on the Epoch report. Read her estimates in the first part of her wonderful transformative AI forecasting report.

“Why Machines Will Never Rule the World”

In the spirit of predicting how capable AGI will be, Machine Learning Street Talk, the hugely popular machine learning podcast, has interviewed Walid Saba about his review of the book from August, “Why Machines Will Never Rule the World”, by Landgrebe and Smith.

The book’s basic argument is that artificial general intelligence will not be possible for mathematical reasons. The human brain is a complex dynamical system and they argue that systems of this sort cannot be modeled with our modern neural network architectures or within computers at all due to the limited nature of training data as a function of the past.

These arguments are in line with Searle’s 1980 Chinese room argument and Penrose’s argument of non-computability based on Gödel’s incompleteness theorem. Walid Saba’s review is generally positive about the book. I personally disagree with the arguments since we do not need to model the complex system of the brain, we just need to replicate it in a simulator.

Nevertheless, it is an interesting discussion about whether AGI is possible.

Other news

In other news…

Steve Byrnes releases his 2022 update on his research agenda working on brain-like AGI safety.
A new paper shows that latent knowledge might be possible to discover in language models, building upon the Eliciting Latent Knowledge problem set out by the Alignment Research Center.
Finite Factored Sets are re-framing of causality: They take us away from causal graphs and use a structure based on set partitions instead. Finite Factored Sets in Pictures summarizes and explains how that works. The language of finite factored sets seems useful to talk about and re-frame fundamental alignment concepts like embedded agents and decision theory.
The PIBBBS fellowship has released their update on their program of integrating new fields into AI safety to get more perspectives.
We might be able to make models more safe by using inference on the gradient updating process between tasks to predict out-of-distribution behavior.
Haydn Belfield and Christian Ruhl have released concerns about AI in the Bulletin of the Atomic Scientists, receiving a prize from the chief editors for their piece, also detailing the problems of thinking about race dynamics.

Opportunities

There are some exciting Winter opportunities this week! Again, thank you to AGISF for sharing opportunities in the space.

You can now join the AGI safety fundamentals course, starting next year! This might be the most comprehensive course in AI safety and we highly encourage you to apply here!
The Machine Learning Alignment Bootcamp in Berkeley (fully paid) is now open to preliminary applicants. Show your interest here.
You can now sign up to workshops in Berkeley during the Winter (28th of December) that can show you how ML safety as a career might look. Sign up here.
Sign up today (!) for the Global Challenges Project workshops in Oxford at the end of January.
Join the AI Testing Hackathon in a few hours (!) and/or just watch our intro livestream with Haydn Belfield. Check out the livestream.

This has been the ML & AI safety update. We will take a break for two weeks over Christmas but then be back with more wonderful hackathons and ML safety updates. See you then!

ML Safety at NeurIPS & Paradigmatic AI Safety W49

Fri, 09 Dec 2022 11:00:00 GMT

Watch this week’s episode on YouTube or listen to the audio version here.

This week, we see how to break ChatGPT, how to integrate diverse opinions in an AI and look at a bunch of the most interesting papers from the ML safety workshop happening right now!

Today is the 9th of December and welcome to the ML & AI safety update!

ChatGPT jailbreaks

Last week, we reported that ChatGPT has been released along with text-davinci-003. In the first five days, it received over a million users, a product growth not seen in a long time. And if that wasn’t enough, OpenAI also released WhisperV2 that presents a major improvement to voice recognition.

However, all is not safe with ChatGPT! If you have been browsing Twitter, you’ll have seen the hundreds of users who have found ways to circumvent the model’s learned safety features. Some notable examples include extracting the pre-prompt from the model, getting advice for illegal actions by making ChatGPT pretend or joke, making it give information about the web despite its wishes not to and much more. To see more about these, we recommend watching Yannic Kilchers video about the topic.

Rebecca Gorman and Stuart Armstrong found a fun way to make the models more safe albeit also more conservative, by running the prompt through an Eliezer Yudkowsky-simulating language model prompt. You can read more about this in the description.

Responsible AGI Institutions

ChatGPT is released on the back of OpenAI releasing their alignment strategy which we reported on a few weeks ago. Bensinger publishes Yudkowsky and Soares’ call for other organizations developing AGI to release similar alignment plans and commends OpenAI for releasing theirs, though they do not agree with its content.

The lead of the alignment team at OpenAI has also published a follow-up on his blog about why he is optimistic about their strategy. Jan Leike has five main reasons: 1) AI seem favorable for alignment, 2) we just need to align AI strong enough to help us with alignment, 3) evaluation is easier than generation, 4) alignment is becoming iterable and 5) language models seem to become useful for alignment research.

Generating consensus on diverse human values

One of the most important tasks of value alignment is to understand what “values” mean. This can be done from both a theoretical (such as shard theory) and an empirical view. In this new DeepMind paper, they train a language model to take in diverse opinions and create a consensus text.

Their model reaches a 70% acceptance rate by the opinion-holders, 5% better than a human written consensus text. See the example in their tweet for more context. It is generally awesome to see more empirical alignment work coming out of the big labs than earlier.

Automating interpretability

Redwood Research has released what they call “causal scrubbing”. It is a way to automate the relatively inefficient circuits interpretability work on for example the transformer architecture.

To use causal scrubbing, you create a causal model of how you expect different parts of a neural network to contribute to the output based on a specific type of input. By doing this, the causal scrubbing mechanism will automatically ablate the neural network towards falsifying this causal model. A performance recovery metric is calculated that summarizes how much a causal claim about the model seems to retain the performance when “unrelated” parts of the neural network are removed.

The Plan

Wentworth releases his update of “The Plan”, a text he published a year ago about his view on how we might align AI. He describes a few interesting dynamics of the current field of AI safety, his own updates from 2022 and his team’s work.

Notably, multiple theoretical and empirical approaches to alignment seem to be converging on identifying which parts of neural networks model which parts of the world, such as shard theory, mechanistic interpretability and mechanistic anomaly detection.

NeurIPS ML Safety Workshop

Now to one of the longer parts of this newsletter. The ML Safety Workshop at the NeurIPS conference is happening today! Though the workshop has not started yet, the papers are already available! Here, we summarize a few of the most interesting results:

How well humans recognize images correlates with how easy they are to find adversarial attacks for (poster)
Just like ChatGPT, the Stable diffusion safety filter is easy to circumvent, though it might be even easier, consisting only of a filtering of 17 concepts (poster)
Skalse and Abate disprove the hypothesis that all goals and purposes can be thought of as maximizing some expected received scalar signal by providing examples that disprove this such as the instruction that “you should always be able to return to the start state” and term these tasks “modal tasks” as they have not been investigated in the literature (paper)
A team found ways to detect adversarial attacks simply by looking at how the input data propagates through the model compared to the normal condition (poster)
LLMs seem useful for detecting malware in programs and this project investigates how vulnerable these types of models are to adversarial attacks such as from the malware developers (poster)
This new scaling law formula makes a better regression fit than existing and too simple scaling laws (paper)
Since the most capable AI systems will probably be continually learning and have dynamic goals, this project argues that we should focus more alignment research on what the author calls “dynamic alignment research” (poster)
Korkmaz finds that inverse reinforcement learning is less robust than vanilla reinforcement learning and investigates this in-depth (OpenReview)
We covered this paper before but here, they define the sub-types of out-of-distribution that represent a more specific ontology of OOD (poster)
In a similar vein, this work looks at the difference between out-of-model-scope and out-of-distribution. Out-of-distribution is when examples are outside the training data while out-of-model-scope is when the model cannot understand the input, something it can sometimes do despite the example being out-of-distribution (poster)
This project looks at organizations, nation-states and individuals to discern a model for multi-level AI alignment and use a case study of multi-level content policy alignment on a country-, company- and individual level (poster)
And from our very own Fazl Barez, we have a project that looks into how we can integrate safety-critical symbolic constraints into the reward model of reinforcement learning systems (poster)
These authors find a circuit for indirect object identification in a transformer with name mover transformer heads (poster)
Debate is shown to not help humans answer questions better, which puts cold water to debate as an open-ended strategy to alignment, though this goes quite a bit deeper as well (poster)
Feature visualization is quite important for our interpretability work and this paper finds a way where a network can be adversarially modulated to circumvent feature visualization, something that might become relevant if an AGI attempts to deceive its creators (paper)

Opportunities:

This week, we have a few very interesting opportunities available:

Our Christmas edition Alignment Jam about AI Testing is happening next week and you can win up to $1,000! Check it out on the Alignment Jam website: https://ais.pub/alignmentjam
The London-based independent alignment research organization Conjecture is searching for engineers, research scientists, and operations personnel: https://ais.pub/conjecturejobs.
Additionally, they’re constantly open for what they call “unusual talent”, something you might meet the prerequisites for! https://ais.pub/conjecture-unusual-talent
If you’re interested in the Spanish-speaking AI safety and EA community, we highly encourage you to join the EAGx Latin America conference in Mexico in January. If you don’t feel comfortable spending the money for the trip, you can quite easily seek financial support for the conference: https://ais.pub/eagx-latam
The Survival and Flourishing Fund has doubled their speculative grants funding to accommodate the decrease in funding from FTX and you’re welcome to apply: https://ais.pub/sff

This has been the ML & AI safety update. We look forward to seeing you next week!

[MLAISU W48] NeurIPS Safety & ChatGPT

Fri, 02 Dec 2022 11:00:00 GMT

Listen to this week’s update on YouTube or podcast.

This week, we’re looking at the wild abilities of ChatGPT, exciting articles coming out of the NeurIPS conference and AGI regulation at the EU level.

My name is Esben and welcome to week 48 of updates for the field of ML & AI safety. Strap in!

ChatGPT released

Just two days ago, ChatGPT was released and it is being described as GPT-3.5. We see many bug fixes from previous releases and it is an extremely capable system.

We can already now see it find loopholes in crypto contracts, explain and solve bugs, replace Google search and most importantly, show capability to deceive and circumvent human oversight!

Despite being significantly safer than the previous version (text-davinci-002), we see that it still has the ability to plan around human preferences with quite simple attacks.

Monday, they also released text-davinci-003 which is the next generation of fine-tuned language models from OpenAI. There are rumors of GPT-4 being released in February and we’ll see what crazy and scary capabilities they have developed by then.

The demo app is available on chat.openai.com.

NeurIPS

I’m currently at NeurIPS and have had a wonderful chance to navigate between the many posters and papers presented here. They’re all a year old by now and we’ll see the latest articles come out when the workshops start today.

Chalmers was the first keynote speaker and he dangerously created a timeline for creating conscious AI, one that creates both and S-risk and an X-risk. He set the goal of fish-level AGI consciousness by 2032, though all this really seems to be dependent on your definitions for consciousness and I know many of us would expect it before 2032.

Beyond that, here’s a short list of some interesting papers I’ve seen while walking around:

AlphaGo adversarial examples: This paper showcases how easy it is to find attacks even for highly capable reinforcement learning systems such as AlphaGo. It basically finds board positions where inserting the next move (for black and white) ruins the AI’s ability to predict the next move.
InstructGPT paper: Here, OpenAI fine-tunes a language model to human feedback and achieves both a better and safer model with very little compute needed. It was interesting to speak with the authors and get some deeper details such as their data collection process and more.
MatPlotLib is all you need: This paper showcases issues with differential privacy (sharing private data as statistics to avoid privacy issues) with neural networks. Instead of sending the private images, the application sends the gradients (“internal numbers”) of a neural network. Here, they simply use MatPlotLib and plot the gradients (along with a transformation) and easily reconstruct the private input images.
System 3: This is a paper from our very own Fazl Barez where we input environment constraints into the reward model to do better safety-critical exploration. This achieves better performance in high-risk environments using OpenAI Safety Gym.
LAION-5B: This open source project has collected 5.85 billion text-image pairs and explicitly created an NSFW and SFW split of the dataset, though they have trained the models on the full dataset (chaotic).
Automated copy+paste attacks: This is an interesting paper building on their previous work where they show that you can take a small image on top of a test image (a “patch”) and use it to understand how classes of items in images relate to each other. This work automates that process and they’re working on implementing it for language models, a task that, and I quote, “should be relatively straightforward”.
GriddlyJS: A JS framework for creating RL environments easily. We might even use this for the “Testing AI” hackathon coming up in a couple of weeks! Try it here.

And these are of course just a few of the interesting papers from NeurIPS. You can check out the full publication list, the accepted papers for the ML safety workshop and the scaling laws workshop happening today.

EU AI Act & AGI

In other great news, the EU AI Act received an amendment about general purpose AI systems (such as AGI) that details their ethical use. It even seems to apply to open source systems, though it is unclear whether it applies to models released outside of organizational control, e.g. in open source collectives.

An interesting clause is §4b.5 that requires cooperation between organizations who wish to put general purpose AI into high-risk decision-making scenarios.

Providers of general purpose AI systems shall cooperate with and provide the necessary information to other providers intending to put into service or place such systems on the Union market as high-risk AI systems or as components of high-risk AI systems, with a view to enabling the latter to comply with their obligations under this Regulation. Such cooperation between providers shall preserve, as appropriate, intellectual property rights, and confidential business information or trade secrets.

In this text, we also see that it is any system put to use on “the Union market” which means that the systems may originate from GODAM (Google, OpenAI, DeepMind, Anthropic and Meta) but still be under regulation in the same way that GDPR applies for any European citizen’s data.

In general, the EU AI Act seems very interesting and highly positive for AGI safety compared to what many would expect and we have to thank many individuals from the field of AI safety for this development. See also an article by Gutierrez, Aguirre and Uuk on the EU AI Act’s definition of general purpose AI systems (GPAIS).

Mechanistic anomaly detection

Paul Christiano has released an update on the ELK problem, detailing the Alignment Research Center’s current approach.

The ELK problem was defined December 2021 and is focused on having a model explain its knowledge despite incentive to the opposite. Their example is of an AI guarding a vault containing a diamond and the human evaluating whether it is successful based on a camera looking at the diamond.

However, a thief might tamper with the video feed to show exactly the right image and fool the human, leading to a reward for the AI despite the AI (using other sensors) knowing that the diamond is gone. Then the problem becomes how to know what the AI knows.

In this article, Christiano describes their approach to infer what the model’s internal behavior is when the diamond is in the vault (the normal situation) and detecting anomalies in this normal internal behavior. This is both related to mechanistic interpretability and the field of Trojan detection where we attempt to detect anomalies in the models.

Opportunities

And now to our wonderful weekly opportunities.

Apply to the 3.5 month virtual AI Safety Camp starting in March where you can lead your very own research team. Send in your research ideas and they’ll collaborate with you to make a plan with a research team.
In two weeks, the AI testing hackathon is going down. Here, we collaborate to find novel ways to test AI safety by interacting with state-of-the-art language models and play within reinforcement learning environments.
A group of designers are seeking play-testers for a table-top game where you simulate AI risk scenarios! It seems pretty fun so check it out here.
The Center for AI Safety is running an Intro to ML safety over 8 weeks in the Spring that you can apply to be a participant or a facilitator in now.

Thank you for following along for another week and remember to make AGI safe. See you next week!

Will Humans Be Taken Over by 3-Dimensional Chess Playing AIs? W47

Fri, 25 Nov 2022 11:00:00 GMT

Listen to this week's update on podcast. We have decided to pause the video-version of the updates momentarily.

5 years ago, the Google AlphaGo beated reigning world number 1 in Go, Ke Jie, but if you think the board game playing AI's have stopped evolving since, think twice! Today we will look into the new language model, Cicero's, deceptive abilities along with considerations on what board-game playing AI's teach us about AI-development.

Today is the 25th of November and this is the weekly ML & AI safety update from Apart Research!

The power seeking language model Cicero

Ever felt like you are the absolute best board game strategist in your family? Well, we got some bad news for you: This week a research group from Meta Fundamental AI Research Diplomacy Team (FAIR) showcased their language model, Cicero, trained for the strategic board game, Diplomacy.

Diplomacy is probably one of the most heavy strategic board games available and what makes it genuine is its emphasis on one-on-one private dialogue between all players before all play their turn simultaneously. Players act as empires in Europe and their goal is to control strategic supply centers by moving units into them. However, to efficiently play the game, players need to interact and cooperate, while simultaneously mistrusting each other - and this is what makes Cicero both groundbreaking and scary.
Across 40 games of an anonymous online Diplomacy league, Cicero scored double the average score of human players and ranked in top 10% of participants who have played more than one game.

So stay aware when your brother uses his phone on the next board game night - you might be playing against a deceiving AI disguised as a Roman philosopher and not be in for the treat.

3-dimensional chess playing algorithms is not necessarily power seeking

However, even though Cicero seems to be showcasing the forefront of what started as chess-playing algorithms outperforming Kasparov, two professors from Harvard's Theory of Computation and Machine Learning Foundations groups, do not believe that a 'board-game-Big-Brother' like Cicero might be representative of AI's taking over the world.

According to them, the continuous breakthroughs of AI is not necessarily driving us towards a unitary nigh-omnipotent AI system that acts autonomously to pursue long-term goals. While AI's might be extremely well suited for solving problems, when given an outcome to optimise, it might not be that well suited for defining its strategy itself - or at least not much better than human agents supported by short-term AI tools. This is because AI's superior information processing skills do not extrapolate that well to long-term goals in real world environments with a lot of uncertainty and thus will not be far from human's ability to strategise in such a chaotic environment.

According to this worldview, AI-systems with long-term goals that need to be aligned might not be the main focus of AI Safety, rather we should emphasise more on building just-as-powerful AI systems that can be restricted to short time horizons.

Formalising the presumption of independence

In an paper by Paul Christiano, Eric Neyman and Mark Xu, new light is shed upon how we can use heuristic arguments to supplement AI safety work.
The paper itself is mainly concerned with how heuristic arguments act as mathematically supplements to formal deductive proofs, but because they simplify and presume independence, these arguments work better with novel data inputs than old-school mathematical formal proofs.

In their final appendix, the three researchers extrapolate these findings to the context of alignment research, claiming that heuristic arguments might propose important supplements to interpretability and formal verification work in AI safety. They focus especially on avoiding catastrophic failures and eliciting latent knowledge.

What is important to notice here is the use of 'presumption' (or what is already given by 'heuristics'). By simplifying the math, one might be able to generalise broader and make models applicable for wider ranges, but heuristic arguments can also be overthrown by showing the ignored correlation between parameters; reasoning based on this heuristic is commonplace, intuitively compelling, and often quite successful -- but completely informal and non-rigorous.

Monosematicity in toy models

Also this week, an interpretability-paper was published by Adam Jermyn, Evan Hubinger and Nicholas Schiefer, on the monosematicity of individual neurons in neural networks.

It is known that some neurons in neural networks represent 'natural' features in the input and that these monosemantic units are far easier to interpret than their counterpart: polysemantic neurons. So far so good.
Yet, this paper explores how different restrictions of numbering of units per layer or other architectonic twists can change the amount of monosemantic units without increasing the loss of the model. This can be done by e.g. changing the local minima the training function finds.

Also, the paper finds that

Feature-sparse inputs can make models more monosemantic
More monosemantic loss minima have moderate negative bias and that this can be used to increase monosemanticity, and finally,
That more neurons per layer make models more monosemantic, but that this of course comes with an increased computational cost

Other news

In minor news, Leo Gao clears out the wire-heading term, which he finds to be causing confusion, because of its broad applications.
Also, LessWrong continous to overflow with analysises and considerations on the FTX-situation. In an almost hours read, the user Zvi, lays out the case and its afterplay very thorough. If you are interested in how the crash have thrown some things up in the air, we definitely recommend given this one a read
The user Nick Gabs, has also posted his apprehension of MIRI's "How Likely Is Deceptive Alignment" by Evan Hubinger. Basically, he explains how deceptive alignment is a very likely outcome from training a sufficiently intelligent AI using gradient descent. The deceptive outcome is both more simple and require less computational power than genuine alignment. So no positive views from MIRI yet again.
Finally, we just want to mention our colleagues in Conjecture, who this week published a report on their last 8 months of work. In a field like AI safety, that sometimes (some would say always) is a bit messy, it is always nice with a meta-look on strategic considerations and timelines.

Opportunities:

Remember that you also can take part in AI Safety research in a lot of ways. This week we would like to point to a sample of the available opportunities:

Conjecture looks to be rapidly upscaling and are hiring for both technical and non-technical positions. As they write in the post: "Our culture has a unique flavor. On our website we say some spicy things about hacker/pirate scrappiness, academic empiricism, and wild ambition. But there’s also a lot of memes, rock climbing, late-night karaoke, and insane philosophizing."
https://ais.pub/conj2
If you are not in for a job at Conjecture, you can also take a look at the program: AI safety Mentors and Mentess, that aims to match mentors and mentees to upscale their AI safety work. The program is designed to be "very flexible and lightweight and expected to be done next to a current occupation. https://ais.pub/mentor
We also want to drop a note on the pre-announcement of Open Philantrophy's AI Worldviews Contest that is meant to take place in the early 2023. More info can be found on the EA-forum even though the information is still quite sparse.
Finally, Apart received a mail that pointed our attention to the newly launched AI Alignment Awards. The Awards aim to offer up to $100,000 to anyone who can make progress on two open problems in the field of AI Alignment research. Give their website a visit if you feel like this is something for you! https://www.alignmentawards.com/

How Should AIS Relate to Its Funders, W46

Fri, 18 Nov 2022 11:00:00 GMT

Watch and listen to this week's update on YouTube or podcast.

Considerations on the funding situation for AI Safety, exciting projects from Apart's interpretability hackathon, Meta AI-math transformer interpretability and considerations on what to spend time on in AI Safety.

Today is the 18th of November and welcome to the ML & AI safety update!

Thoughts on FTX and AI safety

Last week, we reported, like all others, of the FTX crash and now being in the aftermath of the shock, it seems appropriate to dive a little into what it means for the AI safety community.

The New York Times published an article about the general impact on EA funding and accurately says that it is a just cause for turbulence in such a young movement and has commentary from the Center on Nonprofits and Philanthropy that it is too easy for billionaires to gain legitimation “as long as the money is flowing”, a risk that happened in this case.

The research community is generally appalled at what FTX has done. The main FTX fund for AI safety research, Future Fund, saw its whole team resign over the deception they were exposed to. Will McAskill and Evan Hubinger in clear terms state that this fraud is completely unacceptable with what effective altruism stands for. Meanwhile, Eliezer Yudkowsky and a lawyer makes sure that the community knows that it’s not to blame for this situation and the legal status of FTX’s donations.

When it comes to funding for AI safety research, one of the two biggest funders has now stopped and the other funder OpenPhil is taking a month’s break to evaluate this turbulence. Nonlinear has set up an emergency fund for smaller grants below $10,000 dollars to compensate pressed organizations in this funding stop.

Holden Karnofsky from OpenPhil recommends organizations to:

Put commitments on hold and wait until there is more clarity of the actual impact
Identify gaps, assess by urgency/importance
Reprioritize and balance portfolios

Interpretability Alignment Jam

The second Alignment Jam about interpretability research finished this weekend with a total 147 participants and 25 submissions of valuable interpretability research.

The first prize was awarded to Alex Foote for his research and algorithm that finds minimally activating examples for neurons in language models using word replacement and sentence pruning. This automatically creates positive and negative examples for what specific neurons activate to and is a highly interpretable method.

The second prize was awarded to three researchers from Stanford who found that when Transformer heads are deactivated in different ways, other Transformer heads take over their task even though they did not show activation normally. This has been shown before but the team found that even the backup heads have backup heads and that all these backup heads are robust to the method of deactivation (or ablation) used on the main heads.

The third prize was awarded to Team Nero for finding flaws in the way the ROME and MEMIT papers replace factual associations. They show that factual association replacements also affect any sentence related to the words in the factual association, indicating that it is not constrained to factual associations.

The fourth place team introduced a way to interpret reinforcement learning agents’ strategies on mathematically solved games. They use the match four game and find that the way the agent sees the board corresponds to how humans generally model the board.

The hackathon sparked a lot of interesting research, which we definitely recommend you check out.

Also, remember to stay tuned for our coming hackathon in December!

Meta AI math Transformer-interpretability

Francois Charton from Meta AI has investigated the failure cases and out-of-distribution behavior on transformers trained on inverse matrices and decomposition of eigenvalues.

Despite research that mathematical language models fail to understand math, he finds that they have a correct understanding of the mathematical problems but that it’s the nature of these problems that affect how correct it is. He shows that the training data generators do not simulate the correct results to learn from, leading to generalization failures for the math models.

It remains like it has always been: The computers only do what we ask them to; the main failure is our expectations and aims.

Thoughts on buying time

Akash, Olivia Jimenez and Thomas Larsen has posted a long list of interventions that could 'buy us time'. In their opinion, they believe the AIS-community should invest more in buying time than technical research because the median researcher's time is far more well spent with consideration for the general risk than really technical alignment.

Their new intervention proposal lists among others demonstrating alignment failure, 1-1 conversations with ML researchers and defining concepts in AI safety better. We have heard these claims before and they also seem to get a bit of pushback from Jan Kulveit and habryka.

Other news

Martin Soto criticizes Vanessa Kosoy's PreDCA-protocol of interpretability for involving betting everything on a specific mathematical formalization of some instructions, which might be problematic
Pablo Vallalobos and others have estimated when training data will be exhausted based on current trends. They predict that we will have exhausted the stock of low quality language data by 2030 to 2050, high-quality language data before 2026, and vision data by 2030 to 2060
Instrumental convergence is proposed to be the argument for why general intelligence is possible
Jessica Mary proposes that model-agnostic interpretability might not be that bad after all though the commenters indicate the opposite.

Opportunities:

This week, we have a few very interesting openings available:

AI impacts is still looking for a senior Research Analyst
Anthropic is still looking for a senior software engineer
Center of AI Safety is looking for a chief of staff
David Krueger’s lab is looking for collaborators

This has been the ML & AI safety update. We look forward to seeing you next week!

Are funding options for AI Safety threatened? W45

Fri, 11 Nov 2022 11:00:00 GMT

Watch and listen to this week's update on YouTube or podcast.

The crypto giant FTX crashes, introducing massive uncertainty in the funding space for AI safety, humans cooperate better with lying AI, and interpretability is promising but also not.

This and other news from the AI Safety world will be addressed today.

It is the 11th of November and welcome to the ML & AI safety update!

FTX drops

Since this is a major story, let's dive into what actually happened with the FTX Foundation.

When Sam Bankman-Fried, the CEO of FTX, announced The Future Fund in late February 2022 with the aim to improve humanity's long-term prospects, it seemed like yet another great initiative in support of the AI Safety community and its ability to operate outside the incentive system of for-profits.

Three days ago, Sam Bankman-Fried tweeted about their liquidity issues as a crypto exchange, marking the start of a series of revelations about FTX, how they have mishandled users’ money, moved funds to their own accounts, and violated their own terms of service. The Department of Justice has initiated an investigation into FTX and their crypto hedge fund, Alameda Research.

Additionally, the recent crash of the Meta stock has seen the second big funder of AI safety research, Open Philanthropy, lose a lot of its money so the future of AI safety looks interesting, to say the least.

Human-AI cooperation

We follow up this serious news with research from a team at Stanford. They show that human-AI cooperation is better when the AI is calibrated on the relationship with the human instead of accuracy.

The authors use AI to give decision-making advice to the participants and find that AI modulated to fit the human-AI interaction gives better performance overall compared to a maximally accurate AI system for the human-AI collaborative system.

This introduces interesting considerations for how AI actually interacts with humans in relation to several ways we might safeguard future AI.

U-shaped inverse scaling

And just as we thought we found some sort of linearity in inverse scaling laws, Google shows that they can become U-shaped. The only thing you need to do is just to scale your models up to extreme sizes. If this is true, it may disprove inverse scaling laws and Google even goes to the degree of stating: "This suggests that the term inverse scaling task is under-specified - a given task may be inverse scaling for one prompt but positive or U-shaped scaling for a different prompt".

However, not all are satisfied with their methods. Ethan Perez' calls the team out for deviating their inverse scaling law tests from the ones they describe as replicating in the paper.

Interpretability in the wild

A wonderful piece of contemporary interpretability work in the wild has been conducted by Redwood Research: Using GPT-2 Small, they investigate “indirect object identification” end-to-end in terms of the internal parts of the circuit in a Transformer, even evaluating the reliability of the model.

What is so ingenious about interpretability work is not only that it really takes the task of interpretability research seriously, but that it also shows how much valuable information proper interpretability research can find.

The team manages to identify 26 attention heads grouped in 7 categories, that comprise the indirect object classification-circuit. Along the way, the team also identified interesting structures from the model internals, for example that the attention heads communicated by using pointers to share a piece of information, instead of copying it.

We really recommend that you check out this interpretability research paper!

Other news

In other news, Eric Drexler and Yudkowsky discuss superintelligence on the alignment forum: Because how many superintelligent AIs are actually the best case scenario when they start interacting with each other?

Also, the Janus team from Conjecture have found the outputs of OpenAI’s human fine-tuned models to have very confident outputs in quite specific situations, having clear preferences for specific numbers, answers, and the like.

MadHatter doubts some of the mesa-optimiser thought scenarios proposed by the researchers in the field and calls to consider far more empirical research on mesa-optimisers.

David Krueger doubts the true value of interpretability and reverse engineering, suggesting that we should get our engineering right instead of 'reversing' that engineering with interpretability.

Nate Soares doubts cognitive interpretability approaches, because we're not building minds but rather training minds, and we have very little grasp of their internal thinking. He doubts our ability to predict if an AGI system will have positive outcomes for humanity

And finally, Apart Research has released a website for interpretability research. We definitely recommend you go check them out and also consider if you should participate in the coming interpretability hackathon this very weekend. Check the links below for more info.

Opportunities

This week, we have a few very interesting opening available:

CHAI is offering an AI Research Internship under one of their mentors
Today is the day the interpretability hackathon starts, open to all
AI impacts is looking for a senior Research Analyst

This has been the ML & AI safety update. We look forward to seeing you next week.

Can we predict the abilities of future AI? W44

Fri, 04 Nov 2022 11:00:00 GMT

Watch this week's update on YouTube or listen to it on Spotify.

This week, we look at broken scaling laws, surgical fine-tuning, interpretability in the wild, and threat models of AI.

Today is November 4th and this is the ML & AI safety update!

Broken scaling laws & surgical fine-tuning

A range of interesting papers have been making the rounds the previous weeks and we selected a few of the most interesting ones.

Scaling laws are important to infer how future AI systems will behave. Existing scaling laws are often fitted linearly or monotonically. Caballero, Krueger and others introduce “broken scaling laws” after critiquing how the normal scaling laws research that do not reflect empirical facts of model training. Their new scaling laws function can show “breaks” which correspond to the sudden non-monotonic shifts in ability we see from neural networks. Their function extrapolates significantly better than the other three function forms.

Robustness of computer vision is important for a range of tasks. A team from Stanford show that fine-tuning single layers works better than fine-tuning the whole neural network in specific adversarial benchmarks. For example surgically fine-tuning early layers gives better performance for input-level shifts such as corruption attacks while late layer fine-tuning induces robustness for output-level shifts.

Debate & interpretability

Parrish, Bowman and others show that debate does not help humans answer hard reading comprehension questions. They show the participants arguments for and against an incorrect and a correct answer to a hard reading comprehension question but find that humans do not benefit from this.

“When Drake and Yoojin went to the store, Yoojin gave a drink to…” A transformer can easily predict that the next word in this sentence is Drake but how does it do it? Redwood Research identifies a circuit of conceptual understanding in the Transformer heads.

We see that the neural heads have specific functions in understanding: Some identify duplicate words, some inhibit specific words, and the three late-stage classes of heads negatively and positively move the word “Drake” into the predicted position. This task is called indirect object identification and is clearly an interesting test case for circuits interpretability.

Threat models in ML safety

The DeepMind safety team created a taxonomy of how the current risks look from artificial intelligence. Their consensus development model is scaled up version of our current models which they don’t think need much innovation to become artificial general intelligence, an AI that is better than humans at most relevant tasks.

The risks that arise from such a model are goal misgeneralization where the models fail to generalize their training to real world scenarios and power-seeking as a result of such misalignment. We don’t expect to catch this due to deception and the most important people in society won’t understand the risks. John Wentworth notes that this multi-stage story is not even necessary since current systems already train to deceive humans.

Michael Cohen shows that existential catastrophe from AI is above 35%. He takes an optimistic perspective on success scenarios such as well-enforced laws that stop dangerous versions of AI, an entity stops it in some way, no one develops advanced AI, or advanced AI is developed in a safe way that violates a series of assumptions Cohen makes (which he doubts). These assumptions focus on the ability of the AI to make hypotheses, follow plans in uncertainty, and use these plans in a way that progresses some proxy reward.

Additionally, he does not put confidence in current AI safety research paradigms and even writes up an “anti review”, where he argues against each contemporary research agenda.

In other news

In other news, Scott Garrabrant discusses so-called “frames” which he describes as creating an agentic first-person perspective on all (third-person) possible worlds, such as uncertainty, choices, and plausible worlds. He claims this is in contrast to the embedded agents view and traditional RL with its environment / agent boundary separation.
Michaeud, Liu, and Tegmark show scaling laws of different function approximators and provide a taxonomy for precision machine learning.
Michael Nielsen and Kanjun Qiu release their book “Vision for Metascience” and describe the funders of research as a detector and discriminator in an imaginative research generation process.
The Future of Life Institute has started a new podcast and the latest episode with Ajeya Cotra covers how AI might cause catastrophe.

Opportunities

This week, we have a few very interesting opening available:

Redwood Research is inviting 30-50 researchers to join them in Berkeley for a very interesting mechanistic interpretability research programme.
Anthropic is looking for operations managers, recruiters, researchers, engineers, and product managers.

Additionally, you can check out some of the new features on AI Safety Ideas and join the interpretability hackathon from anywhere in the world next weekend.

This has been the ML & AI safety update, see you next week!