Listen to this week's update on podcast. We have decided to pause the video-version of the updates momentarily.
5 years ago, the Google AlphaGo beated reigning world number 1 in Go, Ke Jie, but if you think the board game playing AI's have stopped evolving since, think twice! Today we will look into the new language model, Cicero's, deceptive abilities along with considerations on what board-game playing AI's teach us about AI-development.
Today is the 25th of November and this is the weekly ML & AI safety update from Apart Research!
The power seeking language model Cicero
Ever felt like you are the absolute best board game strategist in your family? Well, we got some bad news for you: This week a research group from Meta Fundamental AI Research Diplomacy Team (FAIR) showcased their language model, Cicero, trained for the strategic board game, Diplomacy.
Diplomacy is probably one of the most heavy strategic board games available and what makes it genuine is its emphasis on one-on-one private dialogue between all players before all play their turn simultaneously. Players act as empires in Europe and their goal is to control strategic supply centers by moving units into them. However, to efficiently play the game, players need to interact and cooperate, while simultaneously mistrusting each other - and this is what makes Cicero both groundbreaking and scary.
Across 40 games of an anonymous online Diplomacy league, Cicero scored double the average score of human players and ranked in top 10% of participants who have played more than one game.
So stay aware when your brother uses his phone on the next board game night - you might be playing against a deceiving AI disguised as a Roman philosopher and not be in for the treat.
3-dimensional chess playing algorithms is not necessarily power seeking
However, even though Cicero seems to be showcasing the forefront of what started as chess-playing algorithms outperforming Kasparov, two professors from Harvard's Theory of Computation and Machine Learning Foundations groups, do not believe that a 'board-game-Big-Brother' like Cicero might be representative of AI's taking over the world.
According to them, the continuous breakthroughs of AI is not necessarily driving us towards a unitary nigh-omnipotent AI system that acts autonomously to pursue long-term goals. While AI's might be extremely well suited for solving problems, when given an outcome to optimise, it might not be that well suited for defining its strategy itself - or at least not much better than human agents supported by short-term AI tools. This is because AI's superior information processing skills do not extrapolate that well to long-term goals in real world environments with a lot of uncertainty and thus will not be far from human's ability to strategise in such a chaotic environment.
According to this worldview, AI-systems with long-term goals that need to be aligned might not be the main focus of AI Safety, rather we should emphasise more on building just-as-powerful AI systems that can be restricted to short time horizons.
Formalising the presumption of independence
In an paper by Paul Christiano, Eric Neyman and Mark Xu, new light is shed upon how we can use heuristic arguments to supplement AI safety work.
The paper itself is mainly concerned with how heuristic arguments act as mathematically supplements to formal deductive proofs, but because they simplify and presume independence, these arguments work better with novel data inputs than old-school mathematical formal proofs.
In their final appendix, the three researchers extrapolate these findings to the context of alignment research, claiming that heuristic arguments might propose important supplements to interpretability and formal verification work in AI safety. They focus especially on avoiding catastrophic failures and eliciting latent knowledge.
What is important to notice here is the use of 'presumption' (or what is already given by 'heuristics'). By simplifying the math, one might be able to generalise broader and make models applicable for wider ranges, but heuristic arguments can also be overthrown by showing the ignored correlation between parameters; reasoning based on this heuristic is commonplace, intuitively compelling, and often quite successful -- but completely informal and non-rigorous.
Monosematicity in toy models
Also this week, an interpretability-paper was published by Adam Jermyn, Evan Hubinger and Nicholas Schiefer, on the monosematicity of individual neurons in neural networks.
It is known that some neurons in neural networks represent 'natural' features in the input and that these monosemantic units are far easier to interpret than their counterpart: polysemantic neurons. So far so good.
Yet, this paper explores how different restrictions of numbering of units per layer or other architectonic twists can change the amount of monosemantic units without increasing the loss of the model. This can be done by e.g. changing the local minima the training function finds.
Also, the paper finds that
Feature-sparse inputs can make models more monosemantic
More monosemantic loss minima have moderate negative bias and that this can be used to increase monosemanticity, and finally,
That more neurons per layer make models more monosemantic, but that this of course comes with an increased computational cost
Other news
In minor news, Leo Gao clears out the wire-heading term, which he finds to be causing confusion, because of its broad applications.
Also, LessWrong continous to overflow with analysises and considerations on the FTX-situation. In an almost hours read, the user Zvi, lays out the case and its afterplay very thorough. If you are interested in how the crash have thrown some things up in the air, we definitely recommend given this one a read​
The user Nick Gabs, has also posted his apprehension of MIRI's "How Likely Is Deceptive Alignment" by Evan Hubinger. Basically, he explains how deceptive alignment is a very likely outcome from training a sufficiently intelligent AI using gradient descent. The deceptive outcome is both more simple and require less computational power than genuine alignment. So no positive views from MIRI yet again.
Finally, we just want to mention our colleagues in Conjecture, who this week published a report on their last 8 months of work. In a field like AI safety, that sometimes (some would say always) is a bit messy, it is always nice with a meta-look on strategic considerations and timelines.
Opportunities:
Remember that you also can take part in AI Safety research in a lot of ways. This week we would like to point to a sample of the available opportunities:
Conjecture looks to be rapidly upscaling and are hiring for both technical and non-technical positions. As they write in the post: "Our culture has a unique flavor. On our website we say some spicy things about hacker/pirate scrappiness, academic empiricism, and wild ambition. But there’s also a lot of memes, rock climbing, late-night karaoke, and insane philosophizing."
​https://ais.pub/conj2If you are not in for a job at Conjecture, you can also take a look at the program: AI safety Mentors and Mentess, that aims to match mentors and mentees to upscale their AI safety work. The program is designed to be "very flexible and lightweight and expected to be done next to a current occupation. https://ais.pub/mentor
We also want to drop a note on the pre-announcement of Open Philantrophy's AI Worldviews Contest that is meant to take place in the early 2023. More info can be found on the EA-forum even though the information is still quite sparse.
Finally, Apart received a mail that pointed our attention to the newly launched AI Alignment Awards. The Awards aim to offer up to $100,000 to anyone who can make progress on two open problems in the field of AI Alignment research. Give their website a visit if you feel like this is something for you! https://www.alignmentawards.com/