AGI Progress & Theoretical Views W40
Watch and listen to this week's update on YouTube and podcast.
Today we’ll show some scary updates in AI development, we’ll summarize Stuart Russell and Eliezer’s discussion of alignment, and see interpretability tools from Redwood.
— Scary progress in AI
The legendary programmer John Carmack has exited virtual reality development to create AGI, and I quote, “by way of mad science” with disregard for safety. This is very concerning and they’ve raised $20 million dollars already. Carmack is very respected, and him taking on this position seems like a disheartening blow on AI safety.
Meta showcased a video generation model a week ago that wowed everyone but new unpublished research shows capability to also combine different scenes into much more interesting and narrative videos. OpenAI has also open-sourced Whisper, an extremely good voice-to-text model.
Meanwhile, DeepMind releases a maths model that creates new algorithms to speed up matrix multiplication, something used everywhere in both machine learning and many other computing fields. This Quanta Magazine article summarizes the state-of-the-art algorithms for matrix multiplication and DeepMind’s model has not found a faster solution but can optimize algorithms for specific GPUs and model architectures.
So clearly, progress is extremely fast, even disregarding the large array of open sourced models recently.
— Meta transfers PyTorch to the Linux Foundation
The ownership of one of the most popular machine learning frameworks, PyTorch, has been transferred to the Linux Foundation who manage 850 open source projects. They generally take a stance of neutrality but they are a non-profit compared to Meta, who previously owned the project. Meta’s head of AI, Yann LeCunn, has also recently proposed a path towards AGI, a concerning point.
So while we all wait for AGI, go and play this paperclip clicker game that shows your paperclip factory optimization might become a risk for humanity.
— Risk of power-seeking AI
Eli shares his critique of Joe Carlsmith’s report on why power-seeking AI is a risk. He mentions that the report, now a canon of understanding AI risk, has optimistic probability estimates because the framing is to avoid existential risk instead of removing that risk and ensuring a good future for humanity. Additionally, it might underestimate the amount of actors in the AI space at that point.
And just to summarize Carlsmith’s report, it focuses on a main argument for the risks that goes like 1) it will be possible to build dangerous AI systems in the future, 2) people will have incentives to build them, 3) it will be hard to build systems that we can ensure are safe, 4) unsafe systems will fail in high-impact ways, 5) this can lead to permanent disempowerment of humanity, and 6) this leads to an existential catastrophe.
Meanwhile, Vendrov describes three paths we might take to ensure this safe AI. One is to change the technology itself, something most AI safety researchers work on. Another is to change the structures that deploy the dangerous AI in a way where they have incentive to make it safe. And the third is to change how the world works so it is resilient to dangerous AI.
— Understanding human preferences
Scott Alexander summarizes a theoretical dispute between Stuart Russell, the godfather of ML, and Eliezer Yudkowsky, the author of alignment. Russell leads the research group CHAI at UC Berkeley in California and their research focus in safe machine learning is to ensure that AI values human opinions much higher than their own values. So if the AI is misunderstanding the task, it will seek human advice to do it right.
MIRI’s criticism says that we do not know how to create models of this type of scenario and that even if we did, we wouldn’t know how to do it correctly. The basic argument goes that an AI with this ability will misunderstand the options available to it and thereby update its understanding towards something that is still not what we want.
Eliezer kindly reached out to us to note that this is not the correct criticism. Here it is in his own words:
The basic argument goes that an AI whose utility function has been made dependent on hidden information, even if that information is inside humans, won't defer to humans because of that; it gets all the information that's obtainable and then ignores the humans (and kills them). There's never a point where "let the humans shut me off and build another AI" looks like a better strategy than "get all the info out of the humans and then stop listening".
— Loss functions, Andrej’s tutorials, interpretability, and alignment jams
In smaller news, Alex releases a description of four ways loss functions in machine learning are used and how we should understand them. Loss functions are important for how AI understands how incorrect it is and for example physics-based deep learning studies loss functions intensely.
Andrej Karpathy has begun creating tutorials on YouTube after quitting as the lead of AI at Tesla. His tutorials are some of the best for learning machine learning you can find and we recommend watching them.
Redwood has released an awesome interpretability tool that complements tools from Anthropic and OpenAI. This democratizes the ability to do interpretability research and understand neural networks.
The 12th to 13th of November, we are doing a hackathon in interpretability and you are very welcome to register your interest already now. Join at the link in the description. Esben Kran did an introductory lecture on interpretability and you can watch it at the same link.