Reservations about some AI alignment research

Sometimes I talk to people about what constitutes a reasonable career given continued AI development. I currently think that regulation seems very important, certain types of practical alignment work seem good, and x-risk-flavored alignment work seems ineffective and potentially actively harmful.

This post is intended to be a broad-strokes record of my beliefs in case they’re helpful to anyone thinking about careers.

World 1: Alignment research may be ineffective because it’s too difficult on practical timescales

Arguably the gap between meaningful safety progress and capabilities progress continues to grow. This gap can’t immediately be attributed to core difficulties in alignment, given that far more researchers and engineers are pushing on capabilities than they are on safety. But certainly there are many intuitive reasons to expect alignment to be difficult (which aren’t particularly controversial, so I won’t list them here). It’s also realistic to expect this gap in expenditure to grow with time, at least until alignment becomes a core bottleneck to deploying highly capable AI systems (see world 4).

I’m worried about the gap between the frontier and alignment, but I’m also increasingly worried about the mélange of emerging uses of systems of varying capabilities that make it more difficult to reason about alignment. The AI space is seeing developments that aren’t that surprising - e.g. AutoGPT - but there are many of them, and I expect applications to become more numerous and unpredictable as the technology proliferates. How does any model interpretability apply, for instance, if models are interacting in simulated markets or plugged into preexisting complex data pipelines, and these systems are the backbone of enormous new recommender systems? It becomes trickier to make assumptions about how ML systems work in practice, but assumptions make problems tractable.

World 2: Alignment research may be ineffective because nobody will adopt solutions

This seems likely if progress in alignment is orthogonal to progress in capabilities. ML models are becoming increasingly commoditized via developer services (“build your own chatbot!”) and open source. If there’s no strong incentive to care about alignment, and millions of people and organizations have access to highly capable ML models, at least a few of them will be using unsafe models. Many more will be using unsafe models if it turns out aligned models cannot compete with unaligned models, i.e. there exists an alignment tax.

This argument is not very relevant to less ambitious safety work, e.g. get models to not be racist, or detect whether or not an output came from a human or a model. These techniques similarly may not see mass adoption, but they will see some adoption, and their benefits are pretty linear. Safety work that aims to stop extremely advanced models from not destroying humanity, on the other hand, relies on nearly complete adoption to be worthwhile - a “one bad apple” situation.

World 3: Alignment research will be ineffective because other harms will emerge before runaway superintelligence

Even absent further progress in AI, it seems likely that the worst of misinformation campaigns, deepfakes, and other harms are yet to come as bad actors and clueless users experiment with existing technology. And language/vision models are far from threatening extinction!

As models get better, the magnitude of potential harms increases. At some point they become appealing tools for extremists or very irresponsible users to wreak existential havoc; for instance, via engineering of bioweapons. I would be unsurprised if we reach this point before we encounter runaway superintelligence that “wants” to wreak havoc, because probably we don’t need goal-directed and autonomous agents to be very good at crunching scientific data.

World 4: Alignment research will be actively harmful if it contributes to capabilities and other harms emerge before runaway superintelligence

Obviously, one way alignment goes poorly is if it enables great misuse by extremists or aggressive countries. The other way alignment goes poorly is if it directly or indirectly accelerates AI arms races. We don’t have a good track record here: RLHF in large part took AI mainstream while leaving huge safety holes (direct), and Anthropic claimed to break off from OpenAI to nominally focus on safety, only to declare intent to commercialize and raise billions of dollars (indirect). Maintaining independence from investors and pressure from customers as a safety-focused research organization seems inherently at odds with exerting influence over important AI labs.

While this is a research-specific claim, it seems quite hard to encourage an exclusively helpful sort of alignment research (and it’s possible that exclusively helpful alignment research just doesn’t exist). So far AI safety field-building looks mostly like orienting around the right words - “alignment” “safety” etc. - rather than rock-solid principles, because these principles don’t exist yet. You might argue that RLHF is only a “minimum viable safety technique” and that current language models won’t really matter in the long run, but if it sets any sort of precedent, it’s that companies can justify pushing on capabilities under the veneer of safety.

It’s difficult to assess even the most straightforward safety efforts, e.g. work to make models less racist. If the way that this problem is approached involves general strategies that look like “get the model to do what I say, including ‘don’t be racist’” and these strategies help a company build their next model, there may be negative externalities. Post hoc filtering for racist lingo for specific models is by contrast a very restricted strategy that won’t have knock-on effects.

If alignment turns out to be a must-have for capabilities and the only existential threat from AI comes from runaway superintelligence - no earlier model could do much damage - we’re possibly in a good position because there’s incentive to adopt safety practice when it matters most and no possible harm could come from an arms races. I find this scenario extremely unlikely.

Conclusion

If one cares about x-risk and is pessimistic about alterative means of intervention, e.g. governance, worlds 1-3 aren’t that important because alignment work has big upside. However, I think one of worlds 2 or 4 has to play out - either alignment/capabilities are correlated or they’re not, and either way it seems like we’re in trouble. If one cares about x-risk and thinks that world 4 is not unlikely, it seems important to weigh the potential huge downsides to the upsides when choosing how to spend one’s time. Is it more likely that humanity is extincted by a runaway superintelligence, or by a slightly dumber but very powerful AI that a human misdirects?

I’m excited about other ways to make AI development go well, in particular via regulatory structures that ensure that AI progress can’t happen without a solution to alignment and ensure that given a solution to alignment, the technology will proliferate in healthy ways. Since I think that one of worlds 2 or 4 has to be true, I struggle to imagine a world where, without regulatory intervention, AI development goes well, regardless of whether we “solve” the alignment problem. Conversely, I can imagine worlds with strong regulation and no solution to the alignment problem that go okay - in particular if runaway superintelligence never arises. This makes me think that with respect to order of operations, regulation should be the first priority.

Can we work on governance and alignment in parallel? Only if one thinks that world 2 is likely, and alignment is at worst ineffectual but it won’t ever aggravate the situation, so that if we eventually set up good regulatory structures we’ll have a solution ready to go. But we still need a vast amount of upfront effort invested in governance since that’s the near-term load-bearing element of the plan.