Policy Options for Preserving Chain of Thought Monitorability

This report was co-authored by Oscar Delaney, Oliver Guest, and Renan Araujo.

Executive Summary

Today’s most advanced artificial intelligence (AI) models use “chain of thought” (CoT) reasoning; monitoring this CoT can be valuable for controlling these systems and ensuring that they will behave as intended. This is because, like humans, AIs need to think in steps in order to solve complex problems. We can monitor the steps that the model writes (i.e., the CoT) and intervene to stop the model’s default course of action, if a concerning intention is expressed in the CoT.

As AI systems are deployed in higher stakes contexts, CoT monitoring could become increasingly important for preventing risks to public safety and national security. In experiments, researchers have already found AIs to cheat on a software development task by altering a test to make the existing code pass, rather than making the code work as intended. The model did this after expressing an intention in the CoT to “fudge” the test. Similarly unreliable AIs deployed in safety-critical systems without adequate control measures could cause serious harm.

However, competitive pressures may incentivize AI developers to create AIs that do not reason in human-language, undermining CoT monitoring. In particular, a shift away from human language CoTs might deliver comparable performance at lower cost. This would be a coordination problem: everyone might benefit if AI developers preserve monitorable architectures, but individual AI developers might not do so for fear of being outcompeted.

Coordination mechanisms such as voluntary agreements between companies, domestic policy measures, and international agreements could help preserve CoT monitoring. Which coordination mechanisms are needed depends on the size of the “monitorability tax”, i.e., how costly it is overall to the developer to preserve monitorability, compared to the benefits to society of preserving monitorability. We sketch out a set of verification measures that could be used to verify compliance across each of these coordination mechanisms.

We also propose several recommendations to enhance CoT monitoring coordination. These would be low-regret to implement, regardless of the size of any monitorability tax:

  • AI developers should refine and implement monitorability evaluations and publish these results in system cards, strengthening norms around CoT monitoring.

  • Governments should establish verification infrastructure in preparation for possible domestic or international CoT monitorability policies.

  • External researchers should pursue monitoring and controllability mechanisms for next-generation AI architectures, in case efforts to preserve existing architectures fail.

Next
Next

Accelerating AI Data Center Security