Research Proceedings 2023-24 Winter

SAIA Research Proceedings 2023-24 Winter

These research proceedings present projects that SAIA members have worked on in Winter 2023-24. The proceedings are non-archival and selected via an approval process absent of peer review.

Markovian Agents for Truthful Language Modeling
Scott Viteri, Max Lamparth, Peter Chatain & Clark Barrett

Chain-of-Thought (CoT) reasoning could in principle enable a deeper understanding of a language model’s internal reasoning. However, prior work suggests that capable models are robust to changes in their reasoning traces, suggesting that those models are not truly using those traces. We propose a method that factors predictions through intermediate "state" text, the sole context for predicting subsequent observations. This increases the faithfulness of CoT by guaranteeing that if the language model can predict future tokens, then it must have used the CoT to understand its context. To this end, we introduce "Markovian Training," a fine-tuning procedure applicable to pre-trained models or during pre-training itself. As generating useful state tokens is non-differentiable, we formulate the problem as a reinforcement learning task, using Proximal Policy Optimization to optimize the log probability of the next observation given only the “state” text. We demonstrate the effectiveness of our training algorithm on reasoning tasks such as arithmetic problems, show that the model utilizes the chain of thought, and validate that the generated chain of thought is meaningful and usable by other models.

Mechanistic Interpretability and AI Safety in Social Science Research: An Overview and A Toy Case Study
Yiran Fan

Mechanistic Interpretability (MI) emerges as a pivotal approach to understanding the intricate workings of neural networks by reverse engineering binary programs using a bottom-up methodology. In the context of natural language tasks in social sciences, particularly tasks in psychology that intersect with clinical domains, the importance of explainability cannot be overstated. This project presents an analysis of MI's role in working towards a therapeutic chatbot, highlighting the challenges and importance of understanding AI systems in a social science context.

Can AI Self-Correct? Analyzing the Limitations of Large Language Models
Kai Fronsdal

Learning from model feedback has become a prevalent strategy for scaling models beyond human capabilities (i.e. RLAIF and its variants). The concept of self-correction, in particular, offers the potential for enhancing the fundamental reasoning abilities of Large Language Models (LLMs). However, recent studies have indicated that this method, while successful for certain tasks, falls short when applied to mathematical reasoning tasks. Our investigation establishes that smaller models exhibit shortcomings in accurately assessing mathematical arguments, and fine-tuning for self-correction only improves their calibration rather than their ability to discern correctness. Further, through manual analysis of correct and incorrect solutions, we find that models cannot identify granular calculation mistakes such as carry errors and missing signs, but can correct high-level reasoning.

Aligning Multi-Model AI Training: Risks and Mitigation Strategies
Aryan Chaudhary

As the field of artificial intelligence rapidly advances, the development of powerful multi-model systems like Generative Adversarial Networks (GANs) and Multi-Agent Reinforcement Learning (MARL) architectures has opened up new frontiers. However, aligning the behavior of these complex models with intended objectives and human values remains a significant challenge. This paper explores the key risks and alignment issues that can arise when using multiple models to train each other in an adversarial or cooperative fashion. For GANs, problems such as mode collapse, bias amplification, reward hacking, and training instabilities are examined, along with potential mitigation strategies like improved discriminator designs and controlled environment frameworks. In MARL systems, alignment threats include emergent misaligned behaviors, reward hacking, scalability issues, and lack of transparency, which could be addressed through techniques like adversarial reward shaping, supervised attention mechanisms, and hierarchical reinforcement learning. By analyzing real-world use cases, evaluating proposed solutions, and synthesizing insights across ethics and machine learning, this work underscores the critical importance of prioritizing AI alignment to ensure the responsible development of multi-model AI capabilities that remain interpretable, robust, and aligned with human values as they become increasingly sophisticated and widely deployed.

Constitutional AI - A Path Towards Global Value Convergence?
Julian Serra

The rapid advancement of large language models (LLMs) has led to the development of constitutional AI, exemplified by Anthropic's AI assistant, Claude. By training LLMs with a codified set of human ethics and values, constitutional AI aims to align these powerful systems with human intentions while preserving their utility. This approach raises important questions about the democratization of AI ethics and the potential for AI constitutions to shape the moral foundations of both artificial and human intelligence in the future.

A Literature Review of Deceptive Alignment and When It Might Occur
Eugenie Shi

This paper is a literature review of some current research regarding deceptive alignment. Firstly, the paper introduces the definition of deceptive alignment, including the definition of mesa-optimizer and base-optimizer, and its difference with dishonesty, as represented by the sycophants. Secondly, the paper describes concrete examples of deceptive alignment and its process, as well as an empirical example shown in the Sleeper Agents paper, when a model acts differently in the training and deployment environments. Thirdly, the paper summarizes conditions in which deceptive alignment will occur, possible ways to prevent or correct them, and when deceptively aligned mesa-optimizers can detect the distributional shift from training to deployment such that they can begin reaching its mesa-objective. Lastly, the paper addresses the debate of whether deceptive alignment is likely to happen or not with arguments of both the pro and the con side.

SAIA Research Proceedings 2023-24 Winter

Research Proceedings 2023-24 Fall

Stanford AI Alignment (SAIA)