Research Proceedings 2022-23 Spring

SAIA Research Proceedings 2023 Spring

Evaluating Prompt Injection Success Based on Model Scale and Methods for Generating Adversarial Prompts
Chris Cundy, Shafin Khan, Jinyoung Kim, Ashley Raigosa

Considering the continuous scaling of LLMs, these models are susceptible to prompt injections, or a form of adversarial attack. Previously, prompt injections have appeared synonymous with adversarial inputs meant to produce harmful language but defining the ambiguous nature of "harmfulness" is arbitrary. Thus, this paper narrowly defines prompt injections as an undesired token to test the effect of handwritten and automatically generated prompt injections against varying scales.

Are Emergent Abilities of Large Language Models a Mirage?
Rylan Schaeffer, Brando Miranda, Oluwasanmi Koyejo

Recent work claims that large language models display emergent abilities, abilities not present in smaller-scale models that are present in larger-scale models. What makes emergent abilities intriguing is two-fold: their sharpness, transitioning seemingly instantaneously from not present to present, and their unpredictability, appearing at seemingly unforeseeable model scales. Here, we present an alternative explanation for emergent abilities: that for a particular task and model family, when analyzing fixed model outputs, emergent abilities appear due the researcher’s choice of metric rather than due to fundamental changes in model behavior with scale. Specifically, nonlinear or discontinuous metrics produce apparent emergent abilities, whereas linear or continuous metrics produce smooth, continuous, predictable changes in model performance. We present our alternative explanation in a simple mathematical model, then test it in three complementary ways: we (1) make, test and confirm three predictions on the effect of metric choice using the InstructGPT/GPT-3 family on tasks with claimed emergent abilities, (2) make, test and confirm two predictions about metric choices in a metaanalysis of emergent abilities on BIG-Bench; and (3) show how to choose metrics to produce never-before-seen seemingly emergent abilities in multiple vision tasks across diverse deep networks. Via all three analyses, we provide evidence that alleged emergent abilities evaporate with different metrics or with better statistics, and may not be a fundamental property of scaling AI models.

Double Descent Demystified: Identifying, Interpreting & Ablating the Sources of a Deep Learning Puzzle
Rylan Schaeffer, Mikail Khona, Zachary Robertson, Akhilan Boopathy, Kateryna Pistunova, Jason W. Rocks, Ila Rani Fiete, Oluwasanmi Koyejo

Double descent is a surprising phenomenon in machine learning, in which as the number of model parameters grows relative to the number of data, test error drops as models grow ever larger into the highly overparameterized (data undersampled) regime. This drop in test error flies against classical learning theory on overfitting and has arguably underpinned the success of large models in machine learning. This non-monotonic behavior of test loss depends on the number of data, the dimensionality of the data and the number of model parameters. Here, we briefly describe double descent, then provide an explanation of why double descent occurs in an informal and approachable manner, requiring only familiarity with linear algebra and introductory probability. We provide visual intuition using polynomial regression, then mathematically analyze double descent with ordinary linear regression and identify three interpretable factors that, when simultaneously all present, together create double descent. We demonstrate that double descent occurs on real data when using ordinary linear regression, then demonstrate that double descent does not occur when any of the three factors are ablated. We use this understanding to shed light on recent observations in nonlinear models concerning superposition and double descent. Code is publicly available.

FACADE: A Framework for Adversarial Circuit Anomaly Detection and Evaluation
Dhruv Pai, Andres Carranza, Rylan Schaeffer, Arnuv Tandon, Sanmi Koyejo

We present FACADE, a novel probabilistic and geometric framework designed for unsupervised mechanistic anomaly detection in deep neural networks. Its primary goal is advancing the understanding and mitigation of adversarial attacks. FACADE aims to generate probabilistic distributions over circuits, which provide critical insights to their contribution to changes in the manifold properties of pseudo-classes, or high-dimensional modes in activation space, yielding a powerful tool for uncovering and combating adversarial attacks. Our approach seeks to improve model robustness, enhance scalable model oversight, and demonstrates promising applications in real-world deployment settings.

FlowHF: Generative Flow Networks for RLHF
Dhruv Pai, Andres Carranza, Raj Pabari

Reinforcement Learning from Human Feedback (RLHF) aims to align large language models (LLMs) like GPT, Claude, and LLaMA with human values by training a reward model on human preferences. The existing RLHF process employs a Proximal Policy Optimization (PPO) model that rapidly converges to specific reward modes, potentially leading to echo-chambering and sycophantic behaviors in LLMs. In particular, scaling RLHF to larger models can lead to increased political and ideological biases. We argue that the objective of RLHF should not be to maximize reward but to learn the underlying energy distribution of the reward model. Our proposed approach, FlowHF, utilizes Generative Flow Networks (GFlowNets), a new category of RL algorithm introduced by Bengio et al., which focuses on matching the energy function of the underlying reward distribution rather than maximizing reward. GFlowNets serve as an alternative to Markov Chain Monte Carlo RL algorithms for sampling, which often struggle in domains where few trajectories yield rewards and are separated by large low-probability regions. Through amortized sampling, GFlowNets can replace the combinatorial complexity of Monte Carlo sampling by training a model to approximate the trajectory distribution. In FlowHF, we first establish a PPO baseline using LLaMA, then implement the GFlowNet amortized sampling method for large language models and the Trajectory Balancing (TB) and Forward Looking (FL) objectives. Our main contribution is then using the FlowHF method to perform RLHF on a 7B parameter LLaMA, using the GFlowNet FL objective to better align with human values while avoiding sycophancy. We compare FlowHF with both objectives to traditional PPO-based RLHF and the fine-tuned model without RLHF, demonstrating its effectiveness on the axes of mean reward, diversity, and novelty. Our results show that FlowHF achieves comparable average reward to PPO, but outperforms on the diversity and novelty metrics. Finally, we demonstrate that FlowHF can achieve high reward and fluent text without the Kullback-Leibler regularization objective widely used in traditional RLHF. GFlowNets thus present a promising alternative to PPO-based RLHF that warrants further investigation in a wider array of tasks.

SAIA Research Proceedings 2023 Spring

Research Proceedings 2023-24 Fall

Stanford AI Alignment (SAIA)