Research Papers
Training Large Language Models to Reason in a Continuous Latent Space
Large language models (LLMs) are restricted to reason in the "language
space", where they typically express the reasoning process with a
chain-of-thought (CoT) to solve a complex reasoning problem. However, we argue
that language space may not always be optimal for reasoning. For example, most
word tokens are primarily for textual coherence and not essential for
reasoning, while some critical tokens require complex planning and pose huge
challenges to LLMs. To explore the potential of LLM reasoning in an
unrestricted latent space instead of using natural language, we introduce a new
paradigm Coconut (Chain of Continuous Thought). We utilize the last hidden
state of the LLM as a representation of the reasoning state (termed "continuous
thought"). Rather than decoding this into a word token, we feed it back to the
LLM as the subsequent input embedding directly in the continuous space.
Experiments show that Coconut can effectively augment the LLM on several
reasoning tasks. This novel latent reasoning paradigm leads to emergent
advanced reasoning patterns: the continuous thought can encode multiple
alternative next reasoning steps, allowing the model to perform a breadth-first
search (BFS) to solve the problem, rather than prematurely committing to a
single deterministic path like CoT. Coconut outperforms CoT in certain logical
reasoning tasks that require substantial backtracking during planning, with
fewer thinking tokens during inference. These findings demonstrate the promise
of latent reasoning and offer valuable insights for future research.
Sloth: scaling laws for LLM skills to predict multi-benchmark performance across families
Scaling laws for large language models (LLMs) predict model performance based
on parameters like size and training data. However, differences in training
configurations and data processing across model families lead to significant
variations in benchmark performance, making it difficult for a single scaling
law to generalize across all LLMs. On the other hand, training family-specific
scaling laws requires training models of varying sizes for every family. In
this work, we propose Skills Scaling Laws (SSLaws, pronounced as Sloth), a
novel scaling law that leverages publicly available benchmark data and assumes
LLM performance is driven by low-dimensional latent skills, such as reasoning
and instruction following. These latent skills are influenced by computational
resources like model size and training tokens but with varying efficiencies
across model families. Sloth exploits correlations across benchmarks to provide
more accurate and interpretable predictions while alleviating the need to train
multiple LLMs per family. We present both theoretical results on parameter
identification and empirical evaluations on 12 prominent benchmarks, from Open
LLM Leaderboard v1/v2, demonstrating that Sloth predicts LLM performance
efficiently and offers insights into scaling behaviors for downstream tasks
such as coding and emotional intelligence applications.
INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations
Imagine having a conversation with a socially intelligent agent. It can
attentively listen to your words and offer visual and linguistic feedback
promptly. This seamless interaction allows for multiple rounds of conversation
to flow smoothly and naturally. In pursuit of actualizing it, we propose INFP,
a novel audio-driven head generation framework for dyadic interaction. Unlike
previous head generation works that only focus on single-sided communication,
or require manual role assignment and explicit role switching, our model drives
the agent portrait dynamically alternates between speaking and listening state,
guided by the input dyadic audio. Specifically, INFP comprises a Motion-Based
Head Imitation stage and an Audio-Guided Motion Generation stage. The first
stage learns to project facial communicative behaviors from real-life
conversation videos into a low-dimensional motion latent space, and use the
motion latent codes to animate a static image. The second stage learns the
mapping from the input dyadic audio to motion latent codes through denoising,
leading to the audio-driven head generation in interactive scenarios. To
facilitate this line of research, we introduce DyConv, a large scale dataset of
rich dyadic conversations collected from the Internet. Extensive experiments
and visualizations demonstrate superior performance and effectiveness of our
method. Project Page: https://grisoon.github.io/INFP/.
The Surprising Effectiveness of Test-Time Training for Abstract Reasoning
Language models have shown impressive performance on tasks within their
training distribution, but often struggle with novel problems requiring complex
reasoning. We investigate the effectiveness of test-time training (TTT) --
updating model parameters temporarily during inference using a loss derived
from input data -- as a mechanism for improving models' reasoning capabilities,
using the Abstraction and Reasoning Corpus (ARC) as a benchmark. Through
systematic experimentation, we identify three crucial components for successful
TTT: (1) initial finetuning on similar tasks (2) auxiliary task format and
augmentations (3) per-instance training. TTT significantly improves performance
on ARC tasks, achieving up to 6x improvement in accuracy compared to base
fine-tuned models; applying TTT to an 8B-parameter language model, we achieve
53% accuracy on the ARC's public validation set, improving the state-of-the-art
by nearly 25% for public and purely neural approaches. By ensembling our method
with recent program generation approaches, we get SoTA public validation
accuracy of 61.9%, matching the average human score. Our findings suggest that
explicit symbolic search is not the only path to improved abstract reasoning in
neural language models; additional test-time applied to continued training on
few-shot examples can also be extremely effective.
Stealing User Prompts from Mixture of Experts
Mixture-of-Experts (MoE) models improve the efficiency and scalability of
dense language models by routing each token to a small number of experts in
each layer. In this paper, we show how an adversary that can arrange for their
queries to appear in the same batch of examples as a victim's queries can
exploit Expert-Choice-Routing to fully disclose a victim's prompt. We
successfully demonstrate the effectiveness of this attack on a two-layer
Mixtral model, exploiting the tie-handling behavior of the torch.topk CUDA
implementation. Our results show that we can extract the entire prompt using
$O({VM}^2)$ queries (with vocabulary size $V$ and prompt length $M$) or 100
queries on average per token in the setting we consider. This is the first
attack to exploit architectural flaws for the purpose of extracting user
prompts, introducing a new class of LLM vulnerabilities.
STAR: A Simple Training-free Approach for Recommendations using Large Language Models
Recent progress in large language models (LLMs) offers promising new
approaches for recommendation system (RecSys) tasks. While the current
state-of-the-art methods rely on fine-tuning LLMs to achieve optimal results,
this process is costly and introduces significant engineering complexities.
Conversely, methods that bypass fine-tuning and use LLMs directly are less
resource-intensive but often fail to fully capture both semantic and
collaborative information, resulting in sub-optimal performance compared to
their fine-tuned counterparts. In this paper, we propose a Simple Training-free
Approach for Recommendation (STAR), a framework that utilizes LLMs and can be
applied to various recommendation tasks without the need for fine-tuning. Our
approach involves a retrieval stage that uses semantic embeddings from LLMs
combined with collaborative user information to retrieve candidate items. We
then apply an LLM for pairwise ranking to enhance next-item prediction.
Experimental results on the Amazon Review dataset show competitive performance
for next item prediction, even with our retrieval stage alone. Our full method
achieves Hits@10 performance of +23.8% on Beauty, +37.5% on Toys and Games, and
-1.8% on Sports and Outdoors relative to the best supervised models. This
framework offers an effective alternative to traditional supervised models,
highlighting the potential of LLMs in recommendation systems without extensive
training or custom architectures.
Mars: Situated Inductive Reasoning in an Open-World Environment
Large Language Models (LLMs) trained on massive corpora have shown remarkable
success in knowledge-intensive tasks. Yet, most of them rely on pre-stored
knowledge. Inducing new general knowledge from a specific environment and
performing reasoning with the acquired knowledge -- \textit{situated inductive
reasoning}, is crucial and challenging for machine intelligence. In this paper,
we design Mars, an interactive environment devised for situated inductive
reasoning. It introduces counter-commonsense game mechanisms by modifying
terrain, survival setting and task dependency while adhering to certain
principles. In Mars, agents need to actively interact with their surroundings,
derive useful rules and perform decision-making tasks in specific contexts. We
conduct experiments on various RL-based and LLM-based methods, finding that
they all struggle on this challenging situated inductive reasoning benchmark.
Furthermore, we explore \textit{Induction from Reflection}, where we instruct
agents to perform inductive reasoning from history trajectory. The superior
performance underscores the importance of inductive reasoning in Mars. Through
Mars, we aim to galvanize advancements in situated inductive reasoning and set
the stage for developing the next generation of AI systems that can reason in
an adaptive and context-sensitive way.
AgentGen: Enhancing Planning Abilities for Large Language Model based Agent via Environment and Task Generation
Large Language Model-based agents have garnered significant attention and are
becoming increasingly popular. Furthermore, planning ability is a crucial
component of an LLM-based agent, which generally entails achieving a desired
goal from an initial state. This paper investigates enhancing the planning
abilities of LLMs through instruction tuning, referred to as agent training.
Recent studies have demonstrated that utilizing expert-level trajectory for
instruction-tuning LLMs effectively enhances their planning capabilities.
However, existing work primarily focuses on synthesizing trajectories from
manually designed planning tasks and environments. The labor-intensive nature
of creating these environments and tasks impedes the generation of sufficiently
varied and extensive trajectories. To address this limitation, this paper
explores the automated synthesis of diverse environments and a gradual range of
planning tasks, from easy to difficult. We introduce a framework, AgentGen,
that leverages LLMs first to generate environments and subsequently generate
planning tasks conditioned on these environments. Specifically, to improve
environmental diversity, we propose using an inspiration corpus composed of
various domain-specific text segments as the context for synthesizing
environments. Moreover, to increase the difficulty diversity of generated
planning tasks, we propose a bidirectional evolution method, Bi-Evol, that
evolves planning tasks from easier and harder directions to synthesize a task
set with a smoother difficulty curve. The evaluation results derived from
AgentBoard show that AgentGen greatly improves LLMs' planning ability, e.g.,
the AgentGen instruction-tuned Llama-3.1-8B surpasses GPT-3.5 in overall
performance. Moreover, the AgentGen-tuned Llama-3.1-70B model achieves
state-of-the-art results in planning tasks.
The Llama 3 Herd of Models
Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language models such as GPT-4 on a plethora of tasks. We publicly release Llama 3, including pre-trained and post-trained versions of the 405B parameter language model and our Llama Guard 3 model for input and output safety. The paper also presents the results of experiments in which we integrate image, video, and speech capabilities into Llama 3 via a compositional approach. We observe this approach performs competitively with the state-of-the-art on image, video, and speech recognition tasks. The resulting models are not yet being broadly released as they are still under development.
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision
Attention, as a core layer of the ubiquitous Transformer architecture, is the
bottleneck for large language models and long-context applications.
FlashAttention elaborated an approach to speed up attention on GPUs through
minimizing memory reads/writes. However, it has yet to take advantage of new
capabilities present in recent hardware, with FlashAttention-2 achieving only
35% utilization on the H100 GPU. We develop three main techniques to speed up
attention on Hopper GPUs: exploiting asynchrony of the Tensor Cores and TMA to
(1) overlap overall computation and data movement via warp-specialization and
(2) interleave block-wise matmul and softmax operations, and (3) block
quantization and incoherent processing that leverages hardware support for FP8
low-precision. We demonstrate that our method, FlashAttention-3, achieves
speedup on H100 GPUs by 1.5-2.0$\times$ with FP16 reaching up to 740 TFLOPs/s
(75% utilization), and with FP8 reaching close to 1.2 PFLOPs/s. We validate
that FP8 FlashAttention-3 achieves 2.6$\times$ lower numerical error than a
baseline FP8 attention.