Research Papers
Generative Agents: Interactive Simulacra of Human Behavior
Believable proxies of human behavior can empower interactive applications
ranging from immersive environments to rehearsal spaces for interpersonal
communication to prototyping tools. In this paper, we introduce generative
agents--computational software agents that simulate believable human behavior.
Generative agents wake up, cook breakfast, and head to work; artists paint,
while authors write; they form opinions, notice each other, and initiate
conversations; they remember and reflect on days past as they plan the next
day. To enable generative agents, we describe an architecture that extends a
large language model to store a complete record of the agent's experiences
using natural language, synthesize those memories over time into higher-level
reflections, and retrieve them dynamically to plan behavior. We instantiate
generative agents to populate an interactive sandbox environment inspired by
The Sims, where end users can interact with a small town of twenty five agents
using natural language. In an evaluation, these generative agents produce
believable individual and emergent social behaviors: for example, starting with
only a single user-specified notion that one agent wants to throw a Valentine's
Day party, the agents autonomously spread invitations to the party over the
next two days, make new acquaintances, ask each other out on dates to the
party, and coordinate to show up for the party together at the right time. We
demonstrate through ablation that the components of our agent
architecture--observation, planning, and reflection--each contribute critically
to the believability of agent behavior. By fusing large language models with
computational, interactive agents, this work introduces architectural and
interaction patterns for enabling believable simulations of human behavior.
2023-04-07
arXiv
Multi-agent
A Method for Animating Children's Drawings of the Human Figure
Children's drawings have a wonderful inventiveness, creativity, and variety
to them. We present a system that automatically animates children's drawings of
the human figure, is robust to the variance inherent in these depictions, and
is simple and straightforward enough for anyone to use. We demonstrate the
value and broad appeal of our approach by building and releasing the Animated
Drawings Demo, a freely available public website that has been used by millions
of people around the world. We present a set of experiments exploring the
amount of training data needed for fine-tuning, as well as a perceptual study
demonstrating the appeal of a novel twisted perspective retargeting technique.
Finally, we introduce the Amateur Drawings Dataset, a first-of-its-kind
annotated dataset, collected via the public demo, containing over 178,000
amateur drawings and corresponding user-accepted character bounding boxes,
segmentation masks, and joint location annotations.
LLaMA: Open and Efficient Foundation Language Models
We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B. We release all our models to the research community.
Multi-Agent Reinforcement Learning is a Sequence Modeling Problem
Large sequence model (SM) such as GPT series and BERT has displayed
outstanding performance and generalization capabilities on vision, language,
and recently reinforcement learning tasks. A natural follow-up question is how
to abstract multi-agent decision making into an SM problem and benefit from the
prosperous development of SMs. In this paper, we introduce a novel architecture
named Multi-Agent Transformer (MAT) that effectively casts cooperative
multi-agent reinforcement learning (MARL) into SM problems wherein the task is
to map agents' observation sequence to agents' optimal action sequence. Our
goal is to build the bridge between MARL and SMs so that the modeling power of
modern sequence models can be unleashed for MARL. Central to our MAT is an
encoder-decoder architecture which leverages the multi-agent advantage
decomposition theorem to transform the joint policy search problem into a
sequential decision making process; this renders only linear time complexity
for multi-agent problems and, most importantly, endows MAT with monotonic
performance improvement guarantee. Unlike prior arts such as Decision
Transformer fit only pre-collected offline data, MAT is trained by online
trials and errors from the environment in an on-policy fashion. To validate
MAT, we conduct extensive experiments on StarCraftII, Multi-Agent MuJoCo,
Dexterous Hands Manipulation, and Google Research Football benchmarks. Results
demonstrate that MAT achieves superior performance and data efficiency compared
to strong baselines including MAPPO and HAPPO. Furthermore, we demonstrate that
MAT is an excellent few-short learner on unseen tasks regardless of changes in
the number of agents. See our project page at
https://sites.google.com/view/multi-agent-transformer.
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Transformers are slow and memory-hungry on long sequences, since the time and
memory complexity of self-attention are quadratic in sequence length.
Approximate attention methods have attempted to address this problem by trading
off model quality to reduce the compute complexity, but often do not achieve
wall-clock speedup. We argue that a missing principle is making attention
algorithms IO-aware -- accounting for reads and writes between levels of GPU
memory. We propose FlashAttention, an IO-aware exact attention algorithm that
uses tiling to reduce the number of memory reads/writes between GPU high
bandwidth memory (HBM) and GPU on-chip SRAM. We analyze the IO complexity of
FlashAttention, showing that it requires fewer HBM accesses than standard
attention, and is optimal for a range of SRAM sizes. We also extend
FlashAttention to block-sparse attention, yielding an approximate attention
algorithm that is faster than any existing approximate attention method.
FlashAttention trains Transformers faster than existing baselines: 15%
end-to-end wall-clock speedup on BERT-large (seq. length 512) compared to the
MLPerf 1.1 training speed record, 3$\times$ speedup on GPT-2 (seq. length 1K),
and 2.4$\times$ speedup on long-range arena (seq. length 1K-4K). FlashAttention
and block-sparse FlashAttention enable longer context in Transformers, yielding
higher quality models (0.7 better perplexity on GPT-2 and 6.4 points of lift on
long-document classification) and entirely new capabilities: the first
Transformers to achieve better-than-chance performance on the Path-X challenge
(seq. length 16K, 61.4% accuracy) and Path-256 (seq. length 64K, 63.1%
accuracy).
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
We explore how generating a chain of thought -- a series of intermediate reasoning steps -- significantly improves the ability of large language models to perform complex reasoning. In particular, we show how such reasoning abilities emerge naturally in sufficiently large language models via a simple method called chain of thought prompting, where a few chain of thought demonstrations are provided as exemplars in prompting. Experiments on three large language models show that chain of thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks. The empirical gains can be striking. For instance, prompting a 540B-parameter language model with just eight chain of thought exemplars achieves state of the art accuracy on the GSM8K benchmark of math word problems, surpassing even finetuned GPT-3 with a verifier.
LoRA: Low-Rank Adaptation of Large Language Models
An important paradigm of natural language processing consists of large-scale
pre-training on general domain data and adaptation to particular tasks or
domains. As we pre-train larger models, full fine-tuning, which retrains all
model parameters, becomes less feasible. Using GPT-3 175B as an example --
deploying independent instances of fine-tuned models, each with 175B
parameters, is prohibitively expensive. We propose Low-Rank Adaptation, or
LoRA, which freezes the pre-trained model weights and injects trainable rank
decomposition matrices into each layer of the Transformer architecture, greatly
reducing the number of trainable parameters for downstream tasks. Compared to
GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable
parameters by 10,000 times and the GPU memory requirement by 3 times. LoRA
performs on-par or better than fine-tuning in model quality on RoBERTa,
DeBERTa, GPT-2, and GPT-3, despite having fewer trainable parameters, a higher
training throughput, and, unlike adapters, no additional inference latency. We
also provide an empirical investigation into rank-deficiency in language model
adaptation, which sheds light on the efficacy of LoRA. We release a package
that facilitates the integration of LoRA with PyTorch models and provide our
implementations and model checkpoints for RoBERTa, DeBERTa, and GPT-2 at
https://github.com/microsoft/LoRA.
Language Models are Few-Shot Learners
Recent work has demonstrated substantial gains on many NLP tasks and
benchmarks by pre-training on a large corpus of text followed by fine-tuning on
a specific task. While typically task-agnostic in architecture, this method
still requires task-specific fine-tuning datasets of thousands or tens of
thousands of examples. By contrast, humans can generally perform a new language
task from only a few examples or from simple instructions - something which
current NLP systems still largely struggle to do. Here we show that scaling up
language models greatly improves task-agnostic, few-shot performance, sometimes
even reaching competitiveness with prior state-of-the-art fine-tuning
approaches. Specifically, we train GPT-3, an autoregressive language model with
175 billion parameters, 10x more than any previous non-sparse language model,
and test its performance in the few-shot setting. For all tasks, GPT-3 is
applied without any gradient updates or fine-tuning, with tasks and few-shot
demonstrations specified purely via text interaction with the model. GPT-3
achieves strong performance on many NLP datasets, including translation,
question-answering, and cloze tasks, as well as several tasks that require
on-the-fly reasoning or domain adaptation, such as unscrambling words, using a
novel word in a sentence, or performing 3-digit arithmetic. At the same time,
we also identify some datasets where GPT-3's few-shot learning still struggles,
as well as some datasets where GPT-3 faces methodological issues related to
training on large web corpora. Finally, we find that GPT-3 can generate samples
of news articles which human evaluators have difficulty distinguishing from
articles written by humans. We discuss broader societal impacts of this finding
and of GPT-3 in general.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
We introduce a new language representation model called BERT, which stands
for Bidirectional Encoder Representations from Transformers. Unlike recent
language representation models, BERT is designed to pre-train deep
bidirectional representations from unlabeled text by jointly conditioning on
both left and right context in all layers. As a result, the pre-trained BERT
model can be fine-tuned with just one additional output layer to create
state-of-the-art models for a wide range of tasks, such as question answering
and language inference, without substantial task-specific architecture
modifications.
BERT is conceptually simple and empirically powerful. It obtains new
state-of-the-art results on eleven natural language processing tasks, including
pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI
accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering
Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1
(5.1 point absolute improvement).
Attention Is All You Need
The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks in an encoder-decoder configuration. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple network architecture, the Transformer, based
solely on attention mechanisms, dispensing with recurrence and convolutions
entirely. Experiments on two machine translation tasks show these models to be
superior in quality while being more parallelizable and requiring significantly
less time to train. Our model achieves 28.4 BLEU on the WMT 2014
English-to-German translation task, improving over the existing best results,
including ensembles by over 2 BLEU. On the WMT 2014 English-to-French
translation task, our model establishes a new single-model state-of-the-art
BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction
of the training costs of the best models from the literature. We show that the
Transformer generalizes well to other tasks by applying it successfully to
English constituency parsing both with large and limited training data.