Research Papers
Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations
Large-scale recommendation systems are characterized by their reliance on
high cardinality, heterogeneous features and the need to handle tens of
billions of user actions on a daily basis. Despite being trained on huge volume
of data with thousands of features, most Deep Learning Recommendation Models
(DLRMs) in industry fail to scale with compute.
Inspired by success achieved by Transformers in language and vision domains,
we revisit fundamental design choices in recommendation systems. We reformulate
recommendation problems as sequential transduction tasks within a generative
modeling framework ("Generative Recommenders"), and propose a new architecture,
HSTU, designed for high cardinality, non-stationary streaming recommendation
data.
HSTU outperforms baselines over synthetic and public datasets by up to 65.8%
in NDCG, and is 5.3x to 15.2x faster than FlashAttention2-based Transformers on
8192 length sequences. HSTU-based Generative Recommenders, with 1.5 trillion
parameters, improve metrics in online A/B tests by 12.4% and have been deployed
on multiple surfaces of a large internet platform with billions of users. More
importantly, the model quality of Generative Recommenders empirically scales as
a power-law of training compute across three orders of magnitude, up to
GPT-3/LLaMa-2 scale, which reduces carbon footprint needed for future model
developments, and further paves the way for the first foundational models in
recommendations.
LLM Multi-Agent Systems: Challenges and Open Problems
This paper explores existing works of multi-agent systems and identifies
challenges that remain inadequately addressed. By leveraging the diverse
capabilities and roles of individual agents within a multi-agent system, these
systems can tackle complex tasks through collaboration. We discuss optimizing
task allocation, fostering robust reasoning through iterative debates, managing
complex and layered context information, and enhancing memory management to
support the intricate interactions within multi-agent systems. We also explore
the potential application of multi-agent systems in blockchain systems to shed
light on their future development and application in real-world distributed
systems.
Large Language Model based Multi-Agents: A Survey of Progress and Challenges
Large Language Models (LLMs) have achieved remarkable success across a wide
array of tasks. Due to the impressive planning and reasoning abilities of LLMs,
they have been used as autonomous agents to do many tasks automatically.
Recently, based on the development of using one LLM as a single planning or
decision-making agent, LLM-based multi-agent systems have achieved considerable
progress in complex problem-solving and world simulation. To provide the
community with an overview of this dynamic field, we present this survey to
offer an in-depth discussion on the essential aspects of multi-agent systems
based on LLMs, as well as the challenges. Our goal is for readers to gain
substantial insights on the following questions: What domains and environments
do LLM-based multi-agents simulate? How are these agents profiled and how do
they communicate? What mechanisms contribute to the growth of agents'
capacities? For those interested in delving into this field of study, we also
summarize the commonly used datasets or benchmarks for them to have convenient
access. To keep researchers updated on the latest studies, we maintain an
open-source GitHub repository, dedicated to outlining the research on LLM-based
multi-agent systems.
Mixtral of Experts
We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model.
Mixtral has the same architecture as Mistral 7B, with the difference that each
layer is composed of 8 feedforward blocks (i.e. experts). For every token, at
each layer, a router network selects two experts to process the current state
and combine their outputs. Even though each token only sees two experts, the
selected experts can be different at each timestep. As a result, each token has
access to 47B parameters, but only uses 13B active parameters during inference.
Mixtral was trained with a context size of 32k tokens and it outperforms or
matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular,
Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and
multilingual benchmarks. We also provide a model fine-tuned to follow
instructions, Mixtral 8x7B - Instruct, that surpasses GPT-3.5 Turbo,
Claude-2.1, Gemini Pro, and Llama 2 70B - chat model on human benchmarks. Both
the base and instruct models are released under the Apache 2.0 license.
Mistral 7B
We introduce Mistral 7B v0.1, a 7-billion-parameter language model engineered
for superior performance and efficiency. Mistral 7B outperforms Llama 2 13B
across all evaluated benchmarks, and Llama 1 34B in reasoning, mathematics, and
code generation. Our model leverages grouped-query attention (GQA) for faster
inference, coupled with sliding window attention (SWA) to effectively handle
sequences of arbitrary length with a reduced inference cost. We also provide a
model fine-tuned to follow instructions, Mistral 7B -- Instruct, that surpasses
the Llama 2 13B -- Chat model both on human and automated benchmarks. Our
models are released under the Apache 2.0 license.
Llama 2: Open Foundation and Fine-Tuned Chat Models
In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closed-source models. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs.
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Scaling Transformers to longer sequence lengths has been a major problem in
the last several years, promising to improve performance in language modeling
and high-resolution image understanding, as well as to unlock new applications
in code, audio, and video generation. The attention layer is the main
bottleneck in scaling to longer sequences, as its runtime and memory increase
quadratically in the sequence length. FlashAttention exploits the asymmetric
GPU memory hierarchy to bring significant memory saving (linear instead of
quadratic) and runtime speedup (2-4$\times$ compared to optimized baselines),
with no approximation. However, FlashAttention is still not nearly as fast as
optimized matrix-multiply (GEMM) operations, reaching only 25-40\% of the
theoretical maximum FLOPs/s. We observe that the inefficiency is due to
suboptimal work partitioning between different thread blocks and warps on the
GPU, causing either low-occupancy or unnecessary shared memory reads/writes. We
propose FlashAttention-2, with better work partitioning to address these
issues. In particular, we (1) tweak the algorithm to reduce the number of
non-matmul FLOPs (2) parallelize the attention computation, even for a single
head, across different thread blocks to increase occupancy, and (3) within each
thread block, distribute the work between warps to reduce communication through
shared memory. These yield around 2$\times$ speedup compared to FlashAttention,
reaching 50-73\% of the theoretical maximum FLOPs/s on A100 and getting close
to the efficiency of GEMM operations. We empirically validate that when used
end-to-end to train GPT-style models, FlashAttention-2 reaches training speed
of up to 225 TFLOPs/s per A100 GPU (72\% model FLOPs utilization).
Recommender Systems in the Era of Large Language Models (LLMs)
With the prosperity of e-commerce and web applications, Recommender Systems
(RecSys) have become an important component of our daily life, providing
personalized suggestions that cater to user preferences. While Deep Neural
Networks (DNNs) have made significant advancements in enhancing recommender
systems by modeling user-item interactions and incorporating textual side
information, DNN-based methods still face limitations, such as difficulties in
understanding users' interests and capturing textual side information,
inabilities in generalizing to various recommendation scenarios and reasoning
on their predictions, etc. Meanwhile, the emergence of Large Language Models
(LLMs), such as ChatGPT and GPT4, has revolutionized the fields of Natural
Language Processing (NLP) and Artificial Intelligence (AI), due to their
remarkable abilities in fundamental responsibilities of language understanding
and generation, as well as impressive generalization and reasoning
capabilities. As a result, recent studies have attempted to harness the power
of LLMs to enhance recommender systems. Given the rapid evolution of this
research direction in recommender systems, there is a pressing need for a
systematic overview that summarizes existing LLM-empowered recommender systems,
to provide researchers in relevant fields with an in-depth understanding.
Therefore, in this paper, we conduct a comprehensive review of LLM-empowered
recommender systems from various aspects including Pre-training, Fine-tuning,
and Prompting. More specifically, we first introduce representative methods to
harness the power of LLMs (as a feature encoder) for learning representations
of users and items. Then, we review recent techniques of LLMs for enhancing
recommender systems from three paradigms, namely pre-training, fine-tuning, and
prompting. Finally, we comprehensively discuss future directions in this
emerging field.
A Survey on Large Language Models for Recommendation
Large Language Models (LLMs) have emerged as powerful tools in the field of
Natural Language Processing (NLP) and have recently gained significant
attention in the domain of Recommendation Systems (RS). These models, trained
on massive amounts of data using self-supervised learning, have demonstrated
remarkable success in learning universal representations and have the potential
to enhance various aspects of recommendation systems by some effective transfer
techniques such as fine-tuning and prompt tuning, and so on. The crucial aspect
of harnessing the power of language models in enhancing recommendation quality
is the utilization of their high-quality representations of textual features
and their extensive coverage of external knowledge to establish correlations
between items and users. To provide a comprehensive understanding of the
existing LLM-based recommendation systems, this survey presents a taxonomy that
categorizes these models into two major paradigms, respectively Discriminative
LLM for Recommendation (DLLM4Rec) and Generative LLM for Recommendation
(GLLM4Rec), with the latter being systematically sorted out for the first time.
Furthermore, we systematically review and analyze existing LLM-based
recommendation systems within each paradigm, providing insights into their
methodologies, techniques, and performance. Additionally, we identify key
challenges and several valuable findings to provide researchers and
practitioners with inspiration. We have also created a GitHub repository to
index relevant papers on LLMs for recommendation,
https://github.com/WLiK/LLM4Rec.
QLoRA: Efficient Finetuning of Quantized LLMs
We present QLoRA, an efficient finetuning approach that reduces memory usage
enough to finetune a 65B parameter model on a single 48GB GPU while preserving
full 16-bit finetuning task performance...