Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations

Jiaqi Zhai , Lucy Liao , Xing Liu , Yueming Wang , Rui Li , Xuan Cao , Leon Gao , Zhaojie Gong , Fangda Gu , Michael He , Yinghai Lu , Yu Shi
0
Large-scale recommendation systems are characterized by their reliance on high cardinality, heterogeneous features and the need to handle tens of billions of user actions on a daily basis. Despite being trained on huge volume of data with thousands of features, most Deep Learning Recommendation Models (DLRMs) in industry fail to scale with compute. Inspired by success achieved by Transformers in language and vision domains, we revisit fundamental design choices in recommendation systems. We reformulate recommendation problems as sequential transduction tasks within a generative modeling framework ("Generative Recommenders"), and propose a new architecture, HSTU, designed for high cardinality, non-stationary streaming recommendation data. HSTU outperforms baselines over synthetic and public datasets by up to 65.8% in NDCG, and is 5.3x to 15.2x faster than FlashAttention2-based Transformers on 8192 length sequences. HSTU-based Generative Recommenders, with 1.5 trillion parameters, improve metrics in online A/B tests by 12.4% and have been deployed on multiple surfaces of a large internet platform with billions of users. More importantly, the model quality of Generative Recommenders empirically scales as a power-law of training compute across three orders of magnitude, up to GPT-3/LLaMa-2 scale, which reduces carbon footprint needed for future model developments, and further paves the way for the first foundational models in recommendations.
2024-02-27 arXiv IR LG

LLM Multi-Agent Systems: Challenges and Open Problems

Shanshan Han , Qifan Zhang , Yuhang Yao , Weizhao Jin , Zhaozhuo Xu , Chaoyang He
0
This paper explores existing works of multi-agent systems and identifies challenges that remain inadequately addressed. By leveraging the diverse capabilities and roles of individual agents within a multi-agent system, these systems can tackle complex tasks through collaboration. We discuss optimizing task allocation, fostering robust reasoning through iterative debates, managing complex and layered context information, and enhancing memory management to support the intricate interactions within multi-agent systems. We also explore the potential application of multi-agent systems in blockchain systems to shed light on their future development and application in real-world distributed systems.
2024-02-05 arXiv Memory Management in Multi-agent Systems Multi-agent Systems Task Allocation Optimization

Large Language Model based Multi-Agents: A Survey of Progress and Challenges

Taicheng Guo , Xiuying Chen , Yaqi Wang , Ruidi Chang , Shichao Pei , Nitesh V. Chawla , Olaf Wiest , Xiangliang Zhang
0
Large Language Models (LLMs) have achieved remarkable success across a wide array of tasks. Due to the impressive planning and reasoning abilities of LLMs, they have been used as autonomous agents to do many tasks automatically. Recently, based on the development of using one LLM as a single planning or decision-making agent, LLM-based multi-agent systems have achieved considerable progress in complex problem-solving and world simulation. To provide the community with an overview of this dynamic field, we present this survey to offer an in-depth discussion on the essential aspects of multi-agent systems based on LLMs, as well as the challenges. Our goal is for readers to gain substantial insights on the following questions: What domains and environments do LLM-based multi-agents simulate? How are these agents profiled and how do they communicate? What mechanisms contribute to the growth of agents' capacities? For those interested in delving into this field of study, we also summarize the commonly used datasets or benchmarks for them to have convenient access. To keep researchers updated on the latest studies, we maintain an open-source GitHub repository, dedicated to outlining the research on LLM-based multi-agent systems.
2024-01-21 arXiv Complex Problem-solving Large Language Models Multi-agent Systems

Mixtral of Experts

Thibaut Lavril (Meta AI) , Marie-Anne Lachaux (Meta AI) , Timothée Lacroix (Meta AI) , Guillaume Lample (Meta AI) , Albert Q. Jiang , Alexandre Sablayrolles , Antoine Roux , Arthur Mensch , Blanche Savary , Chris Bamford , Devendra Singh Chaplot , Diego de las Casas , Emma Bou Hanna , Florian Bressand , Gianna Lengyel , Guillaume Bour , Lélio Renard Lavaud , Lucile Saulnier , Pierre Stock , Sandeep Subramanian , Sophia Yang , Szymon Antoniak , Teven Le Scao , Théophile Gervet , Thomas Wang , William El Sayed
0
We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular, Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks. We also provide a model fine-tuned to follow instructions, Mixtral 8x7B - Instruct, that surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B - chat model on human benchmarks. Both the base and instruct models are released under the Apache 2.0 license.
2024-01-08 arXiv Inference Optimization Language Model Architecture Sparse Mixture of Experts

Mistral 7B

Thibaut Lavril (Meta AI) , Marie-Anne Lachaux (Meta AI) , Timothée Lacroix (Meta AI) , Guillaume Lample (Meta AI) , Albert Q. Jiang , Alexandre Sablayrolles , Arthur Mensch , Chris Bamford , Devendra Singh Chaplot , Diego de las Casas , Florian Bressand , Gianna Lengyel , Lélio Renard Lavaud , Lucile Saulnier , Pierre Stock , Teven Le Scao , Thomas Wang , William El Sayed
0
We introduce Mistral 7B v0.1, a 7-billion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms Llama 2 13B across all evaluated benchmarks, and Llama 1 34B in reasoning, mathematics, and code generation. Our model leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost. We also provide a model fine-tuned to follow instructions, Mistral 7B -- Instruct, that surpasses the Llama 2 13B -- Chat model both on human and automated benchmarks. Our models are released under the Apache 2.0 license.
2023-10-10 arXiv Inference Optimization Instruction-following Model Language Model Engineering

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron (Meta AI) , Louis Martin (Meta AI) , Kevin Stone (Meta AI) , Peter Albert (Meta AI) , Amjad Almahairi (Meta AI) , Yasmine Babaei (Meta AI) , Nikolay Bashlykov (Meta AI) , Soumya Batra (Meta AI) , Prajjwal Bhargava (Meta AI) , Shruti Bhosale (Meta AI) , Dan Bikel (Meta AI) , Lukas Blecher (Meta AI) , Cristian Canton Ferrer (Meta AI) , Moya Chen (Meta AI) , Thomas Scialom (Meta AI)
0
In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closed-source models. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs.
2023-07-18 arXiv cs.AI cs.CL

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Tri Dao
0
Scaling Transformers to longer sequence lengths has been a major problem in the last several years, promising to improve performance in language modeling and high-resolution image understanding, as well as to unlock new applications in code, audio, and video generation. The attention layer is the main bottleneck in scaling to longer sequences, as its runtime and memory increase quadratically in the sequence length. FlashAttention exploits the asymmetric GPU memory hierarchy to bring significant memory saving (linear instead of quadratic) and runtime speedup (2-4$\times$ compared to optimized baselines), with no approximation. However, FlashAttention is still not nearly as fast as optimized matrix-multiply (GEMM) operations, reaching only 25-40\% of the theoretical maximum FLOPs/s. We observe that the inefficiency is due to suboptimal work partitioning between different thread blocks and warps on the GPU, causing either low-occupancy or unnecessary shared memory reads/writes. We propose FlashAttention-2, with better work partitioning to address these issues. In particular, we (1) tweak the algorithm to reduce the number of non-matmul FLOPs (2) parallelize the attention computation, even for a single head, across different thread blocks to increase occupancy, and (3) within each thread block, distribute the work between warps to reduce communication through shared memory. These yield around 2$\times$ speedup compared to FlashAttention, reaching 50-73\% of the theoretical maximum FLOPs/s on A100 and getting close to the efficiency of GEMM operations. We empirically validate that when used end-to-end to train GPT-style models, FlashAttention-2 reaches training speed of up to 225 TFLOPs/s per A100 GPU (72\% model FLOPs utilization).
2023-07-17 arXiv FlashAttention Optimization GPU Parallelism in Attention Work Partitioning in Attention

Recommender Systems in the Era of Large Language Models (LLMs)

Zihuai Zhao , Wenqi Fan , Jiatong Li , Yunqing Liu , Xiaowei Mei , Yiqi Wang , Zhen Wen , Fei Wang , Xiangyu Zhao , Jiliang Tang , Qing Li
0
With the prosperity of e-commerce and web applications, Recommender Systems (RecSys) have become an important component of our daily life, providing personalized suggestions that cater to user preferences. While Deep Neural Networks (DNNs) have made significant advancements in enhancing recommender systems by modeling user-item interactions and incorporating textual side information, DNN-based methods still face limitations, such as difficulties in understanding users' interests and capturing textual side information, inabilities in generalizing to various recommendation scenarios and reasoning on their predictions, etc. Meanwhile, the emergence of Large Language Models (LLMs), such as ChatGPT and GPT4, has revolutionized the fields of Natural Language Processing (NLP) and Artificial Intelligence (AI), due to their remarkable abilities in fundamental responsibilities of language understanding and generation, as well as impressive generalization and reasoning capabilities. As a result, recent studies have attempted to harness the power of LLMs to enhance recommender systems. Given the rapid evolution of this research direction in recommender systems, there is a pressing need for a systematic overview that summarizes existing LLM-empowered recommender systems, to provide researchers in relevant fields with an in-depth understanding. Therefore, in this paper, we conduct a comprehensive review of LLM-empowered recommender systems from various aspects including Pre-training, Fine-tuning, and Prompting. More specifically, we first introduce representative methods to harness the power of LLMs (as a feature encoder) for learning representations of users and items. Then, we review recent techniques of LLMs for enhancing recommender systems from three paradigms, namely pre-training, fine-tuning, and prompting. Finally, we comprehensively discuss future directions in this emerging field.
2023-07-05 arXiv Large Language Models Natural Language Processing Recommender Systems

A Survey on Large Language Models for Recommendation

Likang Wu , Zhi Zheng , Zhaopeng Qiu , Hao Wang , Hongchao Gu , Tingjia Shen , Chuan Qin , Chen Zhu , Hengshu Zhu , Qi Liu , Hui Xiong , Enhong Chen
0
Large Language Models (LLMs) have emerged as powerful tools in the field of Natural Language Processing (NLP) and have recently gained significant attention in the domain of Recommendation Systems (RS). These models, trained on massive amounts of data using self-supervised learning, have demonstrated remarkable success in learning universal representations and have the potential to enhance various aspects of recommendation systems by some effective transfer techniques such as fine-tuning and prompt tuning, and so on. The crucial aspect of harnessing the power of language models in enhancing recommendation quality is the utilization of their high-quality representations of textual features and their extensive coverage of external knowledge to establish correlations between items and users. To provide a comprehensive understanding of the existing LLM-based recommendation systems, this survey presents a taxonomy that categorizes these models into two major paradigms, respectively Discriminative LLM for Recommendation (DLLM4Rec) and Generative LLM for Recommendation (GLLM4Rec), with the latter being systematically sorted out for the first time. Furthermore, we systematically review and analyze existing LLM-based recommendation systems within each paradigm, providing insights into their methodologies, techniques, and performance. Additionally, we identify key challenges and several valuable findings to provide researchers and practitioners with inspiration. We have also created a GitHub repository to index relevant papers on LLMs for recommendation, https://github.com/WLiK/LLM4Rec.
2023-05-31 arXiv Large Language Models for Recommendation Natural Language Processing in Recommendation Transfer Techniques in Recommendation with LLMs

QLoRA: Efficient Finetuning of Quantized LLMs

Tim Dettmers , Artidoro Pagnoni , Ari Holtzman , Luke Zettlemoyer
0
We present QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance...
2023-05-23 arXiv 4-bit NormalFloat (NF4) Efficient Finetuning of Quantized LLMs Low Rank Adapters (LoRA)