Research Papers
Offline Reinforcement Learning for LLM Multi-Step Reasoning
Improving the multi-step reasoning ability of large language models (LLMs)
with offline reinforcement learning (RL) is essential for quickly adapting them
to complex tasks. While Direct Preference Optimization (DPO) has shown promise
in aligning LLMs with human preferences, it is less suitable for multi-step
reasoning tasks because (1) DPO relies on paired preference data, which is not
readily available for multi-step reasoning tasks, and (2) it treats all tokens
uniformly, making it ineffective for credit assignment in multi-step reasoning
tasks, which often come with sparse reward. In this work, we propose OREO
(Offline Reasoning Optimization), an offline RL method for enhancing LLM
multi-step reasoning. Building on insights from previous works of maximum
entropy reinforcement learning, it jointly learns a policy model and value
function by optimizing the soft Bellman Equation. We show in principle that it
reduces the need to collect pairwise data and enables better credit assignment.
Empirically, OREO surpasses existing offline learning methods on multi-step
reasoning benchmarks, including mathematical reasoning tasks (GSM8K, MATH) and
embodied agent control (ALFWorld). The approach can be extended to a
multi-iteration framework when additional resources are available. Furthermore,
the learned value function can be leveraged to guide the tree search for free,
which can further boost performance during test time.
Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
We propose to synthesize high-quality and synchronized audio, given video and
optional text conditions, using a novel multimodal joint training framework
MMAudio. In contrast to single-modality training conditioned on (limited) video
data only, MMAudio is jointly trained with larger-scale, readily available
text-audio data to learn to generate semantically aligned high-quality audio
samples. Additionally, we improve audio-visual synchrony with a conditional
synchronization module that aligns video conditions with audio latents at the
frame level. Trained with a flow matching objective, MMAudio achieves new
video-to-audio state-of-the-art among public models in terms of audio quality,
semantic alignment, and audio-visual synchronization, while having a low
inference time (1.23s to generate an 8s clip) and just 157M parameters. MMAudio
also achieves surprisingly competitive performance in text-to-audio generation,
showing that joint training does not hinder single-modality performance. Code
and demo are available at: https://hkchengrex.github.io/MMAudio
Parallelized Autoregressive Visual Generation
Autoregressive models have emerged as a powerful approach for visual
generation but suffer from slow inference speed due to their sequential
token-by-token prediction process. In this paper, we propose a simple yet
effective approach for parallelized autoregressive visual generation that
improves generation efficiency while preserving the advantages of
autoregressive modeling. Our key insight is that parallel generation depends on
visual token dependencies-tokens with weak dependencies can be generated in
parallel, while strongly dependent adjacent tokens are difficult to generate
together, as their independent sampling may lead to inconsistencies. Based on
this observation, we develop a parallel generation strategy that generates
distant tokens with weak dependencies in parallel while maintaining sequential
generation for strongly dependent local tokens. Our approach can be seamlessly
integrated into standard autoregressive models without modifying the
architecture or tokenizer. Experiments on ImageNet and UCF-101 demonstrate that
our method achieves a 3.6x speedup with comparable quality and up to 9.5x
speedup with minimal quality degradation across both image and video generation
tasks. We hope this work will inspire future research in efficient visual
generation and unified autoregressive modeling. Project page:
https://epiphqny.github.io/PAR-project.
LLMs Lost in Translation: M-ALERT uncovers Cross-Linguistic Safety Gaps
Building safe Large Language Models (LLMs) across multiple languages is
essential in ensuring both safe access and linguistic diversity. To this end,
we introduce M-ALERT, a multilingual benchmark that evaluates the safety of
LLMs in five languages: English, French, German, Italian, and Spanish. M-ALERT
includes 15k high-quality prompts per language, totaling 75k, following the
detailed ALERT taxonomy. Our extensive experiments on 10 state-of-the-art LLMs
highlight the importance of language-specific safety analysis, revealing that
models often exhibit significant inconsistencies in safety across languages and
categories. For instance, Llama3.2 shows high unsafety in the category
crime_tax for Italian but remains safe in other languages. Similar differences
can be observed across all models. In contrast, certain categories, such as
substance_cannabis and crime_propaganda, consistently trigger unsafe responses
across models and languages. These findings underscore the need for robust
multilingual safety practices in LLMs to ensure safe and responsible usage
across diverse user communities.
MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design
Quantization has become one of the most effective methodologies to compress
LLMs into smaller size. However, the existing quantization solutions still show
limitations of either non-negligible accuracy drop or system inefficiency. In
this paper, we make a comprehensive analysis of the general quantization
principles on their effect to the triangle of accuracy, memory consumption and
system efficiency. We propose MixLLM that explores the new optimization space
of mixed-precision quantization between output features based on the insight
that different output features matter differently in the model. MixLLM
identifies the output features with high salience in the global view rather
than within each single layer, effectively assigning the larger bit-width to
output features that need it most to achieve good accuracy with low memory
consumption. We present the sweet spot of quantization configuration of
algorithm-system co-design that leads to high accuracy and system efficiency.
To address the system challenge, we design the two-step dequantization to make
use of the int8 Tensor Core easily and fast data type conversion to reduce
dequantization overhead significantly, and present the software pipeline to
overlap the memory access, dequantization and the MatMul to the best. Extensive
experiments show that with only 10% more bits, the PPL increasement can be
reduced from about 0.5 in SOTA to within 0.2 for Llama 3.1 70B, while on
average MMLU-Pro improves by 0.93 over the SOTA of three popular models. In
addition to its superior accuracy, MixLLM also achieves state-of-the-art system
efficiency.
HEC-GCN: Hypergraph Enhanced Cascading Graph Convolution Network for Multi-Behavior Recommendation
Multi-behavior recommendation (MBR) has garnered growing attention recently
due to its ability to mitigate the sparsity issue by inferring user preferences
from various auxiliary behaviors to improve predictions for the target
behavior. Although existing research on MBR has yielded impressive results,
they still face two major limitations. First, previous methods mainly focus on
modeling fine-grained interaction information between users and items under
each behavior, which may suffer from sparsity issue. Second, existing models
usually concentrate on exploiting dependencies between two consecutive
behaviors, leaving intra- and inter-behavior consistency largely unexplored. To
the end, we propose a novel approach named Hypergraph Enhanced Cascading Graph
Convolution Network for multi-behavior recommendation (HEC-GCN). To be
specific, we first explore both fine- and coarse-grained correlations among
users or items of each behavior by simultaneously modeling the
behavior-specific interaction graph and its corresponding hypergraph in a
cascaded manner. Then, we propose a behavior consistency-guided alignment
strategy that ensures consistent representations between the interaction graph
and its associated hypergraph for each behavior, while also maintaining
representation consistency across different behaviors. Extensive experiments
and analyses on three public benchmark datasets demonstrate that our proposed
approach is consistently superior to previous state-of-the-art methods due to
its capability to effectively attenuate the sparsity issue as well as preserve
both intra- and inter-behavior consistencies. The code is available at
https://github.com/marqu22/HEC-GCN.git.
SAFERec: Self-Attention and Frequency Enriched Model for Next Basket Recommendation
Transformer-based approaches such as BERT4Rec and SASRec demonstrate strong
performance in Next Item Recommendation (NIR) tasks. However, applying these
architectures to Next-Basket Recommendation (NBR) tasks, which often involve
highly repetitive interactions, is challenging due to the vast number of
possible item combinations in a basket. Moreover, frequency-based methods such
as TIFU-KNN and UP-CF still demonstrate strong performance in NBR tasks,
frequently outperforming deep-learning approaches. This paper introduces
SAFERec, a novel algorithm for NBR that enhances transformer-based
architectures from NIR by incorporating item frequency information,
consequently improving their applicability to NBR tasks. Extensive experiments
on multiple datasets show that SAFERec outperforms all other baselines,
specifically achieving an 8\% improvement in Recall@10.
SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation
Key-Value (KV) cache has become a bottleneck of LLMs for long-context
generation. Despite the numerous efforts in this area, the optimization for the
decoding phase is generally ignored. However, we believe such optimization is
crucial, especially for long-output generation tasks based on the following two
observations: (i) Excessive compression during the prefill phase, which
requires specific full context impairs the comprehension of the reasoning task;
(ii) Deviation of heavy hitters occurs in the reasoning tasks with long
outputs. Therefore, SCOPE, a simple yet efficient framework that separately
performs KV cache optimization during the prefill and decoding phases, is
introduced. Specifically, the KV cache during the prefill phase is preserved to
maintain the essential information, while a novel strategy based on sliding is
proposed to select essential heavy hitters for the decoding phase. Memory usage
and memory transfer are further optimized using adaptive and discontinuous
strategies. Extensive experiments on LongGenBench show the effectiveness and
generalization of SCOPE and its compatibility as a plug-in to other
prefill-only KV compression methods.
Sequence Matters: Harnessing Video Models in 3D Super-Resolution
3D super-resolution aims to reconstruct high-fidelity 3D models from
low-resolution (LR) multi-view images. Early studies primarily focused on
single-image super-resolution (SISR) models to upsample LR images into
high-resolution images. However, these methods often lack view consistency
because they operate independently on each image. Although various
post-processing techniques have been extensively explored to mitigate these
inconsistencies, they have yet to fully resolve the issues. In this paper, we
perform a comprehensive study of 3D super-resolution by leveraging video
super-resolution (VSR) models. By utilizing VSR models, we ensure a higher
degree of spatial consistency and can reference surrounding spatial
information, leading to more accurate and detailed reconstructions. Our
findings reveal that VSR models can perform remarkably well even on sequences
that lack precise spatial alignment. Given this observation, we propose a
simple yet practical approach to align LR images without involving fine-tuning
or generating 'smooth' trajectory from the trained 3D models over LR images.
The experimental results show that the surprisingly simple algorithms can
achieve the state-of-the-art results of 3D super-resolution tasks on standard
benchmark datasets, such as the NeRF-synthetic and MipNeRF-360 datasets.
Project page: https://ko-lani.github.io/Sequence-Matters
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE)
Vision-Language Models that significantly improves upon its predecessor,
DeepSeek-VL, through two key major upgrades. For the vision component, we
incorporate a dynamic tiling vision encoding strategy designed for processing
high-resolution images with different aspect ratios. For the language
component, we leverage DeepSeekMoE models with the Multi-head Latent Attention
mechanism, which compresses Key-Value cache into latent vectors, to enable
efficient inference and high throughput. Trained on an improved vision-language
dataset, DeepSeek-VL2 demonstrates superior capabilities across various tasks,
including but not limited to visual question answering, optical character
recognition, document/table/chart understanding, and visual grounding. Our
model series is composed of three variants: DeepSeek-VL2-Tiny,
DeepSeek-VL2-Small and DeepSeek-VL2, with 1.0B, 2.8B and 4.5B activated
parameters respectively. DeepSeek-VL2 achieves competitive or state-of-the-art
performance with similar or fewer activated parameters compared to existing
open-source dense and MoE-based models. Codes and pre-trained models are
publicly accessible at https://github.com/deepseek-ai/DeepSeek-VL2.