Supervised Learning-enhanced Multi-Group Actor Critic for Live Stream Allocation in Feed

November 28, 2024 · Declared Dead · 🏛 Knowledge Discovery and Data Mining

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Jingxin Liu, Xiang Gao, Yisha Li, Xin Li, Haiyang Lu, Ben Wang arXiv ID 2412.10381 Category cs.IR: Information Retrieval Cross-listed cs.AI Citations 2 Venue Knowledge Discovery and Data Mining Last Checked 4 months ago

Abstract

In the context of a short video & live stream mixed recommendation scenario, the live stream recommendation system (RS) decides whether to allocate at most one live stream into the video feed for each user request. To maximize long-term user engagement, it is crucial to determine an optimal live stream policy for accurate live stream allocation. The inappropriate live stream allocation policy can significantly affect the duration of the usage app and user retention, which ignores the long-term negative impact of live stream allocation. Recently, reinforcement learning (RL) has been widely applied in recommendation systems to capture long-term user engagement. However, traditional RL algorithms often face divergence and instability problems, which restricts the application and deployment in the large-scale industrial recommendation systems, especially in the aforementioned challenging scenario. To address these challenges, we propose a novel Supervised Learning-enhanced Multi-Group Actor Critic algorithm (SL-MGAC). Specifically, we introduce a supervised learning-enhanced actor-critic framework that incorporates variance reduction techniques, where multi-task reward learning helps restrict bootstrapping error accumulation during critic learning. Additionally, we design a multi-group state decomposition module for both actor and critic networks to reduce prediction variance and improve model stability. We also propose a novel reward function to prevent overly greedy live stream allocation. Empirically, we evaluate the SL-MGAC algorithm using offline policy evaluation (OPE) and online A/B testing. Experimental results demonstrate that the proposed method not only outperforms baseline methods under the platform-level constraints but also exhibits enhanced stability in online recommendation scenarios.