Offline的效果不好,尽管它们提高了资源效率,
They place actor and rollout models on different devices, allowing trajectories used for training to be generated from previous model versions.
但在最近的用于 LLM 后训练系统的异步强化学习中,其优势被严重削弱。根本限制在于其批处理导向的设计,这无法有效屏蔽长尾轨迹的生成延迟。
核心问题就是全局权重同步。
虽然这减少了长尾气泡,但它引入了两个主要问题。(1) 暂停和同步周期通过强制部署为每个中断的轨迹重建 KVCache(即重新填充),在每个 RL 迭代中反复进行,浪费 GPU 资源而未推进生成。(2) 使用不一致的策略版本生成单个响应会损害模型收敛性,导致收敛速度变慢,如我们在§8.2 中实验所示。§9 提供了更全面的讨论和相关工作的更多内容。
一个 rollout 在完成一批数据的生成后立即获取最新的 actor 权重。由于每个 rollout 独立运行并以自己的节奏完成生成,rollout 的更新可以在 actor 训练的任何时刻发生。
individual rollouts can still get stuck in long-tail generation (Figure 3(e)).
The continuous, asynchronous training workflow of Laminar is designed to maintain high training throughput when scaling up. It begins as rollouts pull prompts from the prompt pool to generate trajectories on their GPUs (step ①). For fault tolerance, in-progress trajectories are streamed to the partial response pool (step ②). Upon generation completion, they are moved to the experience buffer (step ③). In parallel with rollout generation, the trainer samples completed trajectories from the experience buffer to perform model training (step ④). This fundamental decoupling of data production from consumption is key to the system’s scalability.
Memory-Bound,权重从内存拷到GPU比较耗时。
After a model update, the trainer pushes the new actor weights to the master relay and immediately resumes its next training iteration without waiting for weights to be fully distributed to other relays or rollouts (step ⑤). The master relay then broadcasts the new weights directly to all other relays using RDMA, which occurs in the background on CPU memory without affecting ongoing GPU-based generation on the same machine (step ⑥). A rollout can fetch the latest weights from its colocated relay at any time, over high-speed PCIe with minimal latency (step ⑦). This decouples parameter dependencies across the actor and all rollouts in the system.
