Hybrid Flow

RLHF: Reinforcement Learning from Human Feedback

PPO-based RLHF:

Three Stages

response generation
preparation of training data
learning from human preference by updating actor and critic through forward and backward pass

Intra-node: Multi-controller paradigm Inter-node: Single Controller

3D-Hybrid Engine: excutes training and generation of the actor model with high computation efficiency & zero-redundancy transition.