[论文笔记] chatgpt系列 1.2 PPO(chatlama & colossalAI 代码解读)

编程之家690 更新时间：2026-04-04 04:09:54

ChatGPT 训练一共分为三个步骤：Pretrain/FT、Reward Model、PPO

GitHub - hpcaitech/ColossalAI: Making large AI models cheaper, faster and more accessible

nebullvm/apps/accelerate/chatllama at main · nebuly-ai/nebullvm · GitHub

一、 Actor模型训练（微调GPT）

这个步骤是对Actor模型即GPT进行有监督预训练/微调。

模型采用 GPT2LHHeadModel ，损失函数采用softmax交叉熵。

class SFTDataset(Dataset):
    def __init__(self, dataset, tokenizer: Callable,

本文发布于:2025-08-10，感谢您对本站的认可！

评论列表（有0条评论）