InstructGPT, and the arrival of RLHF

OpenAI published InstructGPT recently, and it's fantastic work. It's the culmination in a series of papers from 2017, 2019, and 2021. I think it'll be a big deal. Beyond the obvious alignment angle, even purely capabilities focused folks should be interested, given the equal performance with ~100x fewer parameters (175B vs. 1.3B).

Some additional directions I think will be interesting to pursue include reward modeling (Anthropic recently published related work), disentangling the effects of the algorithmic bits and the human preference data collected, RL algorithms (e.g. utilizing off-policy data from my old labmates in a similar RLHF setup), modeling for diverse or conflicting feedback (e.g. in the style of Jury Learning, distributional RL, or Who Said What), and learning from other sources of ratings (e.g. natural langauge feedback after deployment, or implicit preferences).