Training Models to Critique Themselves
We study the setting of large language models critiquing themselves in natural language. We find that:
Critiques help humans find flaws in summaries that they would have otherwise missed.
Larger models write more helpful critiques, and on most tasks are better at self-critiquing.
Larger models can use their own self-critiques, refining their own summaries into better ones.
We suggest methodology for and find evidence that our models’ critiques may not be able to surface all its relevant knowledge of flaws.
Jeff Wu is a research engineer at OpenAI working on language modeling (e.g. GPT-2) and alignment (InstructGPT).