Natural Language Processing Seminar

Training Models to Critique Themselves

Friday, October 21, 2022

11:10 am - 12:30 pm

Jeff Wu

We study the setting of large language models critiquing themselves in natural language. We find that:

Critiques help humans find flaws in summaries that they would have otherwise missed.
Larger models write more helpful critiques, and on most tasks are better at self-critiquing.
Larger models can use their own self-critiques, refining their own summaries into better ones.

We suggest methodology for and find evidence that our models’ critiques may not be able to surface all its relevant knowledge of flaws.

Jeff Wu is a research engineer at OpenAI working on language modeling (e.g. GPT-2) and alignment (InstructGPT).

Last updated: October 13, 2022

Contact