Syntactic Generalization in Natural Language Inference
Neural network models for natural language processing often perform very well on examples that are drawn from the same distribution as the training set. Do they accomplish such success by learning to solve the task as a human might solve it, or do they adopt heuristics that happen to work well on the data set in question, but do not reflect the normative definition of the task (how one “should” solve the task)? This question can be addressed effectively by testing how the system generalizes to examples constructed specifically to diagnose whether the system relies on such fallible heuristics. In my talk, I will discuss ongoing work applying this methodology to the natural language inference (NLI) task.
I will show that a standard neural model — BERT fine-tuned on the MNLI corpus — achieves high accuracy on the MNLI test set, but shows little sensitivity to syntactic structure when tested on our diagnostic data set (HANS); instead, the model relies on word overlap between the premise and the hypothesis, and concludes, for example, that “the doctor visited the lawyer” entails “the lawyer visited the doctor”. While accuracy on the test set is very stable across fine-tuning runs with different weight initializations, generalization behavior varies widely, with accuracy on some classes of examples ranging from 0% to 66%. Finally, augmenting the training set with a moderate number of examples that contradict the word overlap heuristic leads to a dramatic improvement in generalization accuracy. This improvement generalizes to constructions that were not included in the augmentation set. Overall, our results suggest that the syntactic deficiencies of the fine-tuned model do not arise primarily from poor abstract syntactic representations in the underlying BERT model; rather, because of its weak inductive bias, BERT requires a strong fine-tuning signal to favor those syntactic representations over simpler heuristics.
Tal Linzen is an assistant professor of cognitive science (with a joint appointment in Computer Science) at Johns Hopkins University, and affiliated faculty at the JHU Center for Language and Speech Processing. Before moving to Johns Hopkins, he was a postdoctoral researcher at the École Normale Supérieure in Paris, and before that he obtained his PhD from the Department of Linguistics at New York University. At Johns Hopkins, Tal directs the Computation and Psycholinguistics Lab, which develops computational models of human language comprehension and acquisition, as well as psycholinguistically-informed methods for interpreting, evaluating and extending neural network models for natural language processing. The lab’s work has appeared in venues such as ACL, CoNLL, EMNLP, ICLR, NAACL and TACL, as well as in journals such as Cognitive Science and Journal of Neuroscience. Tal co-organized the first two editions of the BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (EMNLP 2018, ACL 2019) and is a co-chair of CoNLL 2020.