Multimodal Conversation Structure Understanding
Co-sponsored by the Berkeley Institute for Data Science, the School of Information, and the Department of Scandinavian.
While multimodal large language models (LLMs) excel at dialogue, whether they can adequately parse the structure of conversation — conversational roles and threading — remains underexplored.
In this work, we introduce a suite of tasks and release TV-MMPC, a new annotated dataset, for multimodal conversation structure understanding. Our evaluation reveals that while all multimodal LLMs outperform our heuristic baseline, even the best-performing model we consider experiences a substantial drop in performance when character identities of the conversation are anonymized.
Beyond evaluation, we carry out a sociolinguistic analysis of 350,842 utterances in TVQA. We find that while female characters initiate conversations at rates in proportion to their speaking time, they are 1.2 times more likely than men to be cast as an addressee or side-participant, and the presence of side-participants shifts the conversational register from personal to social.
Space is limited. Submit the application form to request an invitation.
Speaker
Kent Chang
Kent K. Chang is a Ph.D. candidate at the University of California, Berkeley, researching in natural language processing (NLP) and cultural analytics, advised by Prof. David Bamman at the School of Information and Berkeley Artificial Intelligence Research (BAIR).
Kent’s research sits at the intersection of NLP and cultural studies: he seeks to leverage theoretical and textual resources in the humanities and social science to improve systems and innovate new tasks in NLP, and, at the same time, offer researchers in relevant fields more tools to study language, social interactions, and cultural representation at scale.
