Kilolanguage Learning, Projection, and Translation
The breadth of information digitized in the world’s languages gives opportunities for linguistic insights and computational tools with pan-lingual perspective. We can achieve this by projecting lexical information across language, either at the type or token level. First, we project information between thousands of languages at the type level to investigate the classic color word hypotheses of Berlin and Kay. Applying fourteen computational linguistic measures of color word basicness/secondariness, we find cross-linguistic credence and shed additional nuance. Second, we project information between thousands of languages at the token level to create fine-grained morphological analyzers and generators. We show applications to pronoun clusivity and multilingual MT. Finally, we produce morphological tools grounded in UniMorph that improve on strong initial models and generalize across languages.
Arya McCarthy is a Ph.D. candidate at Johns Hopkins University, working on massively multilingual natural language processing. He is advised by David Yarowsky in the Center for Language and Speech Processing; his work is funded by DARPA LORELEI, the International Olympic Committee, and the American Political Science Association (APSA). His work focuses on improving translation and computational modeling of rare languages. Primarily, he approaches this through weakly supervised natural language processing at the scale of thousands of languages. Previously, Arya has spent time at Google, Duolingo, Facebook, Harvard University, and the University of Edinburgh. Arya is the PI for an APSA grant geared toward better integrating computational and social sciences. In this effort, he is partnering with Tom Lippincott, Kathy McKeown, David Mimno, Philip Resnik, and Noah Smith.