Information Access Seminar

Experience with Open Source Linguistic Data and Tools for Analysis of Chinese Text

Friday, October 28, 2016

3:10 pm - 5:00 pm

Alex Amies

(Note the room change.)

The speaker will discuss experience in access to linguistic data from the development of a reader and corpus analysis tools for Chinese Buddhist texts. To begin with there is much promise and hype in open source projects that have been used or investigated in this project. Some of these sources are truly useful, such as CBETA, which provides a digitization of several versions of the Chinese Buddhist canon. The presentation will describe challenges actually working with these sources, including inferior quality, small volume, and critical missing pieces of the open source data sources compared with copyrighted or otherwise locked materials. The presentation will then describe approaches to overcoming challenges, in particular tools, such as Jupyter and Pandas, for efficient curation of linguistic data in a semi-automated manner.

Alex Amies is working on a project to build a Chinese text reader (ntireader.org) for the Taishō Shinshū Daizōkyō version of the Chinese Buddhist canon, as part of a Master of Arts in applied Buddhist studies at Nan Tien Institute, Australia. Alex works as a technical solution consultant specializing in cloud computing as a full time job. He graduated with an M.S. in civil engineering from Stanford and a B.S. in computer science from the University of New South Wales, Australia.

Last updated: October 28, 2016

Experience with Open Source Linguistic Data and Tools for Analysis of Chinese Text

Getting to South Hall