Morphosyntactic annotation of CHILDES transcripts

Kenji Sagae; Eric Davis; Alon Lavie; Brian Macwhinney; Shuly Wintner

doi:10.1017/S0305000909990407

Morphosyntactic annotation of CHILDES transcripts

J Child Lang. 2010 Jun;37(3):705-29. doi: 10.1017/S0305000909990407. Epub 2010 Mar 25.

Authors

Kenji Sagae¹, Eric Davis, Alon Lavie, Brian Macwhinney, Shuly Wintner

Affiliation

¹ Institute for Creative Technologies, University of Southern California, CA 90292, USA. sagae@usc.edu

Abstract

Corpora of child language are essential for research in child language acquisition and psycholinguistics. Linguistic annotation of the corpora provides researchers with better means for exploring the development of grammatical constructions and their usage. We describe a project whose goal is to annotate the English section of the CHILDES database with grammatical relations in the form of labeled dependency structures. We have produced a corpus of over 18,800 utterances (approximately 65,000 words) with manually curated gold-standard grammatical relation annotations. Using this corpus, we have developed a highly accurate data-driven parser for the English CHILDES data, which we used to automatically annotate the remainder of the English section of CHILDES. We have also extended the parser to Spanish, and are currently working on supporting more languages. The parser and the manually and automatically annotated data are freely available for research purposes.

Publication types

Evaluation Study
Research Support, Non-U.S. Gov't
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Adult
Algorithms
Automation
Child
Child Language
Computer Simulation
Databases, Factual*
Humans
Interpersonal Relations
Language
Linguistics*
Speech
Speech Production Measurement
Speech Recognition Software*

Abstract

Publication types

MeSH terms

Grants and funding