Published online by Cambridge University Press: 30 May 2017
In written Chinese, personal pronouns are commonly dropped when they can be inferred from context. This practice is particularly common in informal genres like Short Message Service messages sent via cell phones. Restoring dropped personal pronouns can be a useful preprocessing step for information extraction. Dropped personal pronoun recovery can be divided into two subtasks: (1) detecting dropped personal pronoun slots and (2) determining the identity of the pronoun for each slot. We address a simpler version of restoring dropped personal pronouns wherein only the person numbers are identified. After applying a word segmenter, we used a linear-chain conditional random field to predict which words were at the start of an independent clause. Then, using the independent clause start information, as well as lexical and syntactic information, we applied a conditional random field or a maximum-entropy classifier to predict whether a dropped personal pronoun immediately preceded each word and, if so, the person number of the dropped pronoun. We conducted a series of experiments using a manually annotated corpus of Chinese Short Message Service. Our approaches substantially outperformed a rule-based approach based partially on rules developed by Chung and Gildea (2010, Effects of Empty Categories on Machine Translation. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics. pp. 636–45). Our approaches also outperformed (though by a considerably smaller margin) a machine-learning approach based closely on work by Yang, Liu, and Xue in (2015, Recovering Dropped Pronouns from Chinese Text Messages. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL). Association for Computational Linguistics. pp. 309–13). Features derived from parsing largely did not help our approaches. We conclude that, given independent clause start information, the parse information we used was largely superfluous for identifying dropped personal pronouns.
Also affiliated with The Dept. of Linguistics, Georgetown University, 3700 O Street NW, Washington DC USA.
We are thankful for the assistance provided by our MITRE colleagues. Dr Sichu Li annotated, in efficient and professional fashion, a large subset of the SMS we downloaded from the National University of Singapore. Dr John Prange, Mr Rob Case, and Mr Rod Holland provided valuable feedback on a presentation we gave describing our preliminary research findings. We are also thankful for the assistance provided by our colleagues at other institutions. Professor Nianwen (Bert) Xue at Brandeis University, Boston, USA shared his thoughts and expertise on Chinese dropped pronoun detection, at an early stage of our research. Professor Derek F. Wong and Mr Junwen Xing at the University of Macau, Macau, SAR PRC applied their word segmenter to the National University of Singapore corpus.