Hostname: page-component-cd9895bd7-mkpzs Total loading time: 0 Render date: 2024-12-23T09:42:04.005Z Has data issue: false hasContentIssue false

Dropped personal pronoun recovery in Chinese SMS*

Published online by Cambridge University Press:  30 May 2017

CHRIS GIANNELLA
Affiliation:
Department of Human Language Technology, The MITRE Corporation, 7515 Colshire Drive, McLean, VA, 22102, USA e-mails: [email protected], [email protected], [email protected]
RANSOM WINDER
Affiliation:
Department of Human Language Technology, The MITRE Corporation, 7515 Colshire Drive, McLean, VA, 22102, USA e-mails: [email protected], [email protected], [email protected]
STACY PETERSEN
Affiliation:
Department of Human Language Technology, The MITRE Corporation, 7515 Colshire Drive, McLean, VA, 22102, USA e-mails: [email protected], [email protected], [email protected]

Abstract

In written Chinese, personal pronouns are commonly dropped when they can be inferred from context. This practice is particularly common in informal genres like Short Message Service messages sent via cell phones. Restoring dropped personal pronouns can be a useful preprocessing step for information extraction. Dropped personal pronoun recovery can be divided into two subtasks: (1) detecting dropped personal pronoun slots and (2) determining the identity of the pronoun for each slot. We address a simpler version of restoring dropped personal pronouns wherein only the person numbers are identified. After applying a word segmenter, we used a linear-chain conditional random field to predict which words were at the start of an independent clause. Then, using the independent clause start information, as well as lexical and syntactic information, we applied a conditional random field or a maximum-entropy classifier to predict whether a dropped personal pronoun immediately preceded each word and, if so, the person number of the dropped pronoun. We conducted a series of experiments using a manually annotated corpus of Chinese Short Message Service. Our approaches substantially outperformed a rule-based approach based partially on rules developed by Chung and Gildea (2010, Effects of Empty Categories on Machine Translation. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics. pp. 636–45). Our approaches also outperformed (though by a considerably smaller margin) a machine-learning approach based closely on work by Yang, Liu, and Xue in (2015, Recovering Dropped Pronouns from Chinese Text Messages. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL). Association for Computational Linguistics. pp. 309–13). Features derived from parsing largely did not help our approaches. We conclude that, given independent clause start information, the parse information we used was largely superfluous for identifying dropped personal pronouns.

Type
Articles
Copyright
Copyright © Cambridge University Press 2017 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

1

Also affiliated with The Dept. of Linguistics, Georgetown University, 3700 O Street NW, Washington DC USA.

*

We are thankful for the assistance provided by our MITRE colleagues. Dr Sichu Li annotated, in efficient and professional fashion, a large subset of the SMS we downloaded from the National University of Singapore. Dr John Prange, Mr Rob Case, and Mr Rod Holland provided valuable feedback on a presentation we gave describing our preliminary research findings. We are also thankful for the assistance provided by our colleagues at other institutions. Professor Nianwen (Bert) Xue at Brandeis University, Boston, USA shared his thoughts and expertise on Chinese dropped pronoun detection, at an early stage of our research. Professor Derek F. Wong and Mr Junwen Xing at the University of Macau, Macau, SAR PRC applied their word segmenter to the National University of Singapore corpus.

References

Baran, E., Yang, Y., and Xue, N. 2012. Annotating dropped pronouns in Chinese newswire text. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC). European Language Resources Association (ELRA). pp. 2795–9.Google Scholar
Cai, S., Chiang, D., and Goldberg, Y. 2011. Language-independent parsing with empty elements. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL), Stroudsburg, PA USA, Association for Computational Linguistics. pp. 212–6.Google Scholar
Chen, C., and Ng, V. 2013. Chinese zero pronoun resolution: some recent advances. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Stroudsburg, PA USA, Association for Computational Linguistics. pp. 1360–5.Google Scholar
Chen, T., and Kan, M.-Y. 2013. Creating a live, public short message service corpus: the NUS SMS corpus. Language Resources and Evaluation 47 (2): 299335. doi: 10.1007/s10579-012-9197-9.Google Scholar
Chen, C., and Ng, V. 2014. Chinese zero pronoun resolution: an unsupervised approach combining ranking and integer linear programming. In Proceedings of the 28th AAAI Conference on Artificial Intelligence, Palo Alto, CA USA, Association for the Advancement of Artificial Intelligence Press. pp. 1622–8.Google Scholar
Chung, T., and Gildea, D. 2010. Effects of empty categories on machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Stroudsburg, PA USA, Association for Computational Linguistics. pp. 636–45.Google Scholar
Edington, E., and Onghena, P. 2007. Randomization Tests, 4th ed. Boca Raton, FL, USA: CRC Press, Taylor & Francis Group. ISBN: 978-1-58488-589-4.CrossRefGoogle Scholar
Grosz, B., Joshi, A., and Weinstein, S., 1995. Centering: a framework for modeling the local coherence of discourse. Computational Linguistics 21 (2): 203–25.Google Scholar
Huang, C. T. J. 1989. Pro-drop in chinese: a generalized control theory. In Jaeggli, O. and Safir, K. (eds.), Studies in Natural Language and Linguistic Theory: The Null Subject Parameter, vol. 15, pp. 185214. Netherlands: Springer. doi: 10.1007/978-94-009-2540-3_6.Google Scholar
Kawahara, D., and Kurohashi, S. 2005. Zero pronoun resolution based on automatically constructed case frames and structural preference of antecedents. In Su, K.-Y., Tsujii, J., Lee, J.-L., and Kwong, O. Y. (eds.), Lecture Notes in Computer Science, vol. 3248, pp. 1221. Berlin Heidelberg: Springer. doi: 10.1007/978-3-540-30211-7_2.Google Scholar
Kong, F., and Zhou, G. 2010. A tree kernel-based unified framework for Chinese zero anaphora resolution. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Stroudsburg, PA USA, Association for Computational Linguistics. pp. 882–91.Google Scholar
Kong, F., and Zhou, G. 2013. A clause-level hybrid approach to Chinese empty element recovery. In Proceedings of the 23rd International Joint Conference on Artificial Intelligence (IJCAI), Palo Alto, CA USA, Association for the Advancement of Artificial Intelligence Press. pp. 2113–9.Google Scholar
Lafferty, J., McCallum, A., and Pereira, F. 2001. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning (ICML), Burlington, MA USA, Morgan Kaufmann. pp. 282–9.Google Scholar
Levy, R., and Galen, A. 2006. Tregex and tsurgeon: tools for querying and manipulating tree data structures. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC). European Language Resources Association (ELRA). pp. 2231–4.Google Scholar
Manning, C., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S. J., and McClosky, D. 2014. The stanford CoreNLP natural language processing toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL), Stroudsburg, PA USA, Association for Computational Linguistics. pp. 5560.Google Scholar
McCallum, A. 2002. Accessed July 16, 2013. http://mallet.cs.umass.edu.Google Scholar
Rahman, A., and Ng, V. 2012. Translation-based projection for multilingual coreference resolution. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Palo Alto, CA USA, Association for Computational Linguistics. pp. 1051–60.Google Scholar
Rao, S., Ettinger, A., Daume, H. III, and Resnik, P. 2015. Dialogue focus tracking for zero pronoun resolution. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Palo Alto, CA USA, Association for Computational Linguistics. pp. 494502.Google Scholar
Sasano, R., and Kurohashi, S. 2011. A discriminative approach to japanese zero anaphora resolution with large-scale lexicalized case frames. In Proceedings of the 5th International Joint Conference on Natural Language Processing (IJCNLP), Stroudsburg, PA USA, Association for Computational Linguistics. pp. 758–66.Google Scholar
Seki, K., Fujii, A., and Ishikawa, T. 2002. A probabilistic method for analtyzing japanese anaphora integrating zero pronoun detection and resolution. In Proceedings of the 19th International Conference on Computational Linguistics (COLING), Stroudsburg, PA USA, Association for Computational Linguistics. pp. 17.Google Scholar
Wang, L., Wong, D., Chao, L., and Xing, J. 2012. CRFs-based Chinese word segmentation for micro-blog with small-scale data. In Proceedings of the Second CIPS-SIGHAN Joint Conference on Chinese Language Processing, Stroudsburg, PA USA, Association for Computational Linguistics. pp. 51–7.Google Scholar
Xue, N., and Yang, Y. 2011. Chinese sentence segmentation as comma classification. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL), Stroudsburg, PA USA, Association for Computational Linguistics. pp. 631–5.Google Scholar
Xue, N., and Yang, Y. 2013. Dependency-based empty category detection via phrase structure trees. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Stroudsburg, PA USA, Association for Computational Linguistics. pp. 1051–60.Google Scholar
Xue, N., Xia, F., Huang, S., and Kroch, A. 2000. The bracketing guidelines for the penn Chinese treebank (3.0). Technical Report No. IRCS-00-08, University of Pennsylvania Institute for Research in Cognitive Science. http://repository.upenn.edu/ircs_reports/39/.Google Scholar
Yang, W., Dai, R., and Cui, X. 2008. Zero pronoun resolution in Chinese using machine learning plus shallow parsing. In Proceedings of the IEEE International Conference on Information and Automation, New York, NY USA, Institute of Electrical and Electronics Engineers. pp. 905–10.Google Scholar
Yang, Y. 2014. Reading between the lines: recovering implicit information from Chinese texts. Ph.D. Stroudsburg, PA USA: Dissertation, Department of Computer Science, Brandeis University.Google Scholar
Yang, Y., Liu, Y., and Xue, N. 2015. Recovering dropped pronouns from Chinese text messages. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL), Stroudsburg, PA USA, Association for Computational Linguistics. pp. 309–13.Google Scholar
Yang, Y., and Xue, N. 2010. Chasing the ghost: recovering empty categories in the Chinese treebank . In Proceedings of the 23rd International Conference on Computational Linguistics (COLING), Beijing, P.R. CHINA, Tsinghua University Press. pp. 1382–90.Google Scholar
Yeh, A. 2000. More accurate tests for the statistical significance of result differences. In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics (ACL), Stroudsburg, PA USA, Association for Computational Linguistics. pp. 947–53Google Scholar
Yeh, C.-L., and Chen, Y.-C., 2007. Zero anaphora resolution in Chinese with shallow parsing. Journal of Chinese Language and Computing 17 (1): 4156.Google Scholar
Zhao, S., and Ng, H.T. 2007. Identification and resolution of Chinese zero pronouns: a machine learning approach. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Stroudsburg, PA USA, Association for Computational Linguistics. pp. 541–50.Google Scholar