Segmentation and alignment of parallel text for statistical machine translation

YONGGANG DENG; SHANKAR KUMAR; WILLIAM BYRNE

doi:10.1017/S1351324906004293

Segmentation and alignment of parallel text for statistical machine translation

Published online by Cambridge University Press: 06 July 2006

YONGGANG DENG ,

SHANKAR KUMAR and

WILLIAM BYRNE

Show author details

YONGGANG DENG: Affiliation:
Center for Language and Speech Processing, Department of Electrical and Computer Engineering, The Johns Hopkins University, 3400 N. Charles St., Baltimore, MD 21218, USA e-mail: [email protected]
SHANKAR KUMAR: Affiliation:
Google Inc., 1600 Amphitheatre Parkway, Mountain View, CA 94043, USA e-mail: [email protected]
WILLIAM BYRNE: Affiliation:
Department of Engineering, Cambridge University, Trumpington Street, Cambridge CB2 1PZ, UK e-mail: [email protected]

Article contents

Abstract
Footnotes

Get access

Rights & Permissions

Abstract

We address the problem of extracting bilingual chunk pairs from parallel text to create training sets for statistical machine translation. We formulate the problem in terms of a stochastic generative process over text translation pairs, and derive two different alignment procedures based on the underlying alignment model. The first procedure is a now-standard dynamic programming alignment model which we use to generate an initial coarse alignment of the parallel text. The second procedure is a divisive clustering parallel text alignment procedure which we use to refine the first-pass alignments. This latter procedure is novel in that it permits the segmentation of the parallel text into sub-sentence units which are allowed to be reordered to improve the chunk alignment. The quality of chunk pairs are measured by the performance of machine translation systems trained from them. We show practical benefits of divisive clustering as well as how system performance can be improved by exploiting portions of the parallel text that otherwise would have to be discarded. We also show that chunk alignment as a first step in word alignment can significantly reduce word alignment error rate.

Type: Papers
Information: Natural Language Engineering , Volume 13 , Issue 3 , September 2007 , pp. 235 - 260

DOI: https://doi.org/10.1017/S1351324906004293 [Opens in a new window]

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

This work was supported by ONR MURI grant N00014-01-1-0685.

Article contents

Segmentation and alignment of parallel text for statistical machine translation

Abstract

Access options

Footnotes

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests