The Talent system: TEXTRACT architecture and data model

MARY S. NEFF; ROY J. BYRD; BRANIMIR K. BOGURAEV

doi:10.1017/S1351324904003493

The Talent system: TEXTRACT architecture and data model

Published online by Cambridge University Press: 11 October 2004

MARY S. NEFF ,

ROY J. BYRD and

BRANIMIR K. BOGURAEV

Show author details

MARY S. NEFF: Affiliation:
IBM T.J. Watson Research Center, 19 Skyline Drive, Hawthorne, NY 10532, USA e-mail: [email protected]@[email protected]
ROY J. BYRD: Affiliation:
IBM T.J. Watson Research Center, 19 Skyline Drive, Hawthorne, NY 10532, USA e-mail: [email protected]@[email protected]
BRANIMIR K. BOGURAEV: Affiliation:
IBM T.J. Watson Research Center, 19 Skyline Drive, Hawthorne, NY 10532, USA e-mail: [email protected]@[email protected]

Article contents

Abstract

Get access

Rights & Permissions

Abstract

We present the architecture and data model for TEXTRACT, a robust, scalable and configurable document analysis framework. TEXTRACT has been engineered as a pipeline architecture, allowing for rapid prototyping and application development by freely mixing reusable, existing, language analysis plugins and custom, new, plugins with customizable functionality. We discuss design issues which arise from requirements of industrial strength efficiency and scalability, and which are further constrained by plugin interactions, both among themselves, and with a common data model comprising an annotation store, document vocabulary and a lexical cache. We exemplify some of these by focusing on a meta-plugin: an interpreter for annotation-based finite state transduction, through which many linguistic filters can be implemented as stand-alone plugins. The framework and component plugins have been extensively deployed in both research and industrial environments, for a broad range of text analysis and mining tasks.

Type: Papers
Information: Natural Language Engineering , Volume 10 , Issue 3-4 , September 2004 , pp. 307 - 326

DOI: https://doi.org/10.1017/S1351324904003493 [Opens in a new window]

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article contents

The Talent system: TEXTRACT architecture and data model

Abstract

Access options

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests