Information representation

doi:10.29085/9781856049740.007

Introduction

We build information retrieval systems to help people satisfy their information needs. Although new computer interfaces have increased the ways that users can satisfy their information needs, the predominant interfaces require users to describe their needs with words. Similarly, most retrieval systems represent the items in their collection using words. The words provided by the user are then compared to the words attached to the items in the collection. If the user's words match those of the item, then that item might be what the user wants. If the words do not match, then the item is not returned by the retrieval system. While this explanation is a simplification of matters, the way we represent documents is the first step towards obtaining a high quality retrieval system. We begin by discussing issues in representing text items that pertain to both manual and automatic representation techniques.

Text representation

Textual items in our collections could be books, journal articles, web pages, XML documents, emails, word processing files and so forth. These items vary in length and structure and may contain non-text items such as images. In all cases, we will refer to the text items in a collection as documents or items, but it is important to remember that there is a large variety of possible text items.

Although there are many possible representations of documents, our focus will be on representations that use words or tokens derived from words. The process of deciding which words to use to describe a document is called indexing and the chosen words are called index terms. Sometimes we want to represent documents with more than words and then it makes sense to talk about the use of features, which are more generic than index terms. An example of a non-word feature could be the number of words in a document.

When we, as humans, manually index documents, we look at the item, read it, and then make decisions ourselves about what index terms to use. When we automatically index, we write computer algorithms to process digital forms of the documents and make the decisions about index terms.

Book contents

5 - Information representation

Summary

Access options

Book contents

5 - Information representation

Summary

Access options

Save book to Kindle

Save book to Dropbox

Save book to Google Drive