Skip to main content Accessibility help
×
Hostname: page-component-78c5997874-j824f Total loading time: 0 Render date: 2024-11-09T22:04:25.524Z Has data issue: false hasContentIssue false

5 - Information representation

Published online by Cambridge University Press:  08 June 2018

Mark D. Smucker
Affiliation:
University of Waterloo
Get access

Summary

Introduction

We build information retrieval systems to help people satisfy their information needs. Although new computer interfaces have increased the ways that users can satisfy their information needs, the predominant interfaces require users to describe their needs with words. Similarly, most retrieval systems represent the items in their collection using words. The words provided by the user are then compared to the words attached to the items in the collection. If the user's words match those of the item, then that item might be what the user wants. If the words do not match, then the item is not returned by the retrieval system. While this explanation is a simplification of matters, the way we represent documents is the first step towards obtaining a high quality retrieval system. We begin by discussing issues in representing text items that pertain to both manual and automatic representation techniques.

Text representation

Textual items in our collections could be books, journal articles, web pages, XML documents, emails, word processing files and so forth. These items vary in length and structure and may contain non-text items such as images. In all cases, we will refer to the text items in a collection as documents or items, but it is important to remember that there is a large variety of possible text items.

Although there are many possible representations of documents, our focus will be on representations that use words or tokens derived from words. The process of deciding which words to use to describe a document is called indexing and the chosen words are called index terms. Sometimes we want to represent documents with more than words and then it makes sense to talk about the use of features, which are more generic than index terms. An example of a non-word feature could be the number of words in a document.

When we, as humans, manually index documents, we look at the item, read it, and then make decisions ourselves about what index terms to use. When we automatically index, we write computer algorithms to process digital forms of the documents and make the decisions about index terms.

Type
Chapter

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Save book to Kindle

To save this book to your Kindle, first ensure [email protected] is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

Available formats
×

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

Available formats
×

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

Available formats
×