Web archiving

doi:10.29085/9781856049863.007

Introduction

In order to be able to conserve artefacts in the long term, it is necessary to understand not only their content, but also their organization, including their physical structure. Librarians have been able to play their historic role as the preservers of books because they have an in-depth understanding not only of the content of books but also of their tabular organization and their physical shaping. Based on this knowledge, they have built the methods, from collection development to preservation policy including shelving, cataloguing, etc. that enable them to fulfil their mission.

There is a wealth of information on the web. The size of the web is difficult to estimate but certainly ranges in the hundreds of terabytes (1 terabyte = 1024 gigabytes), if not in the petabytes (1 petabyte = 1024 terabytes). The largest libraries in the world hold tens of millions of volumes, which represent, if we consider the text only, tens of terabytes of information (1 megabyte per volume on average). If we take into account images, the total can be estimated at 50 times more. If we consider moving images, information held by libraries is even larger. The Bibliothèque National de France (BnF) for instance is running a digitization project for its videotapes collection that will yield an estimated 400 Tb of data. Nevertheless, it is reasonable to assume that the volume of information that the web contains already equals if not exceeds those of the largest libraries in the world.

Content on the web has not been, for the most part, selected and edited. We can estimate that around 40 million active webmasters coordinate the production of hundreds of millions of contributors. A very conservative estimate of the number of web logs indicates that 10 million people use this new publication tool and therefore enrich content available on the web.

This impressive growth in size and number of producers would have resulted in a catastrophe from the preservation standpoint in the analogue world. But, as this has happened in a digital realm, the consequence is more a displacement of problems and practices than a disaster. The main reason for this is that the preservation of borndigital material can benefit from the possibility of automatic processing and falling storage costs that have enabled such growth.

Book contents

4 - Web archiving

Summary

Access options

Book contents

4 - Web archiving

Summary

Access options

Save book to Kindle

Save book to Dropbox

Save book to Google Drive