Data Analysis Phase

Data Archiving and Networked Services

doi:10.1017/9789048513802.006

In this chapter, we turn to important issues that should be addressed during the analysis phase when project staff are actively working with data files to investigate their research questions.

Master datasets and work files

As analysis proceeds, there will be various changes, additions, and deletions to the dataset. Despite the most rigorous data cleaning, additional errors will undoubtedly be discovered. The need to construct new variables might arise. Staff members might want to subset the data by cases and/or variables. Thus, there is a good chance that before long multiple versions of the dataset will be in use. It is not uncommon for a research group to discover that when it comes time to prepare a final version of the data for archiving, there are multiple versions that must be merged to include all of the newly created variables. This problem can be avoided to a degree if the research files are stored on a network where a single version of the data is maintained.

It is a good practice to maintain a master version of the dataset that is stored on a read-only basis. Only one or two staff members should be allowed to change this dataset. Ideally, this dataset should be the basis of all analyses, and other staff members should be discouraged from making copies of it. If a particular user of the data wants to create new variables and save them, a choice should be made between creating a work file for that researcher or adding the new variables to the master dataset. If the latter route is chosen, then all of the standard checks for outliers, inconsistencies, and the like need to be made on the new variables, and full documentation should be prepared. The final dataset reflecting published analyses is the version to archive.

Data and documentation versioning

One way to keep track of changes is to maintain explicit versions of a dataset. The first version might come from the data collection process, the second version from data cleaning, the third from composite variable construction, and so forth. With explicit version numbers, which are reflected in dataset names, it becomes easier to match documentation to datasets and to keep track of what was done by whom and when.

Book contents

5 - Data Analysis Phase

Summary

Access options

Book contents

5 - Data Analysis Phase

Summary

Access options

Save book to Kindle

Save book to Dropbox

Save book to Google Drive