Published online by Cambridge University Press: 24 March 2017
Abstract
The Big Data Era creates a lot of exciting opportunities for new developments in economics and econometrics. At the same time, however, the analysis of large datasets poses difficultmethodological problems that should be addressed appropriately and are the subject of the present chapter.
Introduction
‘Big Data’ has become a buzzword both in academic and in business and policy circles. It is used to cover a variety of data-driven phenomena that have very different implications for empirical methods. This chapter discusses some of these methodological challenges.
In the simplest case, ‘Big Data’ means a large dataset that otherwise has a standard structure. For example, Chapter 13 describes how researchers are gaining increasing access to administrative datasets or business records covering entire populations rather than population samples. The size of these datasets allows for better controls and more precise estimates and is a bonus for researchers. This may raise challenges for data storage and handling, but does not raise any distinct methodological issues.
However, ‘Big Data’ often means much more than just large versions of standard datasets. First, large numbers of units of observation often come with large numbers of variables, that is, large numbers of possible covariates. To illustrate with the same example, the possibility to link different administrative datasets increases the number of variables attached to each statistical unit. Likewise, business records typically contain all consumer interactions with the business. This can create a tension in the estimation between the objective of ‘letting the data speak’ and obtaining accurate (in a way to be specified later) coefficient estimates. Second, Big Data sets often have a very different structure from those we are used to in economics. This includes web search queries, real-time geolocational data or social media, to name a few. This type of data raises questions about how to structure and possibly re-aggregate them.
The chapter starts with a description of the ‘curse of dimensionality’, which arises from the fact that both the number of units of observation and the number of variables associated with each unit are large. This feature is present in many of the Big Data applications of interest to economists. One extreme example of this problem occurs when there are more parameters to estimate than observations.
To save this book to your Kindle, first ensure [email protected] is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Find out more about the Kindle Personal Document Service.
To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.
To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.