Published online by Cambridge University Press: 03 January 2019
Principal component analysis (PCA) is important to summarize data or reduce dimensionality. However, one disadvantage of using PCA is the interpretability of the principal components (PCs), especially in a high-dimensional database. This study aims to analyze the patterns of variance accumulation according to PCA loadings and to approximate PCs with input variables from sample data sets.
There were three data sets of various sizes used to understand the performance of PC approximation: Hitters; SF-12v2 subset of the 2004 to 2011 Medical Expenditure Panel Survey (MEPS); and, the full set of 1996 to 2011 MEPS data. The variables in three data sets were first centered and scaled before PCA. PCs approximation was studied with two approaches. First, the PC loadings were squared to estimate the variance contribution by variables to PCs. The other method was to use forward-stepwise regression to approximate PCs with all input variables.
The first few PCs represented large portions of total variances in each data set. Approximating PCs using stepwise regression could more efficiently identify the input variables that explain large portions of PC variances than approximating according to PCA loadings in three data sets. It required few numbers of variables to explain more than eighty percent of the PC variances.
Approximating and interpreting PCs with stepwise regression is highly feasible. Approximating PCs can help i) interpret PCs with input variables, ii) understand the major sources of variances in data sets, iii) select unique sources of information and iv) search and rank input variables according to the proportions of PC variance explained. This is an approach to systematically understand databases and search for variables that are highly representative of databases.