Hostname: page-component-cd9895bd7-gvvz8 Total loading time: 0 Render date: 2024-12-27T18:10:35.807Z Has data issue: false hasContentIssue false

STATISTICAL INFERENCE WITH F-STATISTICS WHEN FITTING SIMPLE MODELS TO HIGH-DIMENSIONAL DATA

Published online by Cambridge University Press:  27 September 2021

Hannes Leeb*
Affiliation:
University of Vienna
Lukas Steinberger
Affiliation:
University of Vienna
*
Address correspondence to Hannes Leeb, Department of Statistics, University of Vienna, Oskar-Morgenstern-Platz 1, 1090 Vienna, Austria; e-mail: [email protected]
Rights & Permissions [Opens in a new window]

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the ‘Save PDF’ action button.

We study linear subset regression in the context of the high-dimensional overall model $y = \vartheta +\theta ' z + \epsilon $ with univariate response y and a d-vector of random regressors z, independent of $\epsilon $. Here, “high-dimensional” means that the number d of available explanatory variables is much larger than the number n of observations. We consider simple linear submodels where y is regressed on a set of p regressors given by $x = M'z$, for some $d \times p$ matrix M of full rank $p < n$. The corresponding simple model, that is, $y=\alpha +\beta ' x + e$, is usually justified by imposing appropriate restrictions on the unknown parameter $\theta $ in the overall model; otherwise, this simple model can be grossly misspecified in the sense that relevant variables may have been omitted. In this paper, we establish asymptotic validity of the standard F-test on the surrogate parameter $\beta $, in an appropriate sense, even when the simple model is misspecified, that is, without any restrictions on $\theta $ whatsoever and without assuming Gaussian data.

Type
ARTICLES
Creative Commons
Creative Common License - CCCreative Common License - BYCreative Common License - NCCreative Common License - SA
This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike licence (https://creativecommons.org/licenses/by-nc-sa/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the same Creative Commons licence is included and the original work is properly cited. The written permission of Cambridge University Press must be obtained for commercial re-use.
Copyright
© The Author(s), 2021. Published by Cambridge University Press

Footnotes

The first author’s research was partially supported by FWF projects P 26354-N26 and P 28233-N32.

References

REFERENCES

Abadie, G., Imbens, G. W. & Zheng, F. (2014) Inference for misspecified models with fixed regressors. Journal of the American Statistical Association 109, 16011614.CrossRefGoogle Scholar
Akritas, M. & Arnold, S. (2000) Asymptotics for analyis of variance when the number of levels is large. Journal of the American Statistical Association 95, 212226.CrossRefGoogle Scholar
Anderson, T. W. (1958) An Introduction to Multivariate Analysis. Wiley.Google Scholar
Bachoc, F., Leeb, H. & Pötscher, B. M. (2019) Valid confidence intervals for post-model-selection predictors. Annals of Statistics 47, 14751504.CrossRefGoogle Scholar
Bai, Z. & Saranadasa, H. (1996) Effect of high dimension: By an example of a two sample problem. Statistica Sinica 6, 311329.Google Scholar
Baik, J. & Silverstein, J. W. (2006) Eigenvalues of large sample covariance matrices of spiked population models. Journal of Multivariate Analysis 97, 13821408.CrossRefGoogle Scholar
Bathke, A. & Lankowski, D. (2005) Rank procedures for a large number of treatments. Journal of Statistical Planning and Inference 133, 223238.CrossRefGoogle Scholar
Boos, D. D. & Brownie, C. (1995) ANOVA and rank tests when the number of treatments is large. Statistics and Probability Letters 23, 183191.CrossRefGoogle Scholar
Boos, D.D. & Stefanski, L.A. (2013) Essential Statistical Inference, Springer Texts in Statistics. Springer.CrossRefGoogle Scholar
Brannath, W. & Scharpenberg, M. (2014) Interpretation of linear regression coefficients under mean model miss-specification. arXiv:1409.8544.Google Scholar
Buja, A.R., Brown, L. D., George, E., Pitkin, E., Traskin, M., Zhan, K., & Zhao, L. (2014) A conspiracy of random predictors and model violations against classical inference in regression. arXiv:1404.1578.Google Scholar
Cai, T., Ma, Z. & Wu, Y. (2013) Optimal estimation and rank detection for sparse spiked covariance matrices. Probability Theory and Related Fields 161, 135.Google Scholar
Cattaneo, M. D., Jansson, M. & Newey, W. K. (2018) Inference in linear regression models with many covariates and heteroscedasticity. Journal of the American Statistical Association 113, 13501361.CrossRefGoogle Scholar
Chen, S.-X. & Qin, Y.-L. (2010) A two-sample test for high-dimensional data with applications to gene-set testing. Annals of Statistics 38, 808835.CrossRefGoogle Scholar
Choi, H. S. & Kiefer, N. M. (2011) Geometry of the log-likelihood ratio statistic in misspecified models. Journal of Statistical Planning and Inference 141, 20912099.CrossRefGoogle Scholar
Dobriban, E. & Su, W. (2018) Robust inference under heteroskedasticity via the Hadamard estimator. arXiv:1807.00347.Google Scholar
Donoho, D. L., Gavish, M. & Johnstone, I. M. (2018) Optimal shrinkage of eigenvalues in the spiked covariance model. Annals of Statistics 46, 17421778.CrossRefGoogle ScholarPubMed
Eicker, F. (1967) Limit theorems for regressions with unequal and dependent errors. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, pp. 5982. University of California Press.Google Scholar
Fomby, T.B. & Hill, R.C. (2003) Maximum Likelihood Estimation of Misspecified Models: Twenty Years Later, Advances in Econometrics, 17. Elsevier.CrossRefGoogle Scholar
Hall, P. & Li, K.-C. (1993) On almost linearity of low dimensional projections from high dimensional data. Annals of Statistics 21, 867889.CrossRefGoogle Scholar
Harrar, S. & Bathke, A. C. (2008) Nonparametric methods for unbalanced multivariate data and many factor levels. Journal of Multivariate Analysis 99, 16351664.CrossRefGoogle Scholar
Huber, P. J. (1967) The behavior of maximum likelihood estimates under nonstandard conditions. In Le Cam, Lucien M., Neyman, Jerzy (eds.), Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, pp. 221233. University of California Press.Google Scholar
Jensen, D. R. & Ramirez, D. E. (1991) Misspecified ${t}^2$ tests. I. Location and scale. Communications in Statistics. Theory and Methods 20, 249259.CrossRefGoogle Scholar
Jochmans, K. Heteroscedasticity-robust inference in linear regression models with many covariates. Journal of the American Statistical Association, first published online 19 November 2020. https://doi.org/10.1080/01621459.2020.1831924.Google Scholar
Johnstone, I. M. (2001) On the distribution of the largest eigenvalue in principal components analysis. Annals of Statistics 29, 295327.CrossRefGoogle Scholar
Leeb, H. (2013) On the conditional distributions of low-dimensional projections from high-dimensional data. Annals of Statistics 41, 464483.CrossRefGoogle Scholar
Li, Z. & Yao, J. (2019) Testing for heteroscedasticity in high-dimensional regressions. Econometrics and Statistics 9, 122139.CrossRefGoogle Scholar
Portnoy, S. (1984) Asymptotic behavior of $m$ -estimators of $p$ regression parameters when ${p}^2/ n$ is large. I. Consistency. Annals of Statististics 12, 12981309.Google Scholar
Portnoy, S. (1985) Asymptotic behavior of $m$ -estimators of $p$ regression parameters when ${p}^2/ n$ is large. II. Normal approximation. Annals of Statistics 13, 14031417.CrossRefGoogle Scholar
Preinerstorfer, D. & Pötscher, B. M. (2016) On size and power of heteroskedasticity and autocorrelation robust tests. Econometric Theory 32, 261358.CrossRefGoogle Scholar
Ramirez, D. E. & Jensen, D. R. (1991) Misspecified ${t}^2$ tests. II. Series expansions. Communications in Statistics. Theory and Methods 20, 97108.Google Scholar
Rosenthal, H. P. (1970) On the subspaces of ${L}^p\left(p>2\right)$ , spanned by sequences of independent random variables. Israel Journal of Mathematics 8, 273303.CrossRefGoogle Scholar
Souders, T. M. & Stenbakken, G. N. (1991) Cutting the high cost of testing. IEEE Spectrum 28, 4851.CrossRefGoogle Scholar
Steinberger, L. (2015) Statistical inference in high-dimensional linear regression based on simple working models. PhD thesis, University of Vienna.Google Scholar
Steinberger, L. (2016) The relative effects of dimensionality and multiplicity of hypotheses on the F-test in linear regression. Electronic Journal of Statistics 10, 25842640.CrossRefGoogle Scholar
Steinberger, L. & Leeb, H. (2018) On conditional moments of high-dimensional random vectors given lower-dimensional projections. Bernoulli 24, 565591.CrossRefGoogle Scholar
Steinberger, L. & Leeb, H. (2019) Prediction when fitting simple models to high-dimensional data. Annals of Statistics 47, 14081442.CrossRefGoogle Scholar
Stock, J. H. & Watson, M. W. (2002) Forecasting using principal components from a large number of predictors. Journal of the American Statistical Association 97, 11671179.CrossRefGoogle Scholar
van’t Veer, L. J., Dai, H., van de Vijver, M. J., He, Y. D., Hart, A. A. M., Mao, M., Peterse, H. L., van der Kooy, K., Marton, M. J., Witteveen, A. T., Schreiber, G. J., Kerkhoven, R. M., Roberts, C., Linsley, P. S., Bernards, R. & Friend, S. H. (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature 415, 530536.CrossRefGoogle ScholarPubMed
Wang, S. & Cui, H. (2013) Generalized F test for high dimensional linear regression coefficients. Journal of Multivariate Analysis 117, 134149.CrossRefGoogle Scholar
White, H. (1980a) A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica 84, 817838.CrossRefGoogle Scholar
White, H. (1980b) Using least squares to approximate unknown regression functions. International Economic Review 21, 149170.CrossRefGoogle Scholar
Zhong, P. S. & Chen, S. X. (2011) Tests for high-dimensional regression coefficients with factorial designs. Journal of the American Statistical Association 106, 260274.CrossRefGoogle Scholar