Hostname: page-component-745bb68f8f-grxwn Total loading time: 0 Render date: 2025-01-11T06:59:20.048Z Has data issue: false hasContentIssue false

Linear algebra software for large-scale accelerated multicore computing*

Published online by Cambridge University Press:  23 May 2016

A. Abdelfattah
Affiliation:
Innovative Computing Laboratory, University of Tennessee, 1122 Volunteer BoulevardSuite 203 Claxton, Knoxville, TN 37996, USA E-mail: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
H. Anzt
Affiliation:
Innovative Computing Laboratory, University of Tennessee, 1122 Volunteer BoulevardSuite 203 Claxton, Knoxville, TN 37996, USA E-mail: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
J. Dongarra
Affiliation:
Innovative Computing Laboratory, University of Tennessee, 1122 Volunteer BoulevardSuite 203 Claxton, Knoxville, TN 37996, USA E-mail: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
M. Gates
Affiliation:
Innovative Computing Laboratory, University of Tennessee, 1122 Volunteer BoulevardSuite 203 Claxton, Knoxville, TN 37996, USA E-mail: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
A. Haidar
Affiliation:
Innovative Computing Laboratory, University of Tennessee, 1122 Volunteer BoulevardSuite 203 Claxton, Knoxville, TN 37996, USA E-mail: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
J. Kurzak
Affiliation:
Innovative Computing Laboratory, University of Tennessee, 1122 Volunteer BoulevardSuite 203 Claxton, Knoxville, TN 37996, USA E-mail: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
P. Luszczek
Affiliation:
Innovative Computing Laboratory, University of Tennessee, 1122 Volunteer BoulevardSuite 203 Claxton, Knoxville, TN 37996, USA E-mail: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
S. Tomov
Affiliation:
Innovative Computing Laboratory, University of Tennessee, 1122 Volunteer BoulevardSuite 203 Claxton, Knoxville, TN 37996, USA E-mail: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
I. Yamazaki
Affiliation:
Innovative Computing Laboratory, University of Tennessee, 1122 Volunteer BoulevardSuite 203 Claxton, Knoxville, TN 37996, USA E-mail: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
A. YarKhan
Affiliation:
Innovative Computing Laboratory, University of Tennessee, 1122 Volunteer BoulevardSuite 203 Claxton, Knoxville, TN 37996, USA E-mail: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]

Abstract

Many crucial scientific computing applications, ranging from national security to medical advances, rely on high-performance linear algebra algorithms and technologies, underscoring their importance and broad impact. Here we present the state-of-the-art design and implementation practices for the acceleration of the predominant linear algebra algorithms on large-scale accelerated multicore systems. Examples are given with fundamental dense linear algebra algorithms – from the LU, QR, Cholesky, and LDLT factorizations needed for solving linear systems of equations, to eigenvalue and singular value decomposition (SVD) problems. The implementations presented are readily available via the open-source PLASMA and MAGMA libraries, which represent the next generation modernization of the popular LAPACK library for accelerated multicore systems.

To generate the extreme level of parallelism needed for the efficient use of these systems, algorithms of interest are redesigned and then split into well-chosen computational tasks. The task execution is scheduled over the computational components of a hybrid system of multicore CPUs with GPU accelerators and/or Xeon Phi coprocessors, using either static scheduling or light-weight runtime systems. The use of light-weight runtime systems keeps scheduling overheads low, similar to static scheduling, while enabling the expression of parallelism through sequential-like code. This simplifies the development effort and allows exploration of the unique strengths of the various hardware components. Finally, we emphasize the development of innovative linear algebra algorithms using three technologies – mixed precision arithmetic, batched operations, and asynchronous iterations – that are currently of high interest for accelerated multicore systems.

Type
Research Article
Copyright
© Cambridge University Press, 2016 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

REFERENCES18

Aasen, J. (1971), ‘On the reduction of a symmetric matrix to tridiagonal form’, BIT 11, 233242.CrossRefGoogle Scholar
Agullo, E., Augonnet, C., Dongarra, J., Ltaief, H., Namyst, R., Thibault, S. and Tomov, S. (2010), Faster, cheaper, better: A hybridization methodology to develop linear algebra software for GPUs. In GPU Computing Gems (Mei, W. and Hwu, W., eds), Vol. 2, Morgan Kaufmann.Google Scholar
Agullo, E., Demmel, J., Dongarra, J., Hadri, B., Kurzak, J., Langou, J., Ltaief, H., Luszczek, P. and Tomov, S. (2009a), ‘Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects’, J. Phys. Conf. Ser. 180, #1.Google Scholar
Agullo, E., Hadri, B., Ltaief, H. and Dongarra, J. (2009b), Comparative study of one-sided factorizations with multiple software packages on multi-core hardware. In SC ’09: Proc. Conference on High Performance Computing Networking, Storage and Analysis, ACM, #20.Google Scholar
Alvarado, F. L. and Schreiber, R. (1993), ‘Optimal parallel solution of sparse triangular systems’, SIAM J. Sci. Comput. 14, 446460.CrossRefGoogle Scholar
Amestoy, P. R., Duff, I. S. and L’Excellent, J.-Y. (2000), ‘Multifrontal parallel distributed symmetric and unsymmetric solvers’, Comput. Methods Appl. Mech. Eng. 184, 501520.Google Scholar
Amestoy, P. R., Duff, I. S., L’Excellent, J.-Y. and Koster, J. (2001), ‘A fully asynchronous multifrontal solver using distributed dynamic scheduling’, SIAM J. Matrix Anal. Appl. 23, 1541.Google Scholar
Amestoy, P. R., Guermouche, A., L’Excellent, J.-Y. and Pralet, S. (2006), ‘Hybrid scheduling for the parallel solution of linear systems’, Parallel Comput. 32, 136156.CrossRefGoogle Scholar
Anderson, E. and Dongarra, J. (1989), Evaluating block algorithm variants in LAPACK. In Proc. 4th Conference on Parallel Processing for Scientific Computing, pp. 38.Google Scholar
Anderson, E. and Dongarra, J. (1990a) Implementation guide for LAPACK. Technical report UT-CS-90-101, Computer Science Department, University of Tennessee. LAPACK Working Note 18.Google Scholar
Anderson, E. and Dongarra, J. (1990b), Evaluating block algorithm variants in LAPACK. Technical report UT-CS-90-103, Computer Science Department, University of Tennessee. LAPACK Working Note 19.Google Scholar
Anderson, E. and Saad, Y. (1989), ‘Solving sparse triangular systems on parallel computers’, Int. J. High Speed Comput. 1, 7396.Google Scholar
Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Ducroz, J., Greenbaum, A., Hammarling, S., McKenney, A. and Sorensen, D. (1999), LAPACK Users’ Guide, third edition, SIAM.Google Scholar
Anderson, M., Sheffield, D. and Keutzer, K. (2012), A predictive model for solving small linear algebra problems in GPU registers. In Proc. IEEE 26th International Parallel and Distributed Processing Symposium: IPDPS 2012, pp. 213.Google Scholar
Andrews, H. C. and Patterson, C. L. (1976), ‘Singular value decompositions and digital image processing’, IEEE Trans. Acoust. Speech Signal Process. 24, 2653.Google Scholar
Anzt, H. (2012) Asynchronous and multiprecision linear solvers: Scalable and fault-tolerant numerics for energy efficient high performance computing. PhD thesis, Institute for Applied and Numerical Mathematics, Karlsruhe Institute of Technology.Google Scholar
Anzt, H., Chow, E. and Dongarra, J. (2015), Iterative sparse triangular solves for preconditioning. In Euro-Par 2015: Parallel Processing, (Träff, J. L., Hunold, S. and Versaci, F., eds), Vol. 9233 of Lecture Notes in Computer Science, Springer, pp. 650661.Google Scholar
Anzt, H., Tomov, S., Dongarra, J. and Heuveline, V. (2013), ‘A block-asynchronous relaxation method for graphics processing units’, J. Parallel Distrib. Comput. 73, 16131626.Google Scholar
Arioli, M. and Duff, I. S. (2008) Using FGMRES to obtain backward stability in mixed precision. Technical report RAL-TR-2008-006, Rutherford Appleton Laboratory.Google Scholar
Arioli, M., Demmel, J. W. and Duff, I. S. (1989), ‘Solving sparse linear systems with sparse backward error’, SIAM J. Matrix Anal. Appl. 10, 165190.Google Scholar
Ashcraft, C., Grimes, R. and Lewis, J. (1998), ‘Accurate symmetric indefinite linear equation solvers’, SIAM J. Matrix Anal. Appl. 20, 513561.CrossRefGoogle Scholar
Ashcraft, C., Grimes, R., Lewis, J., Peyton, B. W. and Simon, H. (1987), ‘Progress in sparse matrix methods in large sparse linear systems on vector supercomputers’, Int. J. Supercomput. Appl. 1, 1030.Google Scholar
Axelsson, O. and Vassilevski, P. S. (1991), ‘A black box generalized conjugate gradient solver with inner iterations and variable-step preconditioning’, SIAM J. Matrix Anal. Appl. 12, 625644.Google Scholar
Baboulin, M., Becker, D. and Dongarra, J. (2012), A parallel tiled solver for dense symmetric indefinite systems on multicore architectures. In Proc. IEEE 26th International Parallel and Distributed Processing Symposium: IPDPS 2012, pp. 1424. LAPACK Working Note 261.Google Scholar
Baboulin, M., Buttari, A., Dongarra, J., Kurzak, J., Langou, J., Langou, J., Luszczek, P. and Tomov, S. (2009), ‘Accelerating scientific computations with mixed precision algorithms’, Comput. Phys. Comm. 180, 25262533.Google Scholar
Baboulin, M., Dongarra, J., Herrmann, J. and Tomov, S. (2013), ‘Accelerating linear system solutions using randomization techniques’, ACM Trans. Math. Softw. 39, #8.Google Scholar
Ballard, G., Becker, D., Demmel, J., Dongarra, J., Druinsky, A., Peled, I., Schwartz, O., Toledo, S. and Yamazaki, I. (2013), Implementing a blocked Aasen’s algorithm with a dynamic scheduler on multicore architectures. In Proc. IEEE 27th International Parallel and Distributed Processing Symposium: IPDPS, pp. 895907.Google Scholar
Ballard, G., Becker, D., Demmel, J., Dongarra, J., Druinsky, A., Peled, I., Schwartz, O., Toledo, S. and Yamazaki, I. (2014), ‘Communication-avoiding symmetric-indefinite factorization’, SIAM J. Matrix Anal. Appl. 35, 13641406.Google Scholar
Barrett, R., Berry, M., Chan, T. F., Demmel, J., Donato, J., Dongarra, J., Eijkhout, V., Pozo, R., Romine, C. and van der Vorst, H. (1994a), Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, second edition, SIAM. Postscript file available at www.netlib.org/templates/Templates.htmlGoogle Scholar
Barrett, R. F., Chan, T. H. F., D’Azevedo, E. F., Jaeger, E. F., Wong, K. and Wong, R. Y. (2010), ‘Complex version of high performance computing LINPACK benchmark (HPL)’, Concurrency Computat. Pract. Exper. 22, 573587.Google Scholar
Bartels, R. H., Golub, G. H. and Saunders, M. (1971), Numerical techniques in mathematical programming. In Nonlinear Programming, Academic Press, pp. 123176.Google Scholar
Becker, D., Baboulin, M. and Dongarra, J. (2012), Reducing the amount of pivoting in symmetric indefinite systems. In 9th International Conference on Parallel Processing and Applied Mathematics: PPAM 2011, (Wyrzykowski, R., Dongarra, J., Karczewski, K. and Wasniewski, J., eds), Vol. 7203 of Lecture Notes in Computer Science, Springer, pp. 133142.Google Scholar
Bendali, A., Boubendir, Y. and Fares, M. (2007), ‘A FETI-like domain decomposition method for coupling finite elements and boundary elements in large-size problems of acoustic scattering’, Comput. Struct. 85, 526535.CrossRefGoogle Scholar
Benzi, M., Joubert, W. and Mateescu, G. (1999), ‘Numerical experiments with parallel orderings for ILU preconditioners’, Electron. Trans. Numer. Anal. 8, 88114.Google Scholar
Bergman, K. et al. (2008), Exascale computing study: Technology challenges in achieving exascale systems. DARPA IPTO ExaScale Computing Study.Google Scholar
Bientinesi, P., Igual, F. D., Kressner, D. and Quintana-Ortí, E. S. (2010), Reduction to condensed forms for symmetric eigenvalue problems on multi-core architectures. In Proc. 8th international Conference on Parallel Processing and Applied Mathematics: PPAM 2009, Part I, Springer, pp. 387395.CrossRefGoogle Scholar
Bischof, C. (1993), A summary of block schemes for reducing a general matrix to Hessenberg form. Technical report ANL/MCS-TM-175, Argonne National Laboratory.Google Scholar
Bischof, C. and Van Loan, C. (1987), ‘The WY representation for products of Householder matrices’, SIAM J. Sci. Statist. Comput. 8, s2s13.Google Scholar
Bischof, C. H., Lang, B. and Sun, X. (2000), ‘Algorithm 807: The SBR Toolbox – software for successive band reduction’, ACM Trans. Math. Softw. 26, 602616.Google Scholar
Björck, Å (1996), Numerical Methods for Least Squares Problems, SIAM.CrossRefGoogle Scholar
Blackford, L. S., Choi, J., Cleary, A., D’Azevedo, E., Demmel, J., Dhillon, I., Dongarra, J. J., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D. and Whaley, R. C. (1997), ScaLAPACK Users’ Guide, SIAM. www.netlib.org/scalapack/slug/Google Scholar
Blumofe, R. D., Joerg, C. F., Kuszmaul, B. C., Leiserson, C. E., Randall, K. H. and Zhou, Y. (1995), Cilk: An efficient multithreaded runtime system. In Proceedings of the 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming: PPOPP ’95, ACM, pp. 207216.Google Scholar
Braman, K., Byers, R. and Mathias, R. (2002a), ‘The multishift QR algorithm I: Maintaining well-focused shifts and level 3 performance’, SIAM J. Matrix Anal. Appl. 23, 929947.Google Scholar
Braman, K., Byers, R. and Mathias, R. (2002b), ‘The multishift QR algorithm II: Aggressive early deflation’, SIAM J. Matrix Anal. Appl. 23, 948973.Google Scholar
Broquedis, F., Clet-Ortega, J., Moreaud, S., Furmento, N., Goglin, B., Mercier, G., Thibault, S. and Namyst, R. (2010), hwloc: a generic framework for managing hardware affinities in HPC applications. In Proc. IEEE 18th Euromicro International Conference on Parallel, Distributed and Network-Based Computing: PDP 2010, pp. 180186.Google Scholar
Bunch, J. and Kaufman, L. (1977), ‘Some stable methods for calculating inertia and solving symmetric linear systems’, Math. Comp. 31, 163179.Google Scholar
Buttari, A., Dongarra, J., Kurzak, J., Langou, J., Luszczek, P. and Tomov, S. (2006), The impact of multicore on math software. In Applied Parallel Computing: State of the Art in Scientific Computing, PARA 2006, (Kågström, B., Elmroth, E., Dongarra, J. and Wasniewski, J., eds), Vol. 4699 of Lecture Notes in Computer Science, Springer, pp. 110.Google Scholar
Buttari, A., Dongarra, J., Kurzak, J., Luszczek, P. and Tomov, S. (2008a), ‘Using mixed precision for sparse matrix computations to enhance the performance while achieving 64-bit accuracy’, ACM Trans. Math. Softw. 34, #17.Google Scholar
Buttari, A., Dongarra, J., Langou, J., Langou, J., Luszczek, P. and Kurzak, J. (2007), ‘Mixed precision iterative refinement techniques for the solution of dense linear systems’, Int. J. High Performance Comput. Appl. 21, 457466.Google Scholar
Buttari, A., Langou, J., Kurzak, J. and Dongarra, J. (2008b), ‘Parallel tiled QR factorization for multicore architectures’, Concurrency Computat. Pract. Exper. 20, 15731590.CrossRefGoogle Scholar
Buttari, A., Langou, J., Kurzak, J. and Dongarra, J. (2009), ‘A class of parallel tiled linear algebra algorithms for multicore architectures’, Parallel Comput. Syst. Appl. 35, 3853.Google Scholar
Castaldo, A. and Whaley, R. (2010), Scaling LAPACK panel operations using parallel cache assignment. In Proc. 15th AGM SIGPLAN Symposium on Principle and Practice of Parallel Programming, pp. 223232.Google Scholar
Chan, E., Quintana-Orti, E. S., Quintana-Orti, G. and van de Geijn, R. (2007), Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures. In SPAA ’07: Proc. 19th Annual ACM Symposium on Parallel Algorithms and Architectures, ACM, pp. 116125.Google Scholar
Choi, J. (1995), A proposal for a set of parallel basic linear algebra subprograms. Technical report UT-CS-95-292, University of Tennessee, Knoxville. LAPACK Working Note 100.Google Scholar
Chow, E. and Patel, A. (2015), ‘Fine-grained parallel incomplete LU factorization’, SIAM J. Sci. Comput. 37, C169C193.Google Scholar
Chow, E., Anzt, H. and Dongarra, J. (2015), Asynchronous iterative algorithm for computing incomplete factorizations on GPUs. In High Performance Computing, Vol. 9137 of Lecture Notes in Computer Science, pp. 116.Google Scholar
Cook, R. D., Malkus, D. S., Plesha, M. E. and Witt, R. J. (2007), Concepts and Applications of Finite Element Analysis, Wiley.Google Scholar
Cuthill, E. and Mckee, J. (1969), Reducing the bandwidth of sparse symmetric matrices. In Proc. 1969 24th National Conference: ACM ’69, ACM, pp. 157172.Google Scholar
Datta, B. D. (1995), Numerical Linear Algebra and Applications, Brooks Cole.Google Scholar
Davis, T. A. and Hu, Y. (1994), ‘The University of Florida sparse matrix collection’, ACM Trans. Math. Softw. 38, #1.Google Scholar
Demmel, J. W. (1997), Applied Numerical Linear Algebra, SIAM.CrossRefGoogle Scholar
Demmel, J., Grigori, L., Hoemmen, M. and Langou, J. (2008a), Implementing communication-optimal parallel and sequential QR factorizations. arXiv:0809.2407Google Scholar
Demmel, J., Grigori, L., Hoemmen, M. and Langou, J. (2012), ‘Communication-optimal parallel and sequential QR and LU factorizations’, SIAM J. Sci. Comput. 34, A206A239. Also available as technical report UCB/EECS-2008-89, EECS Department, University of California, Berkeley.Google Scholar
Demmel, J., Hida, Y., Kahan, W., Li, X. S., Mukherjee, S. and Riedy, E. J. (2006), ‘Error bounds from extra-precise iterative refinement’, ACM Trans. Math. Softw. 32, 325351.Google Scholar
Demmel, J., Hida, Y., Li, X. S. and Riedy, E. J. (2007), Extra-precise iterative refinement for overdetermined least squares problems. Technical report EECS-2007-77, UC Berkeley. LAPACK Working Note 188.Google Scholar
Demmel, J., Marques, O., Parlett, B. N. and Vomel, C. (2008b), ‘Performance and accuracy of LAPACK’s symmetric tridiagonal eigensolvers’, SIAM J. Sci. Comput. 30, 15081526.Google Scholar
Doi, S. (1991), ‘On parallelism and convergence of incomplete LU factorizations’, Appl. Numer. Math. 7, 417436.Google Scholar
Dong, T., Dobrev, V., Kolev, T., Rieben, R., Tomov, S. and Dongarra, J. (2014), A step towards energy efficient computing: Redesigning a hydrodynamic application on CPU–GPU. In Proc. IEEE 28th International Parallel and Distributed Processing Symposium: IPDPS 2014, pp. 972981.Google Scholar
Dongarra, J. J. (1983), ‘Improving the accuracy of computed singular values’, SIAM J. Sci. Statist. Comput. 4, 712719.Google Scholar
Dongarra, J. and Whaley, R. C. (1995), A user’s guide to the BLACS v1.1, Technical report UT-CS-95-281, University of Tennessee, Knoxville. LAPACK Working Note 94, updated 5 May 1997 (version 1.1).Google Scholar
Dongarra, J. J., Bunch, J. R., Moler, C. B. and Stewart, G. W. (1979), LINPACK Users’ Guide, SIAM.Google Scholar
Dongarra, J. J., Du Croz, J., Hammarling, S. and Duff, I. S. (1990), ‘A set of Level 3 Basic Linear Algebra Subprograms’, ACM Trans. Math. Softw. 16, 117.Google Scholar
Dongarra, J., Faverge, M., Ltaief, H. and Luszczek, P. (2011), Achieving numerical accuracy and high performance using recursive tile LU factorization. Technical report ICL-UT-11-08, Computer Science Department, University of Tennessee, Knoxville.Google Scholar
Dongarra, J., Faverge, M., Ltaief, H. and Luszczek, P. (2012), ‘Exploiting fine-grain parallelism in recursive LU factorization’, Advances in Parallel Computing Special Issue 22, 429436.Google Scholar
Dongarra, J. J., Gustavson, F. G. and Karp, A. (1984), ‘Implementing linear algebra algorithms for dense matrices on a vector pipeline machine’, SIAM Review 26, 91112.CrossRefGoogle Scholar
Dongarra, J., Haidar, A., Kurzak, J., Luszczek, P., Tomov, S. and Yarkhan, A. (2014), ‘Model-driven one-sided factorizations on multicore accelerated systems’, Int. J. Supercomputing Frontiers and Innovations 1, #1.Google Scholar
Dongarra, J. J., Moler, C. B. and Wilkinson, J. H. (1983), ‘Improving the accuracy of computed eigenvalues and eigenvectors’, SIAM J. Numer. Anal. 20, 2345.Google Scholar
Dongarra, J. J., Sorensen, D. C. and Hammarling, S. J. (1989), ‘Block reduction of matrices to condensed forms for eigenvalue computations’, J. Comput. Appl. Math. 27, 215227.Google Scholar
Ducroz, J., Dongarra, J. J. and Higham, N. J. (1992), ‘Stability of methods for matrix inversion’, IMA J. Numer. Anal. 12, 119.Google Scholar
Duff, I. S. and Meurant, G. A. (1989), ‘The effect of ordering on preconditioned conjugate gradients’, BIT 29, 635657.CrossRefGoogle Scholar
Duff, I. S. and Reid, J. K. (1983), ‘The multifrontal solution of indefinite sparse symmetric linear equations’, ACM Trans. Math. Softw. 9, 302325.Google Scholar
Edelman, A. (1993), ‘Large dense numerical linear algebra in 1993: the parallel computing influence’, Int. J. High Performance Comput. Appl. 7, 113128.Google Scholar
Farra, V. and Madariaga, R. (1988), ‘Non-linear reflection tomography’, Geophys. J. Int. 95, 135147.Google Scholar
Frommer, A. and Szyld, D. B. (2000), ‘On asynchronous iterations’, J. Comput. Appl. Math. 123, 201216.Google Scholar
Gansterer, W. N., Kvasnicka, D. F. and Ueberhuber, C. W. (1999), Multi-sweep algorithms for the symmetric eigenproblem. In Vector and Parallel Processing: VECPAR’98, Vol. 1573 of Lecture Notes in Computer Science, Springer, pp. 2028.Google Scholar
Gates, M., Haidar, A. and Dongarra, J. (2014), Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem. In High Performance Computing for Computational Science: VECPAR 2014, Springer, pp. 182191.Google Scholar
Golub, G. H. and Kahan, W. (1965), ‘Calculating the singular values and pseudoinverse of a matrix’, SIAM J. Numer. Anal. 2, 205224.Google Scholar
Golub, G. H. and Reinsch, C. (1971), Singular value decomposition and least squares solutions. In Handbook for Automatic Computation II: Linear Algebra (Wilkinson, J. and Reinsch, C., eds), Springer, pp. 403420.Google Scholar
Golub, G. H. and Van Loan, C. (1996), Matrix Computations, third edition, Johns Hopkins University Press.Google Scholar
Golub, G. H. and Wilkinson, J. H. (1976), ‘Ill-conditioned eigensystems and the computation of the Jordan canonical form’, SIAM Rev. 18, 578619.Google Scholar
Golub, G. H. and Ye, Q. (2000), ‘Inexact preconditioned conjugate gradient method with inner–outer iteration’, SIAM J. Sci. Comput. 21, 13051320.Google Scholar
Grcar, J. F. (2011), ‘Mathematicians of Gaussian elimination’, Notices Amer. Math. Soc. 58, 782792.Google Scholar
Grigori, L., Demmel, J. and Xiang, H. (2011), ‘CALU: a communication optimal LU factorization algorithm’, SIAM. J. Matrix Anal. Appl. 32, 13171350.Google Scholar
Gu, M. and Eisenstat, S. C. (1995), ‘A divide-and-conquer algorithm for the bidiagonal SVD’, SIAM J. Matrix Anal. Appl. 16, 7992.Google Scholar
Gustavson, F. (1997), ‘Recursive leads to automatic variable blocking for dense linear-algebra algorithms’, IBM J. Research and Development 41, 737755.Google Scholar
Hadri, B., Ltaief, H., Agullo, E. and Dongarra, J. (2009), Enhancing parallelism of tile QR factorization for multicore architectures. LAPACK Working Note 222.Google Scholar
Hadri, B., Ltaief, H., Agullo, E. and Dongarra, J. (2010), Tile QR factorization with parallel panel processing for multicore architectures. In Proc. 24th IEEE International Parallel and Distributed Processing Symposium: IPDPS 2010. INRIA HAL Technical report inria-00548899: https://hal.inria.fr/inria-00548899Google Scholar
Haidar, A., Cao, C., Yarkhan, A., Luszczek, P., Tomov, S., Kabir, K. and Dongarra, J. (2014), Unified development for mixed multi-GPU and multi-coprocessor environments using a lightweight runtime environment. In Proc. IEEE 28th International Parallel and Distributed Processing Symposium: IPDPS, pp. 491500.Google Scholar
Haidar, A., Kurzak, J. and Luszczek, P. (2013a), An improved parallel singular value algorithm and its implementation for multicore hardware. In Proc. SC ’13: International Conference for High Performance Computing, Networking, Storage and Analysis, #90.Google Scholar
Haidar, A., Ltaief, H. and Dongarra, J. (2011), Parallel reduction to condensed forms for symmetric eigenvalue problems using aggregated fine-grained and memory-aware kernels. In Proc. SC ’11: International Conference for High Performance Computing, Networking, Storage and Analysis, ACM, #8.Google Scholar
Haidar, A., Ltaief, H., Luszczek, P. and Dongarra, J. (2012), A comprehensive study of task coalescing for selecting parallelism granularity in a two-stage bidiagonal reduction. In Proc. IEEE 26th International Parallel and Distributed Processing Symposium: IPDPS 2012, pp. 2535.Google Scholar
Haidar, A., Luszczek, P. and Dongarra, J. (2014b), New algorithm for computing eigenvectors of the symmetric eigenvalue problem. In 15th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing: PDSEC 2014 (Best Paper), pp. 11501159.Google Scholar
Haidar, A., Solcà, R., Gates, M., Tomov, S., Schulthess, T. C. and Dongarra, J. (2013b), Leading edge hybrid multi-GPU algorithms for generalized eigenproblems in electronic structure calculations. In Supercomputing, (Kunkel, J. M., Ludwig, T. and Meuer, H. W., eds), Vol. 7905 of Lecture Notes in Computer Science, Springer, pp. 6780.Google Scholar
Haidar, A., Tomov, S., Dongarra, J., Solcà, R. and T., Schulthess (2014), ‘A novel hybrid CPU–GPU generalized eigensolver for electronic structure calculations based on fine-grained memory aware tasks’, Int. J. High Performance Comput. Appl. 28, 196209.Google Scholar
Hammarling, S., Dongarra, J., Du Croz, J. and Hanson, R. (1988), ‘An extended set of Fortran Basic Linear Algebra Subprograms’, ACM Trans. Math. Softw. 14, 132.Google Scholar
Hammond, S. W. and Schreiber, R. (1992), ‘Efficient ICCG on a shared memory multiprocessor’, Int. J. High Speed Comput. 4, 121.Google Scholar
Hanson, R. J. (1971), ‘A numerical method for solving Fredholm integral equations of the first kind using singular values’, SIAM J. Numer. Anal. 8, 616626.Google Scholar
Harrington, R. (1990), ‘Origin and development of the method of moments for field computation’, IEEE Antennas Propag. 32, 3135.Google Scholar
Hess, J. L. (1990), ‘Panel methods in computational fluid dynamics’, Annu. Rev. Fluid Mech. 22, 255274.Google Scholar
Higham, N. J. (2002), Accuracy and Stability of Numerical Algorithms, second edition, SIAM.Google Scholar
Hotelling, H. (1933), ‘Analysis of a complex of statistical variables into principal components’, J. Educ. Psych. 24, 417441; 498–520.Google Scholar
Hotelling, H. (1935), ‘Simplified calculation of principal components’, Psychometrica 1, 2735.Google Scholar
Im, E.-J., Yelick, K. and Vuduc, R. (2004), ‘Sparsity: Optimization framework for sparse matrix kernels’, Int. J. High Perform. Comput. Appl. 18, 135158.Google Scholar
Intel (2014) , Intel®64 and IA-32 architectures software developer’s manual. http://download.intel.com/products/processor/manual/Google Scholar
Jaeger, E., Harvey, R., Berry, L., Myra, J., Dumont, R., Philips, C., Smithe, D., Barrett, R., Batchelor, D., Bonoli, P., Carter, M., D’Azevedo, E., D’Ippolito, D., Moore, R. and Wright, J. (2006), ‘Global-wave solutions with self-consistent velocity distributions in ion cyclotron heated plasmas’, Nuclear Fusion 46, S397S408.Google Scholar
Jessup, E. R. and Sorensen, D. C. (1989), A divide and conquer algorithm for computing the singular value decomposition. In Proc. Third SIAM Conference on Parallel Processing for Scientific Computing, SIAM, pp. 6166.Google Scholar
Jiang, E. P. and Berry, M. W. (1998), Information filtering using the Riemannian SVD (R-SVD). In Solving Irregularly Structured Problems in Parallel: Proc. 5th International Symposium, IRREGULAR ’98, (Ferreira, A., Rolim, J. D. P., Simon, H. D. and Teng, S.-H., eds), Vol. 1457 of Lecture Notes in Computer Science, Springer, pp. 386395.Google Scholar
Joffrain, T., Quintana-Orti, E. S. and van de Geijn, R. A. (2006), Rapid development of high-performance out-of-core solvers. In Applied Parallel Computing: State of the Art in Scientific Computing, (Dongarra, J., Madsen, K. and Waśniewski, J., eds), Vol. 3732 of Lecture Notes in Computer Science, Springer, pp. 413422.Google Scholar
Kabir, K., Haidar, A., Tomov, S. and Dongarra, J. (2015), Performance analysis and design of a Hessenberg reduction using stabilized blocked elementary transformations for new architectures. In Proc. Symposium on High Performance Computing, Society for Computer Simulation International, pp. 135142.Google Scholar
Kågström, B., Kressner, D. and Shao, M. (2012), On aggressive early deflation in parallel variants of the QR algorithm. In Applied Parallel and Scientific Computing, (Jónasson, K., ed.), Vol. 7133 of Lecture Notes in Computer Science, Springer, pp. 110.Google Scholar
Karlsson, L. and Kågström, B. (2011), ‘Parallel two-stage reduction to Hessenberg form using dynamic scheduling on shared-memory architectures’, Parallel Computing 37, 771782.Google Scholar
Kurzak, J. and Dongarra, J. J. (2007), ‘Implementation of mixed precision in solving systems of linear equations on the Cell processor’, Concurrency Computat. Pract. Exper. 19, 13711385.Google Scholar
Kurzak, J., Buttari, A. and Dongarra, J. J. (2008), ‘Solving systems of linear equations on the CELL processor using Cholesky factorization’, IEEE Trans. Parallel and Distributed Systems 19, 111.Google Scholar
Lang, B. (1999), ‘Efficient eigenvalue and singular value computations on shared memory machines’, Parallel Computing 25, 845860.CrossRefGoogle Scholar
Langou, J., Langou, J., Luszczek, P., Kurzak, J., Buttari, A. and Dongarra, J. J. (2006), Exploiting the performance of 32 bit floating point arithmetic in obtaining 64 bit accuracy. In Proc. 2006 ACM/IEEE Conference on Supercomputing.Google Scholar
Lawson, C. L., Hanson, R. J., Kincaid, D. and Krogh, F. T. (1979), ‘Basic linear algebra subprograms for Fortran usage’, ACM Trans. Math. Softw. 5, 308323.Google Scholar
Ltaief, H., Luszczek, P., Haidar, A. and Dongarra, J. (2012), Enhancing parallelism of tile bidiagonal transformation on multicore architectures using tree reduction. In 9th International Conference on Parallel Processing and Applied Mathematics: PPAM 2011, (Wyrzykowski, R., Dongarra, J., Karczewski, K and Wasniewski, J., eds), Vol. 7203 of Lecture Notes in Computer Science, Springer, pp. 661670.Google Scholar
Lukarski, D. (2012), Parallel sparse linear algebra for multi-core and many-core platforms: Parallel solvers and preconditioners. PhD thesis, Karlsruhe Institute of Technology (KIT), Germany.Google Scholar
Luszczek, P. and Dongarra, J. (2012), Anatomy of a globally recursive embedded LINPACK benchmark. In Proc. 2012 IEEE Conference on High Performance Extreme Computing Conference: HPEC 2012.Google Scholar
Luszczek, P., Ltaief, H. and Dongarra, J. (2011), Two-stage tridiagonal reduction for dense symmetric matrices using tile algorithms on multicore architectures. In IEEE International Parallel and Distributed Processing Symposium: IPDPS 2011, IEEE, pp. 944955.Google Scholar
Mayer, J. (2009), ‘Parallel algorithms for solving linear systems with sparse triangular matrices’, Computing 86, 291312.Google Scholar
Messer, O., Harris, J., Parete-Koon, S. and Chertkow, M. (2012), Multicore and accelerator development for a leadership-class stellar astrophysics code. In Proc. PARA 2012: Applied Parallel and Scientific Computing, (Manninen, P. and Öster, P., eds), Vol. 7782 of Lecture Notes in Computer Science, pp. 92106.Google Scholar
Meuer, H. W., Strohmaier, E., Dongarra, J. J. and Simon, H. D. (2011), TOP500 Supercomputer Sites, 38th edition. www.netlib.org/benchmark/top500.htmlGoogle Scholar
Moler, C. B. (1967), ‘Iterative refinement in floating point’, J. Assoc. Comput. Mach. 14, 316321.Google Scholar
Moler, C. B. (1972), ‘Matrix computations with Fortran and paging’, Comm. Assoc. Comput. Mach. 15, 268270.Google Scholar
Moore, B. C. (1981), ‘Principal component analysis in linear systems: Controllability, observability, and model reduction’, IEEE Trans. Automatic Control 26, 1732.Google Scholar
Moore, G. E. (1965), ‘Cramming more components onto integrated circuits’, Electronics 38(8).Google Scholar
Naumov, M. (2011), Parallel solution of sparse triangular linear systems in the preconditioned iterative methods on the GPU. Technical report NVR-2011-001, NVIDIA.Google Scholar
Nédélec, J.-C. (2001), Acoustic and Electromagnetic Equations: Integral Representations for Harmonic Problems, Springer.Google Scholar
Notay, Y. (2000), ‘Flexible conjugate gradients’, SIAM J. Sci. Comput. 22, 14441460.Google Scholar
Oettli, W. and Prager, W. (1964), ‘Compatibility of approximate solution of linear equations with given error bounds for coefficients and right-hand sides’, Numer. Math. 6, 405409.Google Scholar
Park, J., Smelyanskiy, M., Sundaram, N. and Dubey, P. (2014), Sparsifying synchronization for high-performance shared-memory sparse triangular solver. In Supercomputing, Vol. 8488 of Lecture Notes in Computer Science, Springer, pp. 124140.Google Scholar
Parker, D. S. (1995a), Random butterfly transformations with applications in computational linear algebra. Technical report CSD-950023, Computer Science Department, University of California.Google Scholar
Parker, D. S. (1995b), A randomizing butterfly transformation useful in block matrix computations. Technical report CSD-950024, Computer Science Department, University of California.Google Scholar
Poole, E. L. and Ortega, J. M. (1987), ‘Multicolor ICCG methods for vector computers’, SIAM J. Numer. Anal. 24, 13941417.Google Scholar
Pothen, A. and Alvarado, F. (1992), ‘A fast reordering algorithm for parallel sparse triangular solution’, SIAM J. Sci. Statist. Comput. 13, 645653.Google Scholar
Quaranta, E. and Drikakis, D. (2009), ‘Noise radiation from a ducted rotor in a swirling-translating flow’, J. Fluid Mech. 641, 463473.Google Scholar
Rotem, E., Naveh, A., Rajwan, D., Ananthakrishnan, A. and Weissmann, E. (2012), ‘Power-management architecture of the Intel microarchitecture code-named Sandy Bridge’, IEEE Micro 32, 2027.Google Scholar
Rozloz̆ník, M., Shklarski, G. and Toledo, S. (2011), ‘Partitioned triangular tridiagonalization’, ACM Trans. Math. Softw. 37, 116.Google Scholar
Saad, Y. (1993), ‘A flexible inner–outer preconditioned GMRES algorithm’, SIAM J. Sci. Comput. 14, 461469.CrossRefGoogle Scholar
Saad, Y. (2003), Iterative Methods for Sparse Linear Systems, SIAM.Google Scholar
Saltz, J. H. (1990), ‘Aggregation methods for solving sparse triangular systems on multiprocessors’, SIAM J. Sci. Statist. Comput. 11, 123144.Google Scholar
Schreiber, R. and Van Loan, C. (1989), ‘A storage-efficient WY representation for products of Householder transformations’, SIAM J. Sci. Statist. Comput. 10, 5357.Google Scholar
Simoncini, V. and Szyld, D. B. (2003), ‘Flexible inner–outer Krylov subspace methods’, SIAM J. Numer. Anal. 40, 22192239.Google Scholar
Singh, D. J. (1994), Planewaves, Pseudopotentials, and the LAPW Method, Kluwer.Google Scholar
Skeel, R. D. (1980), ‘Iterative refinement implies numerical stability for Gaussian elimination’, Math. Comput. 35, 817832.Google Scholar
Solcà, R., Kozhevnikov, A., Haidar, A., Tomov, S., Dongarra, J. and Schulthess, T. C. (2015), Efficient implementation of quantum materials simulations on distributed CPU–GPU systems. In Proc. International Conference for High Performance Computing, Networking, Storage and Analysis, ACM, pp. 110.Google Scholar
Stewart, G. W. (1973), Introduction to Matrix Computations, Academic Press.Google Scholar
Toledo, S. (1997), ‘Locality of reference in LU decomposition with partial pivoting’, SIAM J. Matrix Anal. Appl. 18, 10651081.Google Scholar
Tomov, S., Nath, R. and Dongarra, J. (2010a), ‘Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing’, Parallel Computing 36, 645654.Google Scholar
Tomov, S., Nath, R., Du, P. and Dongarra, J. (2009), MAGMA version 0.2 Users’ Guide.Google Scholar
Tomov, S., Nath, R., Ltaief, H. and Dongarra, J. (2010b), Dense linear algebra solvers for multicore with GPU accelerators. In Proc. IEEE IPDPS ’10, pp. 18.Google Scholar
Trefethen, L. and Schreiber, R. (1990), ‘Average case analysis of Gaussian elimination’, SIAM J. Math. Anal. Appl. 11, 335360.Google Scholar
Tuma, M. and Benzi, M. (1998), ‘A comparative study of sparse approximate inverse preconditioners’, Appl. Numer. Math 30, 305340.Google Scholar
Turk, M. A. and Pentland, A. P. (1991), Face recognition using eigenfaces. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition 1991: Proc. CVPR’91, IEEE, pp. 586591.Google Scholar
Turner, K. and Walker, H. F. (1992), ‘Efficient high accuracy solutions with GMRES(m)’, SIAM J. Sci. Statist. Comput. 13, 815825.Google Scholar
van Duin, A. C. N. (1996), ‘Scalable parallel preconditioning with the sparse approximate inverse of triangular matrices’, SIAM J. Matrix Anal. Appl. 20, 9871006.Google Scholar
van den Eshof, J., Sleijpen, G. L. G. and van Gijzen, M. B. (2005), ‘Relaxation strategies for nested Krylov methods’, J. Comput. Appl. Math. 177, 347365.Google Scholar
van der Vorst, H. (1982), ‘A vectorizable variant of some ICCG methods’, SIAM J. Sci. Statist. Comput. 3, 350356.Google Scholar
van der Vorst, H. A. and Vuik, C. (1994), ‘GMRESR: A family of nested GMRES methods’, Numer. Linear Algebra Appl. 1, 369386.Google Scholar
Villa, O., Fatica, M., Gawande, N. A. and Tumeo, A. (2013a), Power/performance trade-offs of small batched LU based solvers on GPUs. In 19th International Conference on Parallel Processing: Euro-Par 2013, Vol. 8097 of Lecture Notes in Computer Science, pp. 813825.Google Scholar
Villa, O., Gawande, N. A. and Tumeo, A. (2013b), Accelerating subsurface transport simulation on heterogeneous clusters. In 2013 IEEE International Conference on Cluster Computing: CLUSTER IEEE, pp. 18.Google Scholar
Volkov, V. and Demmel, J. W. (2008), LU, QR and Cholesky factorizations using vector capabilities of GPUs. Technical report UCB/EECS-2008-49, University of California, Berkeley. LAPACK Working Note 202.Google Scholar
Vuik, C. (1995), ‘New insights in GMRES-like methods with variable preconditioners’, J. Comput. Appl. Math. 61, 189204.Google Scholar
Wainwright, I. and Sweden, H. P. C. (2013), Optimized LU-decomposition with full pivot for small batched matrices. In 2013 NVIDIA GPU Tech. Conf.Google Scholar
Wilkinson, J. H. (1963), Rounding Errors in Algebraic Processes, Prentice Hall.Google Scholar
Wilkinson, J. H. (Ed.) (1988), The Algebraic Eigenvalue Problem, Oxford University Press.Google Scholar
Wolf, M., Heroux, M. and Boman, E. (2011), Factors impacting performance of multithreaded sparse triangular solve. In High Performance Computing for Computational Science: VECPAR 2010, Vol. 6449 of Lecture Notes in Computer Science, Springer, pp. 3244.Google Scholar
Yamazaki, I., Tomov, S. and Dongarra, J. (2015), ‘Mixed-precision Cholesky QR factorization and its case studies on multicore CPU with multiple GPUs’, SIAM J. Sci. Comput. 37, C307C330.Google Scholar
Yamazaki, I., Tomov, S., Dong, T. and Dongarra, J. (2014), Mixed-precision orthogonalization scheme and adaptive step size for improving the stability and performance of CA-GMRES on GPUs. In 11th International Conference on High Performance Computing for Computational Science: VECPAR 2014, Revised Selected Papers (Best Paper Award), pp. 1730.Google Scholar
YarKhan, A., Kurzak, J. and Dongarra, J. (2011), QUARK Users’ Guide, Innovative Computing Laboratory, Electrical Engineering and Computer Science, University of Tennessee.Google Scholar
Yeralan, S. N., Davis, T. A. and Ranka, S. (2013), Sparse multifrontal QR on the GPU. Technical report, University of Florida.Google Scholar
Yeung, M.-C. and Chan, T. F. (1995), Probabilistic analysis of Gaussian elimination without pivoting. Technical report CAM95-29, Department of Mathematics, University of California, Los Angeles.Google Scholar
Ypma, T. J. (1995), ‘Historical development of the Newton–Raphson method’, SIAM Review 37, 531551.Google Scholar
Zhang, Y., Taylor, M., Sarkar, T., Moon, H. and Yuan, M. (2008), ‘Solving large complex problems using a higher-order basis: Parallel in-core and out-of-core integral-equation solvers’, IEEE Antennas Propag. 50, 1330.Google Scholar