Hostname: page-component-586b7cd67f-dlnhk Total loading time: 0 Render date: 2024-11-25T00:48:25.431Z Has data issue: false hasContentIssue false

Communication lower bounds and optimal algorithms for numerical linear algebra*

Published online by Cambridge University Press:  12 May 2014

G. Ballard
Affiliation:
Sandia National Laboratories, Livermore, CA 94551, USA, E-mail: [email protected]
E. Carson
Affiliation:
EECS Department, UC Berkeley, Berkeley, CA 94704, USA, E-mail: [email protected], [email protected], [email protected]
J. Demmel
Affiliation:
EECS Department, UC Berkeley, Berkeley, CA 94704, USA, E-mail: [email protected], [email protected], [email protected] Mathematics Department, UC Berkeley, Berkeley, CA 94704, USA, E-mail: [email protected]
M. Hoemmen
Affiliation:
Sandia National Laboratories, Albuquerque, NM 87185, USA, E-mail: [email protected]
N. Knight
Affiliation:
EECS Department, UC Berkeley, Berkeley, CA 94704, USA, E-mail: [email protected], [email protected], [email protected]
O. Schwartz
Affiliation:
EECS Department, UC Berkeley, Berkeley, CA 94704, USA, E-mail: [email protected], [email protected], [email protected]

Extract

The traditional metric for the efficiency of a numerical algorithm has been the number of arithmetic operations it performs. Technological trends have long been reducing the time to perform an arithmetic operation, so it is no longer the bottleneck in many algorithms; rather, communication, or moving data, is the bottleneck. This motivates us to seek algorithms that move as little data as possible, either between levels of a memory hierarchy or between parallel processors over a network. In this paper we summarize recent progress in three aspects of this problem. First we describe lower bounds on communication. Some of these generalize known lower bounds for dense classical (O(n3)) matrix multiplication to all direct methods of linear algebra, to sequential and parallel algorithms, and to dense and sparse matrices. We also present lower bounds for Strassen-like algorithms, and for iterative methods, in particular Krylov subspace methods applied to sparse matrices. Second, we compare these lower bounds to widely used versions of these algorithms, and note that these widely used algorithms usually communicate asymptotically more than is necessary. Third, we identify or invent new algorithms for most linear algebra problems that do attain these lower bounds, and demonstrate large speed-ups in theory and practice.

Type
Research Article
Copyright
Copyright © Cambridge University Press 2014 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

*

We acknowledge funding from Microsoft (award 024263) and Intel (award 024894), and matching funding by UC Discovery (award DIG07-10227). Additional support comes from ParLab affiliates National Instruments, Nokia, NVIDIA, Oracle and Samsung, as well as MathWorks. Research is also supported by DOE grants DE-SC0004938, DE-SC0005136, DE-SC0003959, DE-SC0008700, DE-SC0010200, DE-FC02-06-ER25786, AC02-05CH11231, and DARPA grant HR0011-12-2-0016. This research is supported by grant 3-10891 from the Ministry of Science and Technology, Israel, and grant 2010231 from the US-Israel Bi-National Science Foundation. This research was supported in part by an appointment to the Sandia National Laboratories Truman Fellowship in National Security Science and Engineering, sponsored by Sandia Corporation (a wholly owned subsidiary of Lockheed Martin Corporation) as Operator of Sandia National Laboratories under its US Department of Energy Contract DE-AC04-94AL85000.

Colour online for monochrome figures available at journals.cambridge.org/anu.

References

REFERENCES

Aasen, J. O. (1971), ‘On the reduction of a symmetric matrix to tridiagonal form’, BIT Numer. Math. 11, 233242.Google Scholar
Abdelmalek, N. N. (1971), ‘Round off error analysis for Gram–Schmidt method and solution of linear least squares problems’, BIT Numer. Math. 11, 345367.Google Scholar
Agarwal, R., Balle, S., Gustavson, F., Joshi, M. and Palkar, P. (1995), ‘A three-dimensional approach to parallel matrix multiplication’, IBM J. Res. Dev. 39, 575582.Google Scholar
Agarwal, R., Gustavson, F. and Zubair, M. (1994), ‘A high-performance matrix-multiplication algorithm on a distributed-memory parallel computer, using overlapped communication’, IBM J. Res. Dev. 38, 673681.Google Scholar
Aggarwal, A. and Vitter, J. (1988), ‘The input/output complexity of sorting and related problems’, Comm. Assoc. Comput. Mach. 31, 11161127.Google Scholar
Aggarwal, A., Chandra, A. K. and Snir, M. (1990), ‘Communication complexity of PRAMs’, Theoret. Comput. Sci. 71, 328.Google Scholar
Ahmed, N. and Pingali, K. (2000), Automatic generation of block-recursive codes. In Proc. 6th International Euro-Par Conference on Parallel Processing, Springer, pp. 368378.Google Scholar
Anderson, E., Bai, Z., Bischof, C., Demmel, J., Dongarra, J., Croz, J. D., Greenbaum, A., Hammarling, S., McKenney, A., Ostrouchov, S. and Sorensen, D. (1992), LAPACK Users' Guide, SIAM. Also available from http://www.netlib.org/lapack/.Google Scholar
Anderson, M., Ballard, G., Demmel, J. and Keutzer, K. (2011), Communication-avoiding QR decomposition for GPUs. In Proc. 2011 IEEE International Parallel and Distributed Processing Symposium: IPDPS '11, pp. 4858.Google Scholar
Arnoldi, W. E. (1951), ‘The principle of minimized iterations in the solution of the matrix eigenvalue problem’, Quart. Appl. Math. 9, 1729.Google Scholar
Bai, Z. and Day, D. (2000), Block Arnoldi method. In Templates for the Solution of Algebraic Eigenvalue Problems: A Practical Guide (Bai, Z., Demmel, J. W., Dongarra, J. J., Ruhe, A. and van der Vorst, H., eds), SIAM, pp. 196204.Google Scholar
Bai, Z., Day, D., Demmel, J. and Dongarra, J. (1997 a), A test matrix collection for non-Hermitian eigenvalue problems. Technical report CS-97-355, Department of CS, University of Tennessee.Google Scholar
Bai, Z., Demmel, J. and Gu, M. (1997 b), ‘An inverse free parallel spectral divide and conquer algorithm for nonsymmetric eigenproblems’, Numer. Math. 76, 279308.CrossRefGoogle Scholar
Bai, Z., Hu, D. and Reichel, L. (1991), An implementation of the GMRES method using QR factorization. In Proc. Fifth SIAM Conference on Parallel Processing for Scientific Computing, pp. 8491.Google Scholar
Bai, Z., Hu, D. and Reichel, L. (1994), ‘A Newton basis GMRES implementation’, IMA J. Numer. Anal. 14, 563581.Google Scholar
Ballard, G. (2013), Avoiding communication in dense linear algebra. PhD thesis, EECS Department, UC Berkeley.Google Scholar
Ballard, G., Becker, D., Demmel, J., Dongarra, J., Druinsky, A., Peled, I., Schwartz, O., Toledo, S. and Yamazaki, I. (2013 a), Communication-avoiding symmetric-indefinite factorization. Technical report UCB/EECS-2013-127, EECS Department, UC Berkeley.Google Scholar
Ballard, G., Becker, D., Demmel, J., Dongarra, J., Druinsky, A., Peled, I., Schwartz, O., Toledo, S. and Yamazaki, I. (2013 b), Implementing a blocked Aasen's algorithm with a dynamic scheduler on multicore architectures. In Proc. 27th IEEE International Parallel Distributed Processing Symposium: IPDPS '13, pp. 895907.Google Scholar
Ballard, G., Buluç, A., Demmel, J., Grigori, L., Lipshitz, B., Schwartz, O. and Toledo, S. (2013 c), Communication optimal parallel multiplication of sparse random matrices. In Proc. 25th ACM Symposium on Parallelism in Algorithms and Architectures: SPAA '13, ACM, pp. 222231.Google Scholar
Ballard, G., Demmel, J. and Dumitriu, I. (2011a), Communication-optimal parallel and sequential eigenvalue and singular value algorithms. Technical Report EECS-2011-14, UC Berkeley.Google Scholar
Ballard, G., Demmel, J. and Gearhart, A. (2011 b), Brief announcement: Communication bounds for heterogeneous architectures. In Proc. 23rd ACM Symposium on Parallelism in Algorithms and Architectures: SPAA '11, ACM, pp. 257258.Google Scholar
Ballard, G., Demmel, J. and Knight, N. (2012 a), Communication avoiding successive band reduction. In Proc. 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming: PPoPP '12, ACM, pp. 3544.Google Scholar
Ballard, G., Demmel, J. and Knight, N. (2013 d), Avoiding communication in successive band reduction. Technical report UCB/EECS-2013-131, EECS Department, UC Berkeley.Google Scholar
Ballard, G., Demmel, J., Grigori, L., Jacquelin, M., Nguyen, H. D. and Solomonik, E. (2014), Reconstructing Householder vectors from Tall-Skinny QR. In Proc. 2014 IEEE International Parallel and Distributed Processing Symposium: IPDPS '14, to appear.Google Scholar
Ballard, G., Demmel, J., Holtz, O. and Schwartz, O. (2010), ‘Communication-optimal parallel and sequential Cholesky decomposition’, SIAM J. Sci. Comput. 32, 34953523.Google Scholar
Ballard, G., Demmel, J., Holtz, O. and Schwartz, O. (2011 c), Graph expansion and communication costs of fast matrix multiplication. In Proc. 23rd ACM Symposium on Parallelism in Algorithms and Architectures: SPAA '11, ACM, pp. 112.Google Scholar
Ballard, G., Demmel, J., Holtz, O. and Schwartz, O. (2011 d), ‘Minimizing communication in numerical linear algebra’, SIAM J. Matrix Anal. Appl. 32, 866901.Google Scholar
Ballard, G., Demmel, J., Holtz, O. and Schwartz, O. (2012 b), ‘Graph expansion and communication costs of fast matrix multiplication’, J. Assoc. Comput. Mach. 59, #32.CrossRefGoogle Scholar
Ballard, G., Demmel, J., Holtz, O. and Schwartz, O. (2012 c), Sequential communication bounds for fast linear algebra. Technical report EECS-2012-36, UC Berkeley.Google Scholar
Ballard, G., Demmel, J., Holtz, O., Lipshitz, B. and Schwartz, O. (2012 d), Brief announcement: Strong scaling of matrix multiplication algorithms and memory independent communication lower bounds. In Proc. 24th ACM Symposium on Parallelism in Algorithms and Architectures: SPAA '12, ACM, pp. 7779.Google Scholar
Ballard, G., Demmel, J., Holtz, O., Lipshitz, B. and Schwartz, O. (2012 e), Communication-optimal parallel algorithm for Strassen's matrix multiplication. In Proc. 24th ACM Symposium on Parallelism in Algorithms and Architectures: SPAA '12, ACM, pp. 193204.Google Scholar
Ballard, G., Demmel, J., Holtz, O., Lipshitz, B. and Schwartz, O. (2012 f), Graph expansion analysis for communication costs of fast rectangular matrix multiplication. In Design and Analysis of Algorithms (Even, G. and Rawitz, D., eds), Vol. 7659 of Lecture Notes in Computer Science, Springer, pp. 1336.Google Scholar
Ballard, G., Demmel, J., Lipshitz, B., Schwartz, O. and Toledo, S. (2013 f), Communication efficient Gaussian elimination with partial pivoting using a shape morphing data layout. In Proc. 25th ACM Symposium on Parallelism in Algorithms and Architectures: SPAA '13, ACM, pp. 232240.Google Scholar
Barrett, R., Berry, M., Chan, T. F., Demmel, J. W., Donato, J., Dongarra, J. J., Eijkhout, V., Pozo, R., Romine, C. and van der Vorst, H. (1994), Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, second edition, SIAM.Google Scholar
Bender, M. A., Brodal, G. S., Fagerberg, R., Jacob, R. and Vicari, E. (2010), ‘Optimal sparse matrix dense vector multiplication in the I/O-model’, Theory Comput. Syst. 47, 934962.Google Scholar
Bennett, J., Carbery, A., Christ, M. and Tao, T. (2010), ‘Finite bounds for Holder-Brascamp-Lieb multilinear inequalities’, Math. Res. Lett. 17, 647666.CrossRefGoogle Scholar
Berntsen, J. (1989), ‘Communication efficient matrix multiplication on hypercubes’, Parallel Comput. 12, 335342.Google Scholar
Bilardi, G. and Preparata, F. P. (1999), ‘Processor-time tradeoffs under bounded-speed message propagation II: Lower bounds’, Theory Comput. Syst. 32, 531559.Google Scholar
Bilardi, G., Pietracaprina, A. and D'Alberto, P. (2000), On the space and access complexity of computation DAGs. In Graph-Theoretic Concepts in Computer Science: 26th International Workshop (Brandes, U. and Wagner, D., eds), Vol. 1928 of Lecture Notes in Computer Science, Springer, pp. 4758.Google Scholar
Bischof, C. and Loan, C. Van (1987), ‘The WY representation for products of Householder matrices’, SIAM J. Sci. Statist. Comput. 8, 213.Google Scholar
Bischof, C. H., Lang, B. and Sun, X. (2000a), ‘Algorithm 807: The SBR Toolbox, Software for successive band reduction’, ACM Trans. Math. Softw. 26, 602616.CrossRefGoogle Scholar
Bischof, C., Lang, B. and Sun, X. (2000b), ‘A framework for symmetric band reduction’, ACM Trans. Math. Softw. 26, 581601.CrossRefGoogle Scholar
Björck, A. (1967), ‘Solving linear least squares problems by Gram–Schmidt orthogonalization’, BIT Numer. Math. 7, 121.Google Scholar
Blackford, L. S., Choi, J., Cleary, A., D'Azevedo, E., Demmel, J., Dhillon, I., Dongarra, J., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D. and Whaley, R. C. (1997), ScaLAPACK Users' Guide, SIAM. Also available from http://www.netlib.org/scalapack/.Google Scholar
Blackford, L. S., Demmel, J., Dongarra, J., Duff, I., Hammarling, S., Henry, G., Heroux, M., Kaufman, L., Lumsdaine, A., Petitet, A., Pozo, R., Remington, K. and Whaley, R. C. (2002), ‘An updated set of basic linear algebra subroutines (BLAS)’, J. ACM Trans. Math. Softw. 28, 135151.Google Scholar
Börm, S. and Grasedyck, L. (2006), HLib package. http://www.hlib.org/hlib.htmlGoogle Scholar
Börm, S., Grasedyck, L. and Hackbusch, W. (2004), Hierarchical matrices. http://www.mis.mpg.de/scicomp/Fulltext/WS_HMatrices.pdfGoogle Scholar
Braman, K., Byers, R. and Mathias, R. (2002 a), ‘The multishift QR algorithm I: Maintaining well-focused shifts and level 3 performance’, SIAM J. Matrix Anal. Appl. 23, 929947.Google Scholar
Braman, K., Byers, R. and Mathias, R. (2002b), ‘The multishift QR algorithm II: Aggressive early deflation’, SIAM J. Matrix Anal. Appl. 23, 948973.Google Scholar
Bunch, J. and Kaufman, L. (1977), ‘Some stable methods for calculating inertia and solving symmetric linear systems’, Math. Comp. 31, 163179.Google Scholar
Buttari, A., Langou, J., Kurzak, J. and Dongarra, J. J. (2007), A class of parallel tiled linear algebra algorithms for multicore architectures. LAPACK Working Note 191.Google Scholar
Byun, J., Lin, R., Yelick, K. and Demmel, J. (2012), Autotuning sparse matrix-vector multiplication for multicore. Technical report UCB/EECS-2012-215, EECS Department, UC Berkeley.Google Scholar
Cannon, L. (1969), A cellular computer to implement the Kalman filter algorithm. PhD thesis, Montana State University.Google Scholar
Carson, E. and Demmel, J. (2014), ‘A residual replacement strategy for improving the maximum attainable accuracy of s-step Krylov subspace methods’, SIAM J. Matrix Anal. Appl. 35, 2243.Google Scholar
Carson, E., Knight, N. and Demmel, J. (2013), ‘Avoiding communication in non-symmetric Lanczos-based Krylov subspace methods’, SIAM J. Sci. Comput. 35, S42S61.Google Scholar
Catalyurek, U. V. and Aykanat, C. (1999), ‘Hypergraph-partitioning-based decomposition for parallel sparse-matrix vector multiplication’, IEEE Trans. Parallel Distributed Systems 10, 673693.Google Scholar
Catalyiirek, Ü. V. and Aykanat, C. (2001), A fine-grain hypergraph model for 2D decomposition of sparse matrices. In Proc. 15th IEEE International Parallel and Distributed Processing Symposium: IPDPS '01.Google Scholar
Chan, E., Heimlich, M., Purkayastha, A. and van de Geijn, R. (2007), ‘Collective communication: Theory, practice, and experience’, Concurrency and Computation: Pract. Exp. 19, 17491783.Google Scholar
Chow, E. (2000), ‘Apriori sparsity patterns for parallel sparse approximate inverse preconditioners’, SIAM J. Sci. Comput. 21, 18041822.Google Scholar
Chow, E. (2001), ‘Parallel implementation and practical use of sparse approximate inverses with a priori sparsity patterns’, Internat. J. High Perf. Comput. Appl. 15, 5674.Google Scholar
Christ, M., Demmel, J., Knight, N., Scanlon, T. and Yelick, K. (2013), Communication lower bounds and optimal algorithms for programs that reference arrays, part 1. Technical report UCB/EECS-2013-61, EECS Department, UC Berkeley.Google Scholar
Chronopoulos, A. and Gear, C. (1989 a), ‘On the efficient implementation of preconditioned s-step conjugate gradient methods on multiprocessors with memory hierarchy’, Parallel Comput. 11, 3753.Google Scholar
Chronopoulos, A. and Gear, C. (1989 b), ‘s-step iterative methods for symmetric linear systems’, J. Comput. Appl. Math. 25, 153168.Google Scholar
Chronopoulos, A. and Swanson, C. (1996), ‘Parallel iterative s-step methods for unsymmetric linear systems’, Parallel Comput. 22, 623641.Google Scholar
Cohen, E. (1994), Estimating the size of the transitive closure in linear time, in Proc. 35th Ann. Symp. Found. Comp. Sci., IEEE, pp. 190200.Google Scholar
Cohen, E. (1997), ‘Size-estimation framework with applications to transitive closure and reachability’, J. Comput. System Sci. 55, 441453.Google Scholar
Committee on the Analysis of Massive Data; Committee on Applied and Theoretical Statistics; Board on Mathematical Sciences and Their Applications; Division on Engineering and Physical Sciences; National Research Council (2013), Frontiers in Massive Data Analysis, The National Academies Press.Google Scholar
da Cunha, R. D., Becker, D. and Patterson, J. C. (2002), New parallel (rank-revealing) QR factorization algorithms. In Euro-Par 2002 Parallel Processing, Springer, pp. 677686.Google Scholar
Datta, K., Murphy, M., Volkov, V., Williams, S., Carter, J., Oliker, L., Patterson, D., Shalf, J. and Yelick, K. (2008), Stencil computation optimization and autotuning on state-of-the-art multicore architectures. In Proc. 2008 ACM/IEEE Conference on Supercomputing, IEEE Press, p. 4.Google Scholar
Davis, T. and Hu, Y. (2011), ‘The University of Florida sparse matrix collection’, ACM Trans. Math. Softw. 38, 125.Google Scholar
Dekel, E., Nassimi, D. and Sahni, S. (1981), ‘Parallel matrix and graph algorithms’, SIAM J. Comput. 10, 657675.Google Scholar
Demmel, J. (1997), Applied Numerical Linear Algebra, SIAM.Google Scholar
Demmel, J. (2013), An arithmetic complexity lower bound for computing rational functions, with applications to linear algebra. Technical report UCB/EECS-2013-126, EECS Department, UC Berkeley.Google Scholar
Demmel, J., Dumitriu, I. and Holtz, O. (2007 a), ‘Fast linear algebra is stable’, Numer. Math. 108, 5991.Google Scholar
Demmel, J., Dumitriu, I., Holtz, O. and Kleinberg, R. (2007 b), ‘Fast matrix multiplication is stable’, Numer. Math. 106, 199224.Google Scholar
Demmel, J., Eliahu, D., Fox, A., Kamil, S., Lipshitz, B., Schwartz, O. and Spillinger, O. (2013 a), Communication-optimal parallel recursive rectangular matrix multiplication. In Proc. 27th IEEE International Parallel and Distributed Processing Symposium: IPDPS '13, pp. 261272.Google Scholar
Demmel, J., Gearhart, A., Lipshitz, B. and Schwartz, O. (2013 b), Perfect strong scaling using no additional energy. In Proc. 27th IEEE International Parallel and Distributed Processing Symposium: IPDPS '13, pp. 649660.Google Scholar
Demmel, J., Grigori, L., Gu, M. and Xiang, H. (2013 c), Communication avoiding rank revealing QR factorization with column pivoting. Technical report UCB/EECS-2013-46, EECS Department, UC Berkeley.Google Scholar
Demmel, J., Grigori, L., Hoemmen, M. and Langou, J. (2008 a), Communication-avoiding parallel and sequential QR and LU factorizations: Theory and practice. LAPACK Working Note.Google Scholar
Demmel, J., Grigori, L., Hoemmen, M. and Langou, J. (2012), ‘Communication-optimal parallel and sequential QR and LU factorizations’, SIAM J. Sci. Comput. 34, A206A239.Google Scholar
Demmel, J., Hoemmen, M., Mohiyuddin, M. and Yelick, K. (2007 c), Avoiding communication in computing Krylov subspaces. Technical report UCB/EECS-2007-123, EECS Department, UC Berkeley.Google Scholar
Demmel, J., Hoemmen, M., Mohiyuddin, M. and Yelick, K. (2008 b), Avoiding communication in sparse matrix computations. In Proc. 2008 IEEE International Parallel and Distributed Processing Symposium: IPDPS 2008, pp. 112.Google Scholar
Demmel, J., Marques, O., Parlett, B. and Vomel, C. (2008 c), ‘Performance and accuracy of LAPACK's symmetric tridiagonal eigensolvers’, SIAM J. Sci. Comput. 30, 15081526.Google Scholar
Devine, K. D., Boman, E. G., Heaphy, R. T., Bisseling, R. H. and Catalyurek, U. V. (2006), Parallel hypergraph partitioning for scientific computing. In Proc. 20th IEEE International Parallel and Distributed Processing Symposium: IPDPS 2006.Google Scholar
Dongarra, J. J., Croz, J. D., Duff, I. S. and Hammarling, S. (1990 a), ‘Algorithm 679: A set of level 3 basic linear algebra subprograms’, ACM Trans. Math. Softw. 16, 1828.Google Scholar
Dongarra, J. J., Croz, J. D., Duff, I. S. and Hammarling, S. (1990 b), ‘A set of level 3 basic linear algebra subprograms’, ACM Trans. Math. Softw. 16, 117.Google Scholar
Dongarra, J. J., Croz, J. D., Hammarling, S. and Hanson, R. J. (1988 a), ‘Algorithm 656: An extended set of Fortran basic linear algebra subprograms’, ACM Trans. Math. Softw. 14, 1832.CrossRefGoogle Scholar
Dongarra, J. J., Croz, J. D., Hammarling, S. and Hanson, R. J. (1988 b), ‘An extended set of Fortran basic linear algebra subprograms’, ACM Trans. Math. Softw. 14, 117.CrossRefGoogle Scholar
Dongarra, J. J., Moler, C. B., Bunch, J. R. and Stewart, G. W. (1979), LINPACK Users' Guide, SIAM.Google Scholar
Douglas, C. C., Hu, J., Kowarschik, M., Rüde, U. and Weiβ, C. (2000), ‘Cache optimization for structured and unstructured grid multigrid’, Electron. Trans. Numer. Anal. 10, 2140.Google Scholar
Driscoll, M., Georganas, E., Koanantakool, P., Solomonik, E. and Yelick, K. (2013), A communication-optimal W-body algorithm for direct interactions. In Proc. 27th IEEE International Parallel and Distributed Processing Symposium: IPDPS '13, pp. 10751084.Google Scholar
Dursun, H., Nomura, K.-I., Peng, L., Seymour, R., Wang, W., Kalia, R. K., Nakano, A. and Vashishta, P. (2009), A multilevel parallelization framework for highorder stencil computations. In Euro-Par 2009 Parallel Processing, Springer, pp. 642653.CrossRefGoogle Scholar
Elmroth, E. and Gustavson, F. (1998), New serial and parallel recursive QR factorization algorithms for SMP systems. In Applied Parallel Computing: Large Scale Scientific and Industrial Problems (Kågström, B.et al., eds), Vol. 1541 of Lecture Notes in Computer Science, Springer, pp. 120128.Google Scholar
Floyd, R. (1962), ‘Algorithm 97: Shortest path’, Commun. Assoc. Comput. Mach. 5, 345.Google Scholar
Frens, J. D. and Wise, D. S. (2003), ‘QR factorization with Morton-ordered quadtree matrices for memory re-use and parallelism’, ACM SIGPLAN Notices 38, 144154.Google Scholar
Frigo, M. and Strumpen, V. (2005), Cache oblivious stencil computations. In Proc. 19th Annual International Conference on Supercomputing, ACM, pp. 361366.Google Scholar
Frigo, M. and Strumpen, V. (2009), ‘The cache complexity of multithreaded cache oblivious algorithms’, Theory Comput. Syst. 45, 203233.Google Scholar
Frigo, M., Leiserson, C. E., Prokop, H. and Ramachandran, S. (1999), Cache-oblivious algorithms. In Proc. 40th Annual Symposium on Foundations of Computer Science: FOCS '99, IEEE Computer Society, pp. 285297.Google Scholar
Fuller, S. H. and Millett, L. I., eds (2011), The Future of Computing Performance: Game Over or Next Level? The National Academies Press. http://www.nap.eduGoogle Scholar
Gannon, D. and Rosendale, J. Van (1984), ‘On the impact of communication complexity on the design of parallel numerical algorithms’, Trans. Comput. 100, 11801194.Google Scholar
Georganas, E., Gonzalez-Dominguez, J., Solomonik, E., Zheng, Y., Tourino, J. and Yelick, K. (2012), Communication avoiding and overlapping for numerical linear algebra. In Proc. International Conference for High Performance Computing, Networking, Storage and Analysis: SC '12, pp. 111.Google Scholar
George, A. (1973), ‘Nested dissection of a regular finite element mesh’, SIAM J. Numer. Anal. 10, 345363.Google Scholar
Gilbert, J. R. and Tarjan, R. E. (1987), ‘The analysis of a nested dissection algorithm’, Numer. Math. pp. 377404.Google Scholar
Giraud, L. and Langou, J. (2003), ‘A robust criterion for the modified Gram–Schmidt algorithm with selective reorthogonalization’, SIAM J. Sci. Comput. 25, 417441.Google Scholar
Giraud, L., Langou, J. and Rozloznik, M. (2005), ‘The loss of orthogonality in the Gram–Schmidt orthogonalization process’, Comput. Math. Appl. 50, 10691075.Google Scholar
Golub, G. and Loan, C. Van (1996), Matrix Computations, third edition, Johns Hopkins University Press.Google Scholar
Golub, G. H., Plemmons, R. J. and Sameh, A. (1988), Parallel block schemes for large-scale least-squares computations. In High-Speed Computing: Scientific Applications and Algorithm Design, University of Illinois Press, pp. 171179.Google Scholar
Graham, S. L., Snir, M. and Patterson, C. A., eds (2004), Getting up to Speed: The Future of Supercomputing, Report of National Research Council of the National Academies Sciences, The National Academies Press.Google Scholar
Granat, R., Kågström, B., Kressner, D. and Shao, M. (2012), Parallel library software for the multishift QR algorithm with aggressive early deflation. Report UMINF 12.06, Department of Computing Science, Umea University, SE-901.Google Scholar
Greenbaum, A. (1997 a), ‘Estimating the attainable accuracy of recursively computed residual methods’, SIAM J. Matrix Anal. Appl. 18, 535551.Google Scholar
Greenbaum, A. (1997 b), Iterative Methods for Solving Linear Systems, SIAM.Google Scholar
Greenbaum, A., Rozložník, M. and Strakoš, Z. (1997), ‘Numerical behavior of the modified Gram–Schmidt GMRES implementation’, BIT Numer. Math. 37, 706719.Google Scholar
Greiner, G. and Jacob, R. (2010 a), Evaluating non-square sparse bilinear forms on multiple vector pairs in the I/O-model. In Mathematical Foundations of Computer Science 2010, Springer, pp. 393404.Google Scholar
Greiner, G. and Jacob, R. (2010 b), The I/O complexity of sparse matrix dense matrix multiplication. In LATIN 2010: Theoretical Informatics, Springer, pp. 143156.Google Scholar
Grigori, L. and Moufawad, S. (2013), Communication avoiding ILU(0) preconditioned Research report RR-8266, INRIA.Google Scholar
Grigori, L., David, P.-Y., Demmel, J. and Peyronnet, S. (2010), Brief announcement: Lower bounds on communication for sparse Cholesky factorization of a model problem. In Proc. 22nd ACM Symposium on Parallelism in Algorithms and Architectures: SPAA '10, ACM, pp. 7981.Google Scholar
Grigori, L., Demmel, J. and Xiang, H. (2011), ‘CALU: A communication optimal LU factorization algorithm’, SIAM J. Matrix Anal. Appl. 32, 13171350.Google Scholar
Gu, M. and Eisenstat, S. (1996), ‘Efficient algorithms for computing a strong rank-revealing QR factorization’, SIAM J. Sci. Comput. 17, 848869.Google Scholar
Gunter, B. C. and van de Geijn, R. A. (2005), ‘Parallel out-of-core computation and updating of the QR factorization’, ACM Trans. Math. Softw. 31, 6078.Google Scholar
Gustavson, F. G. (1997), ‘Recursion leads to automatic variable blocking for dense linear-algebra algorithms’, IBM J. Res. Dev. 41, 737756.CrossRefGoogle Scholar
Gutknecht, M. (1997), Lanczos-type solvers for nonsymmetric linear systems of equations. In Acta Numerica, Vol. 6, Cambridge University Press, pp. 271398.Google Scholar
Gutknecht, M. and Ressel, K. (2000), ‘Look-ahead procedures for Lanczos-type product methods based on three-term Lanczos recurrences’, SIAM J. Matrix Anal. Appl. 21, 10511078.Google Scholar
Gutknecht, M. and Strakoš, Z. (2000), ‘Accuracy of two three-term and three two-term recurrences for Krylov space solvers’, SIAM J. Matrix Anal. Appl. 22, 213229.Google Scholar
Hackbusch, W. (2006), Hierarchische Matrizen: Algorithmen und Analysis. http://www.mis.mpg.de/scicomp/Fulltext/hmvorlesung.psGoogle Scholar
Haidar, A., Luszczek, P., Kurzak, J. and Dongarra, J. (2013), An improved parallel singular value algorithm and its implementation for multicore hardware. LAPACK Working Note 283.Google Scholar
Hestenes, M. R. and Stiefel, E. (1952), ‘Methods of conjugate gradients for solving linear systems’, J. Res. Nat. Bur. Standards 49, 409436.Google Scholar
Hindmarsh, A. and Walker, H. (1986), Note on a Householder implementation of the GMRES method. Technical report UCID-20899, Lawrence Livermore National Laboratory.Google Scholar
Hoemmen, M. (2010), Communication-avoiding Krylov subspace methods. PhD thesis, EECS Department, UC Berkeley.Google Scholar
Hoffman, A. J., Martin, M. S. and Rose, D. J. (1973), ‘Complexity bounds for regular finite difference and finite element grids’, SIAM J. Numer. Anal. 10, 364369.Google Scholar
Hong, J. W. and Kung, H. T. (1981), I/O complexity: The red-blue pebble game. In Proc. 13th Annual ACM Symposium on Theory of Computing: STOC '81, ACM, pp. 326333.Google Scholar
Howell, G. W., Demmel, J., Fulton, C. T., Hammarling, S. and Marmol, K. (2008), ‘Cache efficient bidiagonalization using BLAS 2.5 operators’, ACM Trans. Math. Softw. 34, #14.Google Scholar
Hupp, P. and Jacob, R. (2013), Tight bounds for low dimensional star stencils in the external memory model. In Algorithms and Data Structures (Dehne, F., Solis-Oba, R. and Sack, J.-R., eds), Vol. 8037 of Lecture Notes in Computer Science, Springer, pp. 415426.Google Scholar
Irigoin, F. and Triolet, R. (1988), Supernode partitioning. In Proc. 15th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, ACM, pp. 319329.Google Scholar
Irony, D., Toledo, S. and Tiskin, A. (2004), ‘Communication lower bounds for distributed-memory matrix multiplication’, J. Parallel Distrib. Comput. 64, 10171026.Google Scholar
Jalby, W. and Philippe, B. (1991), ‘Stability analysis and improvement of the block Gram–Schmidt algorithm’, SIAM J. Sci. Statist. Comput. 12, 10581073.Google Scholar
Johnsson, S. L. (1992), Minimizing the communication time for matrix multiplication on multiprocessors’, Parallel Comput.Google Scholar
Joubert, W. and Carey, G. (1992), ‘Parallelizable restarted iterative methods for nonsymmetric linear systems I: Theory’, Internat. J. Comput. Math. 44, 243267.Google Scholar
Kågström, B., Kressner, D. and Shao, M. (2012), On aggressive early deflation in parallel variants of the QR algorithm. In Applied Parallel and Scientific Computing, Springer, pp. 110.Google Scholar
Karlsson, L. and Kågström, B. (2011), ‘Parallel two-stage reduction to Hessenberg form using dynamic scheduling on shared-memory architectures’, Parallel Comput. 37, 771782.Google Scholar
Kepner, J. and Gilbert, J. (2011), Graph Algorithms in the Language of Linear Algebra, Vol. 22, SIAM.Google Scholar
Kielbasinski, A. (1974), ‘Numerical analysis of the Gram–Schmidt orthogonalization algorithm (analiza numeryczna algorytmu ortogonalizacji Grama–Schmidta)’, Roczniki Polskiego Towarzystwa Matematycznego, Seria III: Matematyka Stosowana II, pp. 1535.Google Scholar
Kim, S. and Chronopoulos, A. (1992), ‘An efficient nonsymmetric Lanczos method on parallel vector computers’, J. Comput. Appl. Math. 42, 357374.Google Scholar
Knight, N., Carson, E. and Demmel, J. (2014), Exploiting data sparsity in parallel matrix powers computations. In Proc. PPAM '13, Vol. 8384 of Lecture Notes in Computer Science, Springer, to appear.Google Scholar
Kolda, T. G. and Bader, B. W. (2009), ‘Tensor decompositions and applications’, SIAM Review 51, 455500.Google Scholar
LaMielle, A. and Strout, M. (2010), Enabling code generation within the sparse polyhedral framework. Technical report CS-10-102, Colorado State University.Google Scholar
Lanczos, C. (1950), An Iteration Method for the Solution of the Eigenvalue Problem of Linear Differential and Integral Operators, United States Government Press Office.Google Scholar
Lawson, C. L., Hanson, R. J., Kincaid, D. and Krogh, F. T. (1979), ‘Basic linear algebra subprograms for Fortran usage’, ACM Trans. Math. Softw. 5, 308323.Google Scholar
Leiserson, C. E., Rao, S. and Toledo, S. (1997), ‘Efficient out-of-core algorithms for linear relaxation using blocking covers’, J. Comput. System Sci. 54, 332344.Google Scholar
Lev, G. and Valiant, L. (1983), ‘Size bounds for superconcentrators’, Theor. Comput. Sci. 22, 233251.Google Scholar
Lipshitz, B. (2013), Communication-avoiding parallel recursive algorithms for matrix multiplication. Master's thesis, EECS Department, UC Berkeley.Google Scholar
Lipshitz, B., Ballard, G., Demmel, J. and Schwartz, O. (2012), Communication-avoiding parallel Strassen: Implementation and performance. In Proc. International Conference on High Performance Computing, Networking, Storage and Analysis: SC '12, #101.Google Scholar
Loomis, L. H. and Whitney, H. (1949), ‘An inequality related to the isoperimetric inequality’, Bull. Amer. Math. Soc. 55, 961962.Google Scholar
McColl, W. and Tiskin, A. (1999), ‘Memory-efficient matrix multiplication in the BSP model’, Algorithmica 24, 287297.Google Scholar
Meurant, G. (2006), The Lanczos and Conjugate Gradient Algorithms: From Theory to Finite Precision Computations, SIAM.Google Scholar
Meurant, G. and Strakos, Z. (2006), The Lanczos and conjugate gradient algorithms in finite precision arithmetic. In Acta Numerica, Vol. 15, Cambridge University Press, pp. 471542.Google Scholar
Mohanty, S. and Gopalan, S. (2012), I/O efficient QR and QZ algorithms, in 2012 19th International Conference on High Performance Computing: HiPC, pp. 19.Google Scholar
Mohiyuddin, M. (2012), Tuning hardware and software for multiprocessors. PhD thesis, EECS Department, UC Berkeley.Google Scholar
Mohiyuddin, M., Hoemmen, M., Demmel, J. and Yelick, K. (2009), Minimizing communication in sparse matrix solvers. In Proc. International Conference on High Performance Computing Networking, Storage and Analysis: SC '09.Google Scholar
Nakatsukasa, Y. and Higham, N. (2012), Stable and efficient spectral divide and conquer algorithms for the symmetric eigenvalue decomposition and the SVD. MIMS EPrint 2012.52, University of Manchester.Google Scholar
Nishtala, R., Vuduc, R. W., Demmel, J. W. and Yelick, K. A. (2007), ‘When cache blocking of sparse matrix vector multiply works and why’, Applicable Algebra in Engineering, Communication and Computing 18, 297311.Google Scholar
Park, J.-S., Penner, M. and Prasanna, V. (2004), ‘Optimizing graph algorithms for improved cache performance’, IEEE Trans. Parallel Distributed Systems 15, 769782.Google Scholar
Parlett, B. (1995), The new QD algorithms. In Acta Numerica, Vol. 4, Cambridge University Press, pp. 459491.Google Scholar
Parlett, B. and Reid, J. (1970), ‘On the solution of a system of linear equations whose matrix is symmetric but not definite’, BIT Numer. Math. 10, 386397.Google Scholar
Parlett, B., Taylor, D. and Liu, Z. (1985), ‘A look-ahead Lanczos algorithm for unsymmetric matrices’, Math. Comp. 44, 105124.Google Scholar
Pfeifer, C. (1963), Data flow and storage allocation for the PDQ-5 program on the Philco-2000. Technical report, Westinghouse Electric Corp. Bettis Atomic Power Lab., Pittsburgh.Google Scholar
Philippe, B. and Reichel, L. (2012), ‘On the generation of Krylov subspace bases’, Appl. Numer. Math. 62, 11711186.Google Scholar
Puglisi, C. (1992), ‘Modification of the Householder method based on compact WY representation’, SIAM J. Sci. Statist. Comput. 13, 723726.Google Scholar
Reichel, L. (1990), ‘Newton interpolation at Leja points’, BIT 30, 332346.Google Scholar
Rozloznik, M., Shklarski, G. and Toledo, S. (2011), ‘Partitioned triangular tridiagonalization’, ACM Trans. Math. Softw. 37, #38.Google Scholar
Saad, Y. (1985), ‘Practical use of polynomial preconditionings for the conjugate gradient method’, SIAM J. Sci. Statist. Comput. 6, 865881.Google Scholar
Saad, Y. (2003), Iterative Methods for Sparse Linear Systems, second edition, SIAM.Google Scholar
Saad, Y. and Schultz, M. H. (1986), ‘GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems’, SIAM J. Sci. Statist. Comput. 7, 856869.Google Scholar
Saad, Y., Yeung, M., Erhel, J. and F. Guyomarc'h (2000), ‘A deflated version of the conjugate gradient algorithm’, SIAM J. Sci. Comput. 21, 19091926.Google Scholar
Savage, J. E. (1995), Extending the Hong-Kung model to memory hierarchies. In Computing and Combinatorics, Vol. 959, Springer, pp. 270281.Google Scholar
Schatz, M., Poulson, J. and van de Geijn, R. (2013), Scalable universal matrix multiplication algorithms: 2D and 3D variations on a theme. Technical report, University of Texas.Google Scholar
Schreiber, R. and Loan, C. Van (1989), ‘A storage-efficient WY representation for products of Householder transformations’, SIAM J. Sci. Statist. Comput. 10, 5357.Google Scholar
Schwarz, H. A. (1870), ‘über einen Grenzübergang durch alternierendes Verfahren’, Vierteljahrsschrift der Naturforschenden Gesellschaft in Zürich 15, 272286.Google Scholar
Scquizzato, M. and Silvestri, F. (2014), Communication lower bounds for distributed-memory computations. In 31st International Symposium on Theoretical Aspects of Computer Science: STACS 2014 (Mayr, E. W. and Portier, N., eds), Vol. 25 of Leibniz International Proceedings in Informatics: LIPIcs, Schloss Dagstuhl-Leibniz-Zentrum für Informatik, pp. 627638.Google Scholar
Sleijpen, G. and van der Vorst, H. (1996), ‘Reliable updated residuals in hybrid Bi-CG methods’, Computing 56, 141163.Google Scholar
Smith, B. T., Boyle, J. M., Dongarra, J. J., Garbow, B. S., Ikebe, Y., Klema, V. C. and Moler, C. B. (1976), Matrix Eigensystem Routines: EISPACK Guide, second edition, Springer.Google Scholar
Smoktunowicz, A., Barlow, J. L. and Langou, J. (2006), ‘A note on the error analysis of classical Gram-Schmidt’, Numer. Math. 105, 299313.Google Scholar
Solomonik, E. and Demmel, J. (2011), Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms. In Euro-Par 2011 Parallel Processing (Jeannot, E., Namyst, R. and Roman, J., eds), Vol. 6853 of Lecture Notes in Computer Science, Springer, pp. 90109.Google Scholar
Solomonik, E., Buluc, A. and Demmel, J. (2013), Minimizing communication in all-pairs shortest-paths. In Proc. 27th IEEE International Parallel Distributed Processing Symposium: IPDPS '13, pp. 548559.Google Scholar
Solomonik, E., Carson, E., Knight, N. and Demmel, J. (2014), Tradeoffs between synchronization, communication, and work in parallel linear algebra computations. Technical report UCB/EECS-2014-8, EECS Department, UC Berkeley.Google Scholar
Solomonik, E., Matthews, D., Hammond, J. and Demmel, J. (2013 c), Cyclops tensor framework: Reducing communication and eliminating load imbalance in massively parallel contractions. In Proc. 27th IEEE International Parallel and Distributed Processing Symposium: IPDPS '13, pp. 813824.Google Scholar
Sorensen, D. (1985), ‘Analysis of pairwise pivoting in Gaussian elimination’, IEEE Trans. Computers C-34, 274278.Google Scholar
Sorensen, D. (1992), ‘Implicit application of polynomial filters in a k-step Arnoldi method’, SIAM J. Matrix Anal. Appl. 13, 357385.CrossRefGoogle Scholar
Stathopoulos, A. and Wu, K. (2002), ‘A block orthogonalization procedure with constant synchronization requirements’, SIAM J. Sci. Comput. 23, 21652182.Google Scholar
Stewart, G. (2008), ‘Block Gram-Schmidt orthogonalization’, SIAM J. Sci. Comput. 31, 761775.CrossRefGoogle Scholar
Strassen, V. (1969), ‘Gaussian elimination is not optimal’, Numer. Math. 13, 354356.Google Scholar
Strout, M. M., Carter, L. and Ferrante, J. (2001), Rescheduling for locality in sparse matrix computations. In Computational Science: ICCS 2001, Springer, pp. 137146.Google Scholar
Sturler, E. (1996), ‘A performance model for Krylov subspace methods on meshbased parallel computers’, Parallel Comput. 22, 5774.Google Scholar
Tang, Y., Chowdhury, R. A., Kuszmaul, B. C., Luk, C.-K. and Leiserson, C. E. (2011), The Pochoir stencil compiler. In Proc. 23rd ACM Symposium on Parallelism in Algorithms and Architectures, ACM, pp. 117128.Google Scholar
Thakur, R., Rabenseifner, R. and Gropp, W. (2005), ‘Optimization of collective communication operations in MPICH’, Internat. J. High Performance Comput. Appl. 19, 4966.Google Scholar
Tiskin, A. (2002), ‘Bulk-synchronous parallel Gaussian elimination’, J. Math. Sci. 108, 977991.Google Scholar
Tiskin, A. (2007), ‘Communication-efficient parallel generic pairwise elimination’, Future Generation Computer Systems 23, 179188.Google Scholar
Toledo, S. (1995), Quantitative performance modeling of scientific computations and creating locality in numerical algorithms. PhD thesis, MIT.Google Scholar
Toledo, S. (1997), ‘Locality of reference in LU decomposition with partial pivoting’, SIAM J. Matrix Anal. Appl. 18, 10651081.Google Scholar
Tong, C. and Ye, Q. (2000), ‘Analysis of the finite precision bi-conjugate gradient algorithm for nonsymmetric linear systems’, Math. Comp. 69, 15591576.Google Scholar
Trefethen, L. and Schreiber, R. (1990), ‘Average-case stability of Gaussian elimination’, SIAM J. Matrix Anal. Appl. 11, 335360.Google Scholar
van de Geijn, R. A. and Watts, J. (1997), ‘SUMMA: Scalable universal matrix multiplication algorithm’, Concurrency: Pract. Exp. 9, 255274.Google Scholar
der Vorst, H. Van and Ye, Q. (1999), ‘Residual replacement strategies for Krylov subspace iterative methods for the convergence of true residuals’, SIAM J. Sci. Comput. 22, 835852.Google Scholar
Rosendale, J. Van (1983), Minimizing inner product data dependencies in conjugate gradient iteration. Technical report 172178, ICASE-NASA.Google Scholar
Vanderstraeten, D. (1999), A stable and efficient parallel block Gram–Schmidt algorithm. In Euro-Par99 Parallel Processing, Springer, pp. 11281135.Google Scholar
Vuduc, R., Demmel, J. and Yelick, K. (2005), OSKI: A library of automatically tuned sparse matrix kernels. In Proc. of SciDAC 2005, J. of Physics Conference Series, Institute of Physics.Google Scholar
Vuduc, R. W. (2003), Automatic performance tuning of sparse matrix kernels. PhD thesis, EECS Department, UC Berkeley.Google Scholar
Walker, H. (1988), ‘Implementation of the GMRES method using Householder transformations’, SIAM J. Sci. Statist. Comput. 9, 152163.Google Scholar
Warshall, S. (1962), ‘A theorem on Boolean matrices’, J. Assoc. Comput. Mach. 9, 1112.Google Scholar
Williams, S., Lijewski, M., Almgren, A., Straalen, B. Van, Carson, E., Knight, N. and Demmel, J. (2014), s-step Krylov subspace methods as bottom solvers for geometric multigrid. In Proc. 2014 IEEE International Parallel and Distributed Processing Symposium: IPDPS '14, to appear.Google Scholar
Williams, S., Oliker, L., Vuduc, R., Shalf, J., Yelick, K. and Demmel, J. (2009), ‘Optimization of sparse matrix-vector multiplication on emerging multicore platforms’, Parallel Comput. 35, 178194.Google Scholar
Williams, V. (2012), Multiplying matrices faster than Coppersmith–Winograd. In Proc. 44th Annual Symposium on Theory of Computing: STOC 12, ACM, pp. 887898.Google Scholar
Wise, D. (2000), Ahnentafel indexing into Morton-ordered arrays, or matrix locality for free. In Euro-Par 2000 Parallel Processing (Bode, A., Ludwig, T., Karl, W. and Wismüoller, R., eds), Vol. 1900 of Lecture Notes in Computer Science, Springer, pp. 774783.Google Scholar
Wolf, M. M., Boman, E. G. and Hendrickson, B. (2008), ‘Optimizing parallel sparse matrix-vector multiplication by corner partitioning’, PARA08, Trondheim, Norway.Google Scholar
Xia, J., Chandrasekaran, S., Gu, M. and Li, X. S. (2010), ‘Fast algorithms for hierarchically semiseparable matrices’, Numer. Linear Algebra Appl. 17, 953976.CrossRefGoogle Scholar
Yotov, K., Roeder, T., Pingali, K., Gunnels, J. and Gustavson, F. (2007), An experimental comparison of cache-oblivious and cache-conscious programs. In Proc. 19th Annual ACM Symposium on Parallel Algorithms and Architectures: SPAA '07, ACM, pp. 93104.Google Scholar
Yzelman, A. and Bisseling, R. H. (2011), ‘Two-dimensional cache-oblivious sparse matrix-vector multiplication’, Parallel Comput. 37, 806819.Google Scholar