Hostname: page-component-745bb68f8f-b6zl4 Total loading time: 0 Render date: 2025-01-27T06:30:13.032Z Has data issue: false hasContentIssue false

Transparent fault tolerance for scalable functional computation

Published online by Cambridge University Press:  17 March 2016

ROBERT STEWART
Affiliation:
Mathematical & Computer Sciences, Heriot-Watt University, Edinburgh, UK (e-mail: [email protected])
PATRICK MAIER
Affiliation:
School of Computing Science, Glasgow, UK (e-mail: [email protected], [email protected])
PHIL TRINDER
Affiliation:
School of Computing Science, Glasgow, UK (e-mail: [email protected], [email protected])
Rights & Permissions [Opens in a new window]

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the ‘Save PDF’ action button.

Reliability is set to become a major concern on emergent large-scale architectures. While there are many parallel languages, and indeed many parallel functional languages, very few address reliability. The notable exception is the widely emulated Erlang distributed actor model that provides explicit supervision and recovery of actors with isolated state. We investigate scalable transparent fault tolerant functional computation with automatic supervision and recovery of tasks. We do so by developing HdpH-RS, a variant of the Haskell distributed parallel Haskell (HdpH) DSL with Reliable Scheduling. Extending the distributed work stealing protocol of HdpH for task supervision and recovery is challenging. To eliminate elusive concurrency bugs, we validate the HdpH-RS work stealing protocol using the SPIN model checker. HdpH-RS differs from the actor model in that its principal entities are tasks, i.e. independent stateless computations, rather than isolated stateful actors. Thanks to statelessness, fault recovery can be performed automatically and entirely hidden in the HdpH-RS runtime system. Statelessness is also key for proving a crucial property of the semantics of HdpH-RS: fault recovery does not change the result of the program, akin to deterministic parallelism. HdpH-RS provides a simple distributed fork/join-style programming model, with minimal exposure of fault tolerance at the language level, and a library of higher level abstractions such as algorithmic skeletons. In fact, the HdpH-RS DSL is exactly the same as the HdpH DSL, hence users can opt in or out of fault tolerant execution without any refactoring. Computations in HdpH-RS are always as reliable as the root node, no matter how many nodes and cores are actually used. We benchmark HdpH-RS on conventional clusters and an High Performance Computing platform: all benchmarks survive Chaos Monkey random fault injection; the system scales well e.g. up to 1,400 cores on the High Performance Computing; reliability and recovery overheads are consistently low even at scale.

Type
Articles
Copyright
Copyright © Cambridge University Press 2016 

References

Aljabri, M., Loidl, H.-W. & Trinder, P W. (2014) The design and implementation of GUMSMP: A multilevel parallel Haskell implementation. In Proceedings of Implementation and Application of Functional Languages (IFL'13). New York, NY: ACM, pp. 37–48.Google Scholar
Armstrong, J. (2010) Erlang. Commun. ACM 53 (9), 6875.Google Scholar
Barroso, L. A., Clidaras, J. & Hölzle, U. (2013) The Datacenter as a Computer, 2nd ed. Morgan & Claypool.CrossRefGoogle Scholar
Boije, J. & Johansson, L. 2009 (December) Distributed Mandelbrot Calculations. Tech. rept. TH Royal Institute of Technology.Google Scholar
Borwein, P. B., Ferguson, R. & Mossinghoff, M. J. (2008) Sign changes in sums of the Liouville function. Math. Comput. 77 (263), 16811694.CrossRefGoogle Scholar
Cappello, F. (2009) Fault tolerance in Petascale/Exascale systems: Current knowledge, challenges and research opportunities. IJHPCA 23 (3), 212226.Google Scholar
Chandy, K. M. & Lamport, L. (1985) Distributed snapshots: Determining global states of distributed systems. ACM Trans. Comput. Syst. 3 (1), 6375.Google Scholar
Chechina, N., Li, H., Ghaffari, A., Thompson, S. & Trinder, P. (2016) Improving network scalability of Erlang. J. Parallel Distrib. Comput. 90–91, 2234.Google Scholar
Cleary, S. 2009 (May) Detection of Half-Open (Dropped) Connections. Technical Report Microsoft. http://blog.stephencleary.com/2009/05/detection-of-half-open-dropped.html.Google Scholar
Cole, M. I. (1988) Algorithmic Skeletons: A Structured Approach to the Management of Parallel Computation. PhD Thesis, Computer Science Department, University of Edinburgh.Google Scholar
Dean, J. & Ghemawat, S. (2008) MapReduce: Simplified data processing on large clusters. Commun. ACM 51 (1), 107113.CrossRefGoogle Scholar
Dinu, F. & Ng, T. S. E. (2011) Hadoop's overload tolerant design exacerbates failure detection and recovery. In Proceedings of 6th International Workshop on Networking Meets Databases, NETDB 2011. Athens, Greece. June.Google Scholar
Edinburgh Parallel Computing Center (EPCC). (2008) HECToR National UK Super Computing Resource, Edinburgh. https://www.hector.ac.uk.Google Scholar
Elnozahy, E. N., Alvisi, L., Wang, Y.-M. & Johnson, D. B. (2002) A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34 (3), 375408.CrossRefGoogle Scholar
Epstein, J., Black, A. P. & Jones, S. L. P. (2011) Towards Haskell in the cloud. In Proceedings of the 4th ACM SIGPLAN Symposium on Haskell, Haskell 2011, Tokyo, Japan, 22 September 2011, pp. 118–129.CrossRefGoogle Scholar
Gupta, M. (2012) Akka Essentials. Packt Publishing Ltd.Google Scholar
Ha, S., Rhee, I. & Xu, L. (2008) CUBIC: A new TCP-friendly high-speed TCP variant. Oper. Syst. Rev. 42 (5), 6474.CrossRefGoogle Scholar
Halstead, R. H. Jr. (1985) Multilisp: A language for concurrent symbolic computation. ACM Trans. Program. Lang. Syst. 7 (4), 501538.Google Scholar
Hammond, K., Zain, A. Al, Cooperman, G., Petcu, D. & Trinder, P. W. (2007) SymGrid: A framework for symbolic computation on the grid. In Proceedings of 13th International Euro-Par Conference, Rennes, France, August 28–31, 2007, pp. 457–466.Google Scholar
Harris, T., Marlow, S. & Jones, S. L. P. (2005) Haskell on a shared-memory multiprocessor. In Proceedings of the ACM SIGPLAN Workshop on Haskell, Haskell 2005, Tallinn, Estonia, September 30, 2005, pp. 49–61.Google Scholar
Herington, D. (2006–2013) Haskell Library: hunit Package. A Unit Testing Framework for Haskell. http://hackage.haskell.org/package/HUnit.Google Scholar
Hoff, T. (2010 December) Netflix: Continually Test by Failing Servers with Chaos Monkey. http://highscalability.com.Google Scholar
Holzmann, G. J. (2004) The SPIN Model Checker - Primer and Reference Manual. Addison-Wesley.Google Scholar
John, A., Konnov, I., Schmid, U., Veith, H. & Widder, J. (2013) Towards modeling and model checking fault-tolerant distributed algorithms. In Model Checking Software – Proceedings of 20th International Symposium, SPIN 2013, Stony Brook, NY, July, 2013, pp. 209–226.CrossRefGoogle Scholar
Kuper, L., Turon, A., Krishnaswami, N. R. & Newton, R. R. (2014) Freeze after writing: Quasi-deterministic parallel programming with LVars and handlers. In Proceedings of POPL 2014, San Diego, ACM, pp. 257–270.Google Scholar
Lamport, L. (1978) Time, clocks, and the ordering of events in a distributed system. Commun. ACM 21 (7), 558565.Google Scholar
Litvinova, A., Engelmann, C. & Scott, S. L. 2010 (February 16–18) A Proactive fault tolerance framework for high-performance computing. In Proceedings of the 9th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2010, Innsbruck, Austria.CrossRefGoogle Scholar
Loogen, R., Ortega-Mallén, Y. & Peña-Marí, R. (2005) Parallel functional programming in Eden. J. Funct. Program. 15 (3), 431475.Google Scholar
Maier, P., Livesey, D., Loidl, H.-W. & Trinder, P. (2014a) High-performance computer algebra: A Hecke algebra case study. In Proceedings of Euro-par 2014 Parallel Processing - 20th International Conference, Porto, Portugal, August 25–29, 2014, Silva, F. M. A., de Castro Dutra, I. & Costa, V. S. (eds), Lecture Notes in Computer Science, vol. 8632. Springer, pp. 19–35.Google Scholar
Maier, P., Stewart, R. J. & Trinder, P. W. (2014b) Reliable scalable symbolic computation: The design of SymGridPar2. Comput. Lang. Syst. Struct. 40 (1), 1935.Google Scholar
Maier, P., Stewart, R. J. & Trinder, P. (2014c) The HdpH DSLs for scalable reliable computation. In Proceedings of the 2014 ACM SIGPLAN Symposium on Haskell, Gothenburg, Sweden, September 4–5, 2014. ACM, pp. 65–76.Google Scholar
Maier, P. & Trinder, P. (2012) Implementing a high-level distributed-memory parallel Haskell in Haskell. In Implementation and Application of Functional Languages, 23rd International Symposium 2011, Lawrence, KS, USA, October 3-5, 2011. Revised Selected Papers. Lecture Notes in Computer Science, vol. 7257. Springer, pp. 35–50.CrossRefGoogle Scholar
Marlow, S., Jones, S. L. P. & Singh, S. (2009) Runtime support for multicore Haskell. In Proceedings of ICFP, Edinburgh, Scotland, pp. 65–78.Google Scholar
Marlow, S. & Newton, R. (2013) Source code for monad-par library. https://github.com/simonmar/monad-par.Google Scholar
Marlow, S., Newton, R. & Jones, S. L. P. (2011) A monad for deterministic parallelism. In Proceedings of the 4th ACM SIGPLAN Symposium on Haskell, Haskell 2011, Tokyo, Japan, 22 September 2011, pp. 71–82.Google Scholar
Mattsson, H., Nilsson, H. & Wikstrm, C. (1999) Mnesia - a distributed robust DBMS for telecommunications applications. In Proceedings of PADL, San Antonio, Texas, USA, pp. 152–163.Google Scholar
Meredith, M., Carrigan, T., Brockman, J., Cloninger, T., Privoznik, J., & Williams, J. (2003) Exploring Beowulf clusters. J. Comput. Sci. Colleges 18 (4), 268284.Google Scholar
Michie, D. (1968) “Memo” functions and machine learning. Nature 218 (5136), 1922.Google Scholar
Peyton, Jones S. (2002) Tackling the awkward squad: Monadic input/output, concurrency, exceptions, and foreign-language calls in Haskell. In Engineering Theories of Software Construction, Marktoberdorf Summer School, pp. 4796.Google Scholar
Pnueli, A. (1977) The temporal logic of programs. In Proceedings of 18th Annual Symposium on Foundations of Computer Science, Providence, Rhode Island, 31 October–1 November 1977. IEEE Computer Society, pp. 46–57.Google Scholar
Postel, J. 1980 (August) User Datagram Protocol. RFC 768 Standard. http://www.ietf.org/rfc/rfc768.txt.CrossRefGoogle Scholar
Prior, A. N. (1957) Time and Modality. Oxford University Press.Google Scholar
Ramalingam, G. & Vaswani, K. (2013) Fault tolerance via idempotence. In Proceedings of POPL, Rome, Italy, pp. 249–262.Google Scholar
Rivin, I., Vardi, I. & Zimmerman, P. (1994) The N-queens problem. Am. Math. Mon. 101 (7), 629639.Google Scholar
Scholz, S.-B. (2003) Single assignment C: Efficient support for high-level array operations in a functional setting. J. Funct. Program. 13 (6), 10051059.Google Scholar
Schroeder, B. & Gibson, G. A. (2007) Understanding failures in Petascale computers. J. Phys.: Conf. Ser. 78, 012022 (11pp) http://stacks.iop.org/1742-6596/78/012022.Google Scholar
Scott, J. & Kazman, R. (2009) Realizing and Refining Architectural Tactics: Availability. Technical Report. Carnegie Mellon University, Software Engineering Institute.Google Scholar
Stewart, R. (2013a December) Promela Abstraction of the HdpH-RS Scheduler. https://github.com/robstewart57/phd-thesis/blob/master/spin_model/hdph_scheduler.pml.Google Scholar
Stewart, R. (2013b November) Reliable Massively Parallel Symbolic Computing: Fault Tolerance for a Distributed Haskell. PhD Thesis, Mathematical and Computer Sciences, Heriot-Watt University, Edinburgh, Scotland.Google Scholar
Stewart, R. & Maier, P. (2013) HdpH-RS source code. https://github.com/robstewart57/hdph-rs.Google Scholar
Stewart, R., Maier, P. & Trinder, P. (2015 June) Open access dataset for “Transparent Fault Tolerance for Scalable Functional Computation”. http://dx.doi.org/10.5525/gla.researchdata.189.Google Scholar
Trinder, P. W., Hammond, K. Jr., Mattson, J. S., Partridge, A. S. & Jones, S. L. P. (1996) GUM: A portable parallel implementation of Haskell. In Proceedings of ACM Programming Language Design and Implementation (PLDI'96), Philadephia, Pennsylvania, May, pp. 79–88.Google Scholar
White, T. (2012) Hadoop - The Definitive Guide: Storage and Analysis at Internet Scale (3. ed., revised and updated). O'Reilly.Google Scholar
Xu, C. & Lau, F. C. (1997) Load Balancing in Parallel Computers: Theory and Practice. Norwell, MA: Kluwer Academic Publishers.Google Scholar
Submit a response

Discussions

No Discussions have been published for this article.