Transparent fault tolerance for scalable functional computation

ROBERT STEWART; PATRICK MAIER; PHIL TRINDER

doi:10.1017/S095679681600006X

Transparent fault tolerance for scalable functional computation

Part of: JFP Research Articles

Published online by Cambridge University Press: 17 March 2016

ROBERT STEWART ,

PATRICK MAIER and

PHIL TRINDER

Show author details

ROBERT STEWART: Affiliation:
Mathematical & Computer Sciences, Heriot-Watt University, Edinburgh, UK (e-mail: [email protected])
PATRICK MAIER: Affiliation:
School of Computing Science, Glasgow, UK (e-mail: [email protected], [email protected])
PHIL TRINDER: Affiliation:
School of Computing Science, Glasgow, UK (e-mail: [email protected], [email protected])

Article contents

Abstract
References

Rights & Permissions

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the ‘Save PDF’ action button.

Reliability is set to become a major concern on emergent large-scale architectures. While there are many parallel languages, and indeed many parallel functional languages, very few address reliability. The notable exception is the widely emulated Erlang distributed actor model that provides explicit supervision and recovery of actors with isolated state. We investigate scalable transparent fault tolerant functional computation with automatic supervision and recovery of tasks. We do so by developing HdpH-RS, a variant of the Haskell distributed parallel Haskell (HdpH) DSL with Reliable Scheduling. Extending the distributed work stealing protocol of HdpH for task supervision and recovery is challenging. To eliminate elusive concurrency bugs, we validate the HdpH-RS work stealing protocol using the SPIN model checker. HdpH-RS differs from the actor model in that its principal entities are tasks, i.e. independent stateless computations, rather than isolated stateful actors. Thanks to statelessness, fault recovery can be performed automatically and entirely hidden in the HdpH-RS runtime system. Statelessness is also key for proving a crucial property of the semantics of HdpH-RS: fault recovery does not change the result of the program, akin to deterministic parallelism. HdpH-RS provides a simple distributed fork/join-style programming model, with minimal exposure of fault tolerance at the language level, and a library of higher level abstractions such as algorithmic skeletons. In fact, the HdpH-RS DSL is exactly the same as the HdpH DSL, hence users can opt in or out of fault tolerant execution without any refactoring. Computations in HdpH-RS are always as reliable as the root node, no matter how many nodes and cores are actually used. We benchmark HdpH-RS on conventional clusters and an High Performance Computing platform: all benchmarks survive Chaos Monkey random fault injection; the system scales well e.g. up to 1,400 cores on the High Performance Computing; reliability and recovery overheads are consistently low even at scale.

Type: Articles
Information: Journal of Functional Programming , Volume 26 , 2016 , e5

DOI: https://doi.org/10.1017/S095679681600006X [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2016

References

Aljabri, M., Loidl, H.-W. & Trinder, P W. (2014) The design and implementation of GUMSMP: A multilevel parallel Haskell implementation. In Proceedings of Implementation and Application of Functional Languages (IFL'13). New York, NY: ACM, pp. 37–48.Google Scholar

Armstrong, J. (2010) Erlang. Commun. ACM 53 (9), 68–75.Google Scholar

Barroso, L. A., Clidaras, J. & Hölzle, U. (2013) The Datacenter as a Computer, 2nd ed. Morgan & Claypool.CrossRef Google Scholar

Boije, J. & Johansson, L. 2009 (December) Distributed Mandelbrot Calculations. Tech. rept. TH Royal Institute of Technology.Google Scholar

Borwein, P. B., Ferguson, R. & Mossinghoff, M. J. (2008) Sign changes in sums of the Liouville function. Math. Comput. 77 (263), 1681–1694.CrossRef Google Scholar

Cappello, F. (2009) Fault tolerance in Petascale/Exascale systems: Current knowledge, challenges and research opportunities. IJHPCA 23 (3), 212–226.Google Scholar

Chandy, K. M. & Lamport, L. (1985) Distributed snapshots: Determining global states of distributed systems. ACM Trans. Comput. Syst. 3 (1), 63–75.Google Scholar

Chechina, N., Li, H., Ghaffari, A., Thompson, S. & Trinder, P. (2016) Improving network scalability of Erlang. J. Parallel Distrib. Comput. 90–91, 22–34.Google Scholar

Cleary, S. 2009 (May) Detection of Half-Open (Dropped) Connections. Technical Report Microsoft. http://blog.stephencleary.com/2009/05/detection-of-half-open-dropped.html.Google Scholar

Cole, M. I. (1988) Algorithmic Skeletons: A Structured Approach to the Management of Parallel Computation. PhD Thesis, Computer Science Department, University of Edinburgh.Google Scholar

Dean, J. & Ghemawat, S. (2008) MapReduce: Simplified data processing on large clusters. Commun. ACM 51 (1), 107–113.CrossRef Google Scholar

Dinu, F. & Ng, T. S. E. (2011) Hadoop's overload tolerant design exacerbates failure detection and recovery. In Proceedings of 6th International Workshop on Networking Meets Databases, NETDB 2011. Athens, Greece. June.Google Scholar

Edinburgh Parallel Computing Center (EPCC). (2008) HECToR National UK Super Computing Resource, Edinburgh. https://www.hector.ac.uk.Google Scholar

Elnozahy, E. N., Alvisi, L., Wang, Y.-M. & Johnson, D. B. (2002) A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34 (3), 375–408.CrossRef Google Scholar

Epstein, J., Black, A. P. & Jones, S. L. P. (2011) Towards Haskell in the cloud. In Proceedings of the 4th ACM SIGPLAN Symposium on Haskell, Haskell 2011, Tokyo, Japan, 22 September 2011, pp. 118–129.CrossRef Google Scholar

Gupta, M. (2012) Akka Essentials. Packt Publishing Ltd.Google Scholar

Ha, S., Rhee, I. & Xu, L. (2008) CUBIC: A new TCP-friendly high-speed TCP variant. Oper. Syst. Rev. 42 (5), 64–74.CrossRef Google Scholar

Halstead, R. H. Jr. (1985) Multilisp: A language for concurrent symbolic computation. ACM Trans. Program. Lang. Syst. 7 (4), 501–538.Google Scholar

Hammond, K., Zain, A. Al, Cooperman, G., Petcu, D. & Trinder, P. W. (2007) SymGrid: A framework for symbolic computation on the grid. In Proceedings of 13th International Euro-Par Conference, Rennes, France, August 28–31, 2007, pp. 457–466.Google Scholar

Harris, T., Marlow, S. & Jones, S. L. P. (2005) Haskell on a shared-memory multiprocessor. In Proceedings of the ACM SIGPLAN Workshop on Haskell, Haskell 2005, Tallinn, Estonia, September 30, 2005, pp. 49–61.Google Scholar

Herington, D. (2006–2013) Haskell Library: hunit Package. A Unit Testing Framework for Haskell. http://hackage.haskell.org/package/HUnit.Google Scholar

Hoff, T. (2010 December) Netflix: Continually Test by Failing Servers with Chaos Monkey. http://highscalability.com.Google Scholar

Holzmann, G. J. (2004) The SPIN Model Checker - Primer and Reference Manual. Addison-Wesley.Google Scholar

John, A., Konnov, I., Schmid, U., Veith, H. & Widder, J. (2013) Towards modeling and model checking fault-tolerant distributed algorithms. In Model Checking Software – Proceedings of 20th International Symposium, SPIN 2013, Stony Brook, NY, July, 2013, pp. 209–226.CrossRef Google Scholar

Kuper, L., Turon, A., Krishnaswami, N. R. & Newton, R. R. (2014) Freeze after writing: Quasi-deterministic parallel programming with LVars and handlers. In Proceedings of POPL 2014, San Diego, ACM, pp. 257–270.Google Scholar

Lamport, L. (1978) Time, clocks, and the ordering of events in a distributed system. Commun. ACM 21 (7), 558–565.Google Scholar

Litvinova, A., Engelmann, C. & Scott, S. L. 2010 (February 16–18) A Proactive fault tolerance framework for high-performance computing. In Proceedings of the 9th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2010, Innsbruck, Austria.CrossRef Google Scholar

Loogen, R., Ortega-Mallén, Y. & Peña-Marí, R. (2005) Parallel functional programming in Eden. J. Funct. Program. 15 (3), 431–475.Google Scholar

Maier, P., Livesey, D., Loidl, H.-W. & Trinder, P. (2014a) High-performance computer algebra: A Hecke algebra case study. In Proceedings of Euro-par 2014 Parallel Processing - 20th International Conference, Porto, Portugal, August 25–29, 2014, Silva, F. M. A., de Castro Dutra, I. & Costa, V. S. (eds), Lecture Notes in Computer Science, vol. 8632. Springer, pp. 19–35.Google Scholar

Maier, P., Stewart, R. J. & Trinder, P. W. (2014b) Reliable scalable symbolic computation: The design of SymGridPar2. Comput. Lang. Syst. Struct. 40 (1), 19–35.Google Scholar

Maier, P., Stewart, R. J. & Trinder, P. (2014c) The HdpH DSLs for scalable reliable computation. In Proceedings of the 2014 ACM SIGPLAN Symposium on Haskell, Gothenburg, Sweden, September 4–5, 2014. ACM, pp. 65–76.Google Scholar

Maier, P. & Trinder, P. (2012) Implementing a high-level distributed-memory parallel Haskell in Haskell. In Implementation and Application of Functional Languages, 23rd International Symposium 2011, Lawrence, KS, USA, October 3-5, 2011. Revised Selected Papers. Lecture Notes in Computer Science, vol. 7257. Springer, pp. 35–50.CrossRef Google Scholar

Marlow, S., Jones, S. L. P. & Singh, S. (2009) Runtime support for multicore Haskell. In Proceedings of ICFP, Edinburgh, Scotland, pp. 65–78.Google Scholar

Marlow, S. & Newton, R. (2013) Source code for monad-par library. https://github.com/simonmar/monad-par.Google Scholar

Marlow, S., Newton, R. & Jones, S. L. P. (2011) A monad for deterministic parallelism. In Proceedings of the 4th ACM SIGPLAN Symposium on Haskell, Haskell 2011, Tokyo, Japan, 22 September 2011, pp. 71–82.Google Scholar

Mattsson, H., Nilsson, H. & Wikstrm, C. (1999) Mnesia - a distributed robust DBMS for telecommunications applications. In Proceedings of PADL, San Antonio, Texas, USA, pp. 152–163.Google Scholar

Meredith, M., Carrigan, T., Brockman, J., Cloninger, T., Privoznik, J., & Williams, J. (2003) Exploring Beowulf clusters. J. Comput. Sci. Colleges 18 (4), 268–284.Google Scholar

Michie, D. (1968) “Memo” functions and machine learning. Nature 218 (5136), 19–22.Google Scholar

Peyton, Jones S. (2002) Tackling the awkward squad: Monadic input/output, concurrency, exceptions, and foreign-language calls in Haskell. In Engineering Theories of Software Construction, Marktoberdorf Summer School, pp. 47–96.Google Scholar

Pnueli, A. (1977) The temporal logic of programs. In Proceedings of 18th Annual Symposium on Foundations of Computer Science, Providence, Rhode Island, 31 October–1 November 1977. IEEE Computer Society, pp. 46–57.Google Scholar

Postel, J. 1980 (August) User Datagram Protocol. RFC 768 Standard. http://www.ietf.org/rfc/rfc768.txt.CrossRef Google Scholar

Prior, A. N. (1957) Time and Modality. Oxford University Press.Google Scholar

Ramalingam, G. & Vaswani, K. (2013) Fault tolerance via idempotence. In Proceedings of POPL, Rome, Italy, pp. 249–262.Google Scholar

Rivin, I., Vardi, I. & Zimmerman, P. (1994) The N-queens problem. Am. Math. Mon. 101 (7), 629–639.Google Scholar

Scholz, S.-B. (2003) Single assignment C: Efficient support for high-level array operations in a functional setting. J. Funct. Program. 13 (6), 1005–1059.Google Scholar

Schroeder, B. & Gibson, G. A. (2007) Understanding failures in Petascale computers. J. Phys.: Conf. Ser. 78, 012022 (11pp) http://stacks.iop.org/1742-6596/78/012022.Google Scholar

Scott, J. & Kazman, R. (2009) Realizing and Refining Architectural Tactics: Availability. Technical Report. Carnegie Mellon University, Software Engineering Institute.Google Scholar

Stewart, R. (2013a December) Promela Abstraction of the HdpH-RS Scheduler. https://github.com/robstewart57/phd-thesis/blob/master/spin_model/hdph_scheduler.pml.Google Scholar

Stewart, R. (2013b November) Reliable Massively Parallel Symbolic Computing: Fault Tolerance for a Distributed Haskell. PhD Thesis, Mathematical and Computer Sciences, Heriot-Watt University, Edinburgh, Scotland.Google Scholar

Stewart, R. & Maier, P. (2013) HdpH-RS source code. https://github.com/robstewart57/hdph-rs.Google Scholar

Stewart, R., Maier, P. & Trinder, P. (2015 June) Open access dataset for “Transparent Fault Tolerance for Scalable Functional Computation”. http://dx.doi.org/10.5525/gla.researchdata.189.Google Scholar

Trinder, P. W., Hammond, K. Jr., Mattson, J. S., Partridge, A. S. & Jones, S. L. P. (1996) GUM: A portable parallel implementation of Haskell. In Proceedings of ACM Programming Language Design and Implementation (PLDI'96), Philadephia, Pennsylvania, May, pp. 79–88.Google Scholar

White, T. (2012) Hadoop - The Definitive Guide: Storage and Analysis at Internet Scale (3. ed., revised and updated). O'Reilly.Google Scholar

Xu, C. & Lau, F. C. (1997) Load Balancing in Parallel Computers: Theory and Practice. Norwell, MA: Kluwer Academic Publishers.Google Scholar

Submit a response

Discussions

No Discussions have been published for this article.

Article contents

Transparent fault tolerance for scalable functional computation

Abstract

References

Discussions

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests