Hostname: page-component-586b7cd67f-rcrh6 Total loading time: 0 Render date: 2024-11-29T07:34:33.209Z Has data issue: false hasContentIssue false

Oracle-guided scheduling for controlling granularity in implicitly parallel languages*

Published online by Cambridge University Press:  10 November 2016

UMUT A. ACAR
Affiliation:
Carnegie Mellon University, Pittsburgh, PA, USA Inria, Paris, France (e-mail: [email protected])
ARTHUR CHARGUÉRAUD
Affiliation:
Inria, Université Paris-Saclay, Palaiseau, France LRI, CNRS & Univ. Paris-Sud, Université Paris-Saclay, Orsay, France (e-mail: [email protected])
MIKE RAINEY
Affiliation:
Inria, Paris, France (e-mail: [email protected])
Rights & Permissions [Opens in a new window]

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the ‘Save PDF’ action button.

A classic problem in parallel computing is determining whether to execute a thread in parallel or sequentially. If small threads are executed in parallel, the overheads due to thread creation can overwhelm the benefits of parallelism, resulting in suboptimal efficiency and performance. If large threads are executed sequentially, processors may spin idle, resulting again in sub-optimal efficiency and performance. This “granularity problem” is especially important in implicitly parallel languages, where the programmer expresses all potential for parallelism, leaving it to the system to exploit parallelism by creating threads as necessary. Although this problem has been identified as an important problem, it is not well understood—broadly applicable solutions remain elusive. In this paper, we propose techniques for automatically controlling granularity in implicitly parallel programming languages to achieve parallel efficiency and performance. To this end, we first extend a classic result, Brent's theorem (a.k.a. the work-time principle) to include thread-creation overheads. Using a cost semantics for a general-purpose language in the style of lambda calculus with parallel tuples, we then present a precise accounting of thread-creation overheads and bound their impact on efficiency and performance. To reduce such overheads, we propose an oracle-guided semantics by using estimates of the sizes of parallel threads. We show that, if the oracle provides accurate estimates in constant time, then the oracle-guided semantics reduces the thread-creation overheads for a reasonably large class of parallel computations. We describe how to approximate the oracle-guided semantics in practice by combining static and dynamic techniques. We require the programmer to provide the asymptotic complexity cost for each parallel thread and use runtime profiling to determine hardware-specific constant factors. We present an implementation of the proposed approach as an extension of the Manticore compiler for Parallel ML. Our empirical evaluation shows that our techniques can reduce thread-creation overheads, leading to good efficiency and performance.

Type
Articles
Copyright
Copyright © Cambridge University Press 2016 

Footnotes

*

This research was partially supported by the National Science Foundation (grants CCF-1320563 and CCF-1408940), European Research Council (grant ERC-2012-StG-308246), and by Microsoft Research.

References

Acar, U. A., & Blelloch, G. (2015a). 15210: Algorithms: Parallel and sequential. Accessed August 2016. Available at: http://www.cs.cmu.edu/~15210/.Google Scholar
Acar, U. A., & Blelloch, G. (2015b). Algorithm design: Parallel and sequential. Accessed August 2016. Available at: http:www.parallel-algorithms-book.com.Google Scholar
Acar, U. A., Blelloch, G. E. & Blumofe, R. D. (2002). The data locality of work stealing. Theory Comput. Syst. 35 (3), 321347.CrossRefGoogle Scholar
Acar, U. A., Charguéraud, A., & Rainey, M. (2011). Oracle scheduling: Controlling granularity in implicitly parallel languages. In Proceedings of ACM SIGPLAN conference on object-oriented programming, systems, languages, and applications (OOPSLA), pp. 499–518.CrossRefGoogle Scholar
Acar, U. A., Charguéraud, A. & Rainey, M. (2013). Scheduling parallel programs by work stealing with private deques. In PPoPP '13.CrossRefGoogle Scholar
Acar, U. A., Chargueraud, A., & Rainey, M. (2015a). An introduction to parallel computing in c++. Available at: http://www.cs.cmu.edu/15210/pasl.html Google Scholar
Acar, U. A., Chargueraud, A., & Rainey, M. (2015b). A work-efficient algorithm for parallel unordered depth-first search. In Proceedings of Acm/ieee conference on high performance computing (sc). New York, NY, USA: ACM.Google Scholar
Aharoni, G., Feitelson, D. G. & Barak, A. (1992). A run-time algorithm for managing the granularity of parallel functional programs. J. Funct. Program. 2, 387405.CrossRefGoogle Scholar
Arora, N. S., Blumofe, R. D., & Plaxton, C. G. (1998). Thread scheduling for multiprogrammed multiprocessors. In Proceedings of the Tenth Annual ACM Symposium on Parallel Algorithms and Architectures. SPAA '98. ACM Press, pp. 119129.Google Scholar
Arora, N. S., Blumofe, R. D. & Plaxton, C. G. (2001). Thread scheduling for multiprogrammed multiprocessors. Theory Comput. Syst. 34 (2), 115144.CrossRefGoogle Scholar
Barnes, J. & Hut, P. (December 1986). A hierarchical O(N log N) force calculation algorithm. Nature 324, 446449.CrossRefGoogle Scholar
Bergstrom, L., Fluet, M., Rainey, M., Reppy, J., & Shaw, A. (2010). Lazy tree splitting. Icfp 2010. ACM Press, pp. 93104.Google Scholar
Blelloch, G., & Greiner, J. (1995). Parallelism in sequential functional languages. In Proceedings of the 7th International Conference on Functional Programming Languages and Computer Architecture. FPCA '95. ACM, pp. 226237.Google Scholar
Blelloch, G. E., Fineman, J. T., Gibbons, P. B. & Simhadri, H. V. (2011). Scheduling irregular parallel computations on hierarchical caches. In Proceedings of the 23rd ACM Symposium on Parallelism in Algorithms and Architectures. SPAA, '11, pp. 355–366.CrossRefGoogle Scholar
Blelloch, G. E. & Gibbons, P. B. (2004). Effectively sharing a cache among threads. In SPAA.CrossRefGoogle Scholar
Blelloch, G. E., & Greiner, J. (1996). A provable time and space efficient implementation of NESL. In Proceedings of the 1st ACM Sigplan International Conference on Functional Programming. ACM, pp. 213225.CrossRefGoogle Scholar
Blelloch, G. E., Hardwick, J. C., Sipelstein, J., Zagha, M. & Chatterjee, S. (1994). Implementation of a portable nested data-parallel language. J. Parallel Distrib. Comput. 21 (1), 414.CrossRefGoogle Scholar
Blelloch, G. E. & Sabot, G. W. (February 1990). Compiling collection-oriented languages onto massively parallel computers. J. Parallel Distrib. Comput. 8, 119134.CrossRefGoogle Scholar
Blumofe, R. D. & Leiserson, C. E. (September 1999). Scheduling multithreaded computations by work stealing. J. ACM 46, 720748.CrossRefGoogle Scholar
Brent, R. P. (1974) The parallel evaluation of general arithmetic expressions. J. ACM 21 (2), 201206.CrossRefGoogle Scholar
Chakravarty, M. M. T., Leshchinskiy, R., Peyton Jones, S., Keller, G. & Marlow, S. (2007). Data parallel Haskell: a status report. In Workshop on declarative aspects of multicore programming. DAMP '07, pp. 1018.Google Scholar
Chowdhury, R. A., Silvestri, F., Blakeley, B. & Ramachandran, V. 2010 (Apr.). Oblivious algorithms for multicores and network of processors. In Proceedings of International Symposium on Parallel Distributed Processing (ipdps), pp. 1–12.Google Scholar
Cole, R. & Ramachandran, V. (2010). Resource oblivious sorting on multicores. In Proceedings of the 37th International Colloquium Conference on Automata, Languages and Programming. ICALP'10. Springer-Verlag, pp. 226237.Google Scholar
Crary, K. & Weirich, S. (2000). Resource bound certification. In Proceedings of the 27th ACM Sigplan-Sigact Symposium on Principles of Programming Languages. POPL '00, pp. 184–198.CrossRefGoogle Scholar
Feeley, M. (1992). A message passing implementation of lazy task creation. In Proceedings of Parallel symbolic computing, pp. 94–107.Google Scholar
Feeley, M. (1993). An Efficient and General Implementation of Futures on Large Scale Shared-Memory Multiprocessors. PhD Thesis, Brandeis University, Waltham, MA, USA, UMI Order No. GAX93-22348.Google Scholar
Fluet, M., Rainey, M. & Reppy, J. (2008). A scheduling framework for general purpose parallel languages. In Proceedings of ACM Sigplan International Conference on Functional Programming (icfp). ACM, pp. 241252.CrossRefGoogle Scholar
Fluet, M., Rainey, M., Reppy, J. & Shaw, A. (2011). Implicitly threaded parallelism in Manticore. J. Funct. Program. 20 (5–6), 140.Google Scholar
Frens, J. D. & Wise, D. S. (1997). Auto-blocking matrix-multiplication or tracking blas3 performance from source code. In Proceedings of the Sixth ACM Sigplan Symposium on Principles and Practice of Parallel Programming. PPOPP '97. New York, NY, USA: ACM, pp. 206216.Google Scholar
Frigo, M., Leiserson, C. E. & Randall, K. H. (1998). The implementation of the Cilk-5 multithreaded language. In Pldi, pp. 212–223.CrossRefGoogle Scholar
Goldsmith, S. F., Aiken, A. S. & Wilkerson, D. S. (2007). Measuring empirical computational complexity. In Proceedings of the 6th joint meeting of the european software engineering conference and the acm symposium on the foundations of software engineering, pp. 395–404.CrossRefGoogle Scholar
Gulwani, S., Mehra, K. K. & Chilimbi, T. (2009). Speed: Precise and efficient static estimation of program computational complexity. In Proceedings of the 36th Annual ACM Sigplan-Sigact Symposium on Principles of Programming Languages, pp. 127–139.CrossRefGoogle Scholar
Halstead, R. H. (1985). Multilisp: A language for concurrent symbolic computation. ACM Trans. Program. Lang. Syst. 7, 501538.CrossRefGoogle Scholar
Hiraishi, T., Yasugi, M., Umatani, S. & Yuasa, T. (2009). Backtracking-based load balancing. In Ppopp '09. ACM, pp. 5564.CrossRefGoogle Scholar
Huelsbergen, L., Larus, James R. & Aiken, A. (1994). Using the run-time sizes of data structures to guide parallel-thread creation. In Proceedings of the 1994 ACM Conference on Lisp and Functional Programming. LFP '94, pp. 79–90.CrossRefGoogle Scholar
Jost, S., Hammond, K., Loidl, H. & Hofmann, M. (2010). Static determination of quantitative resource usage for higher-order programs. In Principles of programming languages (popl), pp. 223–236.CrossRefGoogle Scholar
Leroy, X., Doligez, D., Garrigue, J., Rémy, D. & Vouillon, J. (2005). The Objective Caml System.Google Scholar
Lopez, P., Hermenegildo, M. & Debray, S. (June 1996). A methodology for granularity-based control of parallelism in logic programs. J. Symbol. Comput. 21, 715734.CrossRefGoogle Scholar
Mohr, E., Kranz, D. A. & Halstead, R. H. Jr. (1990). Lazy task creation: A technique for increasing the granularity of parallel programs. In Conference Record of the 1990 ACM Conference on Lisp and Functional Programming. New York, New York, USA: ACM Press, pp. 185197.CrossRefGoogle Scholar
Narlikar, G. J. (1999). Space-Efficient Scheduling for Parallel, Multithreaded Computations. PhD Thesis, Carnegie Mellon University, Pittsburgh, PA, USA.Google Scholar
Pehoushek, J. & Weening, J. (1990). Low-cost process creation and dynamic partitioning in Qlisp. of: Ito, Takayasu, & Halstead, Robert (eds), In Parallel lisp: Languages and Systems. Lecture Notes in Computer Science, vol. 441. Springer Berlin/Heidelberg, pp. 182199.CrossRefGoogle Scholar
Peyton Jones, S. L. (2008). Harnessing the multicores: Nested data parallelism in Haskell. In Aplas, p. 138.CrossRefGoogle Scholar
Peyton Jones, S. L., Leshchinskiy, R., Keller, G. & Chakravarty, M. M. T. (2008). Harnessing the multicores: Nested data parallelism in Haskell. In Fsttcs, pp. 383–414.Google Scholar
Plummer, H. C. (March 1911). On the problem of distribution in globular star clusters. Mon. Not. R. Astron. Soc. 71, 460470.CrossRefGoogle Scholar
Rainey, M. (August 2010). Effective Scheduling Techniques for High-Level Parallel Programming Languages. PhD thesis, University of Chicago.Google Scholar
Rosendahl, M. (1989). Automatic complexity analysis. In Fpca '89: Functional Programming Languages and Computer Architecture. ACM, pp. 144156.CrossRefGoogle Scholar
Sanchez, D., Yoo, R. M. & Kozyrakis, C. (2010). Flexible architectural support for fine-grain scheduling. In Proceedings of the Fifteenth Edition of Asplos on Architectural Support for Programming Languages and Operating Systems. ASPLOS '10. New York, NY, USA: ACM, pp. 311322.CrossRefGoogle Scholar
Sands, D. (September 1990). Calculi for Time Analysis of Functional Programs. PhD Thesis, University of London, Imperial College.Google Scholar
Sivaramakrishnan, K. C., Ziarek, L. & Jagannathan, S. (2014). Multimlton: A multicore-aware runtime for standard ml. J. Funct. Program. FirstView:1–62, 6.Google Scholar
Spoonhower, D. (2009). Scheduling Deterministic Parallel Programs. PhD Thesis, Pittsburgh, PA, USA: Carnegie Mellon University.Google Scholar
Spoonhower, D., Blelloch, G. E., Harper, R. & Gibbons, P. B. (2008). Space profiling for parallel functional programs. In International Conference on Functional Programming.CrossRefGoogle Scholar
Tzannes, A., Caragea, G. C., Vishkin, U. & Barua, R. (September 2014). Lazy scheduling: A runtime adaptive scheduler for declarative parallelism. TOPLAS 36 (3), 10:110:51.CrossRefGoogle Scholar
Valiant, L. G. (August 1990). A bridging model for parallel computation. CACM 33, 103111.CrossRefGoogle Scholar
Weening, J. S. (1989). Parallel Execution of Lisp Programs. PhD Thesis, Stanford University. Computer Science Technical Report STAN-CS-89-1265.Google Scholar
Submit a response

Discussions

No Discussions have been published for this article.