An algebra for distributed Big Data analytics

LEONIDAS FEGARAS

doi:10.1017/S0956796817000193

An algebra for distributed Big Data analytics

Part of: Big Data Special Collection

Published online by Cambridge University Press: 11 December 2017

LEONIDAS FEGARAS

Show author details

LEONIDAS FEGARAS*: Affiliation:
Department of Computer Science and Engineering, University of Texas at Arlington, Arlington TX 76019, USA (e-mail: [email protected])

Article contents

Abstract
References

Rights & Permissions

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the ‘Save PDF’ action button.

We present an algebra for data-intensive scalable computing based on monoid homomorphisms that consists of a small set of operations that capture most features supported by current domain-specific languages for data-centric distributed computing. This algebra is being used as the formal basis of MRQL, which is a query processing and optimization system for large-scale distributed data analysis. The MRQL semantics is given in terms of monoid comprehensions, which support group-by and order-by syntax and can work on heterogeneous collections without requiring any extension to the monoid algebra. We present the syntax and semantics of monoid comprehensions and provide rules to translate them to the monoid algebra. We give evidence of the effectiveness of our algebra by presenting some important optimization rules, such as converting nested queries to joins.

Type: Research Article
Information: Journal of Functional Programming , Volume 27 , 2017 , e27

DOI: https://doi.org/10.1017/S0956796817000193 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2017

References

Acar, U. A., Blelloch, G. E., Blume, M., Harper, R. & Tangwongsan, K. (2009) An experimental analysis of self-adjusting computation. ACM Trans. Program. Lang. Syst. (TOPLAS) 32 (1), 3:1–53.Google Scholar

Aït-Kaci, H. (2013) An abstract, reusable & extensible programming language design architecture. In In Search of Elegance in the Theory and Practice of Computation, Springer 2013, LNCS 8000, pp. 112–166. Available at http://hassan-ait-kaci.net/pdf/hak-opb.pdf.Google Scholar

Alexandrov, A., Katsifodimos, A., Krastev, G. & Markl, V. (2016) Implicit parallelism through deep language embedding. SIGMOD Record 45 (1), 51–58.Google Scholar

Armbrust, M., Xin, R. S., Lian, C., Huai, Y., Liu, D., Bradley, J. K., Meng, X., Kaftan, T., Franklin, M. J., Ghodsi, A. & Zaharia, M. (2015) Spark SQL: Relational data processing in Spark. In International Conference on Management of Data (SIGMOD). pp. 1383–1394.Google Scholar

Apache Flink. (2017) Available at: http://flink.apache.org/, accessed January 2, 2017.Google Scholar

Apache Giraph. (2017) Available at: http://giraph.apache.org/, accessed January 2, 2017.Google Scholar

Apache Hadoop. (2017) Available at: http://hadoop.apache.org/, accessed January 2, 2017.Google Scholar

Apache Hama. (2017) Available at: http://hama.apache.org/, accessed January 2, 2017.Google Scholar

Apache Hive. (2017) Available at: http://hive.apache.org/, accessed January 2, 2017.Google Scholar

Apache MRQL (incubating). (2017) Available at: http://mrql.incubator.apache.org/ The MRQL syntax is described at http://wiki.apache.org/mrql/LanguageDescription, accessed January 2, 2017.Google Scholar

Apache Spark. (2017) Available at: http://spark.apache.org/, accessed January 2, 2017.Google Scholar

Backhouse, R. & Hoogendijk, P. (1993) Elements of a relational theory of datatypes. In Formal Program Development, IFIP TC2/WG 2.1 State of the Art Seminar, Springer-Verlag 1993, LNCS, vol. 755, pp. 7–42.Google Scholar

Bancilhon, F., Briggs, T., Khoshafian, S. & Valduriez, P. (1987) FAD, a powerful and simple database language. In Proceedings of the International Conference on Very Large Data Bases. pp. 97–105.Google Scholar

Battre, D., Ewen, S., Hueske, F., Kao, O., Markl, V. & Warneke, D. (2010) Nephele/PACTs: A programming model and execution framework for web-scale analytical processing. In 1st ACM Symposium on Cloud Computing (SOCC'10). pp. 119–130.Google Scholar

Blelloch, G. (1993) NESL: A nested data-parallel language. Technical report, Carnegie Mellon University. CMU-CS-93-129.Google Scholar

Blelloch, G. & Sabot, G. (1990) Compiling collection-oriented languages onto massively parallel computers. J. Parallel Distrib. Comput. 8 (2), 119–134.Google Scholar

Boykin, O., Ritchie, S., O'Connell, I. & Lin, J. (2014) Summingbird: A framework for integrating batch and online MapReduce computations. Proc. VLDB Endowment (PVLDB) 7 (13), 1441–1451.Google Scholar

Bryant, R. E. (2011) Data-intensive scalable computing for scientific applications. Comput. Sci. Eng. 13 (6), 25–33.Google Scholar

Buneman, P., Libkin, L., Suciu, D., Tannen, V. & Wong, L. (1994) Comprehension syntax. SIGMOD Record 23 (11), 87–96.Google Scholar

Chaiken, R., Jenkins, B., Larson, P.-A., Ramsey, B., Shakib, D., Weaver, S. & Zhou, J. (2008) SCOPE: Easy and efficient parallel processing of massive data Sets. Proc. VLDB Endowment (PVLDB) 1 (2), 1265–1276.Google Scholar

Chakrabarti, D., Zhan, Y. & Faloutsos, C. (2004) R-MAT: A recursive model for graph mining. In SIAM International Conference on Data Mining (SDM). pp. 442–446.Google Scholar

Dean, J. & Ghemawat, S. (2004) MapReduce: Simplified data processing on large clusters. In Symposium on Operating System Design and Implementation (OSDI).Google Scholar

Fegaras, L. (2012) Supporting bulk synchronous parallelism in map-reduce queries. In International Workshop on Data Intensive Computing in the Clouds (DataCloud).Google Scholar

Fegaras, L. (2016) Incremental query processing on big data streams. IEEE Trans. Knowl. Data Eng. 28 (11), 2998–3012. Available at: https://lambda.uta.edu/tkde16-preprint.pdf.Google Scholar

Fegaras, L., Li, C., Gupta, U. & Philip, J. J. (2011) XML query optimization in map-reduce. In International Workshop on the Web and Databases (WebDB).Google Scholar

Fegaras, L., Li, C. & Gupta, U. (2012) An optimization framework for map-reduce queries. In International Conference on Extending Database Technology (EDBT). pp. 26–37.Google Scholar

Fegaras, L. & Maier, D. (1995) Towards an effective calculus for object query languages. In International Conference on Management of Data (SIGMOD). pp. 47–58.Google Scholar

Fegaras, L. & Maier, D. (2000) Optimizing object queries using an effective calculus. ACM Trans. Database Syst. (TODS) 25 (4), 457–516. Available at: https://lambda.uta.edu/tods00.pdf.Google Scholar

Gates, A. F., Natkovich, O., Chopra, S., Kamath, P., Narayanamurthy, S. M., Olston, C., Reed, B., Srinivasan, S. & Srivastava, U. (2009) Building a high-level dataflow system on top of map-reduce: the pig Experience. Proc. VLDB Endowment (PVLDB) 2 (2), 1414–1425.Google Scholar

Gibbons, J. (1996) The third homomorphism theorem. J. Funct. Program. 6 (4), 657–665.Google Scholar

Gibbons, J. (2016) Comprehending ringads: For Phil Wadler on the occasion of his 60th birthday. In A List of Successes That Can Change the World. Springer 2016, LNCS, vol. 9600, pp. 132–151.Google Scholar

Giorgidze, G., Grust, T., Schweinsberg, N. & Weijers, J. (2011) Bringing back monad comprehensions. In Haskell Symposium, pp. 13–22.Google Scholar

Grust, T. & Scholl, M. H. (1999) How to comprehend queries functionally. J. Intell. Inform. Syst. 12 (2–3), 191–218.Google Scholar

Holsch, J., Grossniklaus, M. & Scholl, M. H. (2016) Optimization of nested queries using the NF² algebra. In ACM SIGMOD International Conference on Management of Data. pp. 1765–1780.Google Scholar

Isard, M. & Yu, Y. (2009) Distributed data-parallel computing using a high-level programming language. In ACM SIGMOD International Conference on Management of Data. pp. 987–994.Google Scholar

Lin, J. & Dyer, C. (2010) Data-Intensive Text Processing with MapReduce. Morgan & Claypool Publishers.Google Scholar

Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C. & Hellerstein, J. M. (2012) Distributed GraphLab: A framework for machine learning and data mining in the cloud. Proc. VLDB Endowment (PVLDB) 5 (8), 716–727.Google Scholar

Malewicz, G., Austern, M. H., Bik, A. J. C., Dehnert, J. C., Horn, I., Leiser, N. & Czajkowski, G. (2010) Pregel: A system for large-scale graph processing. In ACM SIGMOD International Conference on Management of Data. pp. 135–146.Google Scholar

Olston, C., Reed, B., Srivastava, U., Kumar, R. & Tomkins, A. (2008) Pig Latin: A not-so-foreign language for data processing. In ACM SIGMOD International Conference on Management of Data. pp. 1099–1110.Google Scholar

Power, R. & Li, J. (2010) Piccolo: Building fast, distributed programs with partitioned tables. In Symposium on Operating System Design and Implementation (OSDI).Google Scholar

Shinnar, A., Cunningham, D., Herta, B. & Saraswat, V. (2012) M3R: Increased performance for in-memory hadoop jobs. Proc. VLDB Endowment (PVLDB) 5 (12), 1736–1747.Google Scholar

Steele, G. L. Jr. (2009) Organizing functional code for parallel execution or, foldl and foldr considered slightly harmful. In ICFP. pp. 1–2.Google Scholar

Tannen, V. B., Buneman, P. & Naqvi, S. (1991) Structural recursion as a query language. In International Workshop on Database Programming Languages: Bulk Types and Persistent Data (DBPL). pp. 9–19.Google Scholar

Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Antony, S., Liu, H., Wyckoff, P. & Murthy, R. (2009) Hive: A warehousing solution over a map-reduce framework. Proc. VLDB Endowment (PVLDB) 2 (2), 1626–1629.Google Scholar

Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Antony, S., Liu, H. & Murthy, R. (2010) Hive: A petabyte scale data warehouse using hadoop. In IEEE International Conference on Data Engineering (ICDE). pp. 996–1005.Google Scholar

Trinder, P. & Wadler, P. (1989) Improving list comprehension database queries. In TENCON. pp. 186–192.Google Scholar

Trinder, P. W. (1991) Comprehensions, a query notation for DBPLs. In International Workshop on Database Programming Languages (DBPL). pp. 55–68.Google Scholar

Valiant, L. G. (1990) A bridging model for parallel computation. Commun. ACM (CACM) 33 (8), 103–111.Google Scholar

Wadler, P. (1990) Comprehending monads. In ACM Symposium on Lisp and Functional Programming. pp. 61–78.Google Scholar

Wadler, P. (1987) List comprehensions. In The Implementation of Functional Programming Languages, Peyton Jones, S. (ed.). Prentice Hall, Chapter 7.Google Scholar

Wadler, P. & Peyton Jones, S. (2007) Comprehensive comprehensions (comprehensions with ‘Order by’ and ‘Group by’). In Haskell Symposium. pp. 61–72.Google Scholar

Wong, L. (2000) Kleisli, a functional query system. J. Funct. Program. 10 (1), 19–56.Google Scholar

Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M. J., Shenker, S. & Stoica, I. (2012) Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In USENIX Symposium on Networked Systems Design and Implementation (NSDI).Google Scholar

Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S. & Stoica, I. (2013) Discretized streams: Fault-tolerant streaming computation at scale. In Symposium on Operating Systems Principles (SOSP).Google Scholar

Submit a response

Discussions

No Discussions have been published for this article.

Article contents

An algebra for distributed Big Data analytics

Abstract

References

Discussions

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests