Hostname: page-component-586b7cd67f-t7czq Total loading time: 0 Render date: 2024-11-22T04:57:04.778Z Has data issue: false hasContentIssue false

Scaling-up reasoning and advanced analytics on BigData

Published online by Cambridge University Press:  05 September 2018

TYSON CONDIE
Affiliation:
ARIYAM DAS
Affiliation:
MATTEO INTERLANDI
Affiliation:
ALEXANDER SHKAPSKY
Affiliation:
MOHAN YANG
Affiliation:
CARLO ZANIOLO
Affiliation:
Rights & Permissions [Opens in a new window]

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the ‘Save PDF’ action button.

BigDatalog is an extension of Datalog that achieves performance and scalability on both Apache Spark and multicore systems to the point that its graph analytics outperform those written in GraphX. Looking back, we see how this realizes the ambitious goal pursued by deductive database researchers beginning 40 years ago: this is the goal of combining the rigor and power of logic in expressing queries and reasoning with the performance and scalability by which relational databases managed BigData. This goal led to Datalog which is based on Horn Clauses like Prolog but employs implementation techniques, such as semi-naïve fixpoint and magic sets, that extend the bottom-up computation model of relational systems, and thus obtain the performance and scalability that relational systems had achieved, as far back as the 80s, using data-parallelization on shared-nothing architectures. But this goal proved difficult to achieve because of major issues at (i) the language level and (ii) at the system level. The paper describes how (i) was addressed by simple rules under which the fixpoint semantics extends to programs using count, sum and extrema in recursion, and (ii) was tamed by parallel compilation techniques that achieve scalability on multicore systems and Apache Spark. This paper is under consideration for acceptance in Theory and Practice of Logic Programming.

Type
Survey Article
Copyright
Copyright © Cambridge University Press 2018 

Footnotes

*This work was supported in part by NSF under Grants IIS-1218471, IIS-1302698 and CNS-1351047, and in part by NIH BigData to Knowledge (BD2K) under Grant U54EB020404.

References

Abiteboul, S. and Hull, R. 1988. Data functions, datalog and negation (extended abstract). In Proc. of ACM SIGMOD International Conference on Management of Data, Chicago, Illinois, June 1–3, 143–153.Google Scholar
Abiteboul, S., Hull, R. and Vianu, V., Eds. 1995. Foundations of Databases: The Logical Level, 1st ed., Addison-Wesley Longman Publishing, Boston, MA, USA.Google Scholar
Agrawal, R. et al. 1994. Fast algorithms for mining association rules. In Proc. of 20th International Conference on Very Large Data Bases, Vol. 1215, 487–499.Google Scholar
Ameloot, T. J., Neven, F. and Van den Bussche, J. 2011. Relational transducers for declarative networking. In Proc. of 30th Principles of Database Systems (PODS), 283–292.Google Scholar
Aref, M. et al. 2015. Design and implementation of the logicblox system. In Proc. of International Conference on Management of Data (SIGMOD). ACM, 1371–1382.Google Scholar
Armbrust, M., Xin, R. S., Lian, C., Huai, Y., Liu, D., Bradley, J. K., Meng, X., Kaftan, T., Franklin, M. J., Ghodsi, A. and Zaharia, M. 2015. Spark SQL: Relational data processing in spark. In Proc. of International Conference on Management of Data (SIGMOD), 1383–1394.Google Scholar
Arni, F., Ong, K., Tsur, S., Wang, H. and Zaniolo, C. 2003. The deductive database system LDL++. Theory and Practice of Logic Programming 3, 1, 6194.Google Scholar
Bell, D. A., Shao, J. and Hull, M. E. C. 1991. A pipelined strategy for processing recursive queries in parallel. Data & Knowledge Engineering 6, 5, 367391.Google Scholar
Borkar, V. R. et al. 2012. Declarative systems for large-scale machine learning. IEEE Data Engineering Bulletin 35, 2, 2432.Google Scholar
Borkar, V. R., Carey, M. J., Grover, R., Onose, N. and Vernica, R. 2011. Hyracks: A flexible and extensible foundation for data-intensive computing. In Proc. of 27th International Conference on Data Engineering (ICDE), 1151–1162.Google Scholar
Bu, Y., Borkar, V. R., Carey, M. J., Rosen, J., Polyzotis, N., Condie, T., Weimer, M. and Ramakrishnan, R. 2012. Scaling datalog for machine learning on big data. CoRR abs/1203.0160.Google Scholar
Cardoso, J. C., Baquero, C. and Almeida, P. S. 2009. Probabilistic estimation of network size and diameter. In Proc. of 4th Latin-American Symposium on Dependable Computing (LADC'09). IEEE, 33–40.Google Scholar
Chimenti, D., O'Hare, A. B., Krishnamurthy, R., Tsur, S., West, C. and Zaniolo, C. 1987. An overview of the LDL system. IEEE Data Engineering Bulletin 10, 4, 5262.Google Scholar
Cohen, S. and Wolfson, O. 1989. Why a single parallelization strategy is not enough in knowledge bases. In Proc. of 8th Principles of Database Systems (PODS), 200–216.Google Scholar
Condie, T., Chu, D., Hellerstein, J. M. and Maniatis, P. 2008. Evita raced: Metacompilation for declarative networks. Proceedings of the VLDB Endowment 1, 1, 11531165.Google Scholar
Conway, N., Marczak, W. R., Alvaro, P., Hellerstein, J. M. and Maier, D. 2012. Logic and lattices for distributed programming. In ACM Symposium on Cloud Computing (SOCC '12). San Jose, CA, USA, October 14–17.Google Scholar
Das, A. and Zaniolo, C. 2016. Fast lossless frequent itemset mining in data streams using crucial patterns. In Proc. of SIAM International Conference on Data Mining. Miami, Florida, USA, May 5–7, 576–584.Google Scholar
de Kergommeaux, J. C. and Codognet, P. 1994. Parallel logic programming systems. ACM Computing Surveys 26, 3, 295336.Google Scholar
Dean, J. and Ghemawat, S. 2004. Mapreduce: Simplified data processing on large clusters. In Proc. of 6th Symposium on Operating System Design and Implementation (OSDI), 137–150.Google Scholar
Erdem, E., Gelfond, M. and Leone, N. 2016. Applications of answer set programming. AI Magazine 37, 3, 5368.Google Scholar
Faber, W., Pfeifer, G. and Leone, N. 2011. Semantics and complexity of recursive aggregates in answer set programming. Artificial Intelligence 175, 1, 278298.Google Scholar
Fang, M., Shivakumar, N., Garcia-molina, H., Motwani, R. and Ullman, J. D. 1998. Computing iceberg queries efficiently. In Proc. of 24rd International Conference on Very Large Data Bases (VLDB), 299–310.Google Scholar
Ganguly, S., Greco, S. and Zaniolo, C. 1991. Minimum and maximum predicates in logic programming. In Proc. of 10th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS '91), 154–163.Google Scholar
Ganguly, S., Greco, S. and Zaniolo, C. 1995. Extrema predicates in deductive databases. Journal of Computer and System Sciences 51, 2, 244259.Google Scholar
Ganguly, S., Silberschatz, A. and Tsur, S. 1990. A framework for the parallel processing of datalog queries. In Proc. of International Conference on Management of Data (SIGMOD), 143–152.Google Scholar
Ganguly, S., Silberschatz, A. and Tsur, S. 1992. Parallel bottom-up processing of datalog queries. Journal of Logic Programming 14, 1, 101126.Google Scholar
Gebser, M., Kaminski, R., Kaufmann, B. and Schaub, T. 2014. Clingo= asp + control: Preliminary report. arXiv:1405.3694.Google Scholar
Gelfond, M. and Zhang, Y. 2014. Vicious circle principle and logic programs with aggregates. Theory and Practice of Logic Programming 14, 4–5, 587601. CoRR abs/1405.3637.Google Scholar
Giacometti, A., Li, D. H., Marcel, P. and Soulet, A. 2014. 20 years of pattern mining: A bibliometric survey. SIGKDD Explorations Newsletter 15, 1, 4150.Google Scholar
Giannotti, F. and Manco, G. 2002. LDL-Mine: Integrating data mining with intelligent query answering. In Proc. of Logics in Artificial Intelligence, European Conference, JELIA, Cosenza, Italy, September, 23–26, 517–520.Google Scholar
Giannotti, F., Manco, G. and Turini, F. 2004. Specifying mining algorithms with iterative user-defined aggregates. IEEE Transactions on Knowledge and Data Engineering 16, 10, 12321246.Google Scholar
Gonzalez, J. E., Xin, R. S., Dave, A., Crankshaw, D., Franklin, M. J. and Stoica, I. 2014. Graphx: Graph processing in a distributed dataflow framework. In Proc. of 11th USENIX Conference on Operating Systems Design and Implementation (OSDI), 599–613.Google Scholar
Greco, S., Zaniolo, C. and Ganguly, S. 1992. Greedy by choice. In Proc. of 11th Symposium on Principles of Database Systems (PODS). ACM, 105–113.Google Scholar
Gupta, G., Pontelli, E., Ali, K. A., Carlsson, M. and Hermenegildo, M. V. 2001. Parallel execution of prolog programs: A survey. ACM Transactions on Programming Languages and Systems 23, 4, 472602.Google Scholar
Halperin, D., de Almeida, V. T., Choo, L. L., Chu, S., Koutris, P., Moritz, D., Ortiz, J., Ruamviboonsuk, V., Wang, J., Whitaker, A., Xu, S., Balazinska, M., Howe, B. and Suciu, D. 2014. Demonstration of the myria big data management service. In Proc. of International Conference on Management of Data (SIGMOD), Snowbird, UT, USA, June 22–27, 881–884.Google Scholar
Han, J., Pei, J. and Yin, Y. 2000. Mining frequent patterns without candidate generation. In Proc. of International Conference on Management of Data (SIGMOD). ACM, 1–12.Google Scholar
Hu, T., Sung, S. Y., Xiong, H. and Fu, Q. 2008. Discovery of maximum length frequent itemsets. Information Sciences 178, 1, 6987.Google Scholar
Interlandi, M. and Tanca, L. 2015. On the CALM principle for BSP computation. In Proc. of Alberto Mendelzon International Workshop on Foundations of Data Management.Google Scholar
Kang, U., Tsourakakis, C. E., Appel, A. P., Faloutsos, C. and Leskovec, J. 2011. Hadi: Mining radii of large graphs. ACM Transactions on Knowledge Discovery from Data 5, 2, 8:18:24.Google Scholar
Kemp, D. B. and Stuckey, P. J. 1991. Semantics of logic programs with aggregates. In Proc. of International Symposium on Logic Programming (ISLP). 387–401.Google Scholar
Kowalski, R. A. 1979. Algorithm = logic + control. Communications of the ACM 22, 7, 424436.Google Scholar
Leone, N. et al. 2006. The DLV system for knowledge representation and reasoning. Transactions on Computational Logic 7, 3, 499562.Google Scholar
Lewis, D. D. 1998. Naive (Bayes) at forty: The independence assumption in information retrieval. In Proc. of 10th European Conference on Machine Learning (ECML '98). Springer-Verlag, London, UK, 4–15.Google Scholar
Lifschitz, S. and Vianu, V. 1998. A probabilistic view of datalog parallelization. Theoretical Computer Science 190, 2, 211239.Google Scholar
Loo, B. T., Condie, T., Garofalakis, M. N., Gay, D. E., Hellerstein, J. M., Maniatis, P., Ramakrishnan, R., Roscoe, T. and Stoica, I. 2006. Declarative networking: Language, execution and optimization. In Proc. of International Conference on Management of Data (SIGMOD). ACM, 97–108.Google Scholar
Loo, B. T., Condie, T., Hellerstein, J. M., Maniatis, P., Roscoe, T. and Stoica, I. 2005. Implementing declarative overlays. In Proc. of 20th ACM Symposium on Operating Systems Principles (SOSP). ACM, 75–90.Google Scholar
Martínez-Angeles, C. A., Dutra, I. and Costa, V. S. 2014. A datalog engine for GPUs. Declarative Programming and Knowledge Management, Springer, 152168.Google Scholar
Martínez-Angeles, C. A., Wu, H., Dutra, I., Costa, V. S. and Buenabad-Chávez, J. 2016. Relational learning with GPUs: Accelerating rule coverage. International Journal of Parallel Programming 44, 3, 663685.Google Scholar
Matula, D. W. and Beck, L. L. 1983. Smallest-last ordering and clustering and graph coloring algorithms. Journal of the ACM 30, 3, 417427.Google Scholar
Mazuran, M., Serra, E. and Zaniolo, C. 2013a. A declarative extension of horn clauses, and its significance for datalog and its applications. Theory and Practice of Logic Programming 13, 4–5, 609623.Google Scholar
Mazuran, M., Serra, E. and Zaniolo, C. 2013b. Extending the power of datalog recursion. The VLDB Journal 22, 4, 471493.Google Scholar
Minker, J., Seipel, D. and Zaniolo, C. 2014. Logic and databases: A history of deductive databases. In Computational Logic, Elsevier, 571627.Google Scholar
Mitchell, T. M. 1997. Machine Learning. McGraw-Hill, Boston, MA.Google Scholar
Morris, K. A., Ullman, J. D. and Gelder, A. V. 1986. Design overview of the nail! system. In Proc. of 3rd International Conference on Logic Programming, Imperial College of Science and Technology. London, UK, July 14–18, 554–568.Google Scholar
Motik, B., Nenov, Y., Piro, R., Horrocks, I. and Olteanu, D. 2014. Parallel materialisation of datalog programs in centralised, main-memory RDF systems. In Proc. of 28th AAAI Conference on Artificial Intelligence (AAAI'14). AAAI Press, 129–137.Google Scholar
Mumick, I. S., Pirahesh, H. and Ramakrishnan, R. 1990. The magic of duplicates and aggregates. In Proc. of 16th International Conference on Very Large Data Bases (VLDB). Morgan Kaufmann Publishers, 264–277.Google Scholar
Murray, D. G., McSherry, F., Isaacs, R., Isard, M., Barham, P. and Abadi, M. 2013. Naiad: A timely dataflow system. In Proc. of 24th Symposium on Operating Systems Principles (SOSP), 439–455.Google Scholar
Mutharaju, R., Maier, F. and Hitzler, P. 2010. A mapreduce algorithm for SC. In Proc. of 23rd International Workshop on Description Logics (DL'10), 456.Google Scholar
Pelov, N., Denecker, M. and Bruynooghe, M. 2007. Well-founded and stable semantics of logic programs with aggregates. Theory and Practice of Logic Programming 7, 3, 301353.Google Scholar
Przymusinski, T. C. 1988. Perfect model semantics. In Proc. of International Conference and Symposium on Logic Programming (ICLP/SLP), 1081–1096.Google Scholar
Quinlan, J. R. 1986. Induction of decision trees. Machine Learning 1, 1, 81106.Google Scholar
Ramakrishnan, R., Srivastava, D. and Sudarshan, S. 1992. CORAL – Control, relations and logic. In Proc. of 18th International Conference on Very Large Data Bases, August 23-27. Vancouver, Canada, 238–250.Google Scholar
Ross, K. A. and Sagiv, Y. 1992. Monotonic aggregation in deductive databases. In Proc. of 11th Symposium on Principles of Database Systems (PODS). ACM, 114–126.Google Scholar
Seib, J. and Lausen, G. 1991. Parallelizing datalog programs by generalized pivoting. In Proc. of 10th Symposium on Principles of Database Systems (PODS), 241–251.Google Scholar
Seo, J., Guo, S. and Lam, M. S. 2013. SociaLite: Datalog extensions for efficient social network analysis. In Proc. of International Conference on Data Engineering (ICDE'13). IEEE, 278–289.Google Scholar
Seo, J., Park, J., Shin, J. and Lam, M. S. 2013. Distributed socialite: A datalog-based language for large-scale graph analysis. Proceedings of the VLDB Endowment 6, 14, 19061917.Google Scholar
Shin, K., Eliassi-Rad, T. and Faloutsos, C. 2016. Corescope: Graph mining using k-core analysis – Patterns, anomalies and algorithms. In Proc. of 16th International Conference on Data Mining (ICDM). IEEE, 469–478.Google Scholar
Shkapsky, A., Yang, M., Interlandi, M., Chiu, H., Condie, T. and Zaniolo, C. 2016. Big data analytics with datalog queries on spark. In Proc. of 2016 International Conference on Management of Data (SIGMOD '16). ACM, New York, NY, USA, 1135–1149.Google Scholar
Shkapsky, A., Zeng, K. and Zaniolo, C. 2013. Graph queries in a next-generation datalog system. Proceedings of the VLDB Endowment 6, 12, 12581261.Google Scholar
Simons, P., Niemelä, I. and Soininen, T. 2002. Extending and implementing the stable model semantics. Artificial Intelligence 138, 1–2, 181234.Google Scholar
Son, T. C. and Pontelli, E. 2007. A constructive semantic characterization of aggregates in answer set programming. Theory and Practice of Logic Programming 7, 3, 355375.Google Scholar
Sudarshan, S. and Ramakrishnan, R. 1991. Aggregation and relevance in deductive databases. In Proc. of 17th International Conference on Very Large Data Bases (VLDB), 501–511.Google Scholar
Swift, T. and Warren, D. S. 2010. Tabling with answer subsumption: Implementation, applications and performance. In Proc. of European Workshop on Logics in Artificial Intelligence (JELIA). 300–312.Google Scholar
Swift, T. and Warren, D. S. 2012. XSB: Extending prolog with tabled logic programming. Theory and Practice of Logic Programming 12, 1–2, 157187.Google Scholar
Tachmazidis, I., Antoniou, G. and Faber, W. 2014. Efficient computation of the well-founded semantics over big data. Theory and Practice of Logic Programming 14, 4–5, 445459.Google Scholar
Tachmazidis, I., Antoniou, G., Flouris, G., Kotoulas, S. and McCluskey, L. 2012. Large-scale parallel stratified defeasible reasoning. In Proc. of 20th European Conference on Artificial Intelligence. IOS Press, 738–743.Google Scholar
Tsur, S. 1991. Deductive databases in action. In Proc. of 10th Symposium on Principles of Database Systems (PODS '91). ACM, New York, NY, USA, 142–153.Google Scholar
Urbani, J., Jacobs, C. J. and Krötzsch, M. 2016. Column-oriented Datalog Materialization for large knowledge graphs. In Proc. of 30th Conference on Artificial Intelligence (AAAI), 258–264.Google Scholar
Urbani, J., Kotoulas, S., Maassen, J., Van Harmelen, F. and Bal, H. 2012. Webpie: A web-scale parallel inference engine using MapReduce. Web Semantics: Science, Services and Agents on the World Wide Web 10, 5975.Google Scholar
Vaghani, J., Ramamohanarao, K., Kemp, D. B., Somogyi, Z., Stuckey, P. J., Leask, T. S. and Harland, J. 1994. The Aditi deductive database system. VLDB Journal 3, 2, 245288.Google Scholar
Van Gelder, A. 1993. Foundations of aggregation in deductive databases. In Proc. of International Conference on Deductive and Object-Oriented Databases. Springer, 13–34.Google Scholar
Venu, B. 2011. Multi-core processors – An overview. CoRR abs/1110.3535.Google Scholar
Wang, J., Balazinska, M. and Halperin, D. 2015. Asynchronous and fault-tolerant recursive Datalog evaluation in shared-nothing engines. Proceedings of the VLDB Endowment 8, 12, 15421553.Google Scholar
Wolfson, O. and Ozeri, A. 1990. A new paradigm for parallel and distributed rule-processing. In Proc. of International Conference on Management of Data (SIGMOD), 133–142.Google Scholar
Wolfson, O. and Silberschatz, A. 1988. Distributed processing of logic programs. In Proc. of International Conference on Management of Data (SIGMOD), 329–336.Google Scholar
Yang, M. 2017. Declarative Languages and Scalable Systems for Graph Analytics and Knowledge Discovery. Ph.D. thesis, UCLA.Google Scholar
Yang, M., Shkapsky, A. and Zaniolo, C. 2015. Parallel bottom-up evaluation of logic programs: DeALS on shared-memory multicore machines. In Technical Communications of ICLP, Cork, Ireland.Google Scholar
Yang, M., Shkapsky, A. and Zaniolo, C. 2017. Scaling up the performance of more powerful datalog systems on multicore machines. VLDB Journal 26, 2, 229248.Google Scholar
Yang, M. and Zaniolo, C. 2014. Main memory evaluation of recursive queries on multicore machines. In Proc. of IEEE International Conference on Big Data, 251–260.Google Scholar
Yu, Y., Gunda, P. K. and Isard, M. 2009. Distributed aggregation for data-parallel computing: Interfaces and implementations. In Proc. of 22nd Symposium on Operating Systems Principles (SOSP '09). ACM, New York, NY, USA, 247–260.Google Scholar
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M. J., Shenker, S. and Stoica, I. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proc. of 9th USENIX Conference on Networked Systems Design and Implementation. USENIX Association, 2–2.Google Scholar
Zaniolo, C., Yang, M., Interlandi, M., Das, A., Shkapsky, A. and Condie, T. 2017. Fixpoint semantics and optimization of recursive datalog programs with aggregates. Theory and Practice of Logic Programming 17, 5–6, 10481065.Google Scholar
Zaniolo, C., Yang, M., Interlandi, M., Das, A., Shkapsky, A. and Condie, T. 2018. Declarative bigdata algorithms via aggregates and relational database dependencies. In Proc. of 12th Alberto Mendelzon International Workshop on Foundations of Data Management, Cali, Colombia, May 21–25.Google Scholar
Zhang, W., Wang, K. and Chau, S.-C. 1995. Data partition and parallel evaluation of datalog programs. IEEE Transactions on Knowledge and Data Engineering 7, 1, 163176.Google Scholar