Scaling-up reasoning and advanced analytics on BigData

TYSON CONDIE; ARIYAM DAS; MATTEO INTERLANDI; ALEXANDER SHKAPSKY; MOHAN YANG; CARLO ZANIOLO

doi:10.1017/S1471068418000418

Scaling-up reasoning and advanced analytics on BigData

Published online by Cambridge University Press: 05 September 2018

ARIYAM DAS ,

MOHAN YANG and

TYSON CONDIE: Affiliation:
University of California, Los Angeles, CA, USA (e-mails: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected])
ARIYAM DAS: Affiliation:
University of California, Los Angeles, CA, USA (e-mails: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected])
MATTEO INTERLANDI: Affiliation:
University of California, Los Angeles, CA, USA (e-mails: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected])
ALEXANDER SHKAPSKY: Affiliation:
University of California, Los Angeles, CA, USA (e-mails: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected])
MOHAN YANG: Affiliation:
University of California, Los Angeles, CA, USA (e-mails: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected])
CARLO ZANIOLO: Affiliation:
University of California, Los Angeles, CA, USA (e-mails: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected])

Article contents

Abstract
Footnotes
References

Rights & Permissions

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the ‘Save PDF’ action button.

BigDatalog is an extension of Datalog that achieves performance and scalability on both Apache Spark and multicore systems to the point that its graph analytics outperform those written in GraphX. Looking back, we see how this realizes the ambitious goal pursued by deductive database researchers beginning 40 years ago: this is the goal of combining the rigor and power of logic in expressing queries and reasoning with the performance and scalability by which relational databases managed BigData. This goal led to Datalog which is based on Horn Clauses like Prolog but employs implementation techniques, such as semi-naïve fixpoint and magic sets, that extend the bottom-up computation model of relational systems, and thus obtain the performance and scalability that relational systems had achieved, as far back as the 80s, using data-parallelization on shared-nothing architectures. But this goal proved difficult to achieve because of major issues at (i) the language level and (ii) at the system level. The paper describes how (i) was addressed by simple rules under which the fixpoint semantics extends to programs using count, sum and extrema in recursion, and (ii) was tamed by parallel compilation techniques that achieve scalability on multicore systems and Apache Spark. This paper is under consideration for acceptance in Theory and Practice of Logic Programming.

Keywords

Deductive databases Datalog BigData parallel distributed computing

Type: Survey Article
Information: Theory and Practice of Logic Programming , Volume 18 , Special Issue 5-6: Special Issue on Parallel and Distributed Logic Programming , September 2018 , pp. 806 - 845

DOI: https://doi.org/10.1017/S1471068418000418 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2018

Footnotes

*This work was supported in part by NSF under Grants IIS-1218471, IIS-1302698 and CNS-1351047, and in part by NIH BigData to Knowledge (BD2K) under Grant U54EB020404.

References

Abiteboul, S. and Hull, R. 1988. Data functions, datalog and negation (extended abstract). In Proc. of ACM SIGMOD International Conference on Management of Data, Chicago, Illinois, June 1–3, 143–153.Google Scholar

Abiteboul, S., Hull, R. and Vianu, V., Eds. 1995. Foundations of Databases: The Logical Level, 1st ed., Addison-Wesley Longman Publishing, Boston, MA, USA.Google Scholar

Agrawal, R. et al. 1994. Fast algorithms for mining association rules. In Proc. of 20th International Conference on Very Large Data Bases, Vol. 1215, 487–499.Google Scholar

Ameloot, T. J., Neven, F. and Van den Bussche, J. 2011. Relational transducers for declarative networking. In Proc. of 30th Principles of Database Systems (PODS), 283–292.Google Scholar

Aref, M. et al. 2015. Design and implementation of the logicblox system. In Proc. of International Conference on Management of Data (SIGMOD). ACM, 1371–1382.Google Scholar

Armbrust, M., Xin, R. S., Lian, C., Huai, Y., Liu, D., Bradley, J. K., Meng, X., Kaftan, T., Franklin, M. J., Ghodsi, A. and Zaharia, M. 2015. Spark SQL: Relational data processing in spark. In Proc. of International Conference on Management of Data (SIGMOD), 1383–1394.Google Scholar

Arni, F., Ong, K., Tsur, S., Wang, H. and Zaniolo, C. 2003. The deductive database system LDL++. Theory and Practice of Logic Programming 3, 1, 61–94.Google Scholar

Bell, D. A., Shao, J. and Hull, M. E. C. 1991. A pipelined strategy for processing recursive queries in parallel. Data & Knowledge Engineering 6, 5, 367–391.Google Scholar

Borkar, V. R. et al. 2012. Declarative systems for large-scale machine learning. IEEE Data Engineering Bulletin 35, 2, 24–32.Google Scholar

Borkar, V. R., Carey, M. J., Grover, R., Onose, N. and Vernica, R. 2011. Hyracks: A flexible and extensible foundation for data-intensive computing. In Proc. of 27th International Conference on Data Engineering (ICDE), 1151–1162.Google Scholar

Bu, Y., Borkar, V. R., Carey, M. J., Rosen, J., Polyzotis, N., Condie, T., Weimer, M. and Ramakrishnan, R. 2012. Scaling datalog for machine learning on big data. CoRR abs/1203.0160.Google Scholar

Cardoso, J. C., Baquero, C. and Almeida, P. S. 2009. Probabilistic estimation of network size and diameter. In Proc. of 4th Latin-American Symposium on Dependable Computing (LADC'09). IEEE, 33–40.Google Scholar

Chimenti, D., O'Hare, A. B., Krishnamurthy, R., Tsur, S., West, C. and Zaniolo, C. 1987. An overview of the LDL system. IEEE Data Engineering Bulletin 10, 4, 52–62.Google Scholar

Cohen, S. and Wolfson, O. 1989. Why a single parallelization strategy is not enough in knowledge bases. In Proc. of 8th Principles of Database Systems (PODS), 200–216.Google Scholar

Condie, T., Chu, D., Hellerstein, J. M. and Maniatis, P. 2008. Evita raced: Metacompilation for declarative networks. Proceedings of the VLDB Endowment 1, 1, 1153–1165.Google Scholar

Conway, N., Marczak, W. R., Alvaro, P., Hellerstein, J. M. and Maier, D. 2012. Logic and lattices for distributed programming. In ACM Symposium on Cloud Computing (SOCC '12). San Jose, CA, USA, October 14–17.Google Scholar

Das, A. and Zaniolo, C. 2016. Fast lossless frequent itemset mining in data streams using crucial patterns. In Proc. of SIAM International Conference on Data Mining. Miami, Florida, USA, May 5–7, 576–584.Google Scholar

de Kergommeaux, J. C. and Codognet, P. 1994. Parallel logic programming systems. ACM Computing Surveys 26, 3, 295–336.Google Scholar

Dean, J. and Ghemawat, S. 2004. Mapreduce: Simplified data processing on large clusters. In Proc. of 6th Symposium on Operating System Design and Implementation (OSDI), 137–150.Google Scholar

Erdem, E., Gelfond, M. and Leone, N. 2016. Applications of answer set programming. AI Magazine 37, 3, 53–68.Google Scholar

Faber, W., Pfeifer, G. and Leone, N. 2011. Semantics and complexity of recursive aggregates in answer set programming. Artificial Intelligence 175, 1, 278–298.Google Scholar

Fang, M., Shivakumar, N., Garcia-molina, H., Motwani, R. and Ullman, J. D. 1998. Computing iceberg queries efficiently. In Proc. of 24rd International Conference on Very Large Data Bases (VLDB), 299–310.Google Scholar

Ganguly, S., Greco, S. and Zaniolo, C. 1991. Minimum and maximum predicates in logic programming. In Proc. of 10th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS '91), 154–163.Google Scholar

Ganguly, S., Greco, S. and Zaniolo, C. 1995. Extrema predicates in deductive databases. Journal of Computer and System Sciences 51, 2, 244–259.Google Scholar

Ganguly, S., Silberschatz, A. and Tsur, S. 1990. A framework for the parallel processing of datalog queries. In Proc. of International Conference on Management of Data (SIGMOD), 143–152.Google Scholar

Ganguly, S., Silberschatz, A. and Tsur, S. 1992. Parallel bottom-up processing of datalog queries. Journal of Logic Programming 14, 1, 101–126.Google Scholar

Gebser, M., Kaminski, R., Kaufmann, B. and Schaub, T. 2014. Clingo= asp + control: Preliminary report. arXiv:1405.3694.Google Scholar

Gelfond, M. and Zhang, Y. 2014. Vicious circle principle and logic programs with aggregates. Theory and Practice of Logic Programming 14, 4–5, 587–601. CoRR abs/1405.3637.Google Scholar

Giacometti, A., Li, D. H., Marcel, P. and Soulet, A. 2014. 20 years of pattern mining: A bibliometric survey. SIGKDD Explorations Newsletter 15, 1, 41–50.Google Scholar

Giannotti, F. and Manco, G. 2002. LDL-Mine: Integrating data mining with intelligent query answering. In Proc. of Logics in Artificial Intelligence, European Conference, JELIA, Cosenza, Italy, September, 23–26, 517–520.Google Scholar

Giannotti, F., Manco, G. and Turini, F. 2004. Specifying mining algorithms with iterative user-defined aggregates. IEEE Transactions on Knowledge and Data Engineering 16, 10, 1232–1246.Google Scholar

Gonzalez, J. E., Xin, R. S., Dave, A., Crankshaw, D., Franklin, M. J. and Stoica, I. 2014. Graphx: Graph processing in a distributed dataflow framework. In Proc. of 11th USENIX Conference on Operating Systems Design and Implementation (OSDI), 599–613.Google Scholar

Greco, S., Zaniolo, C. and Ganguly, S. 1992. Greedy by choice. In Proc. of 11th Symposium on Principles of Database Systems (PODS). ACM, 105–113.Google Scholar

Gupta, G., Pontelli, E., Ali, K. A., Carlsson, M. and Hermenegildo, M. V. 2001. Parallel execution of prolog programs: A survey. ACM Transactions on Programming Languages and Systems 23, 4, 472–602.Google Scholar

Halperin, D., de Almeida, V. T., Choo, L. L., Chu, S., Koutris, P., Moritz, D., Ortiz, J., Ruamviboonsuk, V., Wang, J., Whitaker, A., Xu, S., Balazinska, M., Howe, B. and Suciu, D. 2014. Demonstration of the myria big data management service. In Proc. of International Conference on Management of Data (SIGMOD), Snowbird, UT, USA, June 22–27, 881–884.Google Scholar

Han, J., Pei, J. and Yin, Y. 2000. Mining frequent patterns without candidate generation. In Proc. of International Conference on Management of Data (SIGMOD). ACM, 1–12.Google Scholar

Hu, T., Sung, S. Y., Xiong, H. and Fu, Q. 2008. Discovery of maximum length frequent itemsets. Information Sciences 178, 1, 69–87.Google Scholar

Interlandi, M. and Tanca, L. 2015. On the CALM principle for BSP computation. In Proc. of Alberto Mendelzon International Workshop on Foundations of Data Management.Google Scholar

Kang, U., Tsourakakis, C. E., Appel, A. P., Faloutsos, C. and Leskovec, J. 2011. Hadi: Mining radii of large graphs. ACM Transactions on Knowledge Discovery from Data 5, 2, 8:1–8:24.Google Scholar

Kemp, D. B. and Stuckey, P. J. 1991. Semantics of logic programs with aggregates. In Proc. of International Symposium on Logic Programming (ISLP). 387–401.Google Scholar

Kowalski, R. A. 1979. Algorithm = logic + control. Communications of the ACM 22, 7, 424–436.Google Scholar

Leone, N. et al. 2006. The DLV system for knowledge representation and reasoning. Transactions on Computational Logic 7, 3, 499–562.Google Scholar

Lewis, D. D. 1998. Naive (Bayes) at forty: The independence assumption in information retrieval. In Proc. of 10th European Conference on Machine Learning (ECML '98). Springer-Verlag, London, UK, 4–15.Google Scholar

Lifschitz, S. and Vianu, V. 1998. A probabilistic view of datalog parallelization. Theoretical Computer Science 190, 2, 211–239.Google Scholar

Loo, B. T., Condie, T., Garofalakis, M. N., Gay, D. E., Hellerstein, J. M., Maniatis, P., Ramakrishnan, R., Roscoe, T. and Stoica, I. 2006. Declarative networking: Language, execution and optimization. In Proc. of International Conference on Management of Data (SIGMOD). ACM, 97–108.Google Scholar

Loo, B. T., Condie, T., Hellerstein, J. M., Maniatis, P., Roscoe, T. and Stoica, I. 2005. Implementing declarative overlays. In Proc. of 20th ACM Symposium on Operating Systems Principles (SOSP). ACM, 75–90.Google Scholar

Martínez-Angeles, C. A., Dutra, I. and Costa, V. S. 2014. A datalog engine for GPUs. Declarative Programming and Knowledge Management, Springer, 152–168.Google Scholar

Martínez-Angeles, C. A., Wu, H., Dutra, I., Costa, V. S. and Buenabad-Chávez, J. 2016. Relational learning with GPUs: Accelerating rule coverage. International Journal of Parallel Programming 44, 3, 663–685.Google Scholar

Matula, D. W. and Beck, L. L. 1983. Smallest-last ordering and clustering and graph coloring algorithms. Journal of the ACM 30, 3, 417–427.Google Scholar

Mazuran, M., Serra, E. and Zaniolo, C. 2013a. A declarative extension of horn clauses, and its significance for datalog and its applications. Theory and Practice of Logic Programming 13, 4–5, 609–623.Google Scholar

Mazuran, M., Serra, E. and Zaniolo, C. 2013b. Extending the power of datalog recursion. The VLDB Journal 22, 4, 471–493.Google Scholar

Minker, J., Seipel, D. and Zaniolo, C. 2014. Logic and databases: A history of deductive databases. In Computational Logic, Elsevier, 571–627.Google Scholar

Mitchell, T. M. 1997. Machine Learning. McGraw-Hill, Boston, MA.Google Scholar

Morris, K. A., Ullman, J. D. and Gelder, A. V. 1986. Design overview of the nail! system. In Proc. of 3rd International Conference on Logic Programming, Imperial College of Science and Technology. London, UK, July 14–18, 554–568.Google Scholar

Motik, B., Nenov, Y., Piro, R., Horrocks, I. and Olteanu, D. 2014. Parallel materialisation of datalog programs in centralised, main-memory RDF systems. In Proc. of 28th AAAI Conference on Artificial Intelligence (AAAI'14). AAAI Press, 129–137.Google Scholar

Mumick, I. S., Pirahesh, H. and Ramakrishnan, R. 1990. The magic of duplicates and aggregates. In Proc. of 16th International Conference on Very Large Data Bases (VLDB). Morgan Kaufmann Publishers, 264–277.Google Scholar

Murray, D. G., McSherry, F., Isaacs, R., Isard, M., Barham, P. and Abadi, M. 2013. Naiad: A timely dataflow system. In Proc. of 24th Symposium on Operating Systems Principles (SOSP), 439–455.Google Scholar

Mutharaju, R., Maier, F. and Hitzler, P. 2010. A mapreduce algorithm for SC. In Proc. of 23rd International Workshop on Description Logics (DL'10), 456.Google Scholar

Pelov, N., Denecker, M. and Bruynooghe, M. 2007. Well-founded and stable semantics of logic programs with aggregates. Theory and Practice of Logic Programming 7, 3, 301–353.Google Scholar

Przymusinski, T. C. 1988. Perfect model semantics. In Proc. of International Conference and Symposium on Logic Programming (ICLP/SLP), 1081–1096.Google Scholar

Quinlan, J. R. 1986. Induction of decision trees. Machine Learning 1, 1, 81–106.Google Scholar

Ramakrishnan, R., Srivastava, D. and Sudarshan, S. 1992. CORAL – Control, relations and logic. In Proc. of 18th International Conference on Very Large Data Bases, August 23-27. Vancouver, Canada, 238–250.Google Scholar

Ross, K. A. and Sagiv, Y. 1992. Monotonic aggregation in deductive databases. In Proc. of 11th Symposium on Principles of Database Systems (PODS). ACM, 114–126.Google Scholar

Seib, J. and Lausen, G. 1991. Parallelizing datalog programs by generalized pivoting. In Proc. of 10th Symposium on Principles of Database Systems (PODS), 241–251.Google Scholar

Seo, J., Guo, S. and Lam, M. S. 2013. SociaLite: Datalog extensions for efficient social network analysis. In Proc. of International Conference on Data Engineering (ICDE'13). IEEE, 278–289.Google Scholar

Seo, J., Park, J., Shin, J. and Lam, M. S. 2013. Distributed socialite: A datalog-based language for large-scale graph analysis. Proceedings of the VLDB Endowment 6, 14, 1906–1917.Google Scholar

Shin, K., Eliassi-Rad, T. and Faloutsos, C. 2016. Corescope: Graph mining using k-core analysis – Patterns, anomalies and algorithms. In Proc. of 16th International Conference on Data Mining (ICDM). IEEE, 469–478.Google Scholar

Shkapsky, A., Yang, M., Interlandi, M., Chiu, H., Condie, T. and Zaniolo, C. 2016. Big data analytics with datalog queries on spark. In Proc. of 2016 International Conference on Management of Data (SIGMOD '16). ACM, New York, NY, USA, 1135–1149.Google Scholar

Shkapsky, A., Zeng, K. and Zaniolo, C. 2013. Graph queries in a next-generation datalog system. Proceedings of the VLDB Endowment 6, 12, 1258–1261.Google Scholar

Simons, P., Niemelä, I. and Soininen, T. 2002. Extending and implementing the stable model semantics. Artificial Intelligence 138, 1–2, 181–234.Google Scholar

Son, T. C. and Pontelli, E. 2007. A constructive semantic characterization of aggregates in answer set programming. Theory and Practice of Logic Programming 7, 3, 355–375.Google Scholar

Sudarshan, S. and Ramakrishnan, R. 1991. Aggregation and relevance in deductive databases. In Proc. of 17th International Conference on Very Large Data Bases (VLDB), 501–511.Google Scholar

Swift, T. and Warren, D. S. 2010. Tabling with answer subsumption: Implementation, applications and performance. In Proc. of European Workshop on Logics in Artificial Intelligence (JELIA). 300–312.Google Scholar

Swift, T. and Warren, D. S. 2012. XSB: Extending prolog with tabled logic programming. Theory and Practice of Logic Programming 12, 1–2, 157–187.Google Scholar

Tachmazidis, I., Antoniou, G. and Faber, W. 2014. Efficient computation of the well-founded semantics over big data. Theory and Practice of Logic Programming 14, 4–5, 445–459.Google Scholar

Tachmazidis, I., Antoniou, G., Flouris, G., Kotoulas, S. and McCluskey, L. 2012. Large-scale parallel stratified defeasible reasoning. In Proc. of 20th European Conference on Artificial Intelligence. IOS Press, 738–743.Google Scholar

Tsur, S. 1991. Deductive databases in action. In Proc. of 10th Symposium on Principles of Database Systems (PODS '91). ACM, New York, NY, USA, 142–153.Google Scholar

Urbani, J., Jacobs, C. J. and Krötzsch, M. 2016. Column-oriented Datalog Materialization for large knowledge graphs. In Proc. of 30th Conference on Artificial Intelligence (AAAI), 258–264.Google Scholar

Urbani, J., Kotoulas, S., Maassen, J., Van Harmelen, F. and Bal, H. 2012. Webpie: A web-scale parallel inference engine using MapReduce. Web Semantics: Science, Services and Agents on the World Wide Web 10, 59–75.Google Scholar

Vaghani, J., Ramamohanarao, K., Kemp, D. B., Somogyi, Z., Stuckey, P. J., Leask, T. S. and Harland, J. 1994. The Aditi deductive database system. VLDB Journal 3, 2, 245–288.Google Scholar

Van Gelder, A. 1993. Foundations of aggregation in deductive databases. In Proc. of International Conference on Deductive and Object-Oriented Databases. Springer, 13–34.Google Scholar

Venu, B. 2011. Multi-core processors – An overview. CoRR abs/1110.3535.Google Scholar

Wang, J., Balazinska, M. and Halperin, D. 2015. Asynchronous and fault-tolerant recursive Datalog evaluation in shared-nothing engines. Proceedings of the VLDB Endowment 8, 12, 1542–1553.Google Scholar

Wolfson, O. and Ozeri, A. 1990. A new paradigm for parallel and distributed rule-processing. In Proc. of International Conference on Management of Data (SIGMOD), 133–142.Google Scholar

Wolfson, O. and Silberschatz, A. 1988. Distributed processing of logic programs. In Proc. of International Conference on Management of Data (SIGMOD), 329–336.Google Scholar

Yang, M. 2017. Declarative Languages and Scalable Systems for Graph Analytics and Knowledge Discovery. Ph.D. thesis, UCLA.Google Scholar

Yang, M., Shkapsky, A. and Zaniolo, C. 2015. Parallel bottom-up evaluation of logic programs: DeALS on shared-memory multicore machines. In Technical Communications of ICLP, Cork, Ireland.Google Scholar

Yang, M., Shkapsky, A. and Zaniolo, C. 2017. Scaling up the performance of more powerful datalog systems on multicore machines. VLDB Journal 26, 2, 229–248.Google Scholar

Yang, M. and Zaniolo, C. 2014. Main memory evaluation of recursive queries on multicore machines. In Proc. of IEEE International Conference on Big Data, 251–260.Google Scholar

Yu, Y., Gunda, P. K. and Isard, M. 2009. Distributed aggregation for data-parallel computing: Interfaces and implementations. In Proc. of 22nd Symposium on Operating Systems Principles (SOSP '09). ACM, New York, NY, USA, 247–260.Google Scholar

Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M. J., Shenker, S. and Stoica, I. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proc. of 9th USENIX Conference on Networked Systems Design and Implementation. USENIX Association, 2–2.Google Scholar

Zaniolo, C., Yang, M., Interlandi, M., Das, A., Shkapsky, A. and Condie, T. 2017. Fixpoint semantics and optimization of recursive datalog programs with aggregates. Theory and Practice of Logic Programming 17, 5–6, 1048–1065.Google Scholar

Zaniolo, C., Yang, M., Interlandi, M., Das, A., Shkapsky, A. and Condie, T. 2018. Declarative bigdata algorithms via aggregates and relational database dependencies. In Proc. of 12th Alberto Mendelzon International Workshop on Foundations of Data Management, Cali, Colombia, May 21–25.Google Scholar

Zhang, W., Wang, K. and Chau, S.-C. 1995. Data partition and parallel evaluation of datalog programs. IEEE Transactions on Knowledge and Data Engineering 7, 1, 163–176.Google Scholar

Article contents

Scaling-up reasoning and advanced analytics on BigData

Abstract

Keywords

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests