1 Introduction
Knowledge Representation and Reasoning (KRR) languages based on logic rules are experiencing a significant resurgence in the context of Knowledge Graph (KG) systems. Alongside the graph-based data model of property graphs (Angles Reference Angles2018), more and more companies and users yearn for the possibility to harness domain knowledge in the intensional components of the graph (Bellomarini et al. Reference Bellomarini, Fakhoury, Gottlob and Sallinger2019) and be able to use it to solve complex reasoning tasks, well beyond the initial applications of ontological reasoning of the early times.
KRR languages play a fundamental role in this and, in order to cope with complex real-life applications, should support a number of desiderata. First, the knowledge management community has been showing a strong appreciation for rule-based languages, wishing to benefit from all the typical advantages over procedural approaches, such as user-orientation, modularity, and, of course, explainability. Then, KRR languages should be syntactically simple and achieve high expressive power in order to be able to deal with complex industrial use cases. At the same time, they should guarantee low data complexity (when the set of rules is fixed and the data vary), so to be executable in practice and scale with large volumes of data (Bellomarini et al. Reference Bellomarini, Gottlob, Pieris and Sallinger2017).
vadalog is a logic KRR language based on Warded Datalog $^\pm$ (Bellomarini et al. Reference Bellomarini, Fayzrakhmanov, Gottlob, Kravchenko, Laurenza, Nenov, Reissfelder, Sallinger, Sherkhonov and Wu2018), a member of the Datalog $^\pm$ family (Calì et al. Reference Calì, Gottlob and Pieris2012). Datalog $^\pm$ languages generalize Datalog by adopting existential rules with existential quantification in the rule head. The presence of existential quantification and recursion makes reasoning undecidable in general, as infinitely many symbols may be generated in the logical entailment (Gottlob and Pieris Reference Gottlob and Pieris2015). Warded Datalog $^\pm$ introduces syntactic restrictions that limit the propagation of nulls while preserving high expressive power.
Let us start with the following example to introduce the reasoning task.
Example 1.1 Consider a Knowledge Graph G. The facts in its extensional component describe domain relationships between constants a, b, c, l, m, n as follows:
Let us extend G with the following Warded Datalog $^\pm$ existential rules (the intensional component of the KG), encoding a portion of the credit domain, from our industrial use cases:
Rule (1) encodes that if a lender x is of type y (e.g. a bank, a small company) and lenders of type y are subject to RegulatoryRestrictions, enforced by financial supervision authorities, that require loans to be covered by a specific Guarantee (known as “collateral”) of type z (e.g. securities, real estate properties, cash), then, such Guarantee for lender x exists and involves a guarantor v, the entity issuing the guarantee itself, be it an individual, a bank, or a financial intermediary. LenderClass in Rule (2) defines a taxonomy of lenders, classifying them into classes (e.g. mortgage lenders, retail lenders, direct lenders), so that if y is a subclass of z (e.g. credit unions is a subclass of retail lender) and x is of type y, it follows that x is of type z as well. In practice, loans are formalized by Contracts: in the body of Rule (3), x is the lender, z the borrower, and y the type of contract, encoding the type of loan. Different types of loans give rise to different forms w of financial Exposure, that is, the type of repayment obligation that z has toward x. Rule (3) justifies the existence of another Contract, from z to x, witnessing such repayment obligation encoded by the contract type w associated with the exposure. Finally, Rule (4) complements Rule (2), so if a Contract from x to z is in place to satisfy a restriction that requires a guarantee encoded by the contract type y, based on a lender type w, then x is of type w.
An example of (ontological) reasoning task over G corresponds to answering the query: “What are all the $\textsf{Contracts}$ and $\textsf{Guarantees}$ that are expected to be entailed by the KG?”. They are $\textsf{Contract}(c,l,a)$ and $\textsf{Guarantee}(c,l,v_0)$ , where $v_0$ is a fresh arbitrary value (a labeled null).
Let us consider a modified setting for Example 1.1, in which the assertions are not definitive, but hold with different strengths, due to heterogeneous implementations of the regulations by the different financial intermediaries. We prefix our rules with a weight proportional to the bias we have for them to hold (indicated by the number before the :: symbol) or, under another perspective, that depends on how often the rules are satisfied in the available example data sets.
Example 1.2 We extend G of Example 1.1 with the following set of weight-prefixed Warded Datalog $^\pm$ rules.
We represent here some notion of uncertainty related to our domain: depending on how each financial intermediary implements the regulations, some rules may apply or not. In particular, Rules 5–7 tend to be respected more or less regularly as reflected by their weights, whereas Rule 8 is a hard rule
A probabilistic reasoning task would consist in answering, over logic programs, queries like: “What is the probability for each $\textsf{Contract}$ and Guarantee to be entailed?”. We wish to compute the marginal probability of entailed facts, so, for example, of $\textsf{Contract}(c,l,a)$ and $\textsf{Guarantee}(c,l,v_0)$ .
Beyond our running example, many scenarios of practical applications of probabilistic reasoning in Warded Datalog $^\pm$ are common and span many domains we have direct experience on with Datalog $^\pm$ , besides the financial domain (Bellomarini et al. Reference Bellomarini, Fakhoury, Gottlob and Sallinger2019), the automation of data science pipelines (Bellomarini et al. Reference Bellomarini, Fayzrakhmanov, Gottlob, Kravchenko, Laurenza, Nenov, Reissfelder, Sallinger, Sherkhonov and Wu2018), data acquisition (Michels et al. Reference Michels, Fayzrakhmanov, Ley, Sallinger and Schenkel2017), data extraction (Fayzrakhmanov et al. Reference Fayzrakhmanov, Sallinger, Spencer, Furche and Gottlob2018), and others.
To enable such scenarios, we need KRR languages able to perform probabilistic reasoning in the context of the requirements of ontological reasoning: (i) powerful existential quantification, supporting the quantification of SPARQL and OWL 2 QL, (ii) full recursion, capturing full Datalog, to express general non-ground inductive definitions (e.g. transitive closure, like in the standard path definition $\textit{edge}(X,Y)\to \textit{path}(X,Y)$ ; $\textit{edge}(X,Z), \textit{path}(Z,Y)\to \textit{path}(X,Y)$ ) (Fierens et al. Reference Fierens, den Broeck, Renkens, Shterionov, Gutmann, Thon, Janssens and Raedt2015; Lee and Wang Reference Lee and Wang2016). Although probabilistic reasoning is of interest in four broad research areas, namely probabilistic logic programming (e.g. ProbLog) (Alberti et al. Reference Alberti, Bellodi, Cota, Riguzzi and Zese2017; De Raedt and Kimmig Reference De Raedt and Kimmig2015; Poole Reference Poole2008; Riguzzi Reference Riguzzi2007; Sato Reference Sato1995; Sato and Kameya Reference Sato and Kameya1997; Vennekens et al. Reference Vennekens, Verbaeten and Bruynooghe2004), probabilistic programming languages (e.g. BLOG) (Goodman et al. Reference Goodman, Mansinghka, Roy, Bonawitz and Tenenbaum2008; Kersting and Raedt Reference Kersting and Raedt2008; Milch et al. Reference Milch, Marthi, Russell, Sontag, Ong and Kolobov2005; Pfeffer and River Analytics Reference Pfeffer and River Analytics2009), statistical relational learning (e.g. Markov Logic Networks) (Jaeger Reference Jaeger2018; Lee and Wang Reference Lee and Wang2016; Richardson and Domingos Reference Richardson and Domingos2006), and probabilistic knowledge bases (Borgwardt et al. Reference Borgwardt, Ceylan and Lukasiewicz2018; Jung and Lutz Reference Jung and Lutz2012) as we shall see, none of the existing approaches fit our requirements. In particular, they either fail in providing simultaneous support for recursion and existential quantification or do not allow for inductive definitions in uncertain ontological reasoning settings.
Contribution. In this work, we introduce a theoretical framework and workable system that implements probabilistic reasoning on vadalog KGs. In particular, we contribute the following:
-
We present soft vadalog, a probabilistic extension to vadalog. This KRR language allows to define Probabilistic Knowledge Graphs, that is, a probabilistic version of the notion of a knowledge base or KG commonly adopted in automated reasoning contexts. In particular, a soft vadalog program and a database define a probability distribution over the nodes of a chase network, a data structure (technically, a probabilistic graphical model) built by the application of the chase procedure. Here we combine both the experience from the database community, where chase procedures generate the entailed facts, and the experience from statistical relational learning, as soft vadalog programs are templates for chase networks, along the lines of Markov Logic Networks, where weighted FO formulas generate Markov networks.
-
We characterize the problem of reasoning on PKGs and conclude it is #P-hard.
-
Based on the mentioned intractability result, we propose the MCMC-chase, an approximate technique for marginal inference combining a Markov Chain Monte Carlo method (specifically the Metropolis-Hastings algorithm) with a chase-based procedure. Chase procedures are used in databases to enforce logic rules by generating entailed facts. Here, the chase procedure is guided by MCMC and marginal inference is performed in the process. We present the details of the algorithm and define and prove its theoretical underpinnings.
-
We illustrate applications of probabilistic reasoning to record linkage (Christen Reference Christen2012) and data fusion (Bleiholder and Naumann Reference Bleiholder and Naumann2008), two core data management problems.
-
Finally, an extension of the Vadalog system implementing the algorithm is experimentally evaluated on real-world and synthetic cases in a corporate economics setting.
This invited paper is a substantially extended version of a recent short work (Bellomarini et al. Reference Bellomarini, Laurenza, Sallinger and Sherkhonov2020), where we initially introduced soft vadalog. In particular, all the sections contain relevant new material and more detailed explanations. The background is broadly developed; theoretical results are fully reported, proven and discussed; applications to data management problems and experimental settings are completely new.
Overview. In Section 2, we provide motivation for our approach and analyze the related work. In Section 3, we introduce Warded Datalog $^\pm$ , vadalog, and the needed background. In Section 4, we introduce PKGs, soft vadalog, and probabilistic reasoning with them. In Section 5, we present the MCMC-chase algorithm and its theoretical underpinnings. In Section 6, we discuss the application use cases. Experimental settings are in Section 7. Section 8 concludes the paper.
2 Motivation and related work
The combination of logic-based approaches and uncertainty in the form of probability has been a long-lasting goal of artificial intelligence. Seminal theoretical works (Halpern Reference Halpern1989; Bacchus Reference Bacchus1990) make a preliminary distinction between statistical inquiries on general characteristics of the domain (e.g. in our example, “what is the probability for any random $\textsf{Contract}$ to be entailed by a matching $\textsf{LenderType}$ and $\textsf{RegulatoryRestriction}$ ?”) and degree of belief (e.g. “what is the probability that the $\textsf{Contract}(c,l,\nu_0)$ is entailed by $\textsf{LenderType}(c,m)$ and $\textsf{RegulatoryRestriction}(m,l)$ ?”) and provide a framework to combine the two approaches. The former requires accounting for a probability distribution over the domain, the latter over the possible worlds. Along the lines of successful Statistical Relational Learning approaches (Singla and Domingos Reference Singla and Domingos2005), for ontological reasoning on KGs, we are interested in both the perspectives and need to handle both uncertainty on the rules (and so the degree of belief) and – as a special case of it – on the data (and so the statistical characteristics of the domain). Beyond the early frameworks, which assume a fixed probabilistic distribution over the possible worlds, the complex domains in which KGs are used require handling a broader range of probabilistic relations between the domain objects.
In the areas of Probabilistic Logic Programming (PLP), Probabilistic Programming Languages (PPL), Statistical Relational Learning (SRL), and Probabilistic Knowledge Bases (PKB), we find different angles to approaching uncertainty in reasoning to such a broader extent. All these areas can be considered related to our work, yet none of them singularly satisfies the desiderata we have laid out in the introduction and can be directly used for our purpose of providing a probabilistic extension to vadalog and Warded Datalog $^\pm$ in particular.
Probabilistic Logic Programming (PLP) approaches (De Raedt and Kimmig Reference De Raedt and Kimmig2015) mostly adopt the well-known distribution semantics (Sato Reference Sato1995). According to these semantics, a program induces a probability distribution over a set of different programs (worlds). The marginal probability of a fact is then obtained as proportional to the number of models in which that fact is true, when finite. On this basis, different variants of the distribution semantics can be characterized and exist, depending on the adopted notion of model for a world, such as the standard ones: minimal stratified model, stable model, well-founded model.
PLP offers insufficient support in handling recursion and existentials together in a single decidable fragment. Apart from PLP languages that do not support recursion, most of the other frameworks and systems do not allow for non-ground recursive probabilistic rules including the creation of new values via existential quantification or do not disclose details about how it would be handled. This is the case of Probabilistic Logic Programs (Dantsin Reference Dantsin1991), Probabilistic Horn Abduction (Poole Reference Poole1993), CP-Logic (Vennekens et al. Reference Vennekens, Denecker and Bruynooghe2009), ICL (Poole Reference Poole2008), PRISM (Sato and Kameya Reference Sato and Kameya1997), LPAD (Riguzzi Reference Riguzzi2007; Vennekens et al. Reference Vennekens, Verbaeten and Bruynooghe2004), ProbLog (De Raedt and Kimmig Reference De Raedt and Kimmig2015), cPlint (Alberti et al. Reference Alberti, Bellodi, Cota, Riguzzi and Zese2017), cProbLog (Latour et al. Reference Latour, Babaki, Dries, Kimmig, den Broeck and Nijssen2017).
These limitations are self-evident going back to Example 1.2 and trying to answer the query “What are all the entailed Contracts?” over G. The existing PLP techniques fail to conclude $\mathrm{Contract}(c,l,v_0)$ , because they abort when running into probabilistic recursive rules that involve the creation of new values, like for instance Rule 5 and Rule 8, or Rule 6.
Probabilistic Programming Languages (PPL) systems and frameworks, such as BLOG (Milch et al. Reference Milch, Marthi, Russell, Sontag, Ong and Kolobov2005), BLP (Kersting and Raedt Reference Kersting and Raedt2008), Church (Goodman et al. Reference Goodman, Mansinghka, Roy, Bonawitz and Tenenbaum2008), Figaro (Pfeffer and River Analytics Reference Pfeffer and River Analytics2009), are outside our scope of interest. They are in fact typically based on an underlying Bayesian network model (Koller and Friedman Reference Koller and Friedman2009) and forbid recursion or existential quantification by construction.
Statistical Relational Learning (SRL) approaches have the great merit of pursuing a conciliation of logic-based reasoning and probability. Our yardstick here is Markov Logic (Domingos and Lowd Reference Domingos and Lowd2019) and Markov Logic Networks (MLN) (Richardson and Domingos Reference Richardson and Domingos2006) in particular, which are a probabilistic generalization of both FOL and probabilistic graphical models, such as Relational Bayesian Networks (Jaeger Reference Jaeger2018). Markov logic relaxes the notion of hard constraints of ordinary logical knowledge bases, and introduces soft constraints. A MLN is composed of a set of rules adorned with weights, expressing the rule relative importance in the domain description. In other words, FO rules are a template for the definition of Markov networks, where the actual probabilistic inference takes place after the construction of the network (grounding). A model for a MLN need not satisfy all the MLN rules and the likelihood of a model depends on the weight of the rules it satisfies. Then, the marginal probability of a fact in turn aggregates the marginal probability of the models in which it holds. This is also the case for facts not entailed by the FOL theory. That is, a fact holding only in some models (but not all models, that is, the fact is not logically entailed) can have non-zero probability. Unfortunately, in the ontological reasoning context, this behavior results in the major limitation that MLNs cannot enforce marginal probability zero for models having facts not in the range of the transitive closure. This is just one example of a broader area that refers to the “ability to express (non-ground) inductive definitions” (Fierens et al. Reference Fierens, den Broeck, Renkens, Shterionov, Gutmann, Thon, Janssens and Raedt2015), such as a graph path in terms of its edges. Example 1.2 immediately shows how this can lead to potentially incorrect results. For example, under the closed-world assumption (CWA) used in MLNs, as $\{\textsf{Contract}(a,b,c),\textsf{Exposure}(b,n)\}$ falsifies the premise of Rule 7 ( $\textsf{Exposure}(b,n) \notin D$ and is therefore false by CWA), the rule does not require the presence of the fact $\textsf{Contract}(c,n,a)$ in models. Yet, it is impossible to enforce probability zero for models containing $\textsf{Contract}(c,n,a)$ , or $\textsf{Contract}(c,b,a)$ , or any other fact consistent with the theory.
MLNs are therefore unsuitable to be directly adopted for ontological reasoning, where constructive inductive definitions are understood as a natural form of representation of human knowledge. In the context of KG reasoning, the requirement for general inductive definitions is even stronger, as nontrivial navigation of graph-based structures and individuation of edge patterns are core applications. For example, consider again the standard path definition, where rules hold with different strengths, for example, for data quality reasons: $0.9 :: \textit{edge}(X,Y)\to \textit{path}(X,Y)$ ; 0.8 :: $\textit{edge}(X,Z), \textit{path}(Z,Y)\to \textit{path}(X,Y)$ . Given the database $D=\{\textit{edge}(a,b),\textit{edge}(c,d)\}$ , it would be impossible to assign probability zero to the fact path(a,d), expressing a non-existing path.
To complete the overview of SRL, it is worth mentioning two more proposals for probabilistic logical reasoning that also draw inspiration from Markov logic. One is $\text{LP}^{\text{MLN}}$ , a relevant logic programming approach (Lee and Wang Reference Lee and Wang2016) largely based on MLNs. Unfortunately, it is unsuitable for our purposes as it relies on stable model semantics instead of well-founded or simply stratified semantics, which we consider the standard option. The other has appeared in the context of Datalog $^\pm$ ontologies (Gottlob et al. Reference Gottlob, Lukasiewicz, Martinez and Simari2013), yet has a specific focus on database repair and, more importantly, strict adherence to Markov logic semantics. We also draw inspiration from these approaches, but consider the KG reasoning setting and the consequent desiderata, which make Markov logic not directly applicable for the reasons we have seen in this section.
Probabilistic Knowledge Bases (PKB). Querying large-scale probabilistic knowledge bases is an interesting perspective of uncertain reasoning, with a broad range of proposals (Borgwardt et al. Reference Borgwardt, Ceylan and Lukasiewicz2018). Those based on initial models of probabilistic databases (Dalvi and Suciu Reference Dalvi and Suciu2007) assume the probabilistic independence of the facts, while still encountering major computational challenges as captured by the notorious dichotomy result for the evaluation of conjunctive queries (Dalvi and Suciu Reference Dalvi and Suciu2012). Currently, more advanced approaches stemming from probabilistic databases basically allow to model any probability distribution over the set of possible worlds (Green and Tannen Reference Green and Tannen2006). Recognizing that probabilistic query evaluation is closely connected to the problem of weighted first-order model counting (Beame et al. Reference Beame, den Broeck, Gribkoff and Suciu2014), knowledge compilation, database factorization, tensor factorization techniques have been proposed (Olteanu Reference Olteanu2016; Olteanu and Schleich Reference Olteanu and Schleich2016; Krompaß et al. Reference Krompaß, Nickel and Tresp2014) to cope with computational complexity and, in this line, a number of reasoning systems based on approximate query answering arose, with remarkable examples such as SlimShot (Gribkoff and Suciu Reference Gribkoff and Suciu2016), MayBMS (Huang et al. Reference Huang, Antova, Koch and Olteanu2009), Tuffy (Niu et al. Reference Niu, Ré, Doan and Shavlik2011), and others as summarized in a recent survey (den Broeck and Suciu Reference den Broeck and Suciu2017). This entire line of research shares with our work the need for representing increasingly complex probabilistic relations between domain entities, yet does not immediately lead to frameworks that are directly applicable for ontological reasoning with uncertainty on knowledge graphs, because they adopt the closed-domain assumption, usual in the database context.
Removing the closed-domain assumption, uncertain reasoning has been studied in the context of ontology-based access (OBDA) to probabilistic data (Poggi et al. Reference Poggi, Lembo, Calvanese, Giacomo, Lenzerini and Rosati2008) with both lightweight description logic (Jung and Lutz Reference Jung and Lutz2012) and Datalog $^\pm$ formalisms (Borgwardt et al. Reference Borgwardt, Ceylan and Lukasiewicz2017). For the first category, relevant works (Ceylan and Peñaloza Reference Ceylan and Peñaloza2015; d’Amato et al. Reference d’Amato, Fanizzi and Lukasiewicz2008) combine the $\mathcal{EL}$ and DL-Lite families of description logics and Bayesian networks. Yet, while admitting recursion and existential quantification, such ontological languages do not offer support for multi-attributed structures, for example those needed in property graphs even for basic reasoning tasks such as symmetric relations, and are thus unsuitable for KGs (Marx et al. Reference Marx, Krötzsch and Thost2017). The second category connects to the frameworks we have seen in the SRL area, such as those combining Datalog $^\pm$ and Markov logic (Gottlob et al. Reference Gottlob, Lukasiewicz, Martinez and Simari2013), with the explored limitations for transitive closure.
3 Preliminaries
This section recalls the preliminary notions of Warded Datalog $^\pm$ .
Let $\mathbf{C}$ , $\mathbf{N}$ , and $\mathbf{V}$ be disjoint countably infinite sets of constants, (labeled) nulls and (regular) variables, respectively. A (relational) schema $\mathbf{S}$ is a finite set of relation symbols (or predicates) with associated arity. A term is a either a constant or variable. An atom over $\mathbf{S}$ is an expression $R(\bar v)$ , where $R \in \mathbf{S}$ is of arity $n > 0$ and $\bar v$ is an n-uple of terms. A database instance (or simply database) over $\mathbf{S}$ associates to each relation symbol in $\mathbf{S}$ a relation of the respective arity over the domain of constants and nulls. The members of relations are called tuples.
Warded Datalog $^\pm$ , a member of the Datalog $^\pm$ family (Calì et al. Reference Calì, Gottlob and Pieris2012), extends Datalog (Ceri et al. Reference Ceri, Gottlob and Tanca1989) with existential quantification to support ontological reasoning (whence the $+$ symbol in $\pm$ ) and stratified negation, while at the same time, restricts other aspects of the syntax in order to guarantee decidability and tractability of the reasoning task (the – symbol in $\pm$ ). A rule is a first-order sentence of the form $\forall \bar x \forall \bar y (\varphi(\bar x,\bar y)\ \rightarrow\ \exists \bar z \, \psi(\bar x, \bar z))$ , where $\varphi$ (the body) and $\psi$ (the head) are conjunctions of atoms. We omit universal quantifiers and use comma to denote conjunctions, as usual in this context. A program is a set of rules. Toward introducing the semantics of Datalog $^\pm$ , let us ignore existential quantification for the moment, and recall the standard Datalog semantics.
Datalog semantics. The semantics of Datalog program P can be defined quite easily in terms of model theory (Ceri et al. Reference Ceri, Gottlob and Tanca1989). A Herbrand Base (HB) is the set of all the facts that we can express in P, so all facts of the form $F(c,\ldots,c_k)$ , where $c_i$ are constants and F is a predicate. We call EHB, the Extensional Herbrand Base, as the subset of HB containing all facts whose predicate F never appears in a rule head of P. Conversely, let IHB be the Intensional Herbrand Base, so the subset of HB containing all facts whose predicate F appears in a rule head of P. Let $S\subseteq P$ be a finite set of Datalog rules. Let us discuss what is usually meant by logical consequence in the Datalog context. A fact F is a logical consequence of a set of rules S ( $S\models F$ ) if, for every interpretation I that satisfies S, it also satisfies F. We limit ourselves to the Herbrand Interpretations (HI): those that can be built by subsetting the HB. A Herbrand Model (HM) for S is then a HI satisfying S. Note that, as usual in this context, the closed-world assumption is adopted. We name $\textit{cons}(S)$ the set of all facts F that are logical consequences of S. The semantics of a Datalog program P is a mapping $\mathcal{M}_p$ from the powerset of EHB to the powerset of IHB that associates every possible database $D \subseteq$ EHB to the set $\mathcal{M}_p(D)$ of intensional derived facts defined as $\mathcal{M}_p(D) = \textit{cons}(P \cup D) \cap \text{IHB}$ . We define $\textit{cons}(S)=\{I~|~I~\text{is a}~\textit{HM}~\text{of}~S\}.$ As the intersection of HMs is a HM, we have that $\textit{cons}(S)$ is the Least Herbrand Model (LHM) of S.
With these model-theoretic premises, let us extend our consideration to existential rules, with an operational approach. The semantics of a set of existential rules $\Sigma$ over an instance D, denoted $\Sigma(D)$ , can be defined via the chase procedure. The chase adds new facts to D until $\Sigma(D)$ satisfies all the existential rules. The terms of the facts may include freshly generated symbols, namely labeled nulls or marked nulls, to satisfy existential quantification.
Example 3.1 Consider the database $D = \{\mathrm{Person}(\mathrm{Alice})\}$ , and the set of existential rules:
The ground fact triggers Rule 1, and the following facts are added to D by the chase, where $\nu_1$ is a labeled null: $\mathrm{HasMother}(\mathrm{Alice},\nu_1)$ and $\mathrm{Person}(\nu_1)$ . The fact $(\nu_1)$ triggers again Rule 1, and the chase adds the facts $\mathrm{HasMother}(\nu_1,\nu_2)$ and $\mathrm{Person}(\nu_2)$ . where $\nu_2$ is a new labeled null. Finally, the chase result is the following instance, where $\nu_1$ , $\nu_2$ ,… are labeled nulls: $\mathrm{Person}(\mathrm{Alice}),\mathrm{HasMother}(\mathrm{Alice},\nu_1)\}\cup_{i>0}\{\mathrm{Person}(\nu_i),\mathrm{HasMother}(\nu_i,\nu_{i+1})$ .
The chase. Let us see the chase more formally. Initially we have $\Sigma(D) = D$ . By a unifier we mean a mapping from variables to constants or labeled nulls. We say $\rho = \varphi(\bar x,\bar y) \rightarrow \exists \bar z\, \psi(\bar x,\bar z)$ is applicable to $\Sigma(D)$ if there is a unifier $\theta_\rho$ such that $\varphi(\bar x \theta_\rho,\bar y \theta_\rho)\subseteq \Sigma(D)$ and $\theta_\rho$ has not been used to generate new facts in $\Sigma(D)$ via $\rho$ . If $\rho$ is applicable to $\Sigma(D)$ with a unifier $\theta_\rho$ , then it performs a chase step, that is, it generates new facts $\psi(\bar x \theta_\rho',\bar z\theta_\rho')$ that are added to $\Sigma(D)$ , where $\bar x \theta_\rho = \bar x \theta'_{\rho}$ and $z_i \theta'_{\rho}$ , for each $z_i \in \bar z$ , is a fresh labeled null that does not occur in $\Sigma(D)$ . The chase step easily generalizes to a set of rules. The procedure performs chase steps until no rule in $\Sigma$ is applicable. $\Sigma(D)$ is potentially infinite because of the generation of infinitely many labeled nulls. However, we will consider the chase up to isomorphism of facts, which is sufficient for our logical reasoning task in Warded Datalog $^\pm$ and is finite (Bellomarini et al. Reference Bellomarini, Fayzrakhmanov, Gottlob, Kravchenko, Laurenza, Nenov, Reissfelder, Sallinger, Sherkhonov and Wu2018), as we shall see.
Wardedness. Let us define as frontier variables the universally quantified variables of a rule that also appear in the head. Wardedness introduces syntactic restrictions to limit the propagation of labeled nulls in the frontier variables. Given a predicate p appearing in a set of rules $\Sigma$ , a position p[i] is the i-th term of p, with $i=1,\ldots$ . A position p[i] is affected if: (i) p[i] contains an existentially quantified variable for a rule $\rho$ of $\Sigma$ (e.g. $\textsf{HasMother}[2]$ in Example 3.1 is affected); or, (ii) for some rule $\rho$ of $\Sigma$ s.t. a frontier variable x only appears in affected body positions of $\rho$ and in position p[i] in the head (e.g. $\textsf{Person}[1]$ is affected).
Affectedness induces the following classification of variables. If a variable appears only in affected positions of a rule $\rho$ , then it is harmful, otherwise it is harmless, with respect to that rule. A harmful frontier variable is denoted dangerous. For instance, the variable y in the second rule of Example 3.1 is dangerous. A rule is warded if (i) all the dangerous variables appear in the body in a single atom (the ward), and, all the variables of the ward that are in common with other body atoms are harmless. A program is warded if all its rules are warded. The program in our Example 3.1 is warded and, in particular the wards are $\textsf{Person}$ in the body of the first rule, and $\textsf{HasMother}$ in the body of the second one.
Warded reasoning. Let D be a database instance over the domain of constants $\mathbf{C}$ , and $\Sigma$ be a program. Given a query $Q = (\Sigma,\mathrm{Ans})$ where Ans is an n-ary predicate, an answer Q(D) is the set of all facts $\mathrm{Ans}(\bar t)$ , where the tuple ${\bar t} \in (\mathit{dom}(D) \cup \mathbf{N})^n$ , such that $\mathrm{Ans}(\bar t) \in \Sigma(D)$ .
First of all, observe that we allow labeled nulls in the answers. In fact, it is often the case that the set of rules $\Sigma$ , modeling the domain of interest, cannot completely generate from D all the values of the intensional atoms, and yet we wish to return the labeled null values, for instance for later comparisons between the tuples in the query answer.
Also, since $\Sigma(D)$ is potentially infinite, the number of answers to a query could be infinite as well. In this work we choose to find a general representative answer that subsumes all the others. To this end, we borrow the notion of universal answer (Fagin et al. Reference Fagin, Kolaitis, Miller and Popa2005), very common in data exchange settings. A universal answer $\mathcal{U}$ is such that for any other answer $\mathcal{U}^\prime$ , there exists a substitution of labeled nulls $\theta_U$ such that $\mathcal{U}\theta_U$ (i.e. the application of the substitution to all the facts contained in the answer) coincides with ${U}^\prime$ . Intuitively, as it represents the most general answer, a universal answer can be mapped onto any other answer by assigning the labeled nulls appearing in its facts. In our setting, a logical reasoning task or more simply reasoning task amounts to computing a universal answer. For a Warded Datalog $^\pm$ program, the reasoning task is decidable and is PTIME in data complexity (i.e. when the query is fixed and the data size varies) (Gottlob and Pieris Reference Gottlob and Pieris2015).
Two facts are isomorphic when they have the same terms up to renaming of the labeled nulls. More technically, when there exists a bijection between their terms that preserves constants. Given a database instance D, a set of warded rules $\Sigma$ , let $\Sigma(D)/\mathcal{I}$ be the quotient structure induced by the fact isomorphism relation $\mathcal{I}$ . We define $\Sigma_{W}(D)$ as warded semantics or warded chase as the set of all the class representatives of $\Sigma(D)$ in $\Sigma(D)/\mathcal{I}$ , one for each equivalence class. Roughly speaking, $\Sigma_{W}(D)$ can be seen as a “flat” version of a standard quotient structure, where for each equivalence class, a representative is chosen. We have that $\Sigma_{W}(D)$ is finite, as facts have finite arity and labeled nulls can appear in a finite number of positions, eventually giving rise to isomorphic copies.
Operationally, $\Sigma_W(D)$ is generated by applying reasoning algorithms that just execute a finite number of chase steps, based on the recognition of “repeating patterns” in the chase (Bellomarini et al. Reference Bellomarini, Fayzrakhmanov, Gottlob, Kravchenko, Laurenza, Nenov, Reissfelder, Sallinger, Sherkhonov and Wu2018). In particular, given two isomorphic facts h and $h^\prime$ , one needs to explore only h and so never perform chase steps starting from $h^\prime$ . In this sense, we will refer to warded chase step to mean a chase step limited to those unifiers allowed by the isomorphism criterion. We will use the term fact-isomorphic to refer to database instances for which there exists a bijection between their facts s.t. the corresponding facts are isomorphic. The warded chase semantics is uniquely defined for D, and independent of the rule application order, modulo fact isomorphism. In fact: (i) all applicable rules are applied in the chase, and (ii) after a normalization step that eliminates joins on harmful variables, whenever two facts are isomorphic, the chase graph portions derived from them are guaranteed to be isomorphic as well (Bellomarini et al. Reference Bellomarini, Fayzrakhmanov, Gottlob, Kravchenko, Laurenza, Nenov, Reissfelder, Sallinger, Sherkhonov and Wu2018)[Th.2]. In total, no class representatives are lost in any rule application order.
An interesting practical consequence of wardedness is that queries can be evaluated against the finite instance $\Sigma_W(D)$ , with the following guarantee: (1) a fact $\bar t$ is in $\Sigma_W(D)$ iff $\bar t$ is in $\Sigma(D)$ , that is, the usual chase semantics and the warded semantics are equivalent with respect to the reasoning task. The sufficient implication of (1) ( $\Rightarrow$ ) holds since $\Sigma_{W}(D) \subseteq \Sigma(D)$ by construction; the necessary implication of (1) ( $\Leftarrow$ ) directly descends from Bellomarini et al. (Reference Bellomarini, Fayzrakhmanov, Gottlob, Kravchenko, Laurenza, Nenov, Reissfelder, Sallinger, Sherkhonov and Wu2018), Th.2: intuitively, the only case in which a fact $\bar t \in \Sigma(D)$ would not correspond to any isomorphic fact $\bar t^\prime \in \Sigma_W(D)$ would postulate the existence of an ancestor $\bar s^\prime\in\Sigma(D)$ for which no warded chase step has been activated in $\Sigma_W(D)$ because of an isomorphic fact $\bar s^{\prime\prime}\in\Sigma_W(D)$ having already triggered a rule. Yet, this would contradict Bellomarini et al. (Reference Bellomarini, Fayzrakhmanov, Gottlob, Kravchenko, Laurenza, Nenov, Reissfelder, Sallinger, Sherkhonov and Wu2018), Th.2, which guarantees that the chase subgraph derived from $\bar s^\prime$ and $\bar s^{\prime\prime}$ are isomorphic and so that $\bar t^\prime \in \Sigma_W(D)$ .
It is also worth remarking that, besides atomic queries (only one body predicate), a larger class of conjunctive queries, namely warded conjunctive queries, can be evaluated against the finite instance $\Sigma_W(D)$ , with a query equivalence guarantee that directly descends from Bellomarini et al. (Reference Bellomarini, Fayzrakhmanov, Gottlob, Kravchenko, Laurenza, Nenov, Reissfelder, Sallinger, Sherkhonov and Wu2018), Theorem 2. We say that a conjunctive query $Q : $ $q(\mathbf{x}) \leftarrow \boldsymbol{\phi}(\mathbf{x},\mathbf{y})$ , where $\boldsymbol{\phi}(\mathbf{x},\mathbf{y})$ is a conjunction of atoms, and $\mathbf{x},\mathbf{y}$ are tuples of terms in $\mathit{dom}(D) \cup \mathbf{V}$ , is warded w.r.t. a set of rules $\Sigma$ , if $\Sigma \cup \{\rho_Q\}$ is warded, where ${\rho_Q}$ is a rule expressing Q as $\nu(\mathbf{x}) \leftarrow \boldsymbol{\phi}(\mathbf{x},\mathbf{y})$ and $\nu$ is an invented atom. Operationally, if not already atomic, the conjunctive query Q is translated into a rule $\rho_Q$ of $\Sigma$ and simulated by the atomic query $\hat{Q}:q(\mathbf{x})\leftarrow\nu(\mathbf{x})$ .
Additional features. vadalog is a logical language based on Warded Datalog $^\pm$ and extending it with features of practical utility such as negation, aggregation, algebraic operators, and so on. The language is fully implemented and engineered in the vadalog system (Bellomarini et al. Reference Bellomarini, Fayzrakhmanov, Gottlob, Kravchenko, Laurenza, Nenov, Reissfelder, Sallinger, Sherkhonov and Wu2018), a KG management system. While in this work we concentrate on the core of vadalog, Warded Datalog $^\pm$ , we will use negation and aggregations in the application cases.
vadalog adopts the usual stratified negation semantics (Dantsin et al. Reference Dantsin, Eiter, Gottlob and Voronkov2001), the principle of which is operationally straightforward: provided that a negated atom “ $\textit{not}~a$ ” appearing in a rule $\rho$ of $\Sigma$ is safe, that is, has as terms only variables appearing in positive body atoms of $\rho$ , for each unifier for which, not considering a, $\rho$ is applicable, the rule body is evaluated only if such unifier does not bind a to any fact that has already been generated. Intuitively, the stratification guarantees that negation is not used in a circular way in the dependency predicate graph of $\Sigma$ . Hence, in the evaluation of “ $\textit{not}~a$ ,” the extension of a can be entirely and unambiguously determined beforehand. The stratification condition can be syntactically checked.
Aggregations have been introduced in multiple contexts using a logical formalism and a careful definition of their semantics arose in all of them, especially when procedural semantics is used for operational reasons besides a model-based one. Here, we adopt a simple solution, based on stratified semantics (Mumick et al. Reference Mumick, Pirahesh, Ramakrishnan and Kaufmann1990), which is however enough for our purposes. For each unifier for which a rule is applicable, the aggregation condition is evaluated on all the possible unifiers (defining its operands) that bind the rule body to facts that have already been generated. A rule $\varphi(\bar{x},\bar{y}), v=\textit{aggr}(q) \to \psi(\bar{x},v)$ , where q is a variable of $\bar{x}$ and aggr is a generic aggregation operator, the value of variable v is computed by aggregating the values of q over the distinct groups defined by the values of $\bar{x}\setminus\{q\}$ .
4 A framework for probabilistic reasoning with knowledge graphs
In this section, we are going to introduce Probabilistic Knowledge Graphs (PKGs) and soft vadalog. Let us outline the approach first. A rule $$\rho :\forall \bar x\forall \bar y(\varphi (\bar x,\bar y)\; \to \exists \bar z{\mkern 1mu} \psi (\bar x,\bar z))$$ of a program $\Sigma$ can be viewed as a hard constraint over D. The chase of D under $\Sigma$ enforces all the constraints in $\Sigma$ , by applying rules until they are all satisfied. Query answering is then performed on the database instance derived in this way.
Probabilistic Knowledge Graphs soften this constraint and admit incomplete answers derived from a partial chase. A partial chase can be seen as a derived database not satisfying some of the rules of $\Sigma$ or, in operational terms, the chase steps needed to generate the conclusion of some rules of $\Sigma$ have not been applied. Along the lines of Markov Logic Networks, in PKGs, each rule has an associated weight representing the difference in log probability between a database instance that satisfies the rule vs. one not satisfying it. In this sense, the rule weights are not probabilities, but measure the importance of a rule in $\Sigma$ .
More precisely, PKGs induce a probability distribution on the facts in all the partial chases $\Sigma(D)$ , such that their likelihood is proportional to the weight of the rules that have been applied.
Approach outline. Our approach consists of the following two pillars: (i) we define a Probabilistic Knowledge Graph as the combination of an input database D and a set $\Sigma$ of probabilistic existential rules expressed in soft vadalog (a probabilistic extension of vadalog), (ii) we construct a structure, called chase network, that comprises all possible databases that can be obtained from D by applying rules in $\Sigma$ . This structure is already enough to compute marginal probabilities of the facts in the answer of some given query Q and solve the probabilistic reasoning tasks. Points (i) and (ii) are dealt with in Section 4.1.
However, in order to make the framework applicable, we need to mitigate two issues: logical inference in the presence of general FO rules is undecidable or intractable; computing exact marginal probabilities is intractable as well (#P-hard). For the first issue, we leverage Warded Datalog $^\pm$ , for which reasoning is polynomial, as we have seen. For the second, we compute approximate marginal probability. Thus, in Section 5 we introduce an MCMC method that simultaneously performs logical and marginal inference.
4.1 soft vadalog and probabilistic knowledge graphs
We extend Warded Datalog $^\pm$ , and as a consequence vadalog, to soft vadalog by introducing soft rules. A soft vadalog rule is a pair $(\rho, w(\rho))$ , where $\rho$ is a warded rule and $w(\rho)\in \mathbb{R} \cup \{+ \infty, - \infty\}$ is a weight, reflecting how strong a constraint is and so the absolute bias for a model to respect it (or not to respect it, in the case of negative weights). A soft rule $(\rho, + \infty)$ is called a hard rule. A soft vadalog program is a set of soft vadalog rules.
Semantics of soft vadalog. A soft vadalog program specifies a probability distribution over a chase network, a graph holding the database instances generated by all the partial chase applications over a given database instance.
A Probabilistic Knowledge Graph is a pair $\langle D, \Sigma \rangle$ , where D is a database instance over the domain of constants $\mathbf{C}$ and $\Sigma$ is a soft vadalog program. A PKG can be viewed as a template for constructing chase networks. We define the closure $\mathit{cl}_\Sigma(D)$ of a database D under a set of warded rules $\Sigma$ as the database $D^\prime$ obtained by computing the chase $D^\prime = \Sigma^\prime_W(D)$ , where $\Sigma^\prime \subseteq \Sigma$ is the set of the hard rules in $\Sigma$ . By wardedness, the database $D^\prime$ is finite and unique modulo-fact isomorphism (Section 3). D is closed under $\Sigma$ if D is fact-isomorphic to $\Sigma_W(D)$ , that is, $D = \Sigma_W(D)$ . Given a PKG $\mathcal{G} = \langle D,\Sigma\rangle$ , a chase network $\Gamma(\mathcal{G})$ is a triple $\langle \mathbf{W}, \mathbf{T}, W_0\rangle$ , where:
-
1. $\mathbf{W}$ is a set of nodes, $\mathbf{T}$ is a set of edges.
-
2. Each node of $\mathbf{W}$ corresponds to a class C of all the reachable databases instances, each closed under the hard rules of $\Sigma$ and with relation symbols from $D\cup\Sigma$ . In particular, each class C represents a set of fact-isomorphic database instances.
-
3. $W_0 \in \mathbf{W}$ is a source node associated with $cl_\Sigma(D)$ , that is, $W_0$ is associated with the closure of D.
-
4. There is an edge $t_s \in \mathbf{T}$ from W to W’ iff the database instance associated with W’ can be obtained from the one associated with W by one transition step. A transition step s from W to W’ consists of a warded chase step of at least one applicable soft rule with one unifier followed by the closure with respect to the hard rules of $\Sigma$ . Edge $t_s$ is then labeled by $\sum_{\rho\in \sigma} w(\rho)$ , where $\sigma$ is the set of soft rules applied.
A node W of $\Gamma$ represents a class of fact-isomorphic databases. By wardedness, $\mathbf{W}$ and each database associated with a node $W\in\mathbf{W}$ are finite. Each class of fact-isomorphic instances is represented by exactly one node, and all the paths of the chase network connecting $W_0$ to the same class of instances converge into the same terminal node. In fact, multiple isomorphic versions of the same instances should not be considered at all because their facts are semantically equivalent for query answering and so should be for marginal probability. Nevertheless, for its computation, we will take into account the influence of all possible transition steps leading to isomorphic instances. Each transition step can only add facts to a node $W_i$ . Therefore, the chase network does not contain directed cycles. Moreover, the chase network is a multigraph, since two nodes W and $W^\prime$ can be connected by multiple edges, one for each possible transition step from W to $W^\prime$ .
Let $\Pi_\mathcal{G}(W)$ be the set of all the edges appearing in any path from $W_0$ to a node W in $\Gamma(\mathcal{G})$ . We define the weight w(W) of W as the sum of the edge labels of the edges in $\Pi_\mathcal{G}(W)$ . The chase network induces the following probability distribution over its nodes:
By considering the summation of the weights along all the possible paths to a node W, we want to capture the relevance of W with respect to the entire derivation process, so not only which rules/unifiers it satisfies, but also how “reachable” it is in a chase execution. The normalization constant Z is a partition function to make P(W) a proper distribution. It is defined as $Z= \sum_{W} ~\mathrm{exp}~w(W)$ . For a given fact f, its marginal probability P(f) can be calculated as the summation of the likelihood of the nodes it appears in
Figure 1 shows a fragment of a possible chase execution (left part) and the respective chase network (on the right) for the inductive definition in Example 1.2. In the chase execution diagram, the nodes are the facts f in the database instances associated with nodes $W_i$ of the chase network. Facts f are annotated with a set $\{W_0,\ldots,W_n\}$ of nodes of the chase network such that for each $W_i$ in the set, $f \in W_i$ . Solid edges are warded chase steps applying hard rules; dashed edges are for soft rules, with weight $\gamma$ . In the chase network, nodes are database instances; they are connected by edges annotated with the rule $(\rho)$ that has been applied and its weight (in boldface).
From the chase network, we can now compute the marginal probability of Contracts. By applying Equation 1, we have $w(W_0)=0$ . Then it follows $w(W_1)=0.7$ and $w(W_2)=0.7+0.8=1.5$ , $w(W_3)=0.7+0.9=1.6$ , and $w(W_4)=0.7+(0.8+0.9)\times 2=4.1$ . So we can calculate marginal probability for Contract(c,l,a) with Equation 2. This fact appears for $W_1$ , $W_2$ , $W_3$ , $W_4$ , so we have: $(e^{0.7}+e^{1.5}+e^{1.6}+e^{4.1}) / Z = 0.99$ , with $Z=1+e^{0.7}+e^{1.5}+e^{1.6}+e^{4.1}$ . Similarly, for Contract $(c,l,v_0)$ , we have: $(e^{1.6} + e^{4.1}) /$ $Z = 0.9 $ .
Discussion. As our framework is based on Datalog $^\pm$ features, it incorporates existential quantification with the expressive power allowed by the warded fragment (see Section 3), when also recursion is involved, and so suitable for ontological reasoning (see requirement (i) of Section 1). We use the chase as an operational tool to generate all the facts induced by a database instance and a Warded Datalog $^\pm$ program. Unlike in MLNs, the program can encode inductive definitions, for example, involving transitive closures, and supports full recursion, as directly inherited from Datalog semantics. In fact, according to the model-theoretic semantics of Datalog, the generated facts are those defining the Least Herbrand Model of the program (see Section 3), and the facts that, while satisfying the program rules, are not directly entailed by the chase – and so not part of the LHM – are not derived. Unlike in MLNs, these facts therefore assume zero probability by construction (requirement (ii)).
It is worth making a final note about the origin of soft vadalog programs. While there is substantial soaring literature about first-order rule learning, we consider it out of scope here and assume that rules are defined by domain experts. As weights reflect the relative importance of rules in the KG, in our experience, they can be effectively assigned and tuned by domain experts as well, within an iterative process. They can also be efficiently learnt from relational data by optimizing a pseudo-likelihood measure: while they cannot be estimated with maximum-likelihood, because of the concavity of our log-likelihood function, they can derive from gradient-based, quasi-Newton estimations or iterative scaling (Nocedal and Wright Reference Nocedal and Wright1999), like in MLNs.
4.2 Probabilistic reasoning
Let us now formally study the probabilistic reasoning task. Given a probabilistic knowledge graph $\mathcal{G} =\langle D,\Sigma\rangle$ , where $\mathrm{Ans}(\bar t)$ is an n-ary predicate, let $Q = (\Sigma,\mathrm{Ans})$ be a query. A pair $\langle {\bar t}, p \rangle$ , where ${\bar t}$ is a tuple and $p \in [0,1]$ is a real number, is in a probabilistic answer Q(D) if $\mathrm{Ans}(\bar t)$ is a fact of some instance associated with nodes of $\mathbf{W}$ in $\Gamma(\mathcal{G})$ , and p is the marginal probability $P(\bar t)$ of $\mathrm{ Ans}(\bar t)$ . Then, the probabilistic reasoning task consists in computing the probabilistic answer Q(D) as the set of facts $\{ \langle {\bar t}, P(\bar t) \rangle \}$ .
Probabilistic reasoning consists of two phases: (i) grounding, that is, the construction of the chase network; (ii) marginal inference, that is, the computation of the marginal probability for each fact in the query answer. Grounding requires an exponential number of chase executions, each with polynomial complexity – by wardedness – in the size of D. Marginal inference in soft vadalog is #P-hard when the program is assumed to be fixed.
Proposition 1 The probabilistic reasoning task is $\#P$ -hard in data complexity.
Proof. We adapt the proof for $\#P$ -hardness of query answering over probabilistic databases from (Suciu et al. Reference Suciu, Olteanu, Ré and Koch2011). Let $\Phi = \bigvee_{(i,j)\in E} X_i \wedge Y_j$ be a PP2DNF formula, where $X_1, X_2, \ldots,$ and $Y_1, Y_2\ldots$ are disjoint sets of propositional variables. It is known that the problem of counting the number of satisfying assignments for PP2DNF formulas is hard for $\#P$ (Provan and Ball Reference Provan and Ball1983). Suppose we have three EDBs $R/1, T/1$ and $S/2$ . We define EDB D as the set of facts $\{R(X_1), R(X_2),\ldots,$ $T(Y_1), T(Y_2),\ldots, \} \cup \{S(X_i, X_j) \mid (i, j) \in E\}$ . Additionally, we define a set of rules:
Intuitively speaking, weight 0 of both rules ensures that all possible worlds have equal weights. With worlds having equal weights, computing probabilities comes down to counting, as is our goal. Let us now give the details. We associate an assignment $\theta$ for variables $X_i$ and $Y_j$ with a possible world $\mathcal{D}$ as follows: $\theta(X_i) = \mathit{true}$ iff $R'(X_i) \in \mathcal{D}$ and $\theta(Y_j) = \mathit{true}$ iff $T'(Y_j)\in \mathcal{D}$ , which establishes 1-1 correspondence. Note that $Q() \in \mathcal{D}$ iff $\Phi$ is evaluated to true under $\theta$ . Then the number of satisfying assignments q for $\Phi$ equals $2^n \cdot P(Q())$ , where n is the number of the Boolean variables $X_i$ and $Y_j$ . In fact, $P(Q()) = q\cdot e^0 / Z = q/2^n$ , and $Z=2^n\cdot e^0$ . Thus, an algorithm that computes marginal probability P(Q()) also computes the number of satisfying assignments for $\Phi$ which is $\#P$ -hard.
5 The MCMC-chase algorithm
In order to cope with the high complexity of reasoning in PKGs, we introduce the MCMC-chase, a technique that blurs the conceptual distinction between grounding and marginal inference and performs marginal inference by sampling the chase network. Instead of performing the full grounding of the chase network and then sampling it, we make the chase driven by an MCMC algorithm so that only a representative “subspace” of the chase network – a subgraph – is built. On that subgraph, the MCMC-chase computes exact marginal inference by applying Equation 2. The relevant subgraph of the chase network is chosen in dependence of the weights of the rules and, as a consequence, of the induced instances for the nodes in $\mathbf{W}$ .
In particular, the MCMC-chase is an independence sampling (Tierney Reference Tierney1994) MCMC where the chase procedure is seen as a Markov process (Gilks et al. Reference Gilks, Richardson and Spiegelhalter1995) over the nodes of the chase network. Given a PKG $\mathcal{G} = \langle D,\Sigma\rangle$ , the MCMC-chase applies soft rules of $\Sigma$ , with a probability that is proportional to the rule weight and generates nodes of $\Gamma(\mathcal{G})$ . The algorithm starts from D and applies rules, creating new nodes of $\mathbf{W}$ . The algorithm keeps track of the weight of the current node and decides to whether accept or reject it according to an acceptance probability, in a Metropolis-Hastings (Hastings Reference Hastings1970) style. After a fixed number N of iterations, the algorithm stops and returns the probability for all the generated instances computed by Equation 1, so that Equation 2 can be applied to determine the marginal probability of the facts. Higher values for N result into deeper chase networks, and thus more precise marginal probability. If N is high enough to compute the full chase, the precision is maximum as MCMC-chase degenerates into exact marginal inference over the chase network.
Algorithm 1 gives pseudocode for the MCMC-chase. It takes as input a PKG $\mathcal{G}$ and returns samples from the distribution $P(W) = \frac{1}{Z}~\mathrm{exp}~w(W)$ over the nodes $\mathbf{W}$ of the chase network $\Gamma(\mathcal{G})$ . The algorithm performs N iterations, each consisting of S steps, with S extracted from a Poisson (jump) distribution (line 5). In each step, forward or backward depending on a value $\delta$ uniformly chosen, the algorithm selects subsets $\mathbf{R_a}$ and $\mathbf{R_u}$ of rules from $\Sigma$ with a probability proportional to $w(\rho)$ (lines 10–11) of applicable or undoable rules. This is obtained by uniformly choosing $\mu$ in the (0,1) interval and checking whether $\mu<1-e^{-w(\rho)}$ , whose likelihood of being satisfied grows proportionally with the weight of $\rho$ , as the amount $e^{-w(\rho)}$ decreases. For simplicity, we are only considering positive weights in the pseudocode as the extension to negative ones is straightforward (i.e. $\mu > 1-e^{w(\rho)}$ ). As defined in detail in Section 3, a rule $\rho = \varphi(\bar x,\bar y) \rightarrow \exists \bar z\, \psi(\bar x,\bar z)$ is applicable if for some unifier $\theta_\rho$ , a warded chase step can produce new facts not in $\Sigma(D)$ via $\rho$ . Vice versa, it is undoable if (i) there is a fact $\psi(\bar x \theta_\rho',\bar z\theta_\rho') \in \Sigma(D)$ generated by $\rho$ with some body unifier $\theta_\rho$ extending to $\theta_\rho^\prime$ as $\bar x \theta_\rho = \bar x \theta'_{\rho}$ and $z_i \theta'_{\rho}$ for each $z_i \in \bar z$ being a fresh labeled null of $\Sigma(D)$ , and, (ii) there are no facts $\psi^\prime(\bar x \delta_{\rho^\prime}',\bar z\delta_{\rho^\prime}') \in \Sigma(D)$ generated by some rule $\rho^\prime = \psi(\bar x,\bar y) \rightarrow \exists \bar z\, \psi^\prime(\bar x,\bar z)$ with body unifier $\delta_{\rho^\prime}$ extending to $\delta^\prime_{\rho^\prime}$ as $\bar x \delta_{\rho^\prime} = \bar x \delta'_{\rho^\prime}$ and $z_i \delta'_{\rho^\prime}$ for each $z_i \in \bar z$ being a labeled null of $\Sigma(D)$ . Intuitively, a rule is undoable if it has not been used by any rule to generate new facts for $\Sigma(D),$ and thus, it is a leaf of the chase network.
Forward transition steps (line 12) try to apply a transition step with the selected applicable rules $\mathbf{R_a}$ to the current node $\mathcal{T}$ of the chase network. Backward transition steps (line 13) try to undo a transition step with the selected undoable rules in $\mathbf{R_u}$ . Algorithm 2 gives the pseudocode for both. In the forward case, for each selected soft rule $\rho$ , a unifier $\theta$ is uniformly chosen from the existing ones, and a warded chase step applied; this results in an updated instance $\mathcal{T}$ (lines 2-3). Instance $\mathcal{T}$ is then updated with its closure with respect to the hard rules in $\mathbf{R}$ (line 4) and its weight incremented by the weights of all the applied soft rules. On the other hand, in the backward case, first the facts generated by the hard rules of $\mathbf{R}$ are removed, with a process that is intuitively a backward closure, then, the effects of all the soft rules of $\mathbf{R}$ are canceled, in the sense the facts they generated are removed from $\mathcal{T}$ and their weights subtracted accordingly. Note that hard rules never affect the total weight. After S steps, an acceptance function $f(\mathbf{Y}) = \mathrm{exp}~w(\mathbf{Y})$ evaluates the acceptability of the current node (lines 14–15 of Algorithm 1) in a Metropolis-Hastings style. Finally, all accepted nodes and their weights are returned.
Observe that the stochastic process underlying the MCMC-chase is a Markov process or, equivalently, that it satisfies the Markov property. In fact, the MCMC-chase is memoryless, in the sense that a future process status only depends on the present one: a candidate node inherits all the facts only from one previously generated node and some facts are added to or removed from it by the applicable (resp. undoable) rules. The Markov process associated with the MCMC-chase also has favorable properties, namely detailed balance and ergodicity. A Markov process has detailed balance if the transition probabilities respect the following law: $P\left(\mathcal{D}^i\right)P\left(\mathcal{D}^i \rightarrow \mathcal{D}^{i+1}\right)$ between each pair of states $\mathcal{D}^i$ and $\mathcal{D}^{i+1}$ is equal to the transition probability $P\left(\mathcal{D}^{i+1}\right)P\left(\mathcal{D}^{i+1} \rightarrow \mathcal{D}^{i}\right)$ , where $P(\mathcal{D}^{i})$ and $P\left(\mathcal{D}^{i+1}\right)$ are the equilibrium probabilities of being in the states $\mathcal{D}$ and $\mathcal{D}^{i+1}$ , respectively (Stuart and Ord Reference Stuart and Ord1991). Intuitively, detailed balance guarantees that the probability of flowing from one node $W_i$ to a connected node $W_j$ of the chase network via applying a forward transition step is equivalent to the probability of flowing from $W_j$ back to $W_i$ by applying a backward transition step. Ergodicity ensures the absence of blocking “trap states” so that all the nodes of the chase network are eventually visited.
Proposition 2 The Markov chain generated by MCMC-chase satisfies detailed balance and ergodicity.
Proof. First we prove the property of detailed balance. Let $\mathcal{D}$ be a state at an iteration of the for-loop in lines 8-17 of Algorithm 1, and $\mathcal{D}'$ is a state that is the result of either applying or undoing rules R in line 12 or 13. With $P\left(\mathcal{D}^i\right)$ being the probability of state $\mathcal{D}^i$ , we then denote the probability of selecting the rules R to be applied or undone to go from $\mathcal{D}$ to $\mathcal{D}'$ as $P'(\mathcal{D} \rightarrow \mathcal{D}')$ which is also equal to $P'(\mathcal{D}' \rightarrow \mathcal{D})$ . The next sample state $\mathcal{D}^{i+1}$ is reachable from the current state $\mathcal{D}^i$ by S steps of applying or undoing warded chase steps. The transition probability $P\left(\mathcal{D}^i \rightarrow \mathcal{D}^{i+1}\right)$ can be written as $\min(1, \alpha) \cdot L\left(\mathcal{D}^i \rightarrow \mathcal{D}^{i+1}\right),$ where $\alpha = \frac{ f\left(\mathcal{D}^{i+1}\right)}{ f\left(\mathcal{D}^i\right)}$ and $L\left(\mathcal{D}^i \rightarrow \mathcal{D}^{i+1}\right)$ denotes $\sum_{\mathcal{D}^i = D_1, \ldots, D_S = \mathcal{D}^{i+1}} \prod^{S-1}_{j = 1} P'\left(D_j \rightarrow D_{j+1}\right).$
Here the sum is over all possible paths in the chase network from $\mathcal{D}^i $ to $\mathcal{D}^{i+1}$ of length S. Note that $L\left(\mathcal{D}^i \rightarrow \mathcal{D}^{i+1}\right)= L\left(\mathcal{D}^{i+1} \rightarrow \mathcal{D}^{i}\right)$ since paths between the states are undirected. Then,
Let us show ergodicity. For this, it is enough to show that it is possible to reach any state from any other state with non-zero probability. For any two possible nodes $\mathcal{D}$ and $\mathcal{D}',$ there is a path in the chase network. This path represents applying or undoing of warded chase steps. Let S’ be the length of such a path. There is a non-zero probability that the number S’ is sampled in line 5 and that exactly the same chase applications or undoing are performed in lines 12 and 13, and that finally the state is accepted in line 16. Therefore, the MCMC-chase satisfies ergodicity.
6 Application use cases
Record linkage and data fusion are two relevant faces of information integration, both concerned with heterogeneity at instance level. Probabilistic knowledge graphs offer a well-founded and integrated framework for such problems. In this section, we discuss two relevant use cases.
6.1 Record linkage
Record linkage consists in deciding which records of a database refer to the same real-world entities. It plays a crucial role in the standard information integration (Ullman Reference Ullman1997), data mining (McCallum et al. Reference McCallum, Tejada and Quass2003) and numerous industrial applications (Christen Reference Christen2012). Beyond the seminal statistical approaches (Fellegi and Sunter Reference Fellegi and Sunter1969; Agresti and Kateri Reference Agresti and Kateri2011), more modern techniques (Singla and Domingos Reference Singla and Domingos2005; Culotta and McCallum Reference Culotta and McCallum2005) aid the decision about the matching of one specific pair of entities with decisions about other pairs, even with transitive closure (McCallum and Wellner Reference McCallum and Wellner2004). MLN frameworks for record linkage effectively generalize the mentioned techniques (Singla and Domingos Reference Singla and Domingos2006), but inherit the semantic limitations discussed in Section 2. We show how our framework can handle this domain.
Example 6.1 Consider a Knowledge Graph G with facts describing a network of companies to be matched. It has the following predicates: $\textsf{Company}(\mathrm{company})$ , $\textsf{ Industry}$ $(\mathrm{company},\mathrm{industry})$ , $\textsf{Size}(\mathrm{company},\mathrm{size})$ where the size is in terms of known number of employees (e.g. 1-10, 11-50 employees), $\textsf{Group}(\mathrm{company},\mathrm{group})$ , denoting the participation of a company in a group, $\textsf{Subsidiary}(\mathrm{company},\mathrm{ company})$ , if the first company is a subsidiary of the second, $\textsf{SameSize}(\mathrm{size},$ $\mathrm{size})$ , representing approximate equivalence of company dimensions. Finally, $\textsf{Match}(\mathrm{company},$ $\mathrm{company})$ witnesses two identical companies. We extend G with the following rules:
Rules (1) and (2) increase the matching probability on the basis of common industry and comparable size, two features that we actually verified to be selective in this respect: especially in small markets, companies of the same size active in the same economic area tend to be small clusters. Rule (3) defines when the same size can be assumed for x and y on the basis of an absolute maximum deviation $\epsilon$ . Hard Rules (4) and (5) establish a transitive relation of the condition of “being part of a group”; in particular, assumed that every company is within a group, the singleton one as a limit case, whenever x is a subsidiary of y, then x inherits the groups from y. Finally, Rule (6) establishes a probability for two companies of being of the same size, whenever they are part of the same group and operate in the same industry, based on a large body of supporting data.
6.2 Data fusion
Data fusion addresses the challenge of merging the facts of the same real-world entity into one single fact (Bleiholder and Naumann Reference Bleiholder and Naumann2008). To achieve this goal, data fusion is concerned with solving attribute-level conflicts that can originate from disagreeing or poor quality sources and schema-level heterogeneity. Most of the techniques that have been proposed (Yin et al. Reference Yin, Han and Yu2008; Berti-Équille et al.Berti- 2009; Dong et al.Dong, Berti- 2015) adopt a “truth discovery approach” and perform metadata- and instance-based conflict resolution. In this section, we show an example where probabilistic knowledge graphs are effectively used to model a data fusion setting where multiple and mutually dependent sources need to be harmonized. The use of PKGs generalizes early SRL approaches to data fusion, for example, with Bayesian networks (Laurenza Reference Laurenza2015).
Consider the data fusion use case shown in Figure 2 about six financial data providers (A-F), providing three indicators (income, exposure and capitalization) about companies.
Example 6.2 For each company, data providers express a value, encoded by a fact $\textsf{Vote}(\mathrm{source},\mathrm{company},\mathrm{feature},\mathrm{value})$ , in our PKG. Providers s have different levels $\alpha^s_f$ of accuracy for each feature f, or, in other terms, $1-\alpha^s_f$ is the error rate of s for feature f. This value is given a priori, on the basis of the experts’ knowledge and trust in the source for specific data. In our PKG, we express the accuracy of each source with a ground rule of the form $\alpha :: \textsf{Accuracy}(\mathrm{source},\mathrm{feature})$ . Moreover, sources are not independent, but copy from one another with a given probability. This is expressed in the dependency graph in Figure 2, where edges stand for the “copied by” relationship and are labeled with the feature and respective copy likelihood $\lambda$ . The copy relationships are modeled with ground rules of the form: $\lambda :: \textsf{Copies}(\mathrm{source},\mathrm{target},\mathrm{feature})$ . Then, the following hard rules allow to take decisions on conflicting values.
According to Rule (2), a vote expressed by a source s for a value v of for the feature f of company c will turn into a fact for Value with a probability that is positively affected by the accuracy of s on f and negatively by the fact that s copies such value from some other source. Rule (1) accounts for the features f for which s is a copier. The negation used here uses stratified semantics, as we have seen in Section 3. Rule (3) uses recursion to model the propagation of copies in the influence graph and so affect the marginal probability of facts for Value accordingly. In total, the marginal probabilities of each single Value represents the likelihood of choosing v as a value to solve conflicts on f. More precisely, for each pair $\langle c,f \rangle$ , the value v corresponding to the fact with the maximum likelihood will be the solution to the conflict posed by the data fusion setting.
7 Implementation and experiments
We implemented the soft vadalog framework for probabilistic reasoning on KGs as an extension of the vadalog system. Coherently with the overall architecture of the systems, for the MCMC-chase we adopted a pipeline approach. A set of soft vadalog rules is compiled into an execution pipeline, where each pipeline node implements a rule, and nodes are connected if there is a dependency between rules. Data flow along the pipeline, from source nodes, the EDBs, toward the target node, which corresponds to the Ans atom of the reasoning task. The execution pipeline undergoes an optimization phase, where common query answering heuristics are applied. At runtime, we adopt a pull-based approach: the target node actively polls its predecessors in the pipeline for new facts, and so recursively the polling reaches the source nodes, feeding the pipeline with EDB facts. The isomorphism check applied in the warded chase steps is implemented in the form of a termination strategy, a component that filters the fact originating from the pipeline nodes and guarantees termination of the process.
MCMC-chase is implemented with two main techniques: a routing strategy and an edge filter. The routing strategy decides the number and order of activation of the pipeline nodes (lines 9 and 12 of Algorithm 1). The edge filters sample the facts moving from one node to the other, choosing those to be actually propagated according to the weight of the rules they represent (and so implementing rules 10–11). A probability manager handles the partitioning of facts into nodes of the chase network and takes decisions on their acceptability (line 13); it handles a set of supplementary buffer caches, allowing to revert rules that need to be rolled back. A shared in-memory structure, the chase graph, allows to efficiently identify applicable and undoable rules by tracing the provenance of each fact in terms of applied rule and unifier.
7.1 Experimental settings
We tested our approach using a total of 16 real-world and synthetic KGs. For the real-world KGs, we used 4 graphs of increasing density derived from the real graph of European Financial Companies, by considering increasingly broader notions of company ownership. The details of the topology are shown in Figure 3 (upper part). For the synthetic KGs, we built 12 scale-free graphs, that is, the degree distribution P(k) of nodes with degree k, goes as $k^{-\gamma}$ , where the parameter $\gamma$ is such that $2<\gamma<3$ (Bollobás et al. Reference Bollobás, Borgs, Chayes and Riordan2003; Hidalgo and Barabási Reference Hidalgo and Barabási2008). Scale-free networks can be shaped via four parameters: n, the expected number of nodes; $\alpha$ , probability of adding a new node connected to an existing one; $\beta$ , and $\gamma$ , probabilities of adding an edge between two existing nodes, randomly chosen from the in-degree or out-degree distribution, respectively. We used three graph topologies, denser than the real-world ones, (namely, BASE, DENSE, and SUPER-DENSE), in Figure 3 (lower part) and we varied the number of nodes from 100 to 1K.
The soft vadalog rules. Our KGs are augmented with a soft vadalog program describing the domain. A company x controls a company y if either of the following holds: (i) x owns more than 50% of the shares of y; (ii) x controls a set of companies, which jointly, and possibly together with x direct possession, own more than 50% of y. Uncertainty can depend on three causes: relations present in the EDB not existing in reality; invalid shares in direct ownerships (greater than one or less than zero); invalid shares in indirect ownerships (greater than one).
Example 7.1 We model the three sources of uncertainty and assign specific weights to rules, normalizing them between zero and one as follows.
Rule (1) extracts valid relationships. The high weight witnesses a 10% error rate in the original data source. Rule (2) extracts invalid relationships, which, with some low probability, correspond to actually existing shares, whose amount is replaced with labeled nulls; they are marked as “unreliable.” Rules (3) and (4) handle direct control in the unreliable and reliable case, respectively. Unreliable ownerships produce control with 50% probability, as the share amount is actually unknown. Rule (5) extends control with a reliable ownership, generating a reliable control. Rule (6) extends control in recursive cases with unreliable ownerships. As by Rule (7) every company controls itself, Rules (5) and (6) also consider direct possession. In Rule (5), the sum function denotes an aggregate summation operator, that accumulates the values s for Own.
Settings and metrics. For each KG, we considered the reasoning task consisting in querying the Control relation, that is, enumerating all the pairs of controller-controlled companies.
Full grounding and exact inference: We calculated the full grounding of the chase network with the vadalog system and exact marginal inference by exhaustively exploring all the nodes.
MCMC-chase: We compared the MCMC-chase settings with the exact inference on (i) execution time: the time (averaged over 5 executions) needed to run a predefined number of iterations in the MCMC-chase vs. the time needed for exact inference. A three-hour timeout was considered, and we aborted exceeding executions; (ii) error rate: the percentage difference (averaged over 5 executions and all the facts) between the marginal probability of the facts generated by the MCMC-chase and the marginal probability of the facts generated by exact inference; (iii) acceptance rate: the fraction of explored possible worlds that are accepted throughout the sampling. For the MCMC-chase, we applied an increasing number of iterations, proportional to the number of facts in the EDB of each KG, so that $N_i$ denotes as many iterations as i times the number of facts. This empiric choice proved to be effective to highlight the entire error rate spectrum.
Results. In real-world settings, in Figure 4(a-c), exact inference did not exceed the timeout only for SCALE-1 and SCALE-2, which completed in $02\mathrm{:}09\mathrm{:}53$ and $02\mathrm{:}30\mathrm{:}32$ , respectively. This confirms that exact inference is not affordable in most real cases. MCMC-chase with $N_{100}$ outperformed exact inference, completing in $\sim$ 21 mins, with an error rate less than 2%. Clearly, the MCMC approach allows to keep the elapsed time only dependent on the number of iterations. In fact, for $N_{100}$ , the elapsed time is stable, between 18 and 21 mins also for SCALE-2, SCALE-3, and SCALE-4. The same behavior can be observed for the other configurations $N_i$ . As topologies get denser, we need more iterations: the complete graph is the most dense and therefore the hardest topology to sample. This can be observed with SCALE-2, where for $N_{100}$ error rate rises to 10%. Configuration $N_{10}$ takes less than 9 mins, with 12% and 23% error rate, respectively for SCALE-1 and SCALE-2. Observe that MCMC-chase also completes in very short time for SCALE-3 and SCALE-4, though error rate cannot be calculated, because we could not calculate the exact inference baseline. The observed average acceptance rate is $83.43\%$ , hence fully satisfactory. Results in Figure 4(a,c) have been obtained by fixing $\lambda=5$ as a parameter for the jump distribution. Interestingly, for executions with high number of iterations, higher values for $\lambda$ tend to produce higher elapsed times and a smaller acceptance rate, while error rate is stable. For example, we observed $78.22\%$ acceptance rate for $\lambda=7$ in SCALE-1 for $N_{100}$ , with more than 30 mins elapsed. Smaller values of $\lambda$ reduce elapsed time and do not improve acceptance rate.
Synthetic settings, Figure 4(b,d), are more time-intensive than the real cases, due to the higher density of the graphs. In particular, exact inference exceeded timeout for DENSE with $n=1K$ and for SUPER-DENSE with $n=500$ and $n=1K$ . For KGs with small n, independently of the topology, MCMC-chase is extremely performant and accurate, for example, for SUPER-DENSE with $n=250$ it terminates in $\sim 3$ mins (vs $02\mathrm{:}35\mathrm{:}20$ ) for $N_{100}$ with error rate $<3\%$ . Also in these settings, less dense topologies require less iterations: for example $N_{100}$ for DENSE with $n=250$ achieves $<2\%$ error rate, and BASEeven 1%. The acceptance rate was satisfactory (81% on average), and we observed similar variations as in real-world cases when adjusting $\lambda$ . Footnote 1
8 Conclusion
A probabilistic extension of Warded Datalog $^\pm$ enables effective uncertain reasoning on Knowledge Graphs, with the possibility of encoding complex domains of interest. The probabilistic toolbox available in the literature offers insufficient support, with the impossibility to deal with existential quantification and recursion altogether.
In this paper, we considered the reasoning desiderata for KGs and introduced the syntax and semantics of soft vadalog. Within a new probabilistic reasoning framework, soft vadalog allows to induce a probability distribution over the facts defined through the warded chase, a finite logical derivation procedure under which query answering is decidable and tractable. We introduced the notion of Probabilistic Knowledge Graphs, a template for chase networks, a new probabilistic graphical model where marginal inference can be performed. To cope with intractability of marginal inference, we introduced the MCMC-chase, whose core idea is performing logical and probabilistic inference at the same time, while sampling the chase space with a Monte Carlo technique.
It is our intention to continue evolving the theoretical perspectives as well as their implementation in the system. The probabilistic semantics we have introduced so far is coupled to the adopted chase in that multiple isomorphic copies are not considered in warded chase steps. Possible chase variants, such as other more tolerant forms of restricted and terminating chase (e.g. the Skolem chase), would potentially introduce multiple copies of isomorphic facts, arbitrarily increasing their marginal probability. In this respect, the warded semantics is the most compact and less redundant one for probabilistic reasoning and is, at the same time, sufficient for query answering, as we have seen. Along these lines, we plan to evolve the approach to adopt our isomorphism-based semantics while being chase-independent: independently of the applied chase variant, the probabilities should be computed on the isomorphism quotient set.
Competing interests
The author(s) declare none.
Acknowledgements
The work on this paper was partially supported by the Vienna Science and Technology Fund (WWTF) grant VRG18-013.