# Plan, Activity, and Intent Recognition: Theory and Practice, FIRST EDITION (2014)

### Part I. Plan and Goal Recognition

### Chapter 2. Weighted Abduction for Discourse Processing Based on Integer Linear Programming

Naoya Inoue^{a}, Ekaterina Ovchinnikova^{b}, Kentaro Inui^{a} and Jerry Hobbs^{b}, ^{a}Tohoku University, Sendai, Japan, ^{b}USC/ISI, Marina del Rey, CA, USA

**Abstract**

This chapter explores the logical framework called weighted abduction as applied to solving discourse-processing tasks. Weighted abduction incorporates a cost propagation mechanism allowing us to estimate the likelihood of the obtained abductive proofs. We use a tractable implementation of weighted abduction based on Integer Linear Programming and a large knowledge base generated automatically. We first perform an experiment on plan recognition using the dataset originally developed for Ng and Mooney’s system *[39]*. Then we apply our discourse processing pipeline for predicting whether one text fragment logically entails another one (Recognizing Textual Entailment task). The study we describe is the first attempt to apply tractable inference-based natural language processing on a large scale.

Keywords

**Integer Linear Programming**

**weighted abduction**

**natural language understanding**

**planning**

**Recognizing Textual Entailment**

**Acknowledgments**

This work was supported by Grant-in-Aid for JSPS Fellows (22-9719). We also would like to thank Johan Bos for making the *Boxer* parser available for our experiments.

**2.1 Introduction**

Inference-based discourse processing was a thriving area of research during the 1970s and 1980s. But in spite of successful theoretical progress and small-scale systems, work on large-scale, “real-life” systems foundered on two main difficulties: (1) there was no sufficiently large knowledge base of the right sort for language processing and (2) reasoning procedures were not efficient enough.

In the last few years the situation has changed with respect to the available machine-readable knowledge and efficient reasoning. A number of resources have been developed that encode the kinds of knowledge needed for language processing [*22*,*51*,*15*,*43*]. In addition, progress has been made recently in efficient reasoning techniques [*8*,*29*]. Thus, it seems to be time to look again at the possibilities of inference-based discourse processing.

This chapter explores discourse processing based on a mode of inference called *abduction*, or inference to the best explanation. Abduction-based discourse processing was studied intensively during the 1980s and 1990s [*11*,*28*]. This framework is appealing because it is a realization of the observation that we understand new material by linking it with what we already know. It instantiates in natural language understanding the more general principle that we understand our environment by coming up with the best explanation for the observables in it. Hobbs et al. *[28]* show that the lowest-cost abductive proof provides the solution to a whole range of natural language pragmatics problems such as word sense disambiguation, anaphora and metonymy resolution, interpretation of noun compounds and prepositional phrases, and detection of discourse relations.

Previous experiments on applying abduction to large-scale discourse processing showed that existing implementations of abductive inference quickly become intractable given increased background knowledge [*4*,*41*]. We have developed a new efficient abductive inference framework based on the Integer Linear Programming (ILP) technique *[29]*. Our reasoning system converts a problem of abduction into an ILP problem, and it solves the problem by using efficient existing techniques developed by the ILP research community.

In this chapter, we describe experiments on applying ILP-based weighted abduction to two discourse-processing tasks: plan recognition and textual entailment recognition. The present study is the first attempt to apply abduction-based discourse processing in a realistic setting: (1) with a large knowledge base, (2) using a scalable inference engine, and (3) evaluated on large real-life discourse-processing problems. None of the previous studies on applying weighted abduction to plan recognition [*46*,*54*] or natural language understanding [*4*,*41*] have been done in a similar setting.

This chapter is organized as follows. Related work is described in Section *2.2*. In Section *2.3*, we present the weighted abduction framework. We describe the ILP-based implementation of weighted abduction in Section *2.4*. Section *2.5* presents an experiment on plan recognition using the dataset originally developed for Ng and Mooney’s system *[39]*. In Section *2.6*, we introduce our discourse-processing pipeline based on the *Boxer* semantic parser *[5]* for converting text into a logical form and a large knowledge base of commonsense knowledge using two lexical semantic resources, WordNet *[22]* and FrameNet *[51]*. Section *2.7* presents an evaluation of our discourse-processing pipeline on a knowledge-intensive NLP task—recognizing textual entailment. Section *2.8* concludes the chapter.

**2.2 Related Work**

Previous work on abductive inference can be grouped into two categories in terms of the hypothesis evaluation measure: cost-based approaches and probabilistic approaches. In cost-based approaches [*28*,*13*], the system finds a hypothesis that has a minimum cost among other competing hypotheses, and identifies it as the best hypothesis. Weighted abduction *[28]*, which we adopted, belongs to this group. Attempts to define probabilistic semantics to the cost-based approaches are discussed in Charniak and Shimony and Blythe et al. [*12*,*4*]. In probabilistic approaches [*11*,*44*,*36*,*46*,*4*], etc., the system identifies the highest probability hypothesis as the best hypothesis.

Regarding the inference engine, a number of methods for abduction have been proposed in the literature [*52*,*33*,*45*,*1*,*14*,*25*]. However, most of them focus on propositional logic-based abduction. One must thus transform knowledge bases written in first-order predicate logic to the propositional level (perform *grounding*) in order to employ these methods. Typically, grounding-based approaches generate a huge search space, and do not scale to larger problems as discussed in Section *2.4*.

The closest approaches to the one presented in this chapter are implementations of weighted abduction described in Mulkar et al. *[38]* and Blythe et al. *[4]*. However, given the increasing size of observations and knowledge bases, these implementations immediately suffer from exponentially growing computational cost [*4*,*41*,*40*]. To avoid this problem, we adopted an ILP-based solution *[29]* inspired by Santos *[52]*, where cost-based abduction was formalized as a linear constraint satisfaction problem, and efficiently obtained the best hypothesis by solving it with linear programming techniques.

A series of studies in machine reading *[21]* are aimed at recovering implicit information from texts. Although knowledge bases are acquired automatically, most of the machine reading applications to discourse are specifically designed for solving one particular discourse phenomenon; see, for example, Penas and Hovy *[43]*. In contrast, our framework is quite general and not tuned to a specific task. It can be modularized as a discourse-processing module for any NLP task. Different applications of weighted abduction to different discourse understanding problems are described, for example, in Ovchinnikova et al. and Ovchinnikova [*41*,*40*].

**2.3 Weighted Abduction**

Abduction is inference to the best explanation. Formally, logical abduction is defined as follows:

**Given:** Background knowledge , observations , where both and are sets of first-order logical formulas.

**Find:** A hypothesis such that , where is a set of first-order logical formulas.

Typically, there exist several hypotheses explaining . Each of them is called a *candidate hypothesis*. To rank candidate hypotheses according to plausibility, we use the framework of weighted abduction as defined by Hobbs et al. *[28]*.

In this framework, observation is a conjunction of propositions existentially quantified with the widest possible scope. Each proposition has a positive real-valued cost. We use the notation to indicate that proposition has cost and to represent the cost of . Observation costs imply the unequal treatment of atomic observations reflecting the demand for propositions to be proved. Those propositions that are the most likely to be proved are expensive to assume.

The background knowledge is a set of first-order logic formulas of the form

All variables occurring in the antecedents of such axioms are universally quantified with the widest possible scope. Variables occurring in the consequent only are existentially quantified within the scope of the universal quantifiers. Propositions in the antecedents are assigned positive real-valued *weights*. We use the notation to indicate that proposition has weight .

The two main inference operations in weighted abduction are backward chaining and unification. *Backward chaining* is the introduction of new assumptions given an observation and background knowledge. For example, given and , there are two candidate hypotheses: and . We say that in is *assumed* and it *explains *. A literal is assumed in a candidate hypothesis if it is included in a candidate hypothesis and not explained.

In weighted abduction, a *cost function * is used to calculate assumption costs. The function takes two arguments: costs of the propositions backchained on and the weight of the assumption. Usually, a multiplication function is used i.e., (where is the cost of the propositions backchained on and is the weight of the corresponding assumption). For example, if costs $10 and of is 1.2 in the preceding example above, then assuming in costs $12.

*Unification* is the merging of propositions with the same predicate name by assuming that their arguments are the same and assigning the smallest cost to the result of the unification. For example, if , then there is a candidate hypothesis . Note that the smallest cost is assigned to the equality (the result of the unification) in .

Both operations—backchaining and unification—can be applied as many times as possible to generate a possibly infinite set of hypotheses. The generation of the set of hypotheses initialized as an empty set can be formalized as follows.

**Backchaining**

**(2.1)**

**Unification**

**(2.2)**

Weighted abduction defines a *cost* of a candidate hypothesis as

**(2.3)**

where is an atomic conjunct in also called an *elemental hypothesis* (e.g., in the previous ).

In this framework, minimum-cost hypotheses are the best hypotheses. Given Eq. *(2.3)*, weighting of the candidate hypotheses works as follows:

(1) Proofs with fewer assumptions are favored.

(2) Proofs are favored that exploit the implicit redundancy (unification of predicates with costs > 0 always reduces the overall cost of the hypothesis).

(3) Plausible axioms are favored over less plausible axioms.

Weighted abduction also solves the problem of where to stop drawing inferences, which could easily be unlimited in number; an inference is appropriate if it is part of the lowest-cost proof of the observation.

**2.4 ILP-based Weighted Abduction**

**2.4.1 ILP Formulation of Weighted Abduction**

Recently, we have developed an efficient implementation of weighted abduction based on Integer Linear Programming (ILP) [*29*,*30*]. The key idea of this implementation is that finding hypotheses in weighted abduction can be modeled as a constrained combinatorial optimization problem. This gives us three benefits. First, we reduce the search space of candidate hypotheses as compared to fully grounding approaches, because we represent the search space through the combination of literals and variable substitutions (see a detailed comparison later in this section). Second, we exploit the state-of-the-art combinatorial optimization technology developed in operations research. Specifically, our optimization problem can be naturally formulated as an ILP problem, which can be efficiently solved by existing ILP solvers. Third, the resulting framework is highly extensible; for example, we can easily incorporate linguistically motivated heuristics by simply adding some ILP variables and/or constraints to an optimization problem, keeping the overall framework unchanged.

Let us give an informal description of the ILP-based implementation of weighted abduction (see Inoue and Inni *[29]* for more details). *Figure 2.1* shows an example. Given an observation and background knowledge , the reasoner selects the best candidate hypothesis as follows. First, a set of *potential elemental hypotheses* is constructed by applying backchaining and unification inference operations (step 1 in *Figure 2.1*). This procedure is called *search-space generation*.

**FIGURE 2.1** Illustration of our ILP-based reasoning procedure.

Then, ILP variables and constraints are generated, which delimit the set of all possible candidate hypotheses (step 2 in *Figure 2.1*). The four main ILP variables are , and , where are potential elemental hypotheses and are first-order logical variables or constants used in . is used to represent whether is hypothesized or not .

A potential elemental hypothesis is *hypothesized* in a candidate hypothesis if it is assumed or explained in this candidate hypothesis. is used to represent whether pays its cost (, i.e., it is assumed) or not (, i.e., it is explained). For example, in in*Figure 2.1* is set to 1 since is explained by . and are used to represent unification. When and are unified, is set to 1; 0 otherwise. is set to 1 if and are assumed equal (i.e., ). For instance, in in *Figure 2.1*, and are set to 1, since and are unified and is assumed.

The ILP objective function is as follows:

**(2.4)**

Thus, the cost of is the sum of the costs of , such that is hypothesized and is *not* explained in .

The space of candidate hypotheses is restricted by several ILP constraints.

C1: For every candidate hypothesis, observed literals must be hypothesized, which is expressed as follows:

**(2.5)**

For example, in *Figure 2.1*, is introduced since is observed.

C2: Variables depend on other variables. can be set to 1 only if (1) is explained by some other literal or (2) is unified with a literal that has smaller cost than . For example, to set (i.e., does not have to pay its cost) in *Figure 2.1*, both parents of it, and , must be hypothesized in that candidate hypothesis. For , the value can be set to 1 only if is unified with . This can be expressed as follows:

**(2.6)**

where is a set of potential elemental hypotheses that explain and is a set of potential elemental hypotheses that are unifiable with and have smaller costs than . In *Figure 2.1*, is introduced.

Furthermore, we use an additional constraint to link the variables for literals introduced by a conjunction (e.g., and in *Figure 2.1*) to be the same. We introduce the following equality:

**(2.7)**

where denotes a set of such that is conjoined with by a conjunction. In *Figure 2.1*, is generated to represent that is hypothesized if and only if is hypothesized, and vice versa.

C3: Two elemental hypotheses and can be unified only if both are hypothesized, which is represented as . In *Figure 2.1*, for example, is introduced.

C4: Two literals and can be unified (i.e., ) only if their corresponding arguments are assumed to be equal (for all ). This can be expressed as follows:

**(2.8)**

In *Figure 2.1*, the constraint is generated since and can be unified if is substituted for .

A second type of unification constraint is used for ensuring that the equality relation over is transitive.

C5: is transitive; namely, must be 1 if and . This can be expressed as follows:

**(2.9)**

**(2.10)**

**(2.11)**

In practice, however, an exhaustive enumeration of potential elemental hypotheses is intractable. In the latter experiments, we use a simple heuristic approach to control the search by employing a depth parameter , such that we stop backward chaining at depth , that is, the longest inference chains have the length .^{1}

As mentioned in Section *2.2*, our work is most similar to Santos’s ILP-based formulation *[52]* of propositional logic-based abduction. There are two main differences between our approach and his *[52]*.

The first difference is that we account for the specificity of explanations, which is important for abduction-based NLP as discussed in Hobbs et al. *[28]*. In Santos *[52]*, most-specific explanations are favored. Suppose , and . There are two candidate hypotheses. The first one is , which simply assumes observations, and it is . Backward chaining on results in the second hypothesis , which is more specific than .

The cost of is equal to . Note that we do not count the cost of because is *not* assumed anymore. If , then the more specific hypothesis is selected as the best explanation; otherwise, the less specific hypothesis is selected as the best explanation. The choice is controlled by the ILP variables and Constraint 2. To summarize, in our approach the choice of the best explanation is based on how well the explanation is supported by observations.

Another important difference is that our approach directly models abduction on first-order logic, while Santos *[52]* employs propositional logic using the grounding procedure of replacing variables with all possible constants. Suppose and all possible constants are . To ground this observation, we need to generate a disjunctive clause for , replacing and with all possible combinations from , that is, .

The problem arises in the search-space generation process: there are potential elemental hypotheses ( for all ) for the literal . In addition, backchaining on each given results in potential elemental hypotheses ( for all ). In contrast, we generate the set . Thus, our approach seems to be more robust to the domain size. In discourse processing, this robustness is important because natural language predicates usually have more than two arguments referring to participants of the same event.

**2.4.2 Handling Negation**

The ILP-based framework described before does not provide full support for first-order linear programming (FOPL) since it cannot represent negation. Only axioms in the form of Horn clauses and positive literals in the observations are accepted. However, the capability to handle negation might be crucial for a wide range of reasoning tasks.

We extend ILP-based weighted abduction making it able to handle negated literals (e.g., ) and inequality of variables (e.g., ); see Inoue and Inni *[30]* for more details. Given this extension, negated literals and variable inequalities are allowed both in observations and in background knowledge.

Additional unification constraints need to be introduced in order to handle negation. First, we consider the case where two propositions and occur in set of potential elemental hypotheses such that and are potentially unifiable. For example, we need to explain the observable given the axioms , . We want to prohibit the two propositions— and —from being hypothesized simultaneously. In other words, two propositions and cannot both be hypothesized as ( and ) if variable substitutions are activated () for all . This can be expressed as follows:

**(2.12)**

Formulation of the ILP constraints corresponding to variable inequality is straightforward. For each pair of variables and such that , the following equality is introduced:

**(2.13)**

**2.4.3 Cutting Plane Inference**

The ILP-based reasoning framework generates transitivity constraints (where is the number of logical terms), making inference intractable in a large-scale setting. Our solution to this problem is based on the idea that there is no need to check all transitivity constraints at once. We gradually optimize and add transitivity constraints, if violated, in an iterative manner.

We employ Cutting Plane Inference (CPI), which is an iterative optimization strategy developed in operations research. It is an exact inference optimization technique originally developed for solving large linear programming problems *[19]*. CPI has been successfully applied to a wide range of constrained optimization problems [*50*,*49*,*34*,*3*].

We developed an algorithm called *CPI4CBA* that applies the CPI technique to our optimization problem. The pseudo-code of *CPI4CBA* is given in *Algorithm 2.1*. First, the algorithm creates an ILP optimization problem without any transitivity constraints (line 1). denotes a set of ILP variables, and denotes a set of ILP constraints. Lines 2 through 12 describe the iteration process. The solution is found for the current ILP optimization problem (line 3). Then, for each pair of logical atomic terms unified in the solution (line 4), find a logical term that can be unified with or (line 5). If the transitive relation with respect to is violated (i.e., or ), then constraints for preventing this violation are generated and stored in set (lines 6–8). Set is extended with new constraints (line 11). The iteration terminates when no transitivity constraints are violated (line 12).

**Algorithm 2.1** CPI4CBA(Background Knowledge , Observation )

The key advantage of *CPI4CBA* is that the algorithm reduces the time of search-space generation and the time of ILP optimization; see Inoue and Inni *[30]* for a theoretical proof and an empirical evaluation. *CPI4CBA* does not generate all transitivity constraints before starting optimization. In addition, optimization problems to be solved become smaller, as compared to the original problems, if not all transitivity constraints are to be considered.

**2.5 Weighted Abduction for Plan Recognition**

A *plan* is an agent’s set of intentions and beliefs about achieving a goal. The task of plan recognition consists of inferring an agent’s plan from observed actions or utterances. Recognizing plans is crucial for solving knowledge-intensive NLP tasks such as story understanding and dialogue planning. Since the task of plan recognition can be modeled as finding the best explanation (i.e., a plan) given an observation (i.e., utterances), most of the proposed methods have been based on abduction [*11*,*39*,*54*].

We tested the ILP-based abductive reasoner using a subset of the dataset originally developed for the evaluation of the abductive plan recognition system *ACCEL **[39]*. We extracted 50 plan recognition problems and 107 background axioms from the dataset. The plan recognition problems represent an agent’s actions as conjunctions of propositions. For example, problem *t2* consists of the following propositions:

This logical form represents the sequence of actions that can be described as follows: Bob got a gun; he got off the bus at the liquor store. The task is to infer Bob’s plan given the preceding observations and background knowledge. For example, the dataset contains the following background axiom:

For each axiom, we set the sum of the assumption weights equal to 1.2 as a default, that is, .

We first evaluate the processing time of our ILP-based reasoner. *Table 2.1* shows the runtime of inference for different values of the depth parameter that restricts the possible depth of the inference chains (system: ILPWA). For the sake of comparison, we present the performance of the state-of-the-art abductive reasoner *Mini-TACITUS*^{2}*[38]* on the same dataset (system: M-TAC).

**Table 2.1**

**The Processing Time of Each System**

For all experiments, we restrict the maximum runtime of inference to 120 seconds. In the table, we show the percentage of the problems optimally solved by each reasoner within 120 seconds (% of solved) and the inference time averaged on such problems. For the ILP-based reasoner, we also show the runtime of each processing step (generation of ILP variables and constraints, ILP optimization).

To evaluate how well both systems do in plan recognition, we followed the evaluation strategy of Singla et al. *[54]* and focused on top-level plans and their role fillers (e.g., the agents, themes of plans). Top-level plans include *smarket*_*shopping, liqst*_*shopping, shopping, robbing, going*_*by*_*plane, going*_*by*_*bus, rest*_*dining, drinking, going*_*by*_*taxi, paying, jogging,* and *partying*. We used precision (the fraction of inferred literals that are correct), recall (the fraction of correct literals that are inferred by the system), and F-measure (the harmonic mean of precision and recall) as evaluation measures.

*Table 2.1* shows that our reasoner can find optimal solutions for all problems within 120 seconds in all depths, while *Mini-TACITUS* cannot find optimal solutions for 72% of the problems (36/50) within 120 seconds even for depth 1. Given , 80% of the problems (40/50) could not be solved by *Mini-TACITUS* in 120 seconds. This indicates that *Mini-TACITUS* is sensitive to the depth parameter, which becomes a significant drawback for abductive inference using large-scale background knowledge.

Currently we are developing an automatic method for tuning the cost function in a supervised learning framework *[32]*. We report preliminary results of the learning procedure in the rest of this section (see Inoue et al. *[32]* for more detail). In the learning framework, we model the cost function with a weighted linear model, and learn axiom weights with a Passive Aggressive algorithm *[17]*. This is a large-margin online learning algorithm for finding weight parameters that classifies given training examples and maximizes the margin (i.e., the minimal distance between the separating hyperplane and the examples). To make the predicates representing the top-level plans disjoint, we used 73 ILP constraints that represent disjointness of top-level plan predicates (e.g., ) in our experiment.^{3}

*Table 2.2* shows the results of plan recognition with learned parameters. In the “closed test” setting, we used the development set for both training and testing. In the “open test” setting, we used the development set for training and the test set for testing. The results show that our training procedure successfully learns the cost function from the dataset and has a generalization ability. We compare the results of the open test (0.58 F-score) with two existing systems: (1) the *ACCEL* system *[39]* and (2) a Markov Logic Networks-based (MLN-based) system *[54]*.

**Table 2.2**

**Results of Plan Recognition with Learned Parameters**

Singla and Mooney *[54]* report that *ACCEL* achieves 0.89 F-score on the test set. However, the cost function of *ACCEL* (called *coherence* metric in Ng and Mooney *[39]*) is particularly tuned for the task of story understanding and is not trainable. Our system is more general and applicable to a wide range of tasks because of the learning framework. Singla and Mooney *[54]* also report that the MLN-based system achieves 0.72 F-score on the test set. The MLN-based system can learn the cost function as we do. However, it relies on manually initialized and tuned weights, while our weight vector is initialized with zeros in the experiment described.^{4}

To summarize, the performance of our system is still not as good as the existing systems exploiting manually tuned weights. However, our system achieves 0.58 F-score, which is close to 0.62 of the logical abduction setting in *Table 2.2*, *without* any manual tuning of weights. It indicates that our learning framework is promising. In the future, we plan to evaluate the learning framework on a larger dataset and apply it to a wide range of discourse-processing tasks.

**2.6 Weighted Abduction for Discourse Processing**

This section describes how we apply weighted abduction to discourse processing. Our goal is to recover implicit information from natural language texts. The implicit information includes semantic relations between discourse entities, anaphoric relations, character’s intentions, and so on. These kinds of information have been shown to be useful in several NLP tasks such as question answering and recognizing textual entailment *[26]*.

**2.6.1 NL Pipeline**

Our natural language pipeline produces interpretations of texts given a knowledge base. A text is first input to the English semantic parser *Boxer **[5]*. For each segment, the parse produced by *Boxer* is a first-order fragment of the DRS language used in the Discourse Representation Theory *[35]*. An add-on to *Boxer* converts the DRS into a logical form (LF) in the style of Hobbs *[27]*.

The LF is a conjunction of propositions, which have generalized eventuality arguments that can be used for showing relationships among the propositions. Hobbs *[27]* extends Davidson’s approach *[20]* to all predications and posits that corresponding to any predication that can be made in natural language, there is an eventuality. Similarly, any predication in the logical notation has an extra argument, which refers to the “condition” of that predication being true. Thus, in the logical form for the sentence *John runs*, is a running event by and is a condition of being named “John.”

In terms of weighted abduction, logical forms represent observations, which need to be explained by background knowledge. In the context of discourse processing, we call a hypothesis explaining a logical form *an interpretation* of this LF. The interpretation of the text is carried out by the abductive system. The system tries to prove the logical form of the text, allowing assumptions whenever necessary. Where the system is able to prove parts of the LF, it is anchoring it in what is already known from the overall discourse or from a knowledge base. Where assumptions are necessary, it is gaining new information.

Let us illustrate the procedure with a simple example. Suppose we need to interpret the text fragment “*the Boston office secretary*.” The logical form of this text fragment is as follows:

Suppose our knowledge base contains the following axioms. Axioms 1 and 2 represent the facts that being in a location and working in a workspace can be expressed by a noun compound in English. Axioms 3, 4, and 5 say that Boston is a location, every secretary is a person, and every office is a workspace.

(1)

(2)

(3)

(4)

(5)

Using these axioms, as illustrated in *Figure 2.2*, we can infer that the secretary works in the office located in Boston, which is an inference we can draw from the original text.

**FIGURE 2.2** Weighted abduction for discourse processing.

**2.6.2 Knowledge Base**

The proposed discourse-processing procedure is based on a knowledge base (KB) consisting of a set of axioms. To obtain a reliable KB with a large coverage we exploited existing lexical semantic resources.

First, we have extracted axioms from WordNet *[22]*, version 3.0, which has already proved itself to be useful in knowledge-intensive NLP applications. The central entity in WordNet (WN) is called a *synset*. Synsets correspond to word senses, so that every lexeme can participate in several synsets. We used the lexeme-synset mapping for generating axioms. For example, in the next axioms, the verb *compose* is mapped to *synset-X*, which represents one of its senses.

Moreover, we have converted the following WordNet relations defined on synsets and word senses into axioms: hypernymy (type-of), instantiation, entailment, similarity, causation, meronymy (part-of), and derivation. In total, 333,192 axioms have been generated from WordNet.

The second resource that we used as a source of axioms is FrameNet 1.5 *[51]*. FrameNet has a shorter history in NLP applications than WordNet, but lately more and more researchers have been demonstrating its potential to improve the quality of question answering *[53]* and recognizing textual entailment *[7]*. The lexical meaning of predicates in FrameNet is represented in terms of frames that describe prototypical situations spoken about in natural language. Every frame in FrameNet (FN) contains a set of roles corresponding to the participants of the described situation. Predicates with similar semantics are assigned to the same frame. For most of the lexemes, FrameNet provides syntactic patterns showing the surface realization of these lexemes and their arguments. We used the patterns for deriving axioms. For example, this axiom corresponds to phrases like “*John gave a book to Mary*.”

GIVINGDONORRECIPIENTTHEME

FrameNet also introduces semantic relations defined on frames such as inheritance, causation, or precedence; for example, the GIVING and GETTING frames are connected with the causation relation. Roles of the connected frames are also linked (e.g., DONOR inGIVING is linked with SOURCE in GETTING). To generate corresponding axioms, we used the previous work on axiomatizing frame relations *[42]*. An example of an axiomatized relation follows. In total, 31,861 axioms have been extracted from FrameNet.

GIVINGDONORRECIPIENTTHEME

GETTINGSOURCERECIPIENTTHEME

Axiom weights are calculated using frequency of the corresponding word senses in the annotated corpora. The information about frequency is provided both by WordNet and by FrameNet. In our framework, axioms of the type *species **genus* should have weights exceeding 1, which means that assuming *species* costs more than assuming *genus*, because there might be many possible *species* for the same *genus*. The weights of such axioms are heuristically defined as ranging from 1 to 2.

To assign a weight to a sense of a lexeme, we use information about the frequency of the word sense in the annotated corpora. An obvious way of converting the frequency to the weight is the following equation:

**(2.14)**

where is a set of all senses of the lexeme.

**2.6.3 Overmerging**

Frequently, the lowest-cost interpretation results from unification (i.e., identifying two entities with each other) so that their common properties only need to be proved or assumed once. This is one of the principal methods by which coreference is resolved.

However, *overmerging*—unification of incompatible propositions—is quite a serious problem in large-scale processing. For example, given

weighted abduction incorrectly assumes equals even when and are observed. For “*John runs and Bill runs*,” with the logical form

weighted abduction assumes John and Bill are the same individual just because they are both running.

The problem we will address is this: If and both hold, how good is that as evidence that and are the same entity, given what else we know about and ? To approach this problem, we adopt a solution of imposing constraints on unification based on linguistic observations.

**2.6.3.1 Argument constraints**

The rule for generating argument constraints can be formulated as follows: For each two propositions and if

• is not equal to

• is not a noun predicate

• such that is not equal to and both and occur as arguments of predicate other than

then and cannot be unified.

This is expressed by the following ILP constraint:

**(2.15)**

This rule ensures that nouns can be unified without any restriction and other predicates can be merged only if all their arguments that are not first are equal (due to previous mergings) or uninstantiated. As seen from the preceding statements, the argument unification restriction concerns first arguments only. First arguments of all predicates in the logical forms are “handles” referring to conditions in which the predicate is true of its arguments (i.e., referring to the predication itself) rather than to its semantic arguments.

The proposed nonmerge rule is a heuristic, which corresponds to the intuition that it is unlikely that the same noun refers to different entities in a short discourse, while for other predicates this is possible. According to this rule, the two *run* propositions cannot be unified in logical form . At the same time, two *run* propositions in LF can be unified given the axiom .

**2.6.3.2 Compatibility constraints**

We define sets of incompatible properties (e.g., *dog* and *cat*) and prohibit unification of two propositions if it implies unifying variables with incompatible properties. For example, we prohibit the unification if and are hypothesized. This kind of constraint can be formulated as the logical constraint , where and are predicates that have the property of exclusion. This can be expressed as the following ILP constraint: . One can reduce the number of the constraints by using ontological hierarchy information so that the constraints are introduced only for the most general classes. In the experiments described next, we set all antonym and sister terms in WordNet to be disjoint.

Another possible approach to overmerging is to evaluate the plausibility of the equality statements during inference [*9*,*10*]. Charniak and Goldman [*9*,*10*] perform first-order, logic-based plan recognition using Bayesian Networks (BNs). In the BN, each node represents the presence of an event or entity (e.g., (hang k1), (rope r2)) or an equality statement that means two entities are the same (e.g., (rope-of k1) = r2). Each edge connects causally related or implicationally related nodes (e.g., (hang k1) and (kill k1)).

Once the BN is constructed, plan recognition can be performed by Maximum a posteriori (MAP) inference over the BN given observed actions. In this framework, probabilities are defined over equality nodes. This direction is taken in Inoue et al. *[31]* and model the likelihood of equalities using a weighted linear feature function, such that parameters are discriminatively trained on a corpus with coreference annotations.

**2.7 Evaluation on Recognizing Textual Entailment**

The discourse interpretation procedure and the KB derived from WordNet and FrameNet are fairly general and not tuned for any particular type of inferences or tasks. We have performed an evaluation on the Recognizing Textual Entailment (RTE) task, which is a generic task that seems to capture major semantic inference needs across many NLP applications. In this task, the system is given a text (T) and a hypothesis (H) and must decide whether the hypothesis is entailed by the text plus commonsense knowledge. Suppose an RTE system gets the following input:

**T: ***John gave Mary a book.*

**H1: ***Mary got a book.*

**H2: ***Mary read a book.*

The task consists in predicting that T entails H1 but not H2. It was previously reported that inferring information implicit in the text and the hypothesis can improve the performance of an RTE system *[26]*. We intend to test whether implicit information produced by our system is helpful for solving the RTE task.

RTE has been previously approached by inference-based methods; see Riabinin and Dagan et al. [*47*,*18*] for overviews. Most of them are based on deductive reasoning. Deductive inference for RTE has a serious limitation: If some piece of knowledge required to infer H from T is missing from the KB, then the classic reasoner fails to find a proof and predicts “no entailment.” To overcome this problem, deduction-based RTE systems employ two solutions.

Some systems introduce different heuristics and consider logical inference to be just one of the features used to classify T–H pairs as “entailment” or “no entailment” *[6]*. Other systems decompose logical forms into atomic propositions and measure whether each proposition from H is entailed by a proposition from T [*2*,*23*,*55*].

Abduction solves the problem of knowledge incompleteness by allowing assumptions. Attempts to use weighted abduction for RTE are presented in Blythe et al., Ovchinnikova et al., and Ovchinnikova [*4*,*41*,*40*]. Ovchinnikova et al. *[41]* employ the reasoning system *Mini-TACITUS* (mentioned in Section *2.5*). Blythe et al. *[4]* implement weighted abduction as Markov Logic Networks *[48]*. Both systems suffer from the problem of computational inefficiency of inference that makes large-scale discourse processing intractable.

**2.7.1 Weighted Abduction for Recognizing Textual Entailment**

Our approach is to interpret both the text and the hypothesis using the abductive reasoner, and then to see whether adding information derived from the text to the knowledge base will reduce the cost of the best abductive proof of the hypothesis as compared to using the original knowledge base only. This idea is summarized in *Figure 2.3*.

**FIGURE 2.3** Weighted abduction for recognizing textual entailment.

To judge whether the hypothesis entails the text or not, we use Support Vector Machine (SVM) *[56]*, a powerful and widely used binary classification algorithm, encoding the best proof as SVM features. This procedure can be summarized as follows:

1. Construct the best interpretation of the text .

2. Construct the best interpretation of the hypothesis .

3. Construct the best interpretation of the hypothesis given the best interpretation of the text .

4. For each entailment pair, extract the following features for SVM classification:

• For each part of speech, number of proven/not proven propositions in H.

• For each part of speech, number of unified propositions in H.

• Number of proven/not proven *not* propositions in H (negation).

We used an RBF-kernel SVM classifier* ^{5}* classifying input T–H pairs as “entailment” or “no entailment.”

**2.7.2 Experiment**

We evaluate our procedure on the RTE 1–5 Challenge datasets *[18]*. The evaluation measure in the RTE challenges is *accuracy* calculated as the percentage of correct judgments against the gold standard. The results for the corresponding datasets are presented in*Table 2.3*. For each challenge, the table shows the number of pairs in the development and test sets and the average accuracy achieved by the best runs of the original participants. We also replicate the results of Katrenko and Toledo *[37]* calculating the baseline on the basis of the lexical overlap between T and H by using a support vector classifier with the polynomial kernel.

**Table 2.3**

**RTE Challenge Results**

First, we evaluated the processing time of our ILP-based reasoner. In this experiment, we have used the RTE-2 development and test datasets and the knowledge base containing axioms from WordNet and FrameNet. In *Table 2.4*, we present the results for different values of the depth parameter (D) that restricts the possible depth of the inference chains. The table contains information about the average number of potential elemental hypotheses (PEH), ILP variables (VAR), and ILP constraints (CON) generated per sentence. The deeper the inference chains are, the more variables and constraints are generated, because more axioms can be applied on each inference step. The reasoner was given a 2-minute time limit to finish the inference. If all candidate hypotheses could not be evaluated by that time, the reasoner was requested to output the best solution found so far and terminate.

**Table 2.4**

**Processing Time of the ILP-based Reasoner**

The column ALL in *Table 2.4* shows data for how many sentences all hypotheses were evaluated in 2 minutes and what the average processing time was for such sentences (in seconds). Our reasoner is dramatically faster than the *Mini-TACITUS* system, which needed more than 30 minutes per sentence on average on the same dataset *[41]*. Blythe et al. *[4]* could run to completion only 28 of 100 selected RTE-2 problems with an average processing time of 7.5 minutes.

Then we performed five runs for each RTE dataset (N): without a KB , with compatibility constraints (CC), with axioms from WordNet and compatibility constraints (WN), with axioms from FrameNet and compatibility constraints (FN), and with both axiom sets and compatibility constraints. The accuracy results are presented in *Table 2.5*. We also report how many original participants we outperform in each challenge (Rank).* ^{6}* For RTE 1, 2, and 5 our system outperformed the baseline shown in

*Table 2.3*. In addition, both WordNet and FrameNet helped entailment recognition for RTE 1, 2, and 4. However, for RTE 3 and 5, adding lexical-semantic knowledge was not helpful for predicting entailment, and our system could not beat the baseline.

**Table 2.5**

**Accuracy Results for RTE Challenge Datasets**

We compare our results for RTE 2 with the results reported in Ovchinnikova et al. *[41]*. The developed compatibility constraints give an advantage over the empty KB setting in Ovchinnikova et al. *[41]* (i.e., 60.8% vs. 57%).* ^{7}* Our constraints preventing overmerging have been evaluated using a corpus annotated with coreference relations; see Inoue et al.

*[31]*for a detailed description.

Obviously, WordNet and FrameNet alone are not enough to predict entailment. In the example that follows, our system inferred that *president* is related to *presidential*, Tehran is a part of Iran, *mayor* and *official* can refer to the same person, and *runoff* and *election*can mean the same by WN and WN relations. However, all this information does not help us to predict entailment.

**T: ***Iran will hold the first runoff presidential election in its history, between President Akbar Hashemi Rafsanjani and Tehran’s hard-line mayor, election officials said Saturday.*

**H: ***Hashemi Rafsanjani will face Tehran’s hard-line mayor in Iran’s first runoff presidential election ever, officials said Saturday.*

The analysis of knowledge needed for RTE is given, for example, in Clark et al. *[16]* and Garoufi *[24]*. In both works, the conclusion is that lexical-semantic relations are just one type of knowledge required; thus, our knowledge base needs a significant extension. Moreover, a much more elaborate treatment of logical connectors and quantifiers, such as *if*, *not*, *or*, *all*, *each*, and others, is needed. In the next example, H is (lexically) fully contained in T, but there is still no entailment.

**T: ***Drew Walker, NHS Tayside’s public health director, said: “It is important to stress that this is not a confirmed case of rabies.”*

**H: ***A case of rabies was confirmed.*

**2.8 Conclusion**

In this chapter, we have explored the logical framework of weighted abduction as applied to discourse processing. We used a tractable implementation of weighted abduction based on Integer Linear Programming and a large knowledge base generated automatically.

To the best of our knowledge, this is the first successful attempt to apply tractable inference to NLP on a large scale. We have shown that the proposed implementation of weighted abduction based on ILP is dramatically faster than previous implementations [*4*,*41*] evaluated on the same datasets.

Furthermore, the ILP-based implementation proved to be open to the introduction of linguistically motivated heuristics. For example, we have proposed a partial solution to the problem of coreference resolution as modeled in an abductive framework.

The inference procedure and the knowledge base are general and not tuned for a particular task. We performed evaluations on plan recognition and textual entailment recognition. The results obtained allow us to conclude that the proposed procedure showed performance comparable with those of the state-of-the-art systems. We consider the results of our experiments to be promising for many possible applications. In other papers, we describe evaluations of other knowledge-intensive natural language understanding tasks such as coreference resolution *[31]*.

The experiments we have carried out have shown that there is still a lot of space for improving the procedure. In future work, we plan to investigate different strategies of weight initialization and evaluate our cost and weight learning framework on a larger dataset and apply it to a wide range of discourse processing tasks.

Another future work direction concerns the knowledge base. Lexical-semantic dictionaries offer an easily accessible source of inference rules. However, such resources have limited coverage and are difficult to update. For example, they systematically lack proper names. In the future, we would like to experiment with lexical-semantic resources automatically generated from large amounts of texts; see, for example, Penas and Hovy *[43]*.

We also plan to elaborate our treatment of natural language expressions standing for logical connectors and quantifiers. This advance is needed in order to achieve more precise inferences, which are at the moment based on our approach to the unification of the core information content (“aboutness”) of texts.

The ILP-based abductive reasoner is freely available for download at *https://github.com/naoya-i/henry-n700/*. The developed discourse processing pipeline is freely available for download at *https://github.com/metaphor-adp/Metaphor-ADP*.

**References**

1. Abdelbar AM, Hefny M. An efficient LP-based admissible heuristic for cost-based abduction. *JETAI*. 2005;17(3):297-303.

2. Akhmatova E, Mollá D. Recognizing textual entailment via atomic propositions. In: Proceedings of the PASCAL Challenges Workshop on Recognizing Textual Entailment; 2005. p. 385–403.

3. Berant J, Aviv T, Goldberger J. Global learning of typed entailment rules. *ACL*. 2008:610-619.

4. Blythe J, Hobbs JR, Domingos P, Kate RJ, Mooney RJ. Implementing weighted abduction in Markov logic. *Proceedings of IWCS*. 2011:55-64.

5. Bos J. Wide-coverage semantic analysis with Boxer. In: Bos J, Delmonte R, eds. *Proceedings of STEP. Research in Computational Semantics*. College Publications; 2008:277-286.

6. Bos J, Markert K. Recognizing textual entailment with logical inference. *Proceedings of EMNLP*. 2005:628-635.

7. Burchardt A, Pennacchiotti M, Thater S, Pinkal M. Assessing the impact of frame semantics on textual entailment. *Nat Lang Eng*. 2009;15(4):527-550.

8. Chalupsky H, MacGregor RM, Russ T. PowerLoom manual. Technical Report. University of Southern California; 2006. *<www.isi.edu/isd/LOOM/PowerLoom/documentation/manual/manual.pdf>*.

9. Charniak E, Goldman RP. Plan recognition in stories and in life. *Proceedings of the Fifth Annual Conference on Uncertainty in Artificial Intelligence*. 1989:343-352.

10. Charniak E, Goldman RP. Plan recognition in stories and in life. *Proceedings of the 11th International Joint Conference on Artificial Intelligence*. 1989:1074-1079.

11. Charniak E, Goldman RP. A probabilistic model of plan recognition. *Proceedings of AAAI*. 1991:160-165.

12. Charniak E, Shimony SE. Probabilistic semantics for cost based abduction. *Proceedings of the Eighth National Conference on Artificial Intelligence*. vol. 1. 1990:106-111.

13. Charniak E, Shimony SE. Cost-based abduction and MAP explanation. *Artif Intell J*. 1994;66(2):345-374.

14. Chivers ST, Tagliarini GA, Abdelbar AM. An evolutionary optimization approach to cost-based abduction, with comparison to PSO. *IJCNN*. 2007:2926-2930.

15. Chklovski T, Pantel P. VerbOcean: mining the web for fine-grained semantic verb relations. In: Lin D, Wu D, eds. *Proceedings of EMNLP*. 2004:33-40.

16. Clark P, Murray W, Thompson J, Harrison P, Hobbs J, Fellbaum C. On the role of lexical and world knowledge in RTE3. In: Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing; 2007. p. 54–9.

17. Crammer K, Dekel O, Keshet J, Shalev-Shwartz SYS. Online passive-aggressive algorithms. *J Mach Learn Res*. 2006;7:551-585.

18. Dagan I, Dolan B, Magnini B, Roth D. Recognizing textual entailment: rational, evaluation and approaches—Erratum. *Nat Lang Eng*. 2010;16(1):105.

19. Dantzig GB, Fulkerson R, Johnson SM. Solution of a large-scale traveling salesman problem. *Oper Res*. 1954;2(4):393-410.

20. Davidson D. The logical form of action sentences. In: Rescher N, ed. *The logic of decision and action*. University of Pittsburgh Press; 1967:81-120.

21. Etzioni O, Banko M, Cafarella MJ. Machine reading. *Proceedings of AAAI*. vol. 6. 2006:1517-1519.

22. Fellbaum C, ed. *WordNet: an electronic lexical database*. MIT Press; 1998.

23. Fowler A, Hauser B, Hodges D, Niles I, Novischi A, Stephan J. Applying COGEX to recognize textual entailment. In: Proceedings of the PASCAL Challenges Workshop on Recognising Textual Entailment; 2005. p. 69–72.

24. Garoufi K. Towards a better understanding of applied textual entailment: annotation and evaluation of the RTE-2 dataset. Master’s thesis. Saarland University; 2007.

25. Guinn C, Shipman W, Addison E. The parallelization of membrane computers to find near optimal solutions to cost-based abduction. *GEM*. 2008:241-247.

26. Hickl A, Bensley J. A discourse commitment-based framework for recognizing textual entailment. In: Proceedings of ACL-PASCAL Workshop on Textual Entailment and Paraphrasing; 2007. p. 171–6.

27. Hobbs JR. Ontological promiscuity. *Proceedings of ACL*. 1985:61-69.

28. Hobbs JR, Stickel M, Martin P, Edwards D. Interpretation as abduction. *Artif Intell*. 1993;63:69-142.

29. Inoue N, Inui K. ILP-based reasoning for weighted abduction. In: Proceedings of AAAI Workshop on Plan, Activity and Intent Recognition; 2011. p. 25–32.

30. Inoue N, Inui K. Large-scale cost-based abduction in full-fledged first-order predicate logic with cutting plane inference. *Proceedings of the 13th European Conference on Logics in Artificial Intelligence*. 2012:281-293.

31. Inoue N, Ovchinnikova E, Inui K, Hobbs JR. Coreference resolution with ILP-based weighted abduction. *Proceedings of the 24th International Conference on Computational Linguistics*. 2012:1291-1308.

32. Inoue N, Yamamoto K, Watanabe Y, Okazaki N, Inui K. Online large-margin weight learning for first-order logic-based abduction. In: Proceedings of the 15th Information-based Induction Sciences Workshop, 2012. p. 143–50.

33. Ishizuka M, Matsuo Y. SL Method for computing a near-optimal solution using linear and non-linear programming in cost-based hypothetical reasoning. *PRCAI*. 1998:611-625.

34. Joachims T, Finley T, Yu CJ. Cutting-plane training of structural SVMs. *J Mach Learn*. 2009;77(1):27-59.

35. Kamp H, Reyle U. From discourse to logic: introduction to model-theoretic semantics of natural language, formal logic and discourse representation theory. *Studies in Linguistics and Philosophy*. Kluwer; 1993.

36. Kate RJ, Mooney RJ. Probabilistic abduction using Markov logic networks. In: Proceedings of the IJCAI-09 Workshop on Plan, Activity, and Intent Recognition; July 2009.

37. Katrenko S, Toledo A. A comparison study of lexical and syntactic overlap for textual entailment recognition. *Proceedings of BNAIC 2011*. 2011.

38. Mulkar R, Hobbs J, Hovy E. Learning from reading syntactically complex biology texts. *Proceedings of the Eighth International Symposium on Logical Formalizations of Commonsense Reasoning*. 2007.

39. Ng HT, Mooney RJ. Abductive plan recognition and diagnosis: a comprehensive empirical evaluation. *Proceedings of the Third International Conference on Principles of Knowledge Representation and Reasoning*. October 1992:499-508.

40. Ovchinnikova E. *Integration of world knowledge for natural language understanding*. Atlantis Press, Springer; 2012.

41. Ovchinnikova E, Montazeri N, Alexandrov T, Hobbs JR, McCord M, Mulkar-Mehta R. Abductive reasoning with a large knowledge base for discourse processing. *Proceedings of IWCS*. 2011:225-234.

42. Ovchinnikova E, Vieu L, Oltramari A, Borgo S, Alexandrov T. Data-driven and ontological analysis of FrameNet for natural language reasoning. *Proceedings of LREC. European Language Resources Association*. 2010:3157-3164.

43. Penas A, Hovy E. Filling knowledge gaps in text for machine reading. *Proceedings of COLING: Posters*. 2010:979-987.

44. Poole D. Logic programming, abduction and probability: a top-down anytime algorithm for estimating prior and posterior probabilities. *New Gen Comput*. July 1993;11(3–4):377-400.

45. Prendinger H, Ishizuka M. First-order diagnosis by propositional reasoning: a representation based approach. In: Proceedings of the 10th International Workshop on Principles of Diagnosis; 1999. p. 220–5.

46. Raghavan S, Mooney RJ. Abductive plan recognition by extending Bayesian logic programs. *Proceedings of European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases*. 2011:629-644.

47. Riabinin Y. Recognizing textual entailment using logical inference: a survey of the PASCAL RTE Challenge, 2008. *http://www.yaroslavriabinin.com/CSC2519_Paper.pdf*.

48. Richardson M, Domingos P. Markov logic networks. *Mach Learn*. 2006:107-136.

49. Riedel S. Improving the accuracy and efficiency of MAP inference for Markov logic. *UAI*. 2008:468-475.

50. Riedel S, Clarke J. Incremental integer linear programming for non-projective dependency parsing. *EMNLP*. 2006:129-137.

51. Ruppenhofer J, Ellsworth M, Petruck M, Johnson C, Scheffczyk J. FrameNet II: extended theory and practice. Technical Report, Berkeley, CA. 2010.

52. Santos E. Polynomial solvability of cost-based abduction. *Artif Intell*. 1996;86:157-170.

53. Shen D, Lapata M. Using semantic roles to improve question answering. *Proceedings of EMNLP-CoNLL*. 2007:12-21.

54. Singla P, Mooney R, Domingos P. Abductive Markov logic for plan recognition. *Proceedings of AAAI-11*. 2011:1069-1075.

55. Tatu M, Moldovan D. A semantic approach to recognizing textual entailment. *Proceedings of HLT and EMNLP*. 2005:371-378.

56. Vapnik NN. *The nature of statistical learning theory*. Springer; 1995.

* ^{1}* It is possible to develop smarter heuristics. For example, one can estimate in advance whether an axiom application can reduce the hypothesis cost or not to prune the search. We will address this problem in future work.

^{2}*www.rutumulkar.com/*

* ^{3}* For example, we use .

* ^{4}* For both systems, the results depend on how we initialize the weight values because the learning problem is a nonconvex optimization problem. In future work, we plan to investigate how the strategy of weight initialization affects the performance.

^{5}*libSVM*, *www.csie.ntu.edu.tw/**∼**cjlin/libsvm/*

* ^{6}* X/Y means that we outperform

*X*systems out of

*Y*.

* ^{7}* The accuracy results in Ovchinnikova et al.

*[41]*are rounded to the nearest integer.