Intent Recognition for Human-Robot Interaction - Applications - Plan, Activity, and Intent Recognition: Theory and Practice, FIRST EDITION (2014)

Plan, Activity, and Intent Recognition: Theory and Practice, FIRST EDITION (2014)

Part V. Applications

Chapter 14. Intent Recognition for Human-Robot Interaction

Richard Kelley, Alireza Tavakkoli, Christopher King, Amol Ambardekar, Liesl Wigand, Monica Nicolescu and Mircea Nicolescu, University of Nevada, Reno, NV, USA


For robots to operate in social environments, they must be able to recognize human intentions. In the context of social robotics, intent recognition must rely on imperfect sensors, such as depth cameras, and must operate in real time. This chapter introduces several approaches for recognizing intentions by physical robots. We show how such systems can use sensors, such as the Microsoft Kinect, as well as temporal and contextual information obtained from resources such as Wikipedia.



Hidden Markov model


Lexical graph

Human–robot interaction

14.1 Introduction

For robots to operate in unstructured environments, they must be capable of interacting with humans. Although social interaction between robots and humans presently is extremely simple, one of the main goals of social robotics is to develop robots that can function in complicated settings such as homes, offices, and hospitals. To achieve this goal, robots must be capable of recognizing the intentions of the humans with whom they are designed to interact. This presents both opportunities and challenges to researchers developing intent-recognition techniques. The goal of this chapter is to provide an overview of these opportunities and challenges and to present a system we have developed that begins to address some of them.

In the context of human–robot interaction (HRI), the challenges and the opportunities for intent-recognition systems largely stem from the capabilities and constraints of the underlying robot’s hardware. For instance, a modern robot may have access to fairly sophisticated sensor systems, such as a camera, that are capable of measuring the three-dimensional structure of the environment directly and in real time (as in the case of the Kinect, described in Section 14.3.1). However, the same robot may have severely limited processing capabilities that render complicated reasoning difficult to perform in real time. Or, as is increasingly likely, the robot may be operating in a networked environment in which it must offload some of its intent-recognition capabilities to other processors. In such cases, network latency adds additional challenges to real-time operation.

Although the (soft) real-time constraint has not been considered essential in many approaches to intent recognition, it is largely inescapable in the case of intent-recognition systems designed for HRI; the research that has been done on timing of actions in HRI suggests that actions must not only be correct but also must be timed to occur at interactionally significant points to have maximum effect [18]. This constraint leads to a number of design choices that are embodied in the systems we describe.

Our approach is ultimately based on psychological and neuroscientific evidence for a theory of mind [1], which suggests that the ease with which humans recognize the intentions of others is the result of an innate mechanism for representing, interpreting, and predicting other’s actions. The mechanism relies on taking the perspective of others [2], which allows humans to correctly infer intentions. Although this process is innate to humans, it does not take place in a vacuum. Intuitively, it would seem that our understanding of others’ intentions depends heavily on the contexts in which we find ourselves and those we observe. This intuition is supported by neuroscientific results [3], which suggest that the context of an activity plays an important and sometimes decisive role in correctly inferring underlying intentions.

Our approach to developing this ability in robots consists of two stages: activity modeling followed by intent recognition. During activity modeling, our robot performs the activities it will later be expected to understand, using data it collects to train parameters of hidden Markov models (HMMs) representing the activities. Each HMM represents a single “basic activity.” The hidden states of those HMMs correspond to small-scale goals or subparts of the activities. Most important, the visible states of a model represent the way in which parameters relevant to the activity change over time. For example, a visible state distance-to-goal may correspond to the way in which an observed agent’s distance to some activity’s goal is changing—growing larger, smaller, or staying the same.

During intent recognition, the robot observes other agents interacting and performing various activities. The robot takes the perspective of the agents it is tracking and from there calculates the changes in all parameters of interest. It uses the results of the calculations as inputs to its previously trained HMMs, inferring intentions using those models in conjunction with its prior knowledge of likely intent given the robot’s (previously determined) spatiotemporal context. For example, a robot meant to assist with cooking should be able to observe the actions of a person gathering eggs, milk, flour, and sugar in the kitchen, recognize the intention to bake a cake from this context, and assist, perhaps by finding a bowl.

After describing our system, we analyze its strengths and weaknesses, discuss how it can be extended and improved, and conclude with some general thoughts about the application of intent recognition to human–robot interaction.

14.2 Previous Work in Intent Recognition

This section briefly reviews some of the major work in intent recognition. Along the way, we highlight some of the limitations that make this work difficult to apply in real-time systems. We begin with a discussion of logical and Bayesian methods outside of robotics and then move on to methods that have found wider application in real-time systems.

14.2.1 Outside of Robotics

Outside of robotics, methods for plan recognition have, broadly speaking, fallen into two camps: logical and probabilistic. Some of the earliest approaches viewed plan recognition as an inverse problem to logical plan synthesis and were themselves logically based. The most notable of such approaches is that of Kautz [15]. As has been repeatedly observed in many areas of artificial intelligence (AI), purely logical methods suffer from a number of shortcomings; the most pronounced of them is their inability to account for the pervasive uncertainty found in most of the natural world [16]. In the field of plan recognition, one of the earliest arguments for the necessity of a probabilistic approach was provided by Charniak and Goldman [17]. In their paper, these authors observed a number of limitations of purely logical approaches and contend that Bayesian networks address them. They describe a system for natural-language understanding that successfully uses their approach.

14.2.2 In Robotics and Computer Vision

Unfortunately, both the logical and the Bayesian approaches just described are difficult to apply in human–robot interaction. There are a number of difficulties, largely dealing with resource constraints and the need to produce estimates at a rate of up to 30 hertz (Hz). We detail these issues later but here provide some discussion of methods that computer vision researchers and roboticists have used to predict intention in humans.

Previous work on intent recognition in robotics has focused on significantly simpler methods capable of working with sensor data under challenging time constraints. Much of the early work comes from the computer vision community or makes extensive use of computer vision techniques. Many of the systems that have aimed for real-time operation use fairly simple techniques (e.g., hidden Markov models).

Whenever one wants to perform statistical classification in a system that is evolving over time, HMMs may be appropriate [4]. Such models have been successfully used in problems involving speech recognition [5]. There is also some evidence that hidden Markov models may be just as useful in modeling activities and intentions. For example, HMMs have been used by robots to perform a number of manipulation tasks [68]. These approaches all have a crucial problem: They only allow the robot to detect that a goal has been achieved after the activity has been performed. To the extent that intent recognition is about prediction, these systems do not use HMMs in a way that facilitates the recognition of intentions. Moreover, there are reasons to believe (see Section 14.5.1) that without considering the disambiguation component of intent recognition, there will be unavoidable limitations on a system, regardless of whether it uses HMMs or any other classification approach.

The problem of recognizing intentions is important in situations where a robot must learn from or collaborate with a human. Previous work has shown that forms of simulation or perspective-taking can help robots work with people on joint tasks [10]. More generally, much of the work in learning by demonstration has either an implicit or an explicit component dealing with interpreting ambiguous motions or instructions. The work we present here differs from that body of research in that the focus is mostly on recognition in which the human is not actively trying to help the robot learn—ultimately, intent recognition and learning by demonstration differ in this respect.

The use of HMMs in real-time intent recognition (emphasizing the prediction element of the intent-recognition problem) was first suggested in Tavakkoli et al. [9]. That paper also elaborates on the connection between the HMM approach and theory of mind. However, the system proposed there has shortcomings that the present work seeks to overcome. Specifically, the authors show that in the absence of addition contextual information, a system that uses HMMs alone will have difficulty predicting intentions when two or more of the activities the system has been trained to recognize appear very similar. The model of perspective-taking that uses HMMs to encode low-level actions alone is insufficiently powerful to make predictions in a wide range of everyday situations.

14.3 Intent Recognition in Human–Robot Interaction

Performing operations with a robot places a number of constraints on an intent-recognition system. The most obvious is that the system must operate in real time, particularly if social interaction is required of the robot. However, a more crucial constraint is that the intent-recognition system must rely on the robot’s sensors and actuators to obtain information about and to manipulate the world. This section introduces some of the key sensors our systems use to enable intent recognition. We also touch on actuators that may be relevant to this goal.

14.3.1 Sensors

Although humans use most of their senses to infer the intentions of others, robots are presently limited in the sensors they can use to recognize humans’ intentions. In particular, the preferred sensor modality for robotic intent recognition is vision. Computer vision is a mature research area with well-established methods for performing tasks such as foreground–background segmentation, tracking, and filtering—all of which are important to intent recognition. In previous systems, we have used standard cameras to perform these tasks. With Microsoft’s release of the Kinect, which provides depth information, we have moved to range images and point clouds.

In traditional camera systems, depth information has to be inferred through stereo algorithms [14]. Creating dense depth maps from conventional stereo rigs is challenging and computationally expensive. More recently, projected texture stereo has been used to improve the performance of traditional stereo [12]. Along these lines, Microsoft’s Kinect provides a low-cost system that has dense, image depth maps at 30 Hz.

To process the output of these systems, there are essentially two format options: range images and point clouds. A range image is similar to a standard RGB image except that the value of a pixel represents the distance from the camera to the point in the scene that would have been imaged by a standard camera. A point cloud is simply a set of points in (usually) three-dimensional (3D) space. Given a range image and some easily estimated parameters of a camera, it is straightforward to produce a point cloud in which each point represents a sample from the scene. Both formats are useful because different methods are being developed for each. Many of the techniques of classic computer vision are applicable to range images [14]. Point clouds, however, require different techniques [13](see Figure 14.1).


FIGURE 14.1 An example of a point cloud. Note: The image on the left of the display is, captured from the RGB camera on a Kinect. The point cloud on the right is segmented so as to show only the human and the objects on the table. Processing camera data

We assume that the observer robot is stationary and observes a human interacting with various objects in a household or office setting over time. As it operates, the system takes input from a camera and performs the following steps:

Estimation of the 3D scene. In the case where the input comes from a regular camera, the system estimates three-dimensional information from the sequence of images making up the video stream. When the input comes from a Kinect, the system computes a point cloud and passes it to the next step of the processing pipeline.

Foreground–background segmentation. The system begins by segmenting out uninteresting regions from the cloud. In our office scene, this includes the floor, walls, and the table on which the objects of interest lie. This segmentation is performed using standard tools in the Point Cloud Library [13]. The output of this stage is a set of clouds corresponding to the objects of interest in the scene.

Appearance-based object recognition. Offline, we train a Gaussian mixture model for each object we want recognized. At runtime we use these models to perform classification for each segmented cloud. The output of this stage is a set of object labels and positions (centroids) computed from the classifier and the cloud information in step 1. This includes the locations of the human’s head and hands.

Interaction modeling. Once the position of each object is known, the system tracks the human’s motion across consecutive frames to determine whether he or she is reaching for anything. If a reaching action is detected, that information is sent to the intent-recognition system for further analysis.

At the end of the pipeline, the system produces either a classification or an action. Details regarding this further analysis appear in the next sections.

It is worth noting that in practice the preceding process is somewhat fragile. The information from the cameras is often noisy and sometimes simply wrong. We then perform symbolic reasoning on this data, which may suffer from additional modeling error (as discussed by Charniak and Goldman). Essentially, each step in the pipeline may add errors to the system.

14.3.2 Actuators

Although intent recognition relies primarily on a robot’s sensors, the actuators available to a system may place some constraints on the system’s ability to learn. A robot may gather data by performing the activity itself, which may be difficult or impossible. In systems that rely on theory of mind, it may be difficult for a robot to perform perspective-taking with respect to a human if the robot is not a humanoid. For example, if a person is typing, a robot is unlikely to be able to perform that action. To address this problem on wheeled mobile robots, we have developed an alternate learning approach that is described in Section 14.4. Actuators also influence the system’s ability to respond to the perceived intentions, which allows the robot to give additional feedback to humans.

14.4 HMM-Based Intent Recognition

As mentioned previously, our system uses HMMs to model activities that consist of a number of parts that have intentional significance. Recall that a hidden Markov model consists of a set of hidden states, a set of visible states, a probability distribution that describes the probability of transitioning from one hidden state to another, and a probability distribution that describes the probability of observing a particular visible state given that the model is in a particular hidden state. To apply HMMs, one must give an interpretation to both the hidden states and the visible states of the model, as well as an interpretation for the model as a whole. In our case, each model, image, represents a single well-defined activity. The hidden states of image represent the intentions underlying the parts of the activity, and the visible symbols represent changes in measurable parameters that are relevant to the activity. Notice, in particular, that our visible states correspond to the activity’s dynamic properties so that our system can perform recognition as the observed agents are interacting.

As an example, consider the activity of meeting another person. To a first approximation, the act of meeting someone consists of approaching the person up to a point, interacting with the stationary person in some way (e.g., talking, exchanging something), and then parting. In our framework, we would model meeting using a single HMM. The hidden states would correspond to approach, halt, and part, since these correspond to the short-term intermediate goals of the meeting activity. When observing two people meeting, the two parameters of interest that we can use to characterize the activity are the distance and the angle between the two agents we are observing; in a meeting activity, we would expect that both the distance and the angle between two agents would decrease as the agents approach and face one another. With this in mind, we make the visible states represent changes in the distance and angle between two agents. Since each of these parameters is a real number, it can either be positive, negative, or (approximately) zero. There are then nine possibilities for a pair representing “change in distance” and “change in angle,” and each of these possibilities represents a single visible state that our system can observe.

14.4.1 Training

We train our system in two ways. In situations in which the robot can perform the activity, we have it perform that activity. With a Pioneer robot, this approach makes sense for activities such as “follow an agent” or “pass by that person.” As the robot performs the activity, it records features related to its motion (e.g., speed, direction, changes in its position relative to other agents). These are then converted to discrete symbols as described in the previous section. The symbols are then used to train HMMs representing each activity.

In situations in which the robot cannot perform the activity (in our case, this included reaching for most objects), the system observes a human performing the task. The same features of the motion are recorded as in the previous training method and are used to train an HMM.

In both cases, the topologies of the HMMs and the interpretations of the hidden and visible states are determined by hand. The number of training examples generated with either method was limited due to the fact that a human had to perform the actions. In all cases that follow, we found that, with just one or two dozen performances of the activities, the system was able to train reasonably effective HMMs.

14.4.2 Recognition

During recognition, the stationary robot observes a number of individuals interacting with one another and with stationary objects. It tracks those individuals using the visual capabilities described before and takes the perspective of the agents it is observing. Based on its perspective-taking and its prior understanding of the activities it has been trained to understand, the robot infers the intention of each agent in the scene. It does this using maximum-likelihood estimation, calculating the most probable intention given the observation sequence it has recorded up to the current time for each pair of interacting agents.

14.5 Contextual Modeling and Intent Recognition

In this section we argue for the addition of contextual information to assist in the prediction of intent. We start by exploring the distinction between activity recognition and intent recognition.

14.5.1 Activity Recognition and Intent Recognition

Although some researchers consider the problems of activity recognition and intent recognition to be essentially the same, a common claim is that intent recognition differs from activity recognition in that intent recognition has a predictive component. That is, by determining an agent’s intentions, we are in effect making a judgment about what we believe are the likely actions of the agent in the immediate or near future; whereas activity recognition is recognizing what is happening now. Emphasizing the predictivecomponent of intent recognition is important, but it may not reveal all the significant facets of the problem. For further discussion of intent-versus-activity recognition, see Yiannis [19].

In contrast to the more common view of intent recognition in the computer vision community, which ignores intent or considers it equivalent to action, we contend that disambiguation of activities based on underlying intentions is an essential task that any completely functional intent-recognition system must be capable of performing. For example, if a person’s activity is reading a book, her or his intention may be homework or entertainment. In emphasizing the disambiguation component of an intent-recognition system, we recognize that there are some pairs of actions that may appear identical in all respects except for their underlying intentions.

For an example of intent recognition as disambiguation, consider an agent playing chess. When the agent reaches for a chess piece, we can observe that activity and ascribe to the agent any number of possible intentions. Before the game, an agent reaching for a chess piece may be putting the piece into its initial position; during the game, the agent may be making a move using that piece; and after the game, the agent may be cleaning up by putting the piece away. In each of these cases, it is entirely possible (if not likely) that the activity of reaching for the piece will appear identical to the other cases. It is only the intentional component of each action that distinguishes it from the others. Moreover, this component is determined by the context of the agent’s activity: before, during, or after the game. Notice that we need to infer the agent’s intention in this example even when we are not interested in making any predictions. Disambiguation in such circumstances is essential to even a basic understanding of the agent’s actions.

14.5.2 Local and Global Intentions

In our work, we distinguish between two kinds of intentions, which we call local and global intentions. Local intentions exist on smaller time scales and may correspond to the individual parts of a complex activity. For example, if two agents are performing a “meeting” activity, they may approach one another, stop for some length of time, and then part ways. Each of these three components would correspond to a different local intention. In our approach, the local intentions are modeled using the hidden states of our HMMs, although, of course, there will be other ways to achieve the same result. As this modeling choice implies, though, local intentions are closely tied to particular activities, and it may not even be sensible to discuss these sorts of intentions outside of a given activity or set of activities.

In contrast, global intentions exist on larger time scales and correspond to complex activities in a particular context. In our chess example, “setting up the board,” “making a move,” and “cleaning up” would all correspond to possible global intentions of the system.

This distinction between local and global intentions may be most useful during the activity-modeling stage, if the activities being considered are sufficiently simple that they lack the internal structure that would lead to several local intentions; it may be that HMMs are not necessary for the system, so a simpler purely Bayesian approach could be used instead. In this way, the distinction between local and global intentions can be used to develop a sense of the complexity of the activities being modeled in a given application.

14.5.3 Lexical Directed Graphs

Given that context is sometimes the decisive factor enabling human intent recognition [3], it makes sense to create robot architectures that use contextual information to improve performance. While there are many sources of contextual information that may be useful to infer intentions, we chose to focus primarily on the information provided by object affordances—features that indicate the actions that one can perform with an object (e.g., a handle that allows grabbing). The problem, once this choice is made, is one of training and representation: given that we wish the system to infer intentions from contextual information provided by knowledge of object affordances, how do we learn and represent those affordances? We would like, for each object our system may encounter, to build a representation that contains the likelihood of all actions that can be performed on that object.

Although there are many possible approaches to constructing such a representation, we chose to use a representation based heavily on a graph–theoretic approach to natural language—in particular, English. Given an object, our goal is to connect it to related actions and objects. To this end, we collect sentences that contain that object and include the words in a graph that describes the relationships between them. Specifically, we construct a directed graph in which the vertices are words and a labeled, weighted edge exists between two vertices if and only if the words corresponding to the vertices exist in some kind of grammatical relationship. The label indicates the nature of the relationship, and the edge weight is proportional to the frequency with which the pair of words exists in that particular relationship.

This is a lexical-directed graph, or digraph. For example, we may have the vertices drink and water, along with the edge image, indicating that the word “water” appears as a direct object of the verb “drink” four times in the experience of the system. From this graph, we compute probabilities that provide the necessary context to interpret an activity. It is likely that spoken and written natural language is not enough (on its own) to create reasonable priors for activity and intent recognition. However, we suggest that for a wide range of problems, natural language can provide information that improves prediction over systems that do not use contextual information at all. Using language for context

The use of a linguistic approach is well motivated by human experience. Natural language is a highly effective vehicle for expressing facts about the world, including object affordances. Moreover, it is often the case that such affordances can be easily inferred directly from grammatical relationships, as in the preceding example.

From a computational perspective, we would prefer time- and space efficient models, both to build and to use. If the graph we construct to represent our affordances is sufficiently sparse, then it should be space efficient. As we discuss in the following, the graph we use has a number of edges that are linear in number of vertices, which is in turn linear in the number of sentences the system “reads.” We thus attain space efficiency. Moreover, we can efficiently access the neighbors of any vertex using standard graph algorithms.

In practical terms, the wide availability of texts that discuss or describe human activities and object affordances means that an approach to modeling affordances based on language can scale well beyond a system that uses another means for acquiring affordance models. For example, the use of online encyclopedias to create word graphs provides us with an immense breadth of concepts to connect to an object, as well as decent relationship estimates. The act of “reading” about the world can, with the right model, replace direct experience for the robot in many situations.

Note that this discussion makes an important assumption that, although convenient, may not be accurate in all situations. Namely, we assume that for any given action–object pair, the likelihood of the edge representing that pair in the graph is at least approximately equal to the likelihood that the action takes place in the world. Or in other words, we assume that linguistic frequency sufficiently approximates action frequency. Such an assumption is intuitively reasonable. We are more likely to read a book than we are to throw a book; as it happens, this fact is represented in our graph. However, depending on the source of the text, it may be skewed. A news website might suggest extreme situations (e.g., accidents and disasters), while a blog might focus on more mundane events. We are currently exploring the extent to which the text occurrence assumption is valid and may be safely relied on; at this point, though, it appears that the assumption is valid for a wide enough range of situations to allow for practical use in the field. Dependency parsing and graph representation

To obtain our pairwise relationships between words, we use the Stanford-labeled dependency parser. The parser takes as input a sentence and produces the set of all pairs of words that are grammatically related in the sentence along with a label for each pair, as in the previous “water” example.

Using the parser, we construct a graph, image, where image is the set of all labeled pairs of words returned by the parser for all sentences and each edge is given an integer weight equal to the number of times the edge appears in the text parsed by the system. imagethen consists of the words that appear in the corpus processed. Graph construction and complexity

Given a labeled dependency parser and a set of documents, graph construction is straightforward. Briefly, the steps are

1. Tokenize each document into sentences.

2. For each sentence, build the dependency parse of the sentence.

3. Add each edge of the resulting parse to the graph.

Each of these steps can be performed automatically with reasonably good results using well-known language-processing algorithms. The end result is the previously described graph that the system stores for later use.

One of the greatest strengths of the dependency-grammar approach is its space efficiency: the output of the parser is either a tree on the words of the input sentence or it is a graph made up of a tree plus a (small) constant number of additional edges. This means that the number of edges in our graph is a linear function of the number of nodes in the graph, which (assuming a bounded number of words per sentence in our corpus) is linear in the number of sentences the system processes. In our experience, the digraphs the system has produced have had statistics confirming this analysis, as can be seen by considering the graph used in our recognition experiments.

For our corpus, we used two sources: first, the simplified- English Wikipedia, which contains many of the same articles as the standard Wikipedia except with a smaller vocabulary and simpler grammatical structure, and second, a collection of children’s stories about the objects in which we were interested [20]. In Figure 14.2, we show the number of edges in the Wikipedia graph as a function of the number of vertices at various points during the rendering of the graph. The scales on both axes are identical, and the graph shows that the number of edges for it does depend linearly on the number of vertices.


FIGURE 14.2 The number of edges in the Wikipedia graph as a function of the number of vertices during the process of graph growth.

The final Wikipedia graph we used in our experiments contained 244,267 vertices and 2,074,578 edges. The children’s story graph is much smaller, being built from just a few hundred sentences: it consists of 1754 vertices and 3873 edges. This graph was built to fill in gaps in the information contained in the Wikipedia graph. The stories were selected from what could be called “children’s nonfiction”; the books all contained descriptions and pictures of the world and were chosen to cover the kinds of situations in which we trained our system to work. The graphs were merged to create the final graph we used by taking the union of the vertex and edge sets of them then adding the edge weights of any edges that appeared in both graphs. Induced subgraphs and lexical “noise”

In some instances, our corpus may contain strings of characters that do not correspond to words in English. This is especially a problem if the system automatically crawls a resource, such as the worldwide Web, to find its sentences. We use the term lexical noise to refer to tokens that have vertices in our graph but are not in fact words in English. The extent to which such noise is a problem depends in large part on how carefully the documents are acquired, cleaned up, and tokenized into sentences before being given to the parser. Given the highly variable quality of many sources (e.g., blogs and other webpages), and the imperfect state of the art in sentence tokenization, it is necessary that we have a technique for removing lexical noise. Our current approach to such a problem is to work with induced subgraphs.

Suppose that we have a lexical digraph, image, and a set of words, image. We assume that we do not care about the “words” in image (in fact, they may not even be words in our target language). Then instead of working with the graph image, we use the graph image, where


In addition to solving the problem of lexical noise, this approach has the benefit that it is easy to limit the system’s knowledge to a particular domain, if appropriate. For instance, we might make image a set of words about cars if we know we will be using the system in a context where cars are the only objects of interest. In this manner, we can carefully control the linguistic knowledge of our system and remove a source of error that is difficult to avoid in a fully automated knowledge-acquisition process.

14.5.4 Application to Intent Recognition

To use contextual information to perform intent recognition, we must decide how we want to model the relationship between intentions and contexts. This requires that we describe what intentions and contexts are and that we specify how they are related. There are at least two plausible ways to deal with the latter consideration: We could choose to make intentions “aware” of contexts, or we might make contexts “aware” of intentions.

In the first possibility, each intention knows all the contexts in which it can occur. This would imply that we know in advance all contexts that are possible in our environment. Such an assumption may or may not be appropriate, given a particular application. On the other hand, we might make contexts aware of intentions. This would require that each context know, either deterministically or probabilistically, which intentions are possible in it. The corresponding assumption is that we know in advance all the possible (or at least likely) intentions of the agents we may observe. Either of these approaches is possible and may be appropriate for a particular application. In the present work, we adopt the latter approach by making each context aware of its possible intentions. This awareness is achieved by specifying the content of intention models and context models.

An intention model consists of two parts: The first is an activity model, which is given by a particular HMM, and second a name. This is the minimal amount of information necessary to allow a robot to perform disambiguation. If necessary or desirable, intentions could be augmented with additional information that a robot could use to support interaction. As an example, we might augment an intention model to specify an action to take in response to detecting a particular sequence of hidden states from the activity model.

A context model, at a minimum, must consist of a name or other identifier to distinguish it from other possible contexts in the system, as well as some method for discriminating between intentions. This method might take the form of a set of deterministic rules, or it might be a discrete probability distribution defined over the intentions about which the context is aware. In general, a context model can contain as many or as few features as are necessary to distinguish the intentions of interest.

For our work, we focused on two kinds of information: (1) the location of the event being observed and (2) the identities of any objects being interacted with by an agent. Context of the first kind was useful for basic experiments testing the performance of our system against a system that uses no contextual information but did not use lexical digraphs at all; contexts and possible intentions we determined entirely by hand. Our other source of context, object identities, relied entirely on lexical digraphs as a way to represent object affordances. One of the major sources of information when inferring intent is contained by object affordances.

Affordances indicate the types of actions that can be performed with a particular object and through their relative probabilities constrain the possible intentions a person can have when interacting with an object. For example, one can drink from, break, empty, orwash a glass; all have different probabilities. At the same time, the state of the object can further constrain the potential intentions: It is more likely that one would drink from a full glass, while for an empty, dirty glass, the most probable intention would be to wash it. We use the system described in Section 14.5.3 to extract information about object affordances.

The goal was to build a representation that contains, for each object, the likelihood of all actions that can be performed on that object. The system produces a weighted graph linking words that are connected in a dependency parse of a sentence in the corpus. The weights count the number of times each relationship appears and must be converted to probabilities. To obtain the probability of each action given an object, image, we look at all verbs, image, in relation to image and compute the probability of image given image:


where image consists of all verbs in the digraph that receive an arc from the image node and image is the weight of the arc from image to image, which we use as an approximation to the probability image.

For objects that have different states (e.g., full vs. empty, open vs. closed), we infer the biased probabilities as follows:

• Merge the state vertex image and the object vertex image to obtain a new vertex image.

• Update each edge weight image, as follows:

– 0 if image was not adjacent to both image and image.

image, otherwise.

• Normalize probabilities as in the stateless case.

• Return the probability distribution.

In this way, we can extract probabilities of actions for objects that are stateless as well as objects containing state. Inference algorithm

Suppose that we have an activity model (i.e., an HMM) denoted by image. Let image denote an intention, let image denote a context, and let image denote a sequence of visible states from the activity model image. If we are given a context and a sequence of observations, we would like to find the intention that is the most likely. Mathematically, we want to find


where the probability structure is determined by the activity model image.

To find the correct image, we start by observing that by Bayes’s rule we have:


We can further simplify matters by noting that the denominator is independent of our choice of image. Moreover, we assume without loss of generality that the possible observable symbols are independent of the current context. Based on these observations, we can write


This approximation suggests an algorithm for determining the most likely intention given a series of observations and a context. For each possible intention image for which image, we compute the probability image and choose as our intention that image has the greatest probability. The probability image is available, either by assumption or from our linguistic model, and if the HMM image represents the activity model associated with intention image, then we assume that image. This assumption may be made in the case of location-based context for simplicity, or in the case of object affordances because we focus on simple activities (e.g., reaching), where the same HMM image is used for multiple intentions image. Of course, a perfectly general system would have to choose an appropriate HMM dynamically given the context; we leave the task of designing such a system as future work for now and focus on dynamically deciding on the context to use, based on the digraph information. Intention-based control

In robotics applications, simply determining an observed agent’s intentions may not be enough. Once a robot knows what another’s intentions are, the robot should be able to act on its knowledge to achieve a goal. With this in mind, we developed a simple method to allow a robot to dispatch a behavior based on its intent-recognition capabilities. The robot first infers the global intentions of all the agents it is tracking and, for the activity corresponding to the inferred global intention, determines the most likely local intention. If the robot determines over multiple timesteps that a certain local intention has the highest probability, it can dispatch a behavior in response to the situation it believes is taking place.

For example, consider the activity of stealing an object. The local intentions for this activity might include “approaching the object,” “picking up the object,” and “walking off with the object.” If the robot knows that in its current context the local intention “picking up the object” is not acceptable and it infers that an agent is in fact picking up the object, it can execute a behavior—for example, stopping the thief or warning another person or robot of the theft.

14.6 Experiments on Physical Robots

This section shows how to apply the ideas described in Sections 14.4 and 14.5 to physical robots operating in unstructured environments. We provide quantitative and qualitative evaluation of several intent recognition systems in situations where service robots could be expected to operate.

14.6.1 Setup

To validate our approach, we performed experiments in two different settings: a surveillance setting and a household setting. In the surveillance setting, we performed experiments using a Pioneer 2DX mobile robot with an onboard computer, a laser range finder, and a Sony PTZ camera. In the household setting, we performed experiments using both a Pioneer robot and a humanoid Nao robot. Surveillance setting

We trained our Pioneer to understand three basic activities: following, in which one agent trails behind another; meeting, in which two agents approach one another directly; and passing, in which two agents move past each other without otherwise directly interacting (see Figure 14.3).


FIGURE 14.3 HMM structure for the follow activity.

We placed our trained robot in an indoor environment and had it observe the interactions of multiple human agents with each other and with multiple static objects. In our experiments, we considered both the case where the robot acts as a passive observer and the case where the robot executes an action on the basis of the intentions it infers from the agents it is watching.

We were particularly interested in the system’s performance in two cases. First we wanted to determine the performance of the system when a single activity could have different underlying intentions based on the current context. So, returning to our example in Section 14.5.1, the activity of “moving one’s hand toward a chess piece” could be interpreted as “making a move” during a game or as “cleaning up” after the game is over. This case deals directly with the problem that in some situations two apparently identical activities may in fact be very different, although the difference may lie entirely in the contextually determined intentional component of the activity.

In the second case of interest, we sought to determine the performance of the system in disambiguating two activities that were in fact different but, due to environmental conditions, appeared superficially very similar. This situation represents one of the larger stumbling blocks of systems that do not incorporate contextual awareness.

In the first set of experiments, the same footage was given to the system several times, each with a different context, to determine whether the system could use context alone to disambiguate agents’ intentions. We considered three pairs of scenarios: leaving the building on a normal day/evacuating the building, getting a drink from a vending machine/repairing a vending machine, and going to a movie during the day/going to clean the theater at night. We would expect our intent-recognition system to correctly disambiguate between each of these pairs using knowledge of its current context (see Figure 14.4).


FIGURE 14.4 Using context to infer that an agent is leaving a building under normal circumstances. The human (with identifier 0 in the image) is moving toward the door (identifier 4), and the system is 99% confident that agent 0’s intent is to exit the building. Agent 0 is not currently interacting with objects 2 or 3, so the system does not attempt to classify agent 0’s intentions with respect to those objects.

The second set of experiments was performed in a lobby; it had agents meeting each other and passing each other both with and without contextual information about which of these two activities was more likely in the context of the lobby. To the extent that meeting and passing appear to be similar, we would expect that the use of context would help to disambiguate the activities.

Finally, to test our intention-based control, we set up two scenarios. In the first scenario (the “theft” scenario), a human enters his office carrying a bag. As he enters, he sets his bag down by the entrance. Another human enters the room, takes the bag, and leaves. Our robot was set up to observe these actions and send a signal to a “patrol robot” in the hall that a theft had occurred. The patrol robot is then supposed to follow the thief for as long as possible (see Figures 14.5 and 14.6).


FIGURE 14.5 An observer robot catches an agent stealing a bag. The top left video is the observer’s viewpoint, the top left bars represent possible intentions, the bottom left bars are the robot’s inferred intentions for each agent (with corresponding probabilities), and the bottom right video is the patrol robot’s viewpoint.


FIGURE 14.6 A patrol robot, notified that a theft has occurred, sees the thief in the hallway and follows him. Note: The video shows the patrol robot’s viewpoint superimposed on a map of the building.

In the second scenario, our robot is waiting in the hall and observes a human leaving the bag in the hallway. The robot is supposed to recognize this as a suspicious activity and follow the human who dropped the bag for as long as possible. Household setting

In the household setting, we performed two sets of experiments that further tested the system’s ability to predict intentions and perform actions based on those predictions. In the first set, we trained the Pioneer to recognize a number of household objects and activities and to disambiguate between similar activities based on contextual information. Specifically, we had the system observe three different scenarios: a homework scenario, in which a human was observed reading books and typing on a laptop; a meal scenario, in which a human was observed eating and drinking; and an emergency scenario, in which a human was observed using a fire extinguisher to put out a fire in a trash can.

In the second set of experiments, we trained a humanoid robot to observe a human eating or doing homework. The robot was programmed to predict the observed person’s intentions and offer assistance at socially appropriate moments. We used these scenarios to evaluate the performance of the lexical digraph approach.

14.6.2 Results

In both settings, our robots were able to effectively observe the agents within their fields of view and correctly infer the intentions of the agents they observed. Videos of system performance for both the Pioneer and the humanoid robots can be found∼rkelley/robot-videos.htmlimage.

To provide a quantitative evaluation of intent-recognition performance, we use two measures:

Accuracy rate = the ratio of the number of observation sequences, of which the winning intentional state matches the ground truth, to the total number of test sequences.

Correct duration = C/T, where image is the total time during which the intentional state with the highest probability matches the ground truth and image is the number of observations.

The accuracy rate of our system is 100%: The system ultimately chose the correct intention in all the scenarios on which it was tested. In practice, this means very little. Much more interesting is the correct duration. Next, we consider the correct duration measure in more detail for each of the cases in which we were interested. One activity, many intentions

The first six rows of Table 14.1 indicate the system’s disambiguation performance. For example, we see that in the case of the scenario Leave building, the intentions normal and evacuation are correctly inferred 96.2% and 96.4% of the time, respectively. We obtain similar results in two other scenarios where the only difference between the two activities in question is the intentional information represented by the robot’s current context. Thus, we see that the system is able to use this contextual information to correctly disambiguate intentions.

Table 14.1

Quantitative Evaluation

Scenario (with context)

Correct duration (%)

Leave building (normal)


Leave building (evacuation)


Theater (cleanup)


Theater (movie)


Vending (getting drink)


Vending (repair)


Meet (no context) - Agent 1


Meet (no context) - Agent 2


Meet (context) - Agent 1


Meet (context) - Agent 2

100.0 Similar-looking activities

As we can see from the last four rows of Table 14.1, the system performs substantially better when using context than it does without contextual information. Because meeting and passing can, depending on the position of the observer, appear very similar, without context it may be difficult to decide what the two agents are trying to do. With the proper contextual information, though, it becomes much easier to determine the intentions of the agents in the scene. Intention-based control

In both the scenarios we developed to test our intention-based control, our robot correctly inferred the ground-truth intention and correctly responded to the inferred intention. In the theft scenario, the robot correctly recognized the theft and reported it to the patrol robot in the hallway; that robot was able to track the thief. In the bag drop scenario, the robot correctly recognized that dropping a bag off in a hallway is a suspicious activity and was able to follow the suspicious agent through the hall. Both examples indicate that dispatching actions based on inferred intentions using context and hidden Markov models is a feasible approach (see Figure 14.5). Lexical-digraph-based system

Pioneer robot experiments. To test the lexically informed system in the household setting, we considered three different scenarios. In the first, the robot observed a human during a meal, eating and drinking. In the second, the human was doing homework—reading a book and taking notes on a computer. In the last scenario, the robot observed a person sitting on a couch eating candy. A trash can in the scene then catches on fire, and the robot observes the human using a fire extinguisher to put the fire out (see Figure 14.6).

In the first set of experiments (homework scenario), the objects, their states, and the available activities were:

• Book (open): read, keep, copy, have, put, use, give, write, own, hold, study

• Book (closed): have, put, use, give, own, open, take

• Mouse: click, move, use

• Bottle (full): find, drink, squeeze, shake, have, put, take

• Laptop (open): boot, configure, break, take, leave

• Laptop (closed): boot, configure, break, take, leave

For the eating scenario, the objects, states, and activities were:

• Pitcher: find, drink, shake, have, throw, put, take, pour

• Glass (full): hold, break, drink

• Glass (empty): hold, break

• Plate (full): eat, think of, sell, give

• Plate (empty): throw

For the fire scenario, the objects and activities were:

• Snack: eat, think of, sell, give

• Extinguisher: keep, activate, use

In each scenario, the robot observed a human interacting with objects by performing some of the activities in the lists.

Defining a ground truth for these scenarios is slightly more difficult than in the previous ones since in them the observed agent performs multiple activities and the boundaries between activities in sequence are not clearly defined. However, we can report that, except on the boundary between two activities, the correct duration of the system is 100%. Performance on the boundary is more variable, but it is not clear that this is an avoidable phenomenon. We are currently working on carefully ground-truthed videos to allow us to better compute the accuracy rate and the correct duration for these sorts of scenarios.

Humanoid robot experiments. To test the system performance on another robot platform, we had our humanoid, Nao, observe a human doing homework and eating. The objects, states, and activities for these scenarios were the same as in the Pioneer experiments, with one additional object in the homework scenario: We trained the system to recognize a blank piece of paper, along with the intention of writing. We did this so that the robot could offer a pen to the human after recognizing the human’s intention to write.

To demonstrate that the robot detects human intentions, it takes certain actions or speaks to the person as soon as the intention is recognized. This is based on a basic dialogue system in which, for each intention, the robot has a certain repertoire of actions or utterances it can perform. Our experiments indicate that the robot correctly detects user intentions before the human’s actions are finalized. Moreover, no delays or misidentified intentions occurred, ensuring that the robot’s responses to the human were notinappropriate for the human’s activities. Tables 14.2 and 14.3 detail the interactions between the human and the robot in these scenarios.

Table 14.2

Homework Scenario


Note: This table describes the interactions that take place between the human and our humanoid robot. At the end of the scenario, the robot grabs a pen and hands it to the human.

Table 14.3

Eating Scenario


Note: When the human accepts the robot’s offer of a fork, the robot hands the fork to the human. At the end of the scenario, the robot walks to the human, takes the plate from his hand, and throws it away.

14.7 Discussion

The system just described illustrates a number of issues that must be dealt with to deploy an intent-recognition system on a robot intended for human interaction. In this section, we discuss a number of general concerns for future systems that may be designed for human–robot interaction. We then go on to present some of the other methods we are exploring.

14.7.1 General Concerns

The first and foremost problem that must be addressed by designers of intent-recognition systems for HRI is that they must be fast. Often, a robot is receiving sensor data several times per second and must integrate that data, make a prediction, and act on that prediction in a matter of a few hundred milliseconds. Even with improvements in robot hardware, it is likely that intent-recognition systems will have to be explicitly designed around time constraints.

Just as important, future systems should consider the sources of their inputs and should plan for robustness in the face of occasionally unreliable input. It is likely that probabilistic methods will continue to do well with this issue. However, explicitly planning for such inputs will likely require new techniques.

Finally, we would like to propose the following considerations for future system designs:

Better sensing. As robots continue to be deployed to social situations, it will be increasingly important for designers to include more and more varied sensors to allow robots to predict intentions. The use of cameras (including depth ones) should allow robots to eventually predict intentions much in the same way that humans do, but better-than-human performance could be enabled by intelligent sensor use.

Better evaluation. The preliminary system we have developed performs well on the simple tasks on which it was tested. To be deployed to real-world scenarios, further testing is required to determine the limits of this and similar systems. In particular, there are two forms of testing that would be extremely useful: (1) testing using carefully designed ground-truthed scenarios and (2) testing involving human responses to systems with and without intent-recognition systems enabled. Only with more testing will we be able to decide which systems are suitable for human interaction.

System integration. With robots, intent-recognition systems rarely operate in isolation. Future research will be required to determine the best way to integrate intent recognition into larger systems that can perform useful tasks for humans.

14.7.2 Additional Approaches and Future Work

In addition to the approach outlined in this chapter, we have experimented with a few alternative methods for predicting intentions. Earlier, in Figure 14.1, we showed a history-based method for predicting intentions. In that approach, we used HMMs to recognize low-level intentions that then triggered events that led to changes in a state machine—pictured below the images in the figure. Paths through the state machine were used to identify ongoing activities. Extensions to this event-based approach are continuing and planned for future work.

Additionally, we have explored alternate methods of representing and learning contextual information. In particular, we have found that contextual information about scenes can be compactly represented using sparse autoencoders [11]. In future work, we hope to both extend the limits of this approach and combine it with our other contextual models to improve the robustness of our system.

14.8 Conclusion

Understanding intentions in context is an essential human activity, and it is highly probable that it will be just as essential in any robot that must function in social domains. The approach we propose is based on perspective-taking and experience gained by the robot using its own sensory-motor capabilities. The robot carries out inference using its previous experience and its awareness of its own spatiotemporal context. We described the visual capabilities that support our robots’ intent recognition and validated our approach on a physical robot that was able to correctly determine the intentions of a number of people performing multiple activities in a variety of contexts.


1. Premack D, Woodruff G. Does the chimpanzee have a theory of mind? Behav Brain Sci. 1978;1(4):515-526.

2. Gopnick A, Moore ALewis C, Mitchell P, eds. Changing your views: how understanding visual perception can lead to a new theory of mindChildren’s early understanding of mind. Lawrence Erlbaum. 1994:157-181.

3. Iacobini M, Molnar-Szakacs I, Gallese V, Buccino G, Mazziotta J, Rizzolatti G. Grasping the intentions of others with one’s own mirror neuron system. PLoS Biol. 2005;3(3):e79.

4. Duda R, Hart P, Stork D. Pattern classification. Wiley-Interscience; 2000.

5. Rabiner LR. A tutorial on hidden-Markov models and selected applications in speech recognition. Proc IEEE. 1989;77(2).

6. Pook P, Ballard D. Recognizing teleoperating manipulations. International Conference on Robotics and Automation. 1993:578-585.

7. Hovland G, Sikka P, McCarragher B. Skill acquisition from human demonstration using a hidden Markov model. International Conference on Robotics and Automation. 1996:2706-2711.

8. Ogawara K, Takamatsu J, Kimura H, Ikeuchi K. Modeling manipulation interactions by hidden Markov models. International Conference Intelligent Robots and Systems. 2002:1096-1101.

9. Tavakkoli A, Kelley R, King C, Nicolescu M, Nicolescu M, Bebis G. A vision-based architecture for intent recognition. Proceedings of the International Symposium on Visual Computing. 2007:173-182.

10. Gray J, Breazeal C, Berlin M, Brooks A, Lieberman J. Action parsing and goal inference using self as simulator. IEEE International Workshop on Robot and Human Interactive Communication. 2005.

11. Kelley R, Wigand L, Hamilton B, Browne K, Nicolescu MN, Nicolescu M. Deep networks for predicting human intent with respect to objects. Proc HRI. 2012:171-172.

12. Konolige K. Projected texture stereo. ICRA. 2010.

13. Rusu RB, Cousins S. 3D is here: Point cloud library (pcl). International Conference on Robotics and Automation. 2011.

14. Trucco E, Verri A. Introductory techniques for 3-D computer vision. Prentice Hall PTR; 1998.

15. Kautz H. A formal theory of plan recognition. PhD thesis, Department of Computer Science, University of Rochester, 1987 [Tech. report 215].

16. Thrun S, Burgard W, Fox D. Probabilistic robotics. MIT Press; 2005.

17. Charniak E, Goldman R. A Bayesian model of plan recognition. Artif Intell. 1993;64:53-79.

18. Yamazaki A, Yamazaki K, Kuno Y, Burdelski M, Kawashima M, Kuzuoka H. Precision timing in human–robot interaction: coordination of head movement and utterance. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 2008.

19. Demiris Y. Prediction of intent in robotics and multi-agent systems. Cognitive Process. 2007.

20. Wikipedia. The Free Encyclopedia; 2004.