Multi-Aspect Relevance Ranking - Relevance Ranking for Vertical Search Engines, FIRST EDITION (2014)

Relevance Ranking for Vertical Search Engines, FIRST EDITION (2014)

Chapter 7. Multi-Aspect Relevance Ranking

Abstract

In this chapter, we describe learning to rank with multi-aspect relevance for vertical searches. Many vertical searches, such as local search, focus on specific domains. The meaning of relevance in these verticals is domain-specific and usually consists of multiple well-defined aspects. For example, in local search, text matching and distance are two important aspects to assessing relevance. Usually the overall relevance between a query and a document is a tradeoff among multiple-aspect relevancies. Given a single vertical, such a tradeoff can vary for different types of queries or in different contexts. In this chapter, we explore these vertical-specific aspects in the learning-to-rank setting. We propose a novel formulation in which the relevance between a query and a document is assessed with respect to each aspect, forming the multi-aspect relevance. To compute a ranking function, we study two types of learning-based approaches to estimate the tradeoff among these aspect relevancies: a label aggregation method and a model aggregation method. Since there are only a few aspects, a minimal amount of training data is needed to learn the tradeoff. We conduct both offline and online bucket test experiments on a local vertical search engine, and the experimental results show that our proposed multi-aspect relevance formulation is very promising. The two types of aggregation methods perform more effectively than a set of baseline methods, including a conventional learning-to-rank method.

Keywords

Vertical search

multi-aspect relevance

mapping function

label aggregation

model aggregation

Introduction

Along with the popularity of search engines, end users’ information needs continue being refined. One of the emerging trends is vertical search intents. For example, a user might want to find a restaurant near her current location; another user might want to follow the recent development of a breaking news event such as an earthquake in Japan. Some recent studies show that at least 20% of Web queries have some local intent [357]. As a result, vertical searches start attracting more and more attention. For example, many search engines provide specialized vertical searches such as local search [1,3] and real-time search [55]. Furthermore, vertical search results are often slotted into general Web search results [15,16] to provide more diversified search results. Thus, designing effective ranking functions for vertical searches has become important to improve users’ search experience.

A natural way to build a vertical search is to apply the existing ranking techniques on a vertical. In the TREC conference [2], several specialized tracks such as blog and chemical tracks have been introduced to build a testbed to study retrieval tasks on these special text collections. The main focus of these tracks is on content-based relevance, and most participants extend the traditional IR techniques to consider a few task-specific ranking signals. Recently, learning-to-rank approaches [187,47,407] have been studied extensively and shown to be effective to combine many useful signals into a ranking function. To adapt such a technique on a vertical, an intuitive approach is to construct a training dataset by collecting a set of queries and documents that belong to the vertical and asking human editors to give a single relevance label between a query and a document. A ranking function thus can be learned for the vertical.

However, as we observed that in many verticals, the meaning of relevance is domain-specific and usually consists of multiple well-defined aspects. For example, in, and are both important: A stale result that matches a query perfectly is no longer interesting to end users. In local search, as shown in Figure 7.1, to find a “restaurant” in “Sunnyvale CA,” a satisfactory search result should not only be about a dining place (matching aspect), it should also require the place to be close to the search location (distance aspect) and with good user reviews (reputation aspect). Usually, there is no result that is the best in all these aspects. Ranking the results based on a single aspect such as matching is not optimal. A desired ranked list of results needs to trade off these different aspects appropriately. In such a vertical, blindly applying the conventional learning-to-rank approaches by ignoring these vertical-specific domain knowledge approaches may not be cost-effective to collect training data, for the following reasons: (1) Since there are several pertinent aspects in a vertical, human editors naturally need to consider and trade off the relevance from different aspects before making the overall relevance judgement. Thus, assessing aspect relevance is a necessary step. (2) Trading off multiple aspects is not trivial, since such a tradeoff can vary for different queries or in different contexts. For example, in local search, the reputation aspect can be more important for “restaurants” queries, but the distance aspect can be more important for “banking centers” queries. (3) For different verticals, different aspects are involved, and the tradeoff among aspects is vertical-dependent. Collecting training data with overall relevance for a new vertical needs human editors to learn how to appropriately trade off different aspects.

image

FIGURE 7.1 An example of local search result.

In this chapter, we propose a novel formulation to leverage the vertical-specific knowledge in a cost-effective way. Instead of asking editors to provide the overall relevance directly, our key idea is to collect aspect relevance labels and learn the tradeoff among them in a quantitative way. Specifically, in our formulation, the relevance between a query and a document is judged only with respect to individual aspects. Intuitively, the aspect relevance is more finely specified. Thus it is less dependent on other contexts and can be presumably judged by editors with less effort. To learn a ranking function using our multi-aspect relevance formulation, we study two types of learning-based approaches to trade off these aspect relevancies: a label aggregation approach and a model aggregation approach.

• In the label aggregation approach, we first learn an aggregation function that predicts the overall relevance given the aspect relevance labels. After we get the overall relevance labels, conventional learning-to-rank algorithms can be applied to obtain a ranking function.

• In the model aggregation approach, we first train several aspect-specific ranking models based on the aspect relevance labels. The model aggregation function is then learned to combine the output of these aspect ranking models to generate the final overall relevance output.

Compared with the first approach, the second one has the advantage that we can use different ranking models for different aspects. Since there are only a few aspects in a vertical, a minimal amount of data is needed to learn either the label aggregation function or the model aggregation function. Furthermore, in our aggregation approaches, a mapping function that converts editorial labels (such as “exact match,” “plausible match,” and “no match”) to numerical values for each aspect is necessary. Such a function can be defined heuristically, but we show that we can automatically learn such a function based on training data. Thus our proposed methods are completely free from heuristics.

A main advantage of our learning-based approaches is that they are vertical-independent and can be easily applied to different vertical searches. Specifically in this chapter, we focus on learning a generic ranking function for each of the two types of queries in local search: business name queries (e.g., “walmart”) and category queries (e.g., “restaurant”). We use a training dataset with relative preferences to learn the aggregation functions and study several variants on a large dataset of local search. The experimental results show that our proposed methods for multi-aspect relevance formulation are quite promising. The two types of aggregation methods perform more effectively than a set of baseline methods including a conventional learning to rank method.

The rest of the chapter is organized as follows: In Section 7.1, we introduce related work. We define our problem in Section 7.2 and describe our methods to aggregate aspect relevancies in Section 7.3. We present our experimental results in Section 7.4 and conclude our chapter in Section 7.5.

7.1 Related Work

Modeling relevance is the central topic in the information retrieval community, and most of the past work focuses on overall relevance. In particular, almost all the evaluation methodology is based on overall relevance. For example, in the TREC conference [2], benchmark datasets are labeled by human editors with overall relevance. The relevance labels can be either binary or graded [178]. In the past, many models were proposed to capture overall relevance [301,293,281,401,47]. Most of the existing work treats overall relevance as a unit and has not studied it in a finer granularity.

Vertical searches have become popular recently. For example, in the TREC conference, there are multiple tracks, some of which, such as blog track and legal track, focus on articles from specific domains. Specific characteristics have been explored for vertical search ranking mechanisms. Most participants in the TREC conference designed task-specific ranking features. For example, Elsas et al. [114] went beyond single blog articles and studied how to rank blog feeds. Das Sarma et al. [86] advocated a comparison-based ranking scheme to rank items in Twitter-like forums. Yi et al. [388] tried to discover implicit geographic local search intents of queries automatically. Lu et al. [239] proposed to use the geographic information to personalize Web search results. On top of different verticals, past works such as [15,16] studied how to select appropriate verticals by predicting the vertical intents of different queries based on multiple sources of evidence. Our work focuses on the learning-to-rank approaches of individual verticals, and our multi-aspect relevance formulation is novel for vertical searches.

Our work is related to multifaceted search [386, 404]. The goal of multi-faceted search is to use facet metadata of a domain to help users narrow their search results along different dimensions. In a recent TREC blog track [249], a special track of “faceted blog distillation” is initiated, and the task of this track is to find results relevant to a single facet of a query in the blog collection. This task is tied in with the multifaceted search in that it is intended to study how to rank blog articles after a user has selected a facet (e.g., “opinionated” or “factual”) to refine the original query. Although facets are usually categorical and intended to help explore search results, our definition of aspects is closely related to the notion of relevance and intended to capture partial relevance.

We note that our model aggregation technique is closely related to rank aggregation, which includes score-based and rank-based methods such as [231,111] and model interpolation [134]. We are not contributing any new techniques to this body of work. Rather, we show that supervised score-based aggregation techniques can be used in our multiaspect relevance formulation. Our work is also different from multilabel learning [139], the primary focus of which is to explore the label correlation to improve learning accuracy.

Multi-objective optimization has been applied to both recommendation [294, 5, 6] and general search problems [336]. For example, in [5] and [6], both “click shaping” and its personalized version are proposed for online news article recommendation to consider both article clicks and downstream utility such as time spent. These pieces of work are in the recommendation setting, whereas our focus is on vertical searches. In [336], the focus is to optimize multiple objectives in the learning-to-rank framework for Web search. It considers multiple label sources and design optimization procedures based on the LambdaMART method. In their work, the priority of different label sources and the combination weights are predefined; thus it is not easy to handle a large number of label sources. In our work, the combination weights between different aspects are learned automatically and our formulation is more scalable with respect to the number of aspects. Furthermore, in our method, we will compare with a rule-based baseline method that is similar to their graded measure in the sense that we predefine the aspect priority and use the labels with lower priority to break the ties of labels with higher priority.

7.2 Problem Formulation

In this section, we formally define our problem. We first describe a conventional learning-to-rank approach for vertical searches and then propose our relevance formulation.

7.2.1 Learning to Rank for Vertical Searches

Although vertical searches involve multiple aspects such as matching, reputation, and distance, we can still apply conventional learning-to-rank methods. Given a query image, let image be the training data of image documents, where image is the feature vector and image is the overall relevance label of the image-th document. In a ranking problem, image is given as input and a permutation image of image is returned as output. image is ranked higher than image if image and this means image is more relevant to image than image. Typically, a ranking function image is trained and applied to image. A permutation or ranking image is generated by ordering the image in the descending order.

The overall relevance label image is a discrete label given by human editors. In this work, we follow a common five-grade labeling scheme: {Perfect, Excellent, Good, Fair, Bad}. To reduce disagreement among editors, it is a common practice that an editorial guideline is drawn up. Any editorial guideline is essentially a set of rules that specify a condition for each grade of the overall relevance label.

In this section, we use a local vertical search as an example to describe an editorial guideline. A unique characteristic of local search is that the query intents are to find some locations such as restaurants, hotels, or business centers. This also implies that users intend to use some services provided by the local businesses. Therefore, a user would prefer a location that is close and at the same time whose reputation for services is good. For example, to find a restaurant in local search, a satisfactory search result is not only about a dining place (matching aspect), but it also requires the place to be close to the search location (distance aspect) and with good user reviews (reputation aspect). Overall, we have found that three aspects, i.e., matching, distance, and reputation, are the common and most important aspects for local information needs.

Figure 7.2 shows an example of an editorial guideline for local searches. We have several questions that are intended to capture the desired properties of a local search. Questions 1, 2, and 3 are meant to capture the matching, distance, and reputation aspects, respectively. Each question has a graded judgment that will be labeled by editors. Finally, an aggregation guideline specifies a rule for each grade of the overall relevance label. An aggregation guideline is important since it defines the learning targets. It is necessary for most conventional learning-to-rank tasks, especially those that involve multiple aspects of relevance. Without a good aggregation guideline, the training data will have too much noise to train a good ranking function due to disagreement among editors regarding the overall relevance.

image

FIGURE 7.2 An example of an editorial guideline for local search. Note that the overall relevance question is NOT needed in our aggregation methods. The overall relevance is used only for the conventional method.

For example, given the query “Bank of America, Los Angeles, CA 90018,” the result “Bank of America, 2907 Crenshaw Blvd, Los Angeles, CA” has a distance of about 1.01 miles, and the average rating from two people is 3 out of 5. In this case, the labels for these questions are Exact match, Same location, and Good rating. The overall relevance is assigned as Excellent, considering all these aspects.

Such a rule-based approach is similar to [336] in the sense that we predefine the aspect priority and use the labels with lower priority to break the ties of labels with higher priority. The drawbacks of the conventional rule-based approach are: (1) defining the right rules needs deep domain knowledge and thus is nontrivial; (2) the rules are very coarse and cannot capture the true preferences in a finer granularity; and (3) this method will not scale, since the complexity of defining rules can grow exponentially as the number of aspects increases, though it is feasible for the three aspects in our work.

7.2.2 Multi-Aspect Relevance Formulation

In the conventional learning setting, the intermediate questions regarding aspects are only used to help editors reach the final overall relevance and are usually discarded when training a ranking function. Ways to leverage these aspect relevance labels effectively are not well explored. In this section, we propose a new learning-to-rank framework for multi-aspect relevance to tackle the drawbacks of the conventional rule-based overall relevance scheme.

First,we define the following concepts:

Definition 1 Aspect relevance Given a query q, a document d, and the k-th aspect, the corresponding aspect relevance is the relevance between q and d with respect to this aspect. An aspect relevance label image is used to represent the degree of the relevance where image (image) means the left label is less (more) relevant than the right label.

For example, in the editorial guideline shown in Figure 7.2, each intermediate question is to assess a single aspect relevance label. An aspect relevance label is independent of other aspect relevance labels.

Definition 2 Multi-aspect relevance Given a vertical that has m pertinent aspects, the multi-aspect relevance between a query and a document is a m-dimensional vector with each entry corresponding to an aspect relevance label between the query and the document.

Each entry in a multi-aspect relevance vector corresponds to an aspect relevance label. This label can be mapped to a numerical value by a mapping function as defined here:

Definition 3 Mapping function A mapping function of the k-th aspect image maps an aspect relevance label to a numerical value. For example, image (matching=Plausible match) image. A mapping function is consistent if image for image. We use image as the general notation of the m aspect mapping functions.

A mapping function can be manually defined or learned. In the following, for ease of exposition, we use notation image to represent a multi-aspect relevance vector of either labels or values unless clarified in the context.

7.2.3 Label Aggregation

Given our definitions, the basic idea is to train a function that can quantitatively aggregate the multi-aspect relevance values into an overall relevance value.

Definition 4 Label aggregation function A label aggregation function image is a function that maps a multi-aspect relevance vector image to an absolute overall relevance value image, i.e., image.

To learn an aggregation function image, we need training data with overall relevance signals, either absolute relevance labels or relative preferences. In this chapter, we focus on the relative preferences and use image to represent the data, where imagedenotes that image is preferred to image. Since there are only a few aspects in a vertical, training the aggregation function needs a minimal amount of data. After learning an aggregation function, we can then apply it to the large amount of multi-aspect relevance labels and thus generate a large amount of training data with overall relevance.

In summary, we have a large dataset with ranking features image and the corresponding multi-aspect relevance vectors image and a small set of relative preference data image. Since there is one-to-one correspondence between image and image in our data, we use either image or image. We have the following steps:

• Learn an aggregation function image (and a mapping function image if not manually defined) using image.

• Apply image on image and generate dataset image.

• Train a ranking function image using image based on a conventional learning-to-rank method.

7.2.4 Model Aggregation

The label aggregation method converts the problem of learning from multi-aspect relevance into a conventional learning-to-rank problem. All the rank features related to different aspects are treated uniformly in this method. The idea of model aggregation is to train an individual ranking model for each aspect. The function is to aggregate the output of aspect ranking functions to generate the overall relevance scores.

Definition 5 Aspect ranking function An aspect ranking function image is a function that maps a feature vector image to an aspect relevance score.

Definition 6 Model aggregation function A model aggregation function image is a function that aggregates the estimated aspect relevance scores into the final overall relevance scores.

In summary, we have the following steps for model aggregation:

1. For each aspect image, learn an aspect ranking function image based on the aspect relevance labels and the mapping function.

2. For each image in image, generate an image-dimensional vector image.

3. Train the aggregation function image based on the feature vector image and the training pairs in image.

For this method, the final ranking score is computed as image. Then the central question is how to learn these aggregation functions (and the mapping function image). We explore different formulations in the next section.

7.3 Learning Aggregation Functions

In this section, we propose various methods to learn based on the pairwise preferences.

7.3.1 Learning Label Aggregation

We propose two different approaches: a linear aggregation approach, and a joint learning approach.

7.3.1.1 A Linear Aggregation Method

In this section, we explore a linear model for aggregation by assuming that we have a predefined mapping function. A simple mapping function for an aspect label set image can be constructed as

image

In our local search, such a fixed mapping function is given in Table 7.1. We have an unknown parameter vector image, and the linear function takes the form image. We use the following loss function on the pairwise training data:

image

Furthermore, to ensure the monotonicity, we have to constrain image to be non-negative element-wise:

image

Table 7.1

The aspect relevance mapping function for local search.

image

We solve the optimization problem using a simple gradient descent approach in a similar way as the joint learning model in the next section.

7.3.1.2 A Joint Learning Method

This method assumes that we have a predefined mapping function and learn the aggregation function directly on the numeric values of aspect relevance. But such a mapping is in an ad-hoc fashion. In this section we propose a joint learning model that learns the mapping function and the aggregation weight simultaneously. Without loss of generality, for each aspect we assign 0 to its lowest relevance label and 1 to its highest one, i.e., image and image for image. Our joint learning method will automatically determine the numeric values for the middle labels.

Formally, our goal is to learn the values of all the labels in image and weights image to minimize the following loss function:

image

Here we use image to specifically denote an aspect label vector. We have the following differences compared with the linear method: (1) image, the vector after applying the mapping function. It is also unknown and needs to be optimized. (2) We have the following additional consistency constraint which ensures that a better label gets a higher mapped score:

image

It is easy to verify that the spaces with the constraints are convex. However, such a problem is not easy to optimize due to the quadratic terms in the objective function. Since the dimensionality of the problem is not high, we thus propose a gradient descent approach with projection to optimize the objective function. Let image and image. The gradient for each variable with respect to the objective function is:

image

We use image to denote the iteration in the gradient descent. After iteration image, we project the estimated parameters to the convex space defined by the constraints. We use the norm-2 distance for the projection

image

The projection can be efficiently solved, since it is a standard quadratic programming problem [303].

7.3.2 Learning Model Aggregation

In this section, we propose another method for our multi-aspect relevance formulation, the model aggregation method, which is formulated in Section 7.2.4. In this method, we first learn an aspect ranking function image for each aspect image and use a supervised linear model as our aggregation function:

image

To learn an aspect ranking function image using a method such as GBRank [407], we need to assign numerical values to aspect labels, i.e., the mapping function. We have proposed to automatically learn the mapping function in the joint learning method in the previous section; thus we use the obtained mapping function image to convert the aspect labels.

To learn the parameter image, we use the following loss function:

image

where image. This model is very similar to the linear model in the label aggregation methods. The difference is that we replace the labels by the output of aspect ranking functions.

Compared to the label aggregation methods, there are two benefits of the model aggregation method. First, we can use a different type of model for each aspect ranking function. This is desired in the sense that different aspects are not necessarily homogeneous. For example, the matching aspect can be complex and thus needs a ranking function with high model complexity, but a simple regression model may be good enough for the distance aspect. In particular, we use a model GBRank [407] for the matching aspect, which shows excellent performance in learning such a function. On the other hand, we use two linear regression models for distance and reputation aspect, respectively. Hence, the combination of various types of models for different aspects gives great flexibility to the final ranking function.

Also, in the model aggregation method, we can exploit preference data inferred from other sources, such as user clicks, to learn the aggregation function. Unlike the label aggregation methods, each document in image does not need aspect relevance labels image, and we only need image, the output of aspect ranking functions, to learn the aggregation function. This provides flexibility to quickly adapt the aggregation function to different contexts. For example, this makes it possible to provide personalized search rankings: We may collect preference data image for a user image and use it to learn a user-specific aggregated function image. Note that we do not need to learn aspect ranking functions for each user, since each aspect ranking should be common among users, but the tradeoff among aspects depends on personal preference. Similarly, we can provide customized search rankings for different search applications. For example, in mobile search, the distance aspect may be more important than in desktop search. We can easily build a separate ranking function for mobile search using the preference data obtained from user clicks in mobile search.

7.4 Experiments

In this section, we present experimental results to validate our approaches. The main objective of our experiments is to study the effectiveness of our multi-aspect relevance formulation and the proposed aggregation methods in local search ranking. We report both offline results and online bucket-test results in this section.

7.4.1 Datasets

The datasets we use are from a commercial local search engine, where a document is called a business listing or listing for short. We follow the editorial guideline similar to the guideline in Figure 7.2 to obtain the training data with only multi-aspect relevance. Specifically, each query in our data has an associated location (e.g., “Target Sunnyvale CA”). For each (query, listing) pair, we have three aspects: matching, distance, and reputation, as discussed in Section 7.2.1, and we ask editors to provide the three aspect relevance labels. Note that we do not ask editors to provide the overall relevance labels to reduce the labelling cost. Table 7.2 shows the statistics of this dataset. In particular, we have two types of queries: category queries and business name queries. A category query such as “Chinese restaurant Los Angeles, CA 90018” is similar to informational queries and can be broadly matched by any Chinese restaurant, whereas a business name query such as “Bank of America, Los Angeles, CA 90018” is more like a navigational query and can only be matched by the corresponding bank centers. Intuitively, the relative importance of each aspect for these two types of queries can be potentially different.

Table 7.2

Statistics of multi-aspect relevance data.

image

In Figure 7.3, we show the distribution of the labels for different types of queries and different aspects. We can see clear differences between them. The differences of the statistics regarding matching and distance aspects are due to the different densities of businesses for the two query categories: There are more matching businesses for category queries than for business name queries (e.g., typically there are more restaurants than Target stores in a city). The high percentage of the “Bad Rating” label for the reputation aspect for business name queries is due to two reasons: (1) when there is no user review, it is considered a “Bad Rating” in our guideline, and (2) about 50% of the business name queries are chain queries, such as “Walmart,” and users typically do not bother reviewing chain stores. Indeed, we will show that the reputation aspect is not particularly important for business name queries.

image

FIGURE 7.3 Distribution of aspect relevance labels. The 1, 0.5, and 0 correspond to the aspect labels in Table 7.1. (a) Category queries; (b) business name queries.

We obtain the overall relative preference using the side-by-side comparison as follows: Given any query, we randomly sample a pair of documents and put them side by side. The positions of the two documents are randomly shuffled to avoid any kind of bias. We then ask the editors to judge which one is better than another. The obtained overall training signals are thus relative preferences. Previous studies have shown that relative preferences are more reliable than absolute judgements in many scenarios [189,24]. This idea is very critical when the training data are small. However, it is expensive to obtain a large amount of such data. Fortunately, our experiment results show that only a small amount of such data is needed to obtain highly accurate aggregation functions. Table 7.3summarizes the statistics of this dataset. We split these data into training (image) and test (image) to evaluate our aggregation functions.

Table 7.3

Statistics of overall relative preference datasets obtained through side-by-side comparison.

image

7.4.2 Ranking Algorithms

We compare the following ranking algorithms:

Rule: A traditional one overall relevance label scheme described in Section 7.2.1.

Ed-overall: A ranking function trained directly using editorial pairwise preference data image.

Click: A ranking function trained using pairwise preference data induced from the click model [186].

Linear: The linear aggregation function described in Section 7.3.1.1.

Joint: The joint learning model described in Section 7.3.1.2.

ModAgg: The model aggregation method described in Section 7.3.2.

Rule, Ed-overall, and Click serve as baselines. Rule, a traditional one-overall relevance label scheme described in Section 7.2.1, is similar to the graded measure used in [336] in the sense that the secondary labels are used to break the tie of the primary labels. Ed-overall and Click are other baselines to show the benefit of our multi-aspect relevance. Both are learned using the GBRank models [407]. The click data are easy to obtain but they are noisy. We have 1,441,975 and 58,782 click-based training (query, listing) pairs for category and name queries, respectively. In all our label aggregation methods, the final ranking functions are learned using the GBRank models [407].

7.4.3 Offline Experimental Results

We report our offline experiment results based on the datasets with editorial labels.

7.4.3.1 Evaluation Metrics

To evaluate aggregation functions, we consider two types of pair accuracy with respect to the test dataset image. (1) Label aggregation accuracy: How accurate is an aggregation function to aggregate the multi-aspect relevance to generate the overall relevance? This accuracy is only applied to label aggregation methods. (2) Ranking accuracy: How effective is a ranking function trained using either label or model aggregation methods?

Let image be a label aggregation function and image be a final ranking function trained using either label aggregation or model aggregation methods. The label aggregation accuracy of image is:

image(7.1)

and the ranking accuracy of image is:

image(7.2)

Note that an NDCG-like metric seems to be possible to compare the ranking accuracy of different aggregation methods, since a label aggregation function can generate absolute overall relevance values. However, different label aggregation methods can generate different overall values for the same (query, listing) pair. Thus we do not have a common ground on which to compare them. Hence, we use the pair accuracy as a more isolated and unbiased metric for evaluation.

7.4.3.2 Results on Label Aggregation Accuracy

Table 7.4 shows the comparison of various aggregation functions based on the label aggregation accuracy, which is defined in Eq. (7.1). In this experiment, we use only the data in Table 7.3. We learn the aggregation functions for each type of query separately. Table 7.4 shows the results of category queries and business name queries, respectively. From this table, we make the following observations. (1) The Rule method performs much worse than all the learning-based methods in both types of queries. For example, for category queries, the rule-based method has about 64% accuracy, and all other methods have 82.5% accuracy. This shows that the rules are too coarse to capture the desired tradeoff. (2) They perform equally well on both category and business name queries. The pair accuracy is 83% on category queries and 93% on business name queries. This demonstrates high consistency between the aggregated overall relevance and the true one. (3) Comparing the two types of queries, we can see higher accuracy for business name queries than category queries. This is expected, since the relevance for category queries is naturally more complex than for business name queries.

Table 7.4

Evaluation of aggregation functions on label aggregation accuracy. Linear and Joint are significantly better than Rule (p-value image).

image

7.4.3.3 Results on Ranking Accuracy

Table 7.5 shows the comparison of the ranking accuracy for both label and model aggregation methods. For a label aggregation method, we first use the training data in Table 7.3 to learn the label aggregation function and then apply it to obtain the overall relevance for the set of training queries in Table 7.2. We then test the ranking function using image. The pair ranking accuracy is defined in Eq. (7.2). For the model aggregation method, we use the training data with aspect relevance label data in Table 7.2 to learn the aspect ranking functions and then use the training data in Table 7.3 to learn the model aggregation function. We then apply the obtained aspect ranking functions and model aggregation function together on the test data. In this table, we make the following observations: (1) For category queries, all the learning based methods are significantly better than the Rule method. The ModAgg method is slightly better than all other methods. (2) However, for business name queries, only the ModAgg method can be significantly better than the Rule method, whereas the improvement of the Linear and Joint methods over the Rule method is not significant with respect to p-value image. One possible reason is that business name queries are relatively easy and all the methods have achieved high accuracy.

Table 7.5

Evaluation of aggregation functions on ranking accuracy. Statistically significant differences (p-value image) compared to Rule are highlighted in bold.

image

Overall, we can see that our learning methods are quite effective and the model aggregation method is the most stable one for both classes of queries.

7.4.3.4 Benefit of Multi-Aspect Relevance

Table 7.5 also shows the benefit of our multi-aspect relevance formulation. We can see that Ed-overall performs much worse compared to the ModAgg method. The reason that Ed-overall performs worse is mainly due to the limited amount of training data. Click data performance is inferior, mainly because it is noisy. Our method leverages the aspect relevance and thus can utilize the small amount of Ed-overall data more effectively.

7.4.3.5 Comparison of Aspect Importance

The Linear, Joint, and ModAgg methods generate the weight vector (image, image) for aspects as output. These weights indicate the relative importance of aspects in the overall relevance. For category queries, we have

image

For business name queries, we have

image

This result confirms our intuition that the reputation aspect is more important than the distance aspect for category queries, and vice versa, for business name queries. This finding shows that different types of queries work best with different aggregation functions.

In our current work, we mainly focus on two broad types of queries. These results show that there is probably much room for improvement if we can make our aggregation functions query-dependent or personalized. We leave these tasks as future work.

7.4.4 Online Experimental Results

Ranking functions can be compared in terms of pair accuracy or DCG in the offline setting, but they can also be compared in terms of how users interact with the search results in the online setting. In this section, we report our online results.

7.4.4.1 Experiment Design

In our online experiments, we conduct “bucket tests” for a certain period to compare different ranking algorithms. The bucket is created based on user cookies. A cookie is assigned to a fixed bucket in our test period. Each bucket corresponds to a small percentage of user population who use our local search engine. In different buckets, we show search results of different ranking algorithms. If one ranking algorithm is better than another, we would expect the user experience metric to be better.

We can see that online experiments are expensive and we cannot test many different algorithms. In our experiments, we were able to test only two functions during the same time period in our commercial local search engine. During two weeks, we comparedRule and Linear. During another two weeks, we compared Linear and ModAgg. To compare two functions, we use CTR as our user experience metric. Specifically, we use CTR for top image positions and denote it as CTRimage:

image

For confidential reasons, we do not report the exact CTR, but we do report the normalized CTR over a fixed number.

7.4.4.2 Results

In Figure 7.4, we report the daily trends of the click-through rate (CTR) of ranking functions. The CTRs naturally fluctuate on different days. We can see that Linear is consistently better than Rule and ModAgg outperforms Linear during the whole period. The average CTR5s of Rule and Linear during the first test period are 0.315 and 0.325, respectively. The average CTR5s of Linear and ModAgg during the second test period are 0.308 and 0.314, respectively. These differences are statistically significant (p-value image).

image

FIGURE 7.4 The clickthrough rate (CTR) comparisons based on online experiments for a commercial local search engine. CTRs are normalized not to show absolute values. Each comparison is done for a different test period due to online experiment constraints. (a) CTR5 comparison of Rule and Linear; (b) CTR5 comparison of Linear and ModAgg.

This result is consistent with our offline experimental results and shows that our multi-aspect relevance framework outperforms a traditional one overall relevance scheme. In addition, it demonstrates that the model aggregation method is more effective than the label aggregation methods. Thus, using different ranking models for different aspects is more suitable for our multi-aspect relevance aggregation.

7.5 Conclusions and Future Work

The meaning of relevance in vertical searches is domain-specific and usually consists of multiple aspects. In this chapter, we proposed a multi-aspect relevance formulation for vertical searches in the learning-to-rank setting. In our formulation, the relevance between a query and a document is assessed with respect to each aspect, forming the multi-aspect relevance. To learn a ranking function, we studied two types of learning-based approaches, a label aggregation method and a model aggregation method, to estimate the tradeoff among these aspect relevancies. Then a ranking function is learned based on the multi-aspect relevance formulation. Since there are only a few aspects, a minimal amount of training data is needed to learn the tradeoff. We studied several methods to learn aggregation functions and conducted experiments on local search engine data. Our experimental results show that our multi-aspect relevance formulation is promising. The proposed aggregation approaches are very effective to learn the tradeoff among different aspects.

Our work can be extended as follows: First, we proposed to learn the ranking functions and aggregation functions separately in this chapter. Finding a way to learn them jointly is a promising direction. Second, our methods rely on manually identified aspects, and thus ways to automatically discover the aspects given a vertical are worth studying. Third, our multi-aspect relevance formulation provides a new base to study how to learn context-sensitive or personalized ranking functions. For example, it is possible to dynamically update our aggregation function based on short-term user interaction history.