Mobile Search Ranking - Relevance Ranking for Vertical Search Engines, FIRST EDITION (2014)

Relevance Ranking for Vertical Search Engines, FIRST EDITION (2014)

Chapter 5. Mobile Search Ranking

Abstract

The wide availability of Internet access on mobile devices, such as phones and personal media players, has allowed users to search and access Web information on the go. The availability of continuous fine-grained location information on these devices has enabled mobile local search, which employs user location as a key factor to search for local entities (e.g., a restaurant, store, gas station, or attraction), to overtake a significant part of the query volume. This is also evident by the rising popularity of location-based search engines on mobile devices, such as Bing Local, Google Local, Yahoo! Local, and Yelp. The quality of any mobile local search engine is mainly determined by its ranking function, which formally specifies how we retrieve and rank local entities in response to a user’s query. Acquiring effective ranking signals and heuristics to develop an effective ranking function is arguably the single most important research problem in mobile local search. This chapter first overviews the ranking signals in mobile local search (e.g., distance and customer rating score of a business), which have been recognized to be quite different from general Web search. We next present a recent data analysis that studies the behavior of mobile local search ranking signals using a large-scale query log, which reveals interesting heuristics that can be used to guide the exploitation of different signals to develop effective ranking features. Finally, we also discuss several interesting future research directions.

Keywords

Local search

mobile local search

ranking signals

ranking heuristics

Introduction

The wide availability of Internet access on mobile devices, such as phones and personal media players, has allowed users to search and access Web information on the go. According to a recent report from comScore,1 as of July 2011, there were over 234 million mobile users in the United States, with nearly image percent searching on their mobile devices. The availability of continuous fine-grained location information on these devices has enabled mobile local search, which employs user location as a key factor to search for local entities, and has overtaken a significant part of the query volume. Sohn et al.’s study [326] found that image of mobile information needs are local. This is also evident in recent reports by BIA/Kelsey2 showing that image of all search volume will be local in nature by 2015 as well as in the rising popularity of location-based search applications on mobile devices such as Bing Local, Google Local, Yahoo! Local, and Yelp. A screenshot of Bing Local is shown in Figure 5.1.

image

FIGURE 5.1 Sample search result page of Bing Mobile Local.

The service quality of any mobile local search engine is mainly determined by its ranking function, which formally specifies how to retrieve and rank local entities in response to a user’s query. Even thoughit is similar to general Web search in that they both boil down to a similar problem of relevance/click prediction and result ranking, there are fundamental differences as well as challenges in developing effective ranking functions for mobile local search.

First, the ranking signals in mobile local search are quite different from general Web search. On one hand, Web search handles a wide range of Web objects, particularly Web pages, whereas mobile local search focuses mostly on ranking local entities and businesses (e.g., restaurants). Therefore, special domain knowledge about the ranking objects in mobile local search could be exploited to improve ranking accuracy. For instance, businesses may receive ratings and reviews from their customers thanks to the Web 2.0 services, which have been shown to be useful signals for ranking businesses [26,402]. On the other hand, local search users generally prefer businesses that are physically close to them; this is particularly critical for mobile users who are on the go and their range of reach might be limited. For example, a user would be more likely to visit a restaurant within 1 kilometer than another one within 2 kilometers to get breakfast if the two restaurants are similarly good in other aspects. The distance between the result business and the user’s location has been recognized as an important ranking signal in mobile local searches [132,26,248]. In fact, the customer rating score, the number of reviews, and the distance are all shown explicitly to users in the search result user interface of mobile local search, as shown in Figure 5.1, and therefore play an important role in influencing the user’s click decision. Properly studying and modeling how this information affects user click behavior is arguably the key to improving ranking accuracy.

In spite of the recognition of these new ranking signals, early studies have mostly treated them as a black box to extract very basic features (e.g., raw rating scores and distance values) without going inside the signals to study how exactly they affect the relevance or click rate of a business. For example, it is unclear how a business’s click rate changes with its rating score. Does a lower rating score necessarily lead to a lower click rate? In the aforementioned restaurant example, a 1 kilometer difference in distances may lead to significantly different click rates of two restaurants, but would the same 1 kilometer distance difference also cause similar click rate differences for another two restaurants that are further away from the user’s location, say, 10 and 11 kilometers instead of 1 and 2 kilometers away?

It is critical to understand the underlying behaviors and heuristics of a ranking signal, which would guide us to design more effective ranking features; this has been demonstrated extensively in the development of retrieval functions for general Web search [290,317,119,78,247]. For example, the term frequency signal, which assigns a higher score to a document if it contains more occurrences of the query term, should be normalized to prevent the contribution of repeated occurrences from growing too large due to the burstiness phenomenon [290,119], and the term frequency signal should also be normalized by document length, since long documents tend to use the same terms repeatedly [290,317,119]. All effective retrieval models in Web searches have implemented these heuristics [119], and previous work has also shown that a retrieval function tends to work poorly if any desirable retrieval heuristic is violated [119,247].

Inspired by the successes and lessons from the development of retrieval models, Lv et al. [246] attempted to explore the ranking heuristics in mobile local search and tried to understand how exactly a ranking signal in mobile local search is related to the click rate or relevance of a business. They followed a data-driven methodology to study the behavior of these ranking signals in mobile local search using a large-scale query log; their findings reveal interesting heuristics that can be used to guide the exploitation of different signals. For example, users often take the mean value of a signal (e.g., rating) from the business result list as a “pivot” score and tend to demonstrate different click behaviors on businesses with lower and higher signal values than the pivot. The click rate of a business generally is sublinearly decreasing with its distance to the user, and so on. Inspired by the understanding of these heuristics, they further proposed different transformation methods to generate more effective ranking features.

The rest of this chapter is organized as follows: We first overview the special ranking signals in mobile local search in Section 5.1. Section 5.2 presents and discusses a recent data analysis that explores the ranking heuristics behind these special signals to develop more effective ranking features. We summarize current achievements in acquiring ranking signals and heuristics as well as discuss several interesting future research directions in Section 6.6.

5.1 Ranking Signals

There have been several large-scale studies on mobile search query log analysis for deciphering mobile search query patterns [193,194,71,387,69]. The goal of these studies is to provide quantitative statistics on various aspects of mobile search that can help gain better insight on mobile users’ information needs, but few of these efforts have contributed concrete ranking signals. More recently, the exploration of ranking signals has attracted many efforts [10,209,26,248], which have recognized that the ranking signals in mobile local search are quite different from general Web search. We summarize these special ranking signals in this section.

5.1.1 Distance

Local search users generally prefer businesses that are physically close to them; this issue is particularly critical for mobile users who are on the go and with a limited range of reach. For example, a user would be more likely to visit a restaurant nearby than another restaurant far away to get breakfast if the two restaurants are similarly good in other respects. The distance between the result business and the user’s location has been recognized as an important ranking signal in mobile local search [132,209,26,248,246].

In the case of the measurement of distance, it is easy to compute the geographic distance between a user and a local entity or business once their locations are known. People have further noticed that the absolute distance value itself, however, sometimes does not really reflect the effort required for a user to travel to the business. For example, how sensitive a user is to the geographic distance depends on the type of business that is being ranked (e.g., users may be willing to drive 20 minutes for a furniture store but not for a coffee shop) [209,26], the search time (e.g., users may be willing to travel a longer distance for a dinner than for a lunch) [209], the traffic status and attractiveness of a location (e.g., users tend to travel a longer distance in Texas than in New York) [26,248], etc.

Although an effective ranking signal, Teevan et al. [344] report that users care about how close a business is to their current location only in a portion, image, of mobile local searches. Instead, mobile local searchers are often in transit (image) and want information related to their destination location (image), en route to their destination (12%), or near their destination (12%) [344]. Therefore, finding a way to automatically identify these different cases and predict the destination location would be an interesting future direction in mobile local search.

5.1.2 Customer Reviews and Ratings

Mobile local search focuses mostly on ranking local businesses (e.g., restaurants). Therefore, special domain knowledge about the ranking objects in mobile local searches could be exploited to improve ranking accuracy. For instance, businesses may receive ratings and reviews from their customers thanks to the Web 2.0 services. Intuitively, the customer rating score and the number of reviews of a business would significantly affect users’ click behaviors in mobile local search, since users are acclimated to consulting other people’s ratings and opinions about an entity to help make their own decisions [273]. Recent work has already shown that this information can be useful signals for ranking businesses [26,248,246].

Nevertheless, customer reviews and ratings are not unique ranking signals in mobile local searches; they have also been exploited in some other search tasks. For example, in opinion retrieval [273], the goal of the task is to locate documents (primarily blog posts) that have opinionated content; Ganesan and Zhai [133] studied the use of the content of customer reviews to represent an entity (e.g., business) in entity ranking; Zhang et al. [402] proposed different methods to aggregate the counts of thumb-ups and thumb-downs for rating prediction, etc. But the roles of customer ratings and reviews are clearly different in mobile local searches.

However, many businesses, especially new businesses, might not have any customer ratings or reviews. For instance, we analyzed a sample of 5 million businesses that were shown or clicked by real users on a commercially available search engine and found that over 4 million of these businesses did not receive any reviews. Given this level of data sparseness, the process of properly extracting and effectively using these signals for ranking becomes challenging.

5.1.3 Personal Preference

Compared to general Web search, mobile local search is more “actionable” [248,344] in the sense that mobile users often take an action (such as visiting a restaurant) as soon as they finish a local search session. However, on the other hand, mobile local queries (e.g., the most popular query is “restaurants” in our analysis) are often general and not well specified [71,344,30],3 which could match many businesses from different categories, not all of which are interesting to the user. Therefore, understanding the user’s personal preference is particularly important in successfully answering a mobile local query. For instance, knowing that a mobile user searching for restaurants prefers Chinese food, we can rank more Chinese restaurants on the top to avoid bothering the user with other types of restaurants.

However, conversely to Web search, it is nontrivial to build user profiles for capturing personal preference for different businesses in mobile local search. On one hand, the text associated with each business is often very sparse, so it would be hard to build content-based user profiles as proposed previously [312,339]. On the other hand, due to the local nature, a user tends to click only nearby businesses, so it is hard to find users who live far away from each other but share similar business click patterns, making it difficult to apply the collaborative filtering approaches (e.g., [331,107]) or statistical topic modeling approaches (e.g., [161,33,251]) for user profiling.

Due to these challenges, to the best of our knowledge, few efforts have exploited personalization for mobile local search except Lv et al.’s work [246], which we discuss in more detail in Section 5.2.

5.1.4 Search Context: Location, Time, and Social Factors

We need to balance multiple ranking signals to develop an effective ranking function. However, because of mobile local search’s actionable nature, the role of the user’s search context, such as location, time, and social activities, is particularly important in influencing the way these signals are balanced. For instance, a mobile user in Manhattan on a Friday evening around 5:00 p.m. most probably is willing to travel only a short distance to visit a business because of the heavy traffic. However, a mobile user in Texas might be willing to drive a longer distance because of the ease of access to the highway. In contrast, the context of a desktop user who does similar searches from the comfort of his house right before he goes to sleep on a Monday night might be irrelevant, given that the search session might be for some future plan in the next week.

5.1.4.1 Location

Location is one of the most important context factors in mobile local search, since users at different locations tend to make different decisions, either due to geographic properties of their regions or demographics of the area they live in. For example, a recent data analysis [248] shows that only image of the mobile local queries in the state of Texas result in clicking a business within 1 km of the user location, whereas this percentage doubles (image) when we consider the state of New York.

Lymberopoulos et al. [248] have attempted to studymobile local search. They augmented the feature space with a set of new location-aware features, such as the average traveling distance and the standard deviation of traveling distance within the ZIP code, and then they put forward a machine learning algorithm to build ranking models. More specifically, they evaluated three approaches: (1) training a single model to automatically leverage these features, which implicitly takes location context into account; (2) training a model for each specific location and choosing the corresponding ranking model based on the location of the query; and (3) combining these two models adaptively based on some additional statistics of the location. Their experiments show that the three approaches all work generally more effectively than location-independent ranking, in particular the third one, though building a different ranking model for each location might not scale well in real applications.

Location is important in mobile local search, not only because people are often mobile while searching but also because they are searching for places with a geographic location. Therefore, besides the current location of the searchers, mobile local ranking is also often sensitive to the destination location, i.e., the location of businesses [344,26]. One interesting observation is the phenomenon of distance asymmetry [26]. For example, it is common for users from Redmond, Washington, to drive to Seattle for dinner, but the converse does not hold. Berberich et al. [26] has presented an approach to indirectly incorporate distance asymmetry into the ranking function by modeling the destination location’s attractiveness in terms of its popularity and its distance to other businesses.

5.1.4.2 Time

Time plays a particularly important role in the decision process of mobile local users [209,344]. On one hand, people’s behavior and the activities they want to engage in vary depending on whether it is a weekday or a weekend or if it is evening or morning. For example, Church and Smyth [70] found that 8% of mobile queries mentioned time explicitly; Teevan et al. [344] reported that 14% of mobile local searches were at least partly triggered by the time (e.g., breakfast time or dinner time). On the other hand, users usually will take an action immediately following their searches. For example, Teevan et al. [344] observed that image of mobile local searchers plan to visit the business on the same day as the search, including image who plan to visit as soon as possible and image as part of their current trip. Consequently, people’s preference in mobile local search would be influenced by the time. For example, Lane et al.’s data analysis [209] shows that when searching restaurants, people prefer fast-food places, informal restaurants, and local coffee shops during weekday mornings, but the preference for these businesses reduces significantly during weekends and weekday evenings, with the highest drop happening during weekend evenings.

Although time has been shown to be an effective ranking signal in [209], its performance appears to be inconclusive in two recent studies [248,246]. Thus it would be interesting to further explore how to effectively extract and model the time signal in mobile local search.

5.1.4.3 Social Factors

Social factors appear to be surprisingly important for mobile local search [82,344,286]. In Teevan et al.’s survey [344], a large majority (63%) of respondents said their most recent mobile search was discussed with someone else, suggesting that mobile local searches often tend to have been triggered by social means, such as a conversation or group need. After finishing searches, mobile users often like to share results by simply speaking aloud or sometimes showing their mobile phone screen to others, as reported by [286].

To the best of our knowledge, such kinds of social factors have not been studied before in the ranking problem of mobile local search. It would be a promising direction to explore and understand the ranking issues when mobile meets social.

5.2 Ranking Heuristics

In spite of recognizing these new ranking signals, early studies have mostly treated them as a black box to extract very basic features (e.g., raw rating scores and distance values) without going inside the signals to study how exactly they affect the relevance or click rate for a business. Although some statistics of these signals are also often used as complementary features, such as the average distance and the standard deviation of distance in the current location [248], these studies rely purely on machine learning techniques to combine all features without going inside the signals.

On the other hand, in mobile local search, customer rating scores, numbers of reviews, and distance values are not only the key back-end ranking signals, they are also important information displayed explicitly in the search result user interface (UI), as shown inFigure 5.1. Although the “personal preference” signal and the context factors (e.g., location, time, and social factors) are not explicitly shown to users, users certainly know their own preference and search context. That being said, all these ranking signals are directly observable by users, and users’ click behaviors presumably would be heavily dependent on these signals. Therefore, understanding how exactly a user’s decision relates to these signals would potentially lead to better ways of modeling these signals and thus to improving mobile local searches.

In this section, we present and discuss a recent data analysis of the behavior of some of these ranking signals, including customer rating scores, numbers of reviews, distance, and user preference, in mobile local search using a large-scale query log [246], which reveals interesting heuristics that can be used to guide the exploitation of these signals. We first give an overview of the dataset and the experimental settings used in [246] to discover these heuristics.

5.2.1 Dataset and Experimental Setting

Lv et al. [246] used clickthrough information to approximate the relevance judgments, realizing the challenges in acquiring relevance judgments in mobile local search. Although each individual user click may not be very reliable, the aggregation of a great number of user clicks from a large-scale query log could still provide a powerful indicator of relevance. Thus, it is very likely that features that help improve click prediction will be useful in ranking as well.

Therefore, the problem setting was, given a query, to predict whether a candidate business would be clicked, and then rank the candidate businesses based on the click prediction. In the experiments, the query was sampled from the search log, whereas the candidate businesses were all businesses that were shown to the user for that query. To learn and evaluate a click prediction model, the query log was split into four parts. The first nine months of data were kept out as the “history” data and were used purely for estimating the popularity of a business and the user’s personal preference. Queries were sampled from the next three, one, and one months of data, respectively, for training, validating, and testing the click prediction models.

Three preprocessing steps have been applied to make these four datasets more practical and representative: (1) Queries that did not receive any clicks or received clicks for every candidate business were excluded, since these queries will not influence the average ranking performance in the experiments. (2) Queries that match a business name exactly, e.g., “Starbucks,” were identified and filtered out; solving such kinds of queries is a relatively easy task, since users usually have clearer information needs (e.g., visiting a nearby Starbucks) compared to other more general queries, e.g., “Coffee.” In doing this, we can place an emphasis on difficult queries in our study. (3) They empirically removed queries (from all the four datasets) by “users” who issued more than 1,000 queries in total in the training, validation, and test datasets, because such “users” are more likely to be robots. Finally, 60,475, 18,491, and 23,152 queries were obtained as the training, validation, and test queries; the average number of clicks and the average number of candidate businesses for each query were 1.34 and 17, respectively.

Since multiple signals need to be leveraged for click prediction, this work seeks help from machine learning. They adopted MART [372], a learning tool based on multiple additive regression trees, to provide a common learning framework on top of which one can compare the performance of different ranking features. MART is based on the stochastic gradient boosting approach described in [131,129], which performs gradient descent optimization in the functional space. In the experiments of [246] on click prediction, the log likelihood was used as the loss function, steepest descent (gradient descent) was used as the optimization technique, and binary decision trees were used as the fitting function.

A training instance for each query-business pair was constructed, which consists of a set of features (e.g., distance, rating, etc.) and a click label that indicates whether the user clicked the business (1 for click and 0 otherwise). The training and validation data were fed into MART to build a binary classification model, which was used to estimate the probability of clicks in the test data.

Note that the choice of machine learning algorithms is not critical in this data analysis study: Machine learning is only taken as a black-box tool to evaluate the proposed heuristics and features, so any other appropriate learning algorithm, such as more advanced learning-to-rank algorithms [226], could also be used instead of MART. MART was chosen mainly because it can potentially handle a nonlinear combination of features, and it is widely adopted in current commercially available search and advertisement ranking engines.

The main goal of the experiments was to explore and evaluate effective ranking heuristics for boosting the special ranking signals in mobile local searches. The baseline feature set contained six representative features that were selected based on previous research studies [132,209,248]. These features were: (1) the distance between the query and the business locations, (2) the popularity measure of the business as defined by the number of clicks in the history search logs, (3) the click rate of the business in the history data as defined by the number of clicks divided by the number of impressions and defined as 0 if it did not occur in the history data, (4) the customer rating score of the business in a range of image, (5) the number of customer reviews of the business, and (6) a time code that represents both the timeframe within a day (one out of five timeframes) that the query was submitted and the day of the week on which the query was submitted (weekend or weekday). Since only top-ranked businesses from a commercial mobile local search engine were re-ranked in the experiments, and those businesses that did not match the query keywords at all were already eliminated, they thus do not involve any textual matching feature and focus only on the special signals of mobile local search.

The retrieval performance in terms of mean average precision (MAP) and the precision at different recall levels were compared. Because the experiments only involved reranking a set of top-ranked businesses in response to a query, the recall score would be the same for any ranking model. In other words, MAP will be influenced only by the positions of the relevant results. So, in this study, MAP can be a good measure to capture the ranking accuracy of top-ranked results. Besides, the importance of each feature in constructing the MART model was also compared, following the relative importance measure proposed in [131].

5.2.2 Customer Rating

Intuitively, the customer rating score of a business would significantly affect users’ click behaviors in mobile local searches, since users are accustomed to consulting other people’s ratings and opinions about an entity to help make their own decisions [273]. To verify this intuition, amodel is trained using the baseline features described in the previous section. The relative importance of the different features is shown in Table 5.1, indicating that conversely to our intuition, the importance of the rating score as a feature is relatively low in our baseline system.

Table 5.1

Relative feature importance in baseline.

image

5.2.2.1 Likelihood of Clicks/Relevance

To examine this observation, this work [246] analyzes the likelihood of relevance (i.e., click rate) for businesses of all rating scores and plots these likelihoods against the rating score to obtain a click pattern. Intuitively, the likelihoods should increase monotonically with the rating score.

To estimate these likelihoods, all the businesses in the training data are sorted in order of increasing rating score and divided into several equally sized (i.e., 1,000) “bins,” yielding 1,032 different bins. Then the mean rating score in each bin is selected to represent the bin on the graphs used in later analysis. We can then compute the probability of a randomly selected business from the image-th bin that is clicked, which is the ratio of the number of clicked businesses from the image-th bin, and the bin size. In terms of conditional probability, given a business image, this ratio of the image-th bin can be represented by image.

Figure 5.2(a) shows how the probabilities obtained from this analysis relate to the mean rating score in a bin. Surprisingly, there seems to be no clear relationship between the likelihood of a click and the rating score.

image

FIGURE 5.2 Probability of clicks for businesses from a bin, plotted against the mean rating score of this bin.

This antiintuitive observation could be caused by the potentially incomparable rating scores across different queries; result businesses retrieved by some queries may have higher rating scores than those retrieved by some other queries. To verify this possibility, the popular zero-one score normalization method is adopted. This method linearly normalizes the rating scores of every query to a range of image. Such a normalization strategy has been widely used for feature normalization in many learning-to-rank tasks, e.g.,[59]. After that, we do a similar analysis of the probabilities of a click, but against the normalized rating score. The results are shown in Figure 5.2(b). Unfortunately, there is still no clear relationship, suggesting that the ineffectiveness of the rating score as a feature is not purely caused by the incomparable score range.

5.2.2.2 The “Mean” Normalization Scheme

To diagnose the problem, we further look into the distribution of rating scores of businesses that get clicked. Since the distribution of rating scores is intuitively related to the type or category of businesses, we do this analysis in a category-aware way. Specifically, we compare the rating score of a clicked business with the mean rating score of all businesses from the same category. In doing this, we hope to understand whether the clicked businesses are among the highly rated businesses in a category. The comparison results are illustrated in Figure 5.3, in which we show image of randomly selected clicked businesses from the training data.

image

FIGURE 5.3 Rating of clicked businesses, plotted against the mean rating score of the corresponding business category.

Interestingly, we can see that the rating scores of most of the clicked businesses are above their corresponding categories. For example, when the category mean rating score is 4, few businesses with a rating score lower than 4 get clicked. This shows that the rating score is indeed useful and is directly related to users’ click behaviors. However, how does a user know the mean rating score of a category so as to click businesses above this score?

In reality, we expect that users do not really know the mean score of a category. However, users may be able to approximately estimate this mean score through looking over the retrieved business list; the retrieved businesses for a query often belong to the same category and thus could be used as a sample set of businesses from that category, the mean rating score of which can be viewed as approximately the mean category rating. If this assumption is true, it may suggest that users often take the mean rating score from the business result list as a “pivot” score and tend to click businesses with higher scores than this pivot. This intuitively makes sense: A user’s click decision, although influenced by the rating score, is not entirely based on it, but if the rating of a business is above the user’s expectation (i.e., the pivot), which the user learns from the result list in an ad hoc way, the business would be more likely to be clicked.

Inspired by this analysis, Lv et al. proposed to normalize rating scores using the mean rating score of the retrieved businesses. A straightforward approach is to divide the original rating score using this mean value, which not only makes the rating scores more comparable across queries but also aligns them at the corresponding mean rating scores of each query. The probabilities of clicks against the normalized rating scores are then plotted; see Figure 5.2(c). It is clear that the probability of a click increases monotonically with the normalized rating when the normalized rating is larger than 1 (i.e., the mean point), whereas the probability tends to be random when the normalized rating is lower than 1. This is an empirical verification of the fact that users tend to take the mean rating score from the observed results as a “pivot” score and a clear demonstration of different click behaviors on businesses with lower and higher scores than the pivot score.

To examine whether the proposed simple mean normalization scheme can really improve ranking accuracy, the normalized rating score was used to replace the original rating score to learn a new model, labeled MeanNorm, and its performance is compared with the baseline model. In addition, the widely used zero-one strategy for rating normalization is also taken as another baseline run, which is labeled ZeroOneNorm. The comparison results are reported in Table 5.2 and show that the mean normalization scheme works best, achieving significant improvement over both baseline runs. At the same time, the zero-one normalization does not improve the accuracy of the baseline model. This suggests that we can improve ranking performance by pointing out the mean rating value.

Table 5.2

Comparison of methods for modeling rating scores. Norm+Pred combines methods tagged using image. image indicates the significance over Baseline, ZeroOneNorm, MeanNorm, AutoNorm-C, AutoNorm-Q, and RatingPred, respectively, at the 0.001 level using the Wilcoxon nondirectional test.

image

Another approach to explicitly normalizing the rating of a business with the mean rating of all result businesses would be to encode the mean rating of result businesses as an additional feature and let the training algorithm itself decide how to do the normalization. To evaluate this approach, a new model, labeled AutoNorm-Q, was trained by adding the mean rating score of businesses from the same single query into the baseline feature set. They also trained another model, labeled AutoNorm-C, in which the mean rating score of businesses from the same category was added into the baseline. We present the performance of these two runs in Table 5.2. The results demonstrate that AutoNorm-Q works much better than AutoNorm-C, confirming that users select the “pivot” from the businesses shown to them. AutoNorm-Q improves over the baseline by approximately image. Moreover, AutoNorm-Q improves over MeanNorm, suggesting that the mean normalization scheme can be boosted by optimizing the way of exploiting the mean in a supervised manner.

5.2.2.3 Cluster-Based Smoothing of Ratings

It is often the case that a business does not receive any customer rating. In the presence of missing rating scores, a default value of 0 is often used, which, however, may be inaccurate because: (1) a business that does not receive any customer rating does not necessarily mean that it should be rated low; and (2) it could be unfair to use the same default rating score for all businesses. Even if a business receives a rating score, it may still be inaccurate if the rating is only contributed by a very small number of customers. Therefore, more accurate prediction/smoothing of ratings could potentially improve ranking accuracy, and Lv et al. [246] also proposed a cluster-based method to predict or smooth rating values.

The basic idea is based on the cluster hypothesis [254] and averages rating scores image from all businesses image in the same cluster image to smooth the rating image of business image so as to obtain an updated rating image. The intuition is that businesses in the same cluster should receive similar ratings, and the rating stability of a cluster would benefit an individual business. Formally,

image(5.1)

where image is a function to control rating update. The key component is thus the business cluster. Two types of clusters are used in [246]: business category and business chain. The former allows us to use the rating of all businesses in a given category to estimate the rating of an unrated business in that category (e.g., use all businesses in the “Coffee & Tea” category to estimate the rating score of a Starbucks business). Because a business may belong to multiple categories, it would intuitively help to predict the primary categories of businesses [196]. The latter approach allows us to estimate the rating score of a business by exploiting the rating score of other businesses belonging to the same chain (e.g., use different Starbucks coffeehouses’ rating scores to estimate the rating score of an unrated Starbucks coffeehouse).

There are two challenges with this approach: how to choose function image and how to leverage the evidence from two types of clusters. Inspired by the effective performance of automatic feature normalization in the previous section, the same learning algorithm can be used to optimize these two factors in a supervised way. Specifically, both a category mean rating and a business-chain mean rating are provided as two separate features to the learning algorithm. In addition, two description variables for these two new features, i.e., the size of the category and the size of the business chain, are also introduced to the learning algorithm.

This method is labeled RatingPred, and the experiment results are presented in Table 5.2. We can see that RatingPred improves significantly over the baseline, suggesting that the proposed cluster-based smoothing can indeed improve rating values. Furthermore, RatingPred can be combined with the proposed feature normalization methods, leading to a new run labeled Norm+Pred. It is observed from Table 5.2 that Norm+Pred outperforms either single method alone, suggesting that smoothing ratings and normalizing ratings are complementary to each other. Norm+Pred improves over the baseline by more than image.

5.2.3 Number of Reviews

The number of reviews represents another signal from the opinionated content, which can intuitively reflect the popularity of a business. However, it shows that the importance of this signal is also low in the baseline model, as presented in Table 5.1. Similarly to the rating score analysis, Lv et al. [246] reveal that this is because users often take the mean number of reviews from their observed businesses as a pivot and demonstrate different click patterns on businesses with lower and higher number of reviews than the pivot. However, the learning algorithm fails to capture this important information. To better exploit the strengths of this signal, we also need to feed this pivot number (i.e., mean number of reviews) to the learning algorithm. That is, we should either manually normalize the number of reviews using the pivot or introduce the pivot as an additional feature for automatic normalization.

The experimental results presented in Table 5.3 show that the ranking performance can be significantly improved by the “mean” normalization scheme. Different notations in the table are defined similarly to their counterparts in Table 5.2 but applied to normalize review count. Specifically, the simple mean normalization method (i.e., MeanNorm) performs significantly better than the widely used score normalization method (i.e., ZeroOneNorm), which does not leverage the mean value; furthermore automatic normalization (i.e., AutoNorm-Q) works more effectively than manual normalization (i.e., MeanNorm).

Table 5.3

Comparison of methods for modeling review counts. Norm+Pred combines methods tagged using image. The description of notations image are the same as Table 5.2.

image

In addition, similar to the rating score, the ranking performance can be improved by smoothing the number of reviews based on the cluster hypothesis to make it more accurate (this run is labeled ReviewsPred). The mean normalization scheme and the cluster-based smoothing can be leveraged together (i.e., Norm+Pred) to further improve performance, achieving image improvement in MAP.

5.2.4 Distance

Local search differs from other search tasks, mainly because its ranking signals feature geographical distance. In fact, distance has also been shown to be one of the most important features in Table 5.1.

To understand how distance affects users’ click patterns, Lv et al. [246] also plot in Figure 5.4(a) the likelihood of clicks for a business against its distance from the user. Interestingly, there is indeed a monotonically decreasing trend of the likelihood with respect to distance; this may explain why distance appears to be the most important feature in Table 5.1. Furthermore, it is observed that the likelihood is decreasing sublinearly with distance; the likelihood decreases with distance, but the decreasing speed drops as distance becomes large. This intuitively makes sense: A restaurant within 1 mile of a user would have clear advantages over another similar restaurant within 2 miles, but two restaurants within 9 miles and 10 miles of the user may not have much difference. That is, the relevance of a business is more sensitive to its distance when the distance value is smaller.

image

FIGURE 5.4 Probability of click for businesses from a bin, plotted against the mean distance and log (distance) scores of this bin.

With the analysis [41], it is easy to show there is approximately a logarithm transformation. To illustrate it, Lv et al. [246] plotted the likelihood of clicks with respect to the logarithm transformation of distance, as shown in Figure 5.4(b). Linear scaling in distance would overly penalize businesses that are relatively far away from the user, while the logarithm transformation generally improves modeling distance.

Similar to ratings and review counts, distance is also observable to users. One interesting question is whether there is also a pivot click phenomenon. To answer this question, the distance of clicked businesses against the mean traveling distance to businesses in the same category was also plotted, as shown in Figure 5.5. Indeed, the plot shows that users tend to click businesses closer than the category mean distance, suggesting that the mean normalization scheme could also be applicable in the case of the distance feature. Yet the pivot phenomenon of distance is not as clear as that of ratings and review counts. One possible reason is that users can generally understand distance better than ratings and review counts because distance is a concrete concept, whereas ratings and review counts appear to be more abstract and subjective; as a result, users tend to rely more on the statistics of the observed business list to make sense of ratings and review counts, but absolute distance itself may have already made much sense. This is also consistent with the previous observation that there is a clear relationship between raw distance values and the probability of clicks, as shown in Figure 5.4(a).

image

FIGURE 5.5 Distance of clicked businesses, plotted against the mean (traveling) distance of the corresponding business category.

To verify this analysis, empirical experiment results are reported in Table 5.4. The simple mean normalization method is first applied to divide distance by the mean distance value of all observed businesses for the same query. This run is labeled MeanNorm. We can see it significantly improves over the baseline system, which suggests that feature normalization still helps, even though absolute distance values are already largely comparable. Next MeanNorm is compared with MeanNorm+, in which the simple mean normalization method is applied to image(distance). Apparently, MeanNorm+ works more effectively, confirming the previous analysis that the logarithm transformation is useful for better modeling distance. In addition, another run ZeroOneNorm+ is also created, which differs from MeanNorm+ in that the zero-one normalization is used. It is observed that ZeroOneNorm+ works significantly worse than MeanNorm+, confirming another previous analysis that the “mean” normalization scheme works well for distance and suggesting that users would also like to click businesses with a distance smaller than the pivot (i.e., mean distance in the result list).

Table 5.4

Comparison of methods for modeling distance. Methods with an indicator “+” apply logarithm transformation. image indicates the significance over Baseline, ZeroOneNorm+, MeanNorm, and AutoNorm, respectively, at the 0.05 level using the Wilcoxon nondirectional test.

image

Finally, the two automatic mean normalization runs, namely AutoNorm and AutoNorm+, are also evaluated, where the mean values of distance and image(distance) in the search results are added as additional features into the feature space during the training process to let the learning algorithm automatically normalize the distance and the image(distance) features, respectively. First, AutoNorm+ outperforms AutoNorm, though the improvement is small; this suggests that sublinear transformation of distance is beneficial, yet its advantage tends to be weakened as we use automatic feature normalization, because MART can potentially handle nonlinear combination automatically. Second, by comparing AutoNorm(+) with MeanNorm(+), we can see that automatic normalization can work better than or comparable to the simple mean normalization. Overall, both automatic normalization and manual normalization can improve over baseline by approximate image with the proposed new features.

5.2.5 Personal Preference

The goal here is to build user profiles so that we can compute the user preference of a business so as to rank businesses in a user-adaptive way. However, it is nontrivial to build content-based user profiles [312,339] in mobile local search, since the text associated with each business is often very sparse. Thus, we choose to use the collaborative filtering approach [161,251,33], based on the history click data, to estimate the likelihood that a user image likes business image, formally the conditional probability image. Yet there is another challenging problem: Due to the local nature of the task, a user tends to only click nearby businesses, so the co-occurrences are also of local nature and thus very sparse. For example, it is very hard to find users who live far away from each other but share similar business click patterns. To solve this problem, Lv et al. [246] exploited the domain knowledge of businesses and instead estimated the likelihood that a user image likes business category image, i.e., image: Because a business category can cover businesses from different locations, co-occurrences of categories and users can happen across locations. As a byproduct, the number of categories, which is about 3,000, is only about image of the number of businesses, significantly reducing the dimension. Lv et al.’s experiments show that building profiles for 1 million users can be done in several hours on a single machine.

Although the category information of a business that the user has clicked can be obtained directly from the history data due to the data sparseness problem of many users, Lv et al. [246] followed the idea of statistical topic modeling to estimate image in a more smoothing way. First, hidden variables image are introduced with states image for every user-category pair. The possible set of states image is assumed to be finite and of size image, where image is empirically set to 100 in this work. Then the problem can be mapped to the standard topic modeling problem: The original document-term matrix is replaced by a user-category matrix, and the original co-occurrence relationship is replaced by a click. In Lv et al.’s work, they adopted PLSA [161] and LDA [33]. Since these two models performed similarly effectively in their experiments, only the results based on PLSA were reported.

Considering observations in the form of clicks image of categories and users, PLSA models the probability of each click (i.e., a user image clicks a business of category image) as a mixture of conditionally independent multinomial distributions:

image(5.2)

Since the problem is to estimate user preference, the conditional model is used:

image(5.3)

The model can be estimated using the(EM) algorithm to obtain parameters image and image. Now a user profile image has been constructed. Then, given any business image, since image may belong to multiple categories, the conditional probabilities of these corresponding categories are averaged as the user preference of a business, i.e., image.

Some users may have more history clicks, and the profiles of these users would intuitively be more reliable than those of some other users who make fewer clicks. To make the model more intelligent so as to be able to automatically learn how much personalization we should apply for each user, Lv et al. encoded both the user preference, i.e., image, and the number of history clicks of the user into the learning algorithm. No other normalization was used, and this run is labeled NoNorm. It is compared with the baseline and shows that, though NoNorm outperforms the baseline significantly, the improvement is indeed minor.

To examine the reason, following Section 5.2.2.1, the probability of clicks for businesses against the user preference is also plotted, as shown in Figure 5.6(a). We can see that the probability of clicks generally does not vary a lot when the user preference changes; this may be one possible reason NoNorm does not work very well. Then the zero-one normalization is applied to generate another plot in Figure 5.6(b). It shows that the zero-one normalization essentially stretches the plot along the x-axis. There also appears to be an increasing trend only when the user preference is very small. Next, the mean normalization method is applied in Figure 5.6(c). It shows clearly that when the user preference is below the mean value (i.e., image) of the current search results, the probability of clicks increases monotonically with user preference and the increasing speed decreases as the user preference approaches its mean value. However, after the user preference reaches the mean value, the probability of click even has a tendency to decrease slightly. Again, this observation shows that users choose the mean value as a pivot score and have different click behaviors on the two sides of the pivot. Furthermore, it is interesting to see that the probability of clicks is maximized when the user preference is around the mean value: Too low preference may mean that the business is not interesting to the user (i.e., irrelevant business), whereas too high preference may indicate that the business could be too similar to what the user clicked before (i.e., redundant business). The pivot observation seems to demonstrate that a user’s decision may be like an exploration-exploitation tradeoff: Exploit what the user knows, but meanwhile explore what the user does not know.

image

FIGURE 5.6 Probability of click for businesses from a bin plotted against the mean user preference of this bin.

Inspired by this analysis, Lv et al. developed a manual mean normalization run (MeanNorm) and an automatic mean normalization run (AutoNorm) for feature normalization. According to the results shown in Table 5.5, AutoNorm improves over both the baseline and MeanNorm significantly, whereas MeanNorm does not perform very well. This could be because of the sublinear increasing curve of the mean normalization method, as shown in Figure 5.6(c); similar to the previous observations of distance normalization, automatic normalization using MART can potentially handle nonlinear combination well, whereas manual normalization cannot. To verify this intuition, a logarithm transformation based on the Box-Cox transformation analysis [41] is first applied onto user preference, and then the two normalization methods are added on top of the transformed feature, leading to two new runs MeanNorm+ and AutoNorm+. Table 5.5 shows that these two runs perform similarly well and the best among all methods, verifying our hypothesis and showing the necessity of sublinear transformation. Overall, by normalizing the personalization features, over 2% MAP improvements can be obtained.

Table 5.5

Comparison of methods for modeling user preference. Methods with an indicator “+” apply logarithm transformation. image indicates the significance over Baseline, NoNorm, and MeanNorm, respectively, at the 0.01 level using the Wilcoxon nondirectional test.

image

5.2.6 Sensitivity Analysis

The most effective modeling methods for all four signals are then combined into a final model, including Norm+Pred from Table 5.2 for modeling rating, Norm+Pred from Table 5.3 for modeling review count, AutoNorm+ and MeanNorm+ from Table 5.4 for modeling distance, and AutoNorm+ and MeanNorm+ from Table 5.5 for modeling user preference. The final model is labeled All and shown in Table 5.6. Table 5.7 lists the 23 features in the All model as well as where each feature comes from. We can see that the All model outperforms the baseline by more than image, suggesting that understanding the behaviors and heuristics behind ranking signals can indeed lead to better modeling methods and thus improve ranking accuracy.

Table 5.6

Sensitivity analysis. These data show that combining the proposed new features (i.e., All) can improve the Baseline over image.

image

Table 5.7

Relative feature importance in the final model.

image

It has been shown that the selected modeling methods (now as components in the final All model) perform very well in modeling each individual signal. To examine how sensitive these methods are when they are combined, Lv et al. removed some new modeling method(s) at a time while keeping all other modeling methods and the baseline features. For example, All-Rating in Table 5.6 was constructed by excluding from the All model the proposed novel features in the Norm+Pred method for rating modeling, while the features occurring in other models, including all baseline features, are kept. From Table 5.6, we can see that when Distance is removed, the performance drops the most, suggesting that Distance is very sensitive in the final model and its effect cannot be replaced. However, when Rating, Reviews, or Personalization are excluded, the performance decreases only slightly or even does not change. It suggests that the effect of these signals may have a large overlap; as a result, although they perform well as individual signals, their performance could not add together.

To go a step further, Lv et al. removed Rating and Reviews at the same time and found that its performance degrades much more than that of All-Rating and All-Reviews. This observation confirms that ratings and review counts are highly redundant to each other. After removing Rating and Reviews, we also remove Personalization in the last row of Table 5.6. We can see that the performance degradation is much larger than when we remove Personalization from the All model, suggesting that the personalization features also tend to be redundant to Rating and Reviews, probably due to the general consistency between global preference and individual preference.

Finally, feature-level analysis is also done to examine the relative importance of each feature, as shown in Table 5.7. Apparently distance, with the proposed normalization method, appears to be the most important feature. The two popularity measures from the baseline are the second and the third most important features. These observations show that distance and popularity dominate the ranking signals of mobile local search. However, other signals have also contributed many useful features. For example, the top six features cover all feature contributors.

Two particularly interesting observations are that (1) three mean values are ranked very highly, and (2) 8 out of the top 12 features are directly related to the mean normalization scheme, suggesting that the proposed “mean” normalization scheme indeed helps model signals in mobile local search. Due to the effectiveness of the proposed new features, many features from the baseline have been pushed to the bottom of Table 5.7.

5.3 Summary and Future Directions

It has been recognized that the ranking signals in mobile local searches, such as distance, customer ratings, and reviews, are quite different from general Web searches. We summarize the current understanding of these signals in existing studies. For example, how sensitive distance depends on the type of business that is being ranked (e.g., users may be willing to drive 20 minutes for a furniture store but not for a coffee shop), the search time (e.g., users may be willing to travel a longer distance for a dinner than for a lunch), the traffic status and attractiveness of a location (e.g., users tend to travel a longer distance in Texas than in New York), and so on. Next we present and deeply discuss several heuristics that can guide the development of effective ranking features. For example, a common ranking heuristic for many ranking signals is that users often take the mean value of a signal from the business result list as a pivot score and tend to demonstrate different click behaviors on businesses with lower and higher signal values than the pivot; we can thus encode the pivot information into feature normalization or into model training, which has been shown to significantly improve modeling these signals.

Mobile local search is an emerging search task, and there are many interesting research problems that are underexplored. Here we discuss a few of them.

5.3.1 Evaluation of Mobile Local Search

The process of acquiring relevance judgments to evaluate mobile local search is a challenging problem. First, the Cranfield-style evaluation that has been used in the evaluation of many traditional information retrieval and search tasks [302] would not work here, since the relevance judgments in mobile local search are particularly dependent on the search context, e.g., location of the user [209,248]. Second, asking users to make explicit relevance judgments can be very costly because it is necessary to cover a diverse set of queries in different contexts. Third, although Joachims et al. have developed methods for extracting relative relevance judgments from user clickthrough data in general Web searches [190], it is unclear whether these methods also work for mobile local search where the position bias of clickthrough and the interpretation of clickthrough could be different from general Web search. For example, Ghose et al. [140] have shown that search results that appear at top positions are more likely to be clicked in mobile searches than in Web searches.

In addition, a recent work [286] reports that mobile searches are often not to satisfy a specific information need but rather to kill time (e.g., waiting for the bus) or to satisfy curiosity. In such “casual” search scenarios, finding the relevant answer to a given query and ranking that answer as high as possible may not be the main goals [370]. This suggests that the measure of success of a casual search process should intuitively also be based on how users enjoy the search process, posing another challenge in the evaluation of mobile local search. These challenges make the evaluation of mobile local search still an open problem.

5.3.2 User Modeling and Personalized Search

Mobile local search is more “actionable” in the sense that mobile users often take an action (e.g., visiting a restaurant) as soon as they finish a local search session. Therefore, understanding the user’s personal preference is particularly important in terms of successfully answering a mobile local query. However, mobile local search logs are often seriously sparse because clickthrough is geographically distributed across locations, as we discussed in Section 5.2. On the other hand, users often bring their mobile devices wherever they visit, so mobile devices are capable of knowing more about a user than traditional PCs. This provides opportunities to go beyond search logs and build better user models for mobile local search based on rich user activities, but meanwhile it raises an interesting challenge: how to best utilize and integrate all the available information to build an optimal user model?


1 www.comscore.comimage.

2 www.biakelsey.comimage.

3 Although there are also navigational queries aiming at finding a specific business such as “Pizza Hut,” they are easily identified and processed and thus not considered in our work.