News Search Ranking - Relevance Ranking for Vertical Search Engines, FIRST EDITION (2014)

Relevance Ranking for Vertical Search Engines, FIRST EDITION (2014)

Chapter 2. News Search Ranking

Abstract

News search is one of the most important Internet user activities. For a commercial news search engine, it is critical to provide users with the most relevant and fresh ranking results. Furthermore, it is necessary to group the related news articles so that users can browse search results in terms of news stories rather than individual news articles. This chapter describes a few algorithms for news search engines, including ranking algorithms and clustering algorithms. For the ranking problem, the main challenge is achieving appropriate balance between topical relevance and freshness. For the clustering problem, the main challenge is in grouping related news articles into clusters in a scalable mode. We begin by introducing a few news search ranking approaches, including a learning-to-rank approach (Section 2.1) and a joint learning approach from clickthroughs (Section 2.2). We then describe a scalable clustering approach to group news search results (Section 2.3).

Keywords

News search

freshness

relevance

clustering

temporal features

2.1 The Learning-to-Rank Approach

The main challenge for ranking in news search is how to make appropriate balance between two factors:Relevance and freshness. Here relevance includes both topical relevance as well as news source authority.

A widely adopted approach in practice is to use a simple formula to combine relevance and freshness. For example, the final ranking score for a news article can be computed as

image(2.1)

where image is the value representing the relevance between query and news article, image isnews article age and image is a time decay term, for which the older a news article is, the more penalty the article will receive for its final ranking. The parameter image is used to control the relative importance of freshness in the final ranking result. In the literature of information retrieval, document is usually used to refer to a candidate item in ranking tasks. In this chapter, we use the terms document and news article equally because the application here is to rank news articles in a search.

The advantage of such a heuristic approach to a relevance and freshness combination is its efficiency in real practice, for which only the value of the parameter image needs to be tuned by using some ranking examples. Furthermore, the appropriate image value often leads to good ranking results for many queries, which also makes this approach effective.

The drawback of this approach is that it is incapable of further improving ranking performance, because such a heuristic rule is too naive to handle more complicated ranking cases. For example, in (2.1), time decay is represented by the term image, which is totally dependent on the document age. In fact, an appropriate time decay factor should also rely on the nature of the query, since different queries have different time sensitivities: If a query is related to breaking news, such as an earthquake, that has just happened and has extensive media reports on casualty and rescue, then freshness should be very important because even a document published only one hour ago could be outdated. On the other hand, if a query is for an event that happened weeks ago, then relevance is more important in ranking because the user would like to find the most relevant and comprehensive reports in the search results.

2.1.1 Related Works

Many prior works have exploited the temporal dimension in searches. For example, Baeza-Yates et al. [22] studied the relation among Web dynamics, structure, and page quality and demonstrated that PageRank is biased against new pages. In T-Rank Light and T-Rank algorithms [25], both activity (i.e., update rates) and freshness (i.e., timestamps of most recent updates) of pages and links are taken into account in link analysis. Cho et al. [66] proposed a page quality ranking function in order to alleviate the problem of popularity-based ranking, and they used the derivatives of PageRank to forecast future PageRank values for new pages. Nunes [269] proposed to improve Web information retrieval in the temporal dimension by combining the temporal features extracted from both individual documents and the whole Web. Pandey et al. [276] studied the tradeoff between new page exploration and high-quality page exploitation, which is based on a ranking method to randomly promote some new pages so that they can accumulate links quickly.

Temporal dimension is also considered in other information retrieval applications. Del Corso et al. [94] proposed the ranking framework to model news article generation, topic clustering, and story evolution over time, and this ranking algorithm takes publication time and linkage time into consideration as well as news source authority. Li et al. [221] proposed a TS-Rank algorithm, which considers page freshness in the stationary probability distribution of Markov chains, since the dynamics of Web pages are also important for ranking. This method proves effective in the application of publication search. Pasca [277] used temporal expressions to improve question-answering results for time-related questions. Answers are obtained by aggregating matching pieces of information and the temporal expressions they contain. Furthermore, Arikan et al. [20] incorporated temporal expressions into a language model and demonstrated experimental improvement in retrieval effectiveness.

Recency query classification plays an important role in recency ranking. Diaz [98] determined the newsworthiness of a query by predicting the probability of a user clicking on the news display of a query. König et al. [204] estimated the clickthrough rate for dedicated news search results with a supervised model, which is to satisfy the requirement of adapting quickly to emerging news event.

2.1.2 Combine Relevance and Freshness

Learning-to-rank algorithms have shown significant and consistent success in various applications [226,184,406,54]. Such machine-learned ranking algorithms learn a ranking mechanism by optimizing particular loss functions based on editorial annotations. An important assumption in those learning methods is that document relevance for a given query is generally stationary over time, so that, as long as the coverage of the labeled data is broad enough, the learned ranking functions would generalize well to future unseen data. Such an assumption is often true in Web searches, but it is less likely to hold in news searches because of the dynamic nature of news events and the lack of timely annotations.

A typical procedure is as follows:

• Collect query-URL pairs.

• Ask editors to label the query-URL pairs with relevance grades.

• Apply a learning-to-rank algorithm to the train ranking model.

Traditionally, in learning-to-rank, editors label query-URL pairs with relevance grades, which usually have four or five values, including perfect, excellent, good, fair, or bad. Editorial labeling information is used for ranking model training and ranking model evaluation. For training, these relevance grades are directly mapped to numeric values as learning targets.

For evaluation, we desire an evaluation metric that supports graded judgments and penalizes errors near the beginning of the ranked list. In this work, we useDCG [175],

image(2.2)

where image is the position in the document list, and image is the function of relevance grade. Because the range of DCG values is not consistent across queries, we adopt the NDCG as our primary ranking metric,

image(2.3)

where image is a normalization factor, which is used to make the NDCG of the ideal list be 1. We can use image and image to evaluate the ranking results.

We extend the learning-to-rank algorithm in news searches, for which we mainly make two modifications due to the dynamic nature of the news search: (1) training sample collection and (2) editorial labeling guideline.

2.1.2.1 Training Sample Collection

The training sample collection has to be near real time for news searches by the following steps:

1. Sample the latest queries from the news search query log.

2. Immediately get the candidate URLs for the sampled queries.

3. Immediately ask editors to do judgments on the query-URL pairs with relevance and freshness grades.

We can see that all the steps need to be accomplished in a short period. Therefore, the training sample collection has to be well planned in advance; otherwise, any delay during this procedure would affect the reliability of the collected data. If queries are sampled from an outdated query log or if all of the selected candidate URLs are outdated, they cannot represent the real data distribution. If editors do not label query-URL pairs on time, it will be difficult for them to provide accurate judgments, because editors’ judgments rely on their good understanding of the related news events, which becomes more difficult as time elapses.

2.1.2.2 Editorial Labeling

In a news search, editors should provide query-URL grades on both traditional relevance and freshness. Although document age is usually available in news searches, it is impossible to determine a document’s freshness based solely on document age. A news article published one day ago could either be very fresh or very old, depending on the nature of the related news event. So, we ask editors to provide subjective judgments on document freshness, which can have a few different grades, such as the following:

• Very fresh: latest documents (promote grade: 1)

• Fresh (promote grade: 0)

• A little bit outdated (demote grade: −1)

• Totally outdated and useless (demote grade: −2)

For ranking model training, we combine relevance grade and recency grade for a new grade as learning target. An example of such a grade combination is shown in the list. If the document is very fresh, we promote one grade from its original relevance grade. For example, if a document has a good grade on relevance and a fresh grade on freshness, its final grade is excellent, which is one grade higher than it is relevance grade, good. If the document is fresh, we neither promote nor demote. If the document is a little bit outdated, we demote one grade from its original relevance grade. If the document is totally outdated and useless, we demote two grades.

For ranking model evaluation, we can compute DCG values based on either the combined grades or the relevance grade and freshness grade separately.

To evaluate freshness in isolation, we also include a freshness metric based on DCG,called DCF:

image(2.4)

where image is the position in the document list, and image is the freshness label (1 or 0). A query may have multiple very fresh documents—for example, when multiple news sources simultaneously publish updates to some ongoing news story. Note that DCF is a recency measurement that is independent of overall relevance. Therefore, when we evaluate a ranking, we should first consider demoted NDCG, which represents the overall relevance, then inspect the value of the DCF. We also define normalized discounted cumulative freshness (NDCF) in the same way as for NDCG in Eq. 2.3.

2.2 Joint Learning Approach from Clickthroughs

Given the highly dynamic nature of news events and the sheer scale of news reported around the globe, it is often impractical for human editors to constantly keep track of the news and provide timely relevance and freshness judgments. To validate this conjecture, we asked editors to annotate the news query log for the day of August 9, 2011, immediately one day after, and then demoted the news search results based on their judgments using the method from Dong et al. [106]. The relation between observed CTR and the demoted grades is visualized by a scatter plot in Figure 2.1.

image

FIGURE 2.1 Scatter plot of CTR versus editor’s relevance judgments.

From Figure 2.1, we observe that the clicks are not strictly correlated with the demoted grades; the average Pearson correlation between them across the queries is 0.5764 with a standard deviation 0.6401. The main reason for this inconsistency is the hard demotion rule: Users might have different demotion preferences for different queries, and it’s almost impossible for an editor to predefine the combination rules, given the plurality of possibilities. As a result, the uncertainty from this heuristically derived ranking grade will limit the performance of subsequent learning-to-rank algorithms.

Suppose that, when a user submits a query to a news search engine and gets a list of ranked news documents, she would first judge the usefulness of each document by her underlying sense of relevance and freshness and give it an overall impression grade based on her preference for relevance and freshness at that particular time. Once she has such impressions in mind, she would deliberately click the documents most interesting to her and skip all the others.

Inspired by this example, we proposed to model the users’ click behavior in news searches as a direct consequence of examining the relevance and freshness aspects of the returned documents. Besides, for different queries, users’ relative emphasis on these two aspects can vary substantially, reflecting the searching intention for the specific news event. Therefore, a good ranking function should be able to infer such a tradeoff and return the optimally combined ranking results for individual queries. However, we cannot explicitly obtain users’ relevance/freshness judgments and the preferences for these two aspects, since their evaluation process is not directly observable from the search engine. Fortunately, users’ click patterns are recorded, from which we can assume that the clicked documents are more meaningful to them than the unclicked ones [188]. Therefore, we model relevance and freshness as two latent factors and assume that a linear combination of these two, which is also latent, generates the observed click preferences.

To better determine the temporal property of the news documents and detect the recency preference imposed for the query, we design a set of novel temporal features from click statistics and content-analysis techniques. In the following sections, we introduce the proposed model and temporal features in detail.

2.2.1 Joint Relevance and Freshness Learning

The basic assumption of our proposedmodel is that a user’s overall impression assessment by combining relevance and freshness for the clicked URLs should be higher than the unclicked ones, and such a combination is specific to the issued query. Therefore, our method falls into the pairwise learning-to-rank framework.

Formally, we have image different queries, and for the image-th query we observed image different URL click preference pairs (image), in which image is clicked but image is not. We denote image and image as the relevance and freshness features for image under query image, and image andimage are the corresponding relevance and freshness scores for this URL given by the relevance model image and freshness model image, respectively. In addition, we denote image as the relative emphasis on freshness aspect estimated by the query model image, i.e., image, based on the features image describing query image. To make relevance/freshness scores comparable across all the URLs, we require image. As a result, the user’s latent assessment image about the URL image for a particular query image is assumed to be a linear combination of its relevance and freshness scores:

image(2.5)

Since we have observed the click preference image, we can safely conclude that image.

Based on the previous discussion, we characterize the observed pairwise click preferences as the joint consequences of examining the combined relevance and freshness aspects of the URLs for the query. For a given collection of click logs, we are looking for a set of optimal models (image), which can explain as many of the observed pairwise preferences as possible. As a result, we formalize thisproblem as an optimization task,

image(2.6)

image(2.7)

where image and image are defined in Eq. 2.5, the nonnegative slack variables image are introduced to account for noise in the clicks, image is the functional norm (to be defined later) describing complexity of the models, and image is the tradeoff parameter between model complexity and training error.

Figure 2.2 depicts the intuition behind the proposed Joint Relevance and Freshness Learning (JRFL) model. From the figure, we can clearly notice the difference between our proposed JRFL and other classic pairwise learning-to-rank algorithms, e.g., RankSVM[184] and GBRank [406]. A classic pairwise learning-to-rank algorithm uses only one scoring function to account for all the observed click preferences, where different ranking criteria cannot be easily incorporated. Besides, even though the RankSVM model shares a similar object function as JRFL, they are still quite different:JRFL simultaneously learns a relevance model and a freshness model and utilizes a query-specific model to leverage these two aspects to explain the observed click patterns. Neither RankSVM nor GBRank deal with such query-specific multicriteria ranking. Besides, in previous studies [106,83], image and image for each URL were already known so that they tuned image directly for each type of query. In our problem, all those factors are latent and estimated automatically from the click logs.

image

FIGURE 2.2 Intuitive illustration of the proposed Joint Relevance and Freshness Learning (JRFL) model. The user-issued query and the corresponding returned URLs are represented by their features on the left part. Dashed-arrow lines in the middle indicate the assumed user’s judging process before she clicks. The check boxes on the right record the clicked URLs.

In our previous description, we didn’t specify the forms of the relevance model image, freshness model image, and query model image. Although many alternatives exist, we choose linear functions for all these models to simplify the exposition and derive efficient model estimation procedures. Other types of functions can also be employed, although numerical approximation is unavoidable in general when optimizing their model parameters.

Formally, we use three linear models:

image(2.8)

image(2.9)

image(2.10)

where the bias factor b in linear functions is excluded by introducing the dummy feature 1.

As a result, the proposed JRFL model defined in Eq. 2.6 can be instantiated as:

image(2.11)

image

Thanks to the associative property of linear functions, the optimization problem defined in Eq. 2.11 can be divided into two subproblems: relevance/freshness model estimation and query model estimation. Each is a convex programming problem. (Note that Eq.2.11 itself is nonconvex.) Therefore, we can utilize the coordinate descent algorithm to iteratively solve the two convex programs, as shown in Figure 2.3.

image

FIGURE 2.3 Coordinate descent for JRFL.

The interaction between the relevance/freshness models and the query model is clearly stated in the optimization process: In relevance/freshness model estimation, the query-specific preference weight acts as a tuning factor, which increases or decreases the features in these two models to represent the searching intention. Once we have the relevance/freshness models, we tune the query model to preserve as many click preferences as possible based on the current relevance/freshness predictions.

Since each update step is convex, our coordinate optimization is guaranteed to decrease the object function in Eq. 2.11 monotonically and therefore converges to a local optimum.

2.2.2 Temporal Features

We use 95 basic text matching features, such as query term matched in title, matched position in document, and source authority score, from a subset of features employed in Yahoo! News search engine as our URL relevance features. To capture the temporal property of the news documents and queries, we propose a set of novel time-sensitive features as our URL freshness features and query features, which are summarized in Table 2.1.

Table 2.1

Temporal features for URL freshness and query model.

image

2.2.2.1 URL Freshness Features

Publication age image: The URL’s publication timestamp is used to identify the document’s freshness property.

However, for a news search, the freshness of news content is more important. Therefore, we propose to identify the given news document’s freshness quality from a content analysis perspective.

Story age image: We use the regular expressions defined in [106] to extract the mentioned dates in the news content, calculate their distances to the given query within the document, and select the one with the minimal distance as the extracted story timestamp to infer the corresponding story age.

Story coverage image: Content coverage is an important character of freshness for a news document. Newer stories should cover more content that has not been mentioned by the previous reports. For a given query with a list of candidate news articles at a particular time, we first collect all the previous news articles associated with this query in our query log and build language models [199,400] for each of these documents. Then we separate those language models into three sets: models with documents published one day before, published two to five days before, and all the rest, and we treat the candidate URLs as queries to calculate the maximum generation probability given by all the models in these three sets accordingly.

Relative age t-dist(URLimageQuery): From a user’s perspective, since the news search engine has already displayed image, the document’s relative freshness within the returned list is more meaningful for the user. To capture this signal, we shift each URL’s image value within the returned URL list by the mean value in this list and scale the results by the corresponding standard deviation.

2.2.2.2 Query Freshness Features

The query features are designed to capture the latent preference between relevance and freshness.

Query/User frequency q_prob(Queryimaget) and u_prob (Userimaget): The frequency of a query within a fixed time slot is a good indicator for a breaking news query. We calculate the frequency of the query and unique users who issued this query within a time slot prior to the query time.

Frequency ratio q_ratio(Queryimaget) and u_ratio(Userimaget): The relative frequency ratio of a query within two consecutive time slots implies the change of users’ interest. A higher ratio indicates an increasing user interest in this news event.

Distribution entropy Ent(Queryimaget): The distribution of a query’s issuing time is a sign of breaking news; a burst of searching occurs when a particular news event happens. We utilize the entropy of query issuing time distribution to capture such burstiness. A multinomial distribution image with fixed bin size (e.g., 2 hours per bin) is employed to approximate the query’s temporal distribution within the day.

Average CTR CTR(Queryimaget): CTR is another signal representing the freshness preference of the query. When breaking news happens, people tend to click more returned URLs. We calculate the average CTR over all the associated URLs within a fixed time slot prior to the query time.

URL recency pub_mean(Queryimaged), pub_dev(Queryimaged) and pub_frq(Queryimaged): The recency of URLs associated with the query can be treated as a good profile of this query’s freshness tendency. When the URLs associated with one particular query in a fixed period are mostly fresh, it indicates that the query itself is highly likely to be a breaking news query. We calculate the mean and standard deviation of the associated URLs’ image features and the frequency of the URLs created in that specific period.

2.2.3 Experiment Results

This section validates our JRFL model empirically with large-scale click datasets and editorial annotations. We begin by describing the datasets used.

2.2.3.1 Datasets

2.2.3.2 Click Datasets

We collected real search sessions from the Yahoo! News search engine over a two-month period, from late May to late July 2011. To unbiasedly compare different ranking algorithms, we also set up a random bucket to collect exploration clicks from a small portion of traffic at the same time. In this random bucket, the top four URLs were randomly shuffled and displayed to the real users. By doing such random shuffling, we were able to collect user click feedback on each document without positional bias; such feedback can be thought of as a reliable proxy on information utility of documents [218]. As a result, we collected only the top four URLs from this random bucket.

In addition, we also asked editors to annotate the relevance and freshness in the August 9, 2011, query log immediately one day after, according to the editorial guidance given by Dong et al. [106].

Simple preprocessing is applied on these click datasets: (1) filtering out the sessions without clicks, since they are useless for either training or testing in our experiments; (2) discarding the URLs whose publication time is after the query’s issuing time (caused by errors from news sources); and (3) discarding sessions with fewer than two URLs.

2.2.3.3 Preference Pair Selection

We decide to train our model on the normal click data because such clicks are easier to collect without hurting the search engine’s performance. However, this kind of click is known to be heavily positional biased [8]. To reduce the bias for training, we followed Joachims et al.’s method to extract preferences from clicks [188]. In particular, we employed two click heuristics:

1. “Click image Skip Above”: For a ranked URL list image and a set image containing the clicked URLs, extract a preference pair image for all pairs image with image and image.

2. “Click image Skip Next”: For a ranked URL list image and a set image containing the clicked URLs, extract a preference pair image for all image and image.

In addition, to filter out noisy and conflicting preferences, we defined three rules: (1) filter out the preference pairs appearing fewer than five times; (2) calculate Pearson’s image value [65] on all the pairs and order them according to their image value; and (3) if both image and image are extracted, discard the one with a smaller image value.

After these selection steps, we were able to keep the top 150,000 preference pairs from some portion of normal clicks we collected in Section 2.2.3.2. Besides, for testing purposes, we randomly selected 500,000 query-URL pairs from original normal click set (not including the URLs used for generating the click preference pairs) and 500,000 from the random bucket clicks. To guarantee the quality of the freshness annotation, we asked the editors to finish the annotation in one day. As a result, we have only about 13,000 query-URL pairs annotated out of that day’s query log. As a summary, we list the datasets for our experiments in Table 2.2.

Table 2.2

Evaluation corpus.

image

2.2.3.4 Temporal Feature Implementation

For most of our temporal features, we had to specify a time slot for the implementation; for example, the Query/User frequency q_prob(Queryimaget) and u_prob(Userimaget) are both calculated within a predefined time slot. In the following experiments, we set such a time slot to be 24 hours, and accordingly all the necessary statistics were collected from this time window. Once the features were generated, we linearly scaled each of them into the range [−1, 1] to normalize them.

2.2.3.5 Baselines and Evaluation Metrics

Since the proposed JRFL model works in a pairwise learning-to-rank manner, we employed two classic pairwise learning-to-rank algorithms, RankSVM [184] and GBRank [406], as our baseline methods. Because these two algorithms do not explicitly model relevance and freshness aspects for ranking, we fed them with the concatenation of all our URL relevance/freshness and query features. In addition to compare the models trained on clicks with those trained on editorial judgments, we also used Dong et al.’s freshness-demotion-trained GBRank model [106] as our baseline and denoted it “FreshDem.”

To quantitatively compare different ranking algorithms’ retrieval performance, we employed a set of standard evaluation metrics in information retrieval. In click data, we treat all the clicked URLs as relevant and calculate the corresponding Precision at 1 (P@1),Precision at 2 (P@2), Mean Average Precision at 3 (MAP@3), Mean Average Precision at 4 (MAP@4), and Mean Reciprocal Rank (MRR). Definitions of these metrics can be found in standard texts (e.g., [21]). In the editorial annotation dataset, we treated the grade “Good” and above as relevant for precision-based metrics, and we used discounted cumulative gain (DCG) [176] as an evaluation metric.1

2.2.4 Analysis of JRFL

2.2.4.1 Convergency

We first demonstrate the convergency of the coordinate descent algorithm for the JRFL model as described in Figure 2.3, which is the necessary condition for applying the proposed model in real ranking problems. We randomly divided the training preference pairs into two sets, one with 90,000 pairs for training and the rest with 60,000 for testing. We fixed the tradeoff parameter image in JRFL to be image (we also tried other settings for this parameter; smaller image would render us fewer iterations to converge, but the tendency of convergency is the same), relative convergency bound image to be image, and maximum iteration step image to be image in the coordinate descent algorithm. To study if the algorithm’s convergence is sensitive to the initial state, we tried three starting points: (1) fixing the initial query weights image to image (freshness only); (2) fixing image to image (relevance only); and (3) setting it uniformly between image and image by directly setting all the query weights accordingly at step 0. We visualize the training process by illustrating the updating trace of object function defined in Eq. 2.11, pairwise error rate on both training and testing set, and the mean/standard deviation of the updated query weights during the iterative optimization in Figure 2.4.

image

FIGURE 2.4 Scatter plot of CTR versus JRFL’s prediction. (a) Object function value update. (b) Pairwise error rate (PER) update. (c) Query weight image update.

As demonstrated in Figure 2.4, the proposedmodel converges during the coordinate descent optimization process, and such convergency does not depend on the initial state. From Figure 2.4(c), we can observe that the optimal query weight setting for this training set is usually around image. Hence, the random initialization converges fastest compared with two other settings, since the initial state given by random is closest to this optimal setting.

In addition, it should be emphasized that, although the average of the converged value of image is less than image, it does not necessarily indicate that freshness is less important than relevance in general for the news search task, because the scales of the outputs of our freshness and relevance models may not be comparable. Therefore, only the order among the queries given by such learned query weights represents their relative emphasis on relevance and freshness aspects.

Another phenomena we observe in Figure 2.4(a) and (b) is that even though the object function decreased quickly after the first several iterations, the pairwise error rate needed more iterations to reach its optimal value. Furthermore, during these updates, there were some inconsistent updates between the object function value and the pairwise error rate; whereas the object function value always decreased with more iterations, the pairwise error rate did not show the same monotonic behavior. This inconsistency is expected because our object function in Eq. 2.11 is a relaxed one; we do not directly minimize the pairwise error rate, which is computationally intractable, but we are trying to reduce the prediction gap between the misordered pairs.

Table 2.3 gives the top three positive and negative features from the newly proposed URL freshness features and query features, ordered by the learned weights. The weights in the linear model reflect the features’ relative contribution to the final ranking decision. Because we have only six URL freshness features, and the learned weights for them are all negative, we only list the top three negative ones in this table.

Table 2.3

Feature weights learned by JRFL.

image

The weights learned by the corresponding models are reasonable and consistent with our design: For URL freshness features, the smaller the values of Publication age image, Story coverage image, and Relative age t-dist(URLimageQuery) are, the more recent the news article is; and for the query features, the larger the values of query frequency q_prob(Queryimaget) and URL recency pub_frq(Queryimaged), and the smaller the values of Distribution entropy Ent(Queryimaget), URL recency pub_mean(Queryimaged), andpub_dev(Queryimaged) are, the more users and news reports start to focus on this event, and therefore the freshness aspect becomes more important.

2.2.4.2 Relevance and Freshness Learning

Since our JRFL model does not rely on explicit relevance/freshness annotations, it is important to evaluate how well our relevance and freshness models can estimate each aspect separately. We separately used the relevance and freshness annotations on the August 9, 2011, query log as the test bed and utilized two GBRank models trained on Dong et al.’s relevance and freshness annotation dataset accordingly (44,641 query-URL pairs) [106]. Because [106]’s dataset does not contain the corresponding click information, those two GBrank models were trained without new query features. Our JRFL was trained on all the extracted click preference pairs.

We observe mixed results in Table 2.4: The relevance and freshness modules inside JRFL have worse ranking performance than the purely relevance/freshness trained GBRank models at the top positions (lower P@1 and MAP@3), but they have similar cumulative performance, i.e., DCG@5 for both aspects. The reason for this result is that the JRFL model has to account for the tradeoff between relevance and freshness imposed by the queries during training. The most relevant or recent documents might not be treated as good training examples if their other aspects were not desirable. However, the purely relevance/freshness trained GBRank models do not have such constraints and can derive patterns to put the most relevant/recent documents at the top positions separately. As a result, those two GBRank models’ performance can be interpreted as upper bounds for each individual ranking criterion (freshness/relevance) in this dataset. When we reach the lower positions, those two types of ranking algorithms give users quite similar utilities, i.e., under the DCG@5 metric. In addition, we want to emphasize that such result is already very encouraging, since the JRFL model successfully infers the relevance and freshness solely from the clicks, which confirms the soundness of our assumption in this work that users’ click behavior is the joint consequence of examining the relevance and freshness of a news article for the given query. By properly modeling such relationships, we can estimate the relevance and freshness from the clickthroughs to a certain extent.

Table 2.4

Performance of individual relevance and freshness estimations.

image

2.2.4.3 Query Weight Analysis

There is no direct way for us to evaluate the correctness of the inferred query weights, since such information is not observable in the search log. Therefore, in this experiment, we investigate it in an indirect way. As we discussed in the previous discussion, the order among the queries given by such weight reflects the query’s relative emphasis over the freshness aspect. Therefore, we ranked the queries in our training set according to the learned weights and list the top 10 (freshness-driven) and bottom 10 (relevance-driven) queries in Table 2.5.

Table 2.5

Query intention analysis by the inferred query weight.

Freshness Driven

Relevance Driven

7-Jun-2011, china

5-Jul-2011, casey anthony trial summary

6-Jul-2011, casey anthony trial

9-Jul-2011, nascar qualifying results

24-Jun-2011, nba draft 2011

8-Jul-2011, burbank 100 years parade

28-Jun-2011, libya

10-Jul-2011, gas prices summer 2011

9-Jun-2011, iran

10-Jul-2011, bafta film awards 2011

6-Jun-2011, pakistan

2-Jul-2011, green lantern cast

13-Jun-2011, lebron james

9-Jul-2011, 2011 usga open leaderboard

29-Jun-2011, greece

3-Jul-2011, lake mead water level july 2011

27-May-2011, joplin missing

5-Jul-2011, caylee anthony autopsy report

6-Jun-2011, sarah palin

4-Jul-2011, aurora colorado fireworks 2011

At first glance, it may be surprising to note that most of the top-ranked freshness-driven queries are the names of some countries and celebrities, e.g., “iran,” “libya,” and “lebron james.” But that is also quite reasonable: For those kind of queries, these results are actually ambiguous, since there would be many candidate news reports related to different aspects of these queries. But when users issue such types of “ambiguous” queries in news search engines, they should be most interested in the recent updates about these countries and celebrities. We went back to check the most-clicked URLs of these queries, and the clicks confirmed our assumption: The most-clicked URLs for query “libya” were about the recent progress of the Libyan war; news articles covering the latest diplomatic affairs between the United States and Iran got the most clicks for the query “iran.”

Another interesting finding from the learned query weights is that the length ofqueries is much longer than thefreshness-driven queries. To validate this observation, we selected the top 500 relevance-driven and the top 500 freshness-driven queries to calculate the corresponding query length distributions comparing to the length distribution of all the queries; Table 2.6 shows the results. These results are consistent with our intuition: When users are seeking specific information, they tend to put more constraints (i.e., longer queries) to describe their information need; in contrast, when users are making a recency search, they usually do not have a predetermined mindset about the events, so they often issue broad queries (i.e., shorter queries) about entities of their interest to see what is happening recently to those entities. Apparently, our query weight model is consistent with this intuition and is able to distinguish these two typical searching scenarios. In addition, this also reminds us that query length is a good feature to indicate the preference for the freshness aspect.

Table 2.6

Query length distribution under different query categories.

image

2.2.5 Ranking Performance

To validate the effectiveness of the proposed JRFL model in real news search tasks, we quantitatively compare it with all our baseline methods on random bucket clicks, normal clicks, and editorial judgments. All the click-based learning algorithms are trained on all 150,000 click preferences. Since all these models have several parameters to be tuned (e.g., the tradeoff parameter image in both Rank SVM and JRFL), we report their best performance on the corresponding testing set according to the MAP@4 metric in the following results and perform a t-test to validate the significance of improvement (against the second best performance).

First, we performed the comparison on the random bucket clicks because such clicks are more trustable than normal clicks due to the removal of positional biases.

From the results in Table 2.7, we find that the proposed JRFL model achieves encouraging improvement over the second-best GBRank model, especially on MAP@4, where the relative improvement is over 5.88%. This improvement confirms that properly integrating relevance and freshness can indeed improve the user’s search satisfaction.

Table 2.7

Comparison of random bucket clicks.

image

Now we perform the same comparison on the normal click set, and we can observe similar improvement of JRFL over other ranking methods, as shown in Table 2.8. Besides, from the results on both of these two click datasets, we can clearly observe that the click-preference-trained models generally outperform the freshness-demotion-trained GBRank model: JRFL’s improvement on P@1 against the freshness-demotion-trained GBRank model is 16.2% and 58.6% on the random bucket click set and normal click set, respectively.

Table 2.8

Comparison of normal clicks.

image

On the editorial annotation dataset, to compare different models’ performance, we mapped the separately annotated relevance and freshness grades into one single grade by the freshness demotion strategy in Dong et al.’s work [106].

From this result, we find that the freshness-demotion-trained GBRank model did not achieve the best performance on such “grade-demoted” testing sets either. This might be caused by the time gap between different annotations: [106]’s annotations were generated more than one year previously (February to May 2009). The out-of-date annotation might contain inconsistent relations between queries and documents, as in the new annotations. Besides, we also notice that the margin of improvement from JRFL becomes smaller compared to the click-based evaluations. In the following, we perform some case studies to find out the reasons for the diminished improvement (see Table 2.9).

Table 2.9

Comparison of editorial annotations.

image

In Table 2.10, we illustrate one case of inferior ranking result for the query “afghanistan” from JRFL. We list the top four ranked results from JRFL, together with the editorial relevance and freshness grades. (We have had to truncate some of the URLs because they are too long to be displayed.)

Table 2.10

Case study: Degenerated ranking results by JRFL for query “afghanistan.”

URL

Relevance

Freshness

http://www.cbsnews.com/video/watch/?id=7376057nXXXimage

Good

Excellent

http://news.yahoo.com/afghanistan-helicopter-crash-why-army-used-chinook-half-180000528.htmlimage

Excellent

Excellent

http://news.yahoo.com/whatever-happened-civilian-surge-afghanistan-035607164.htmlimage

Excellent

Excellent

http://www.msnbc.msn.com/id/44055633/ns/world_news-south_and_central_asia/XXXimage

Perfect

Excellent

The freshness weight inferred by JRFL for this query is 0.7694, which is biased to freshness aspect. However, all those URLs’ freshness grades are “Excellent” so that, in the demoted final grades, the “ground-truth” ranking depends only on the relevance aspect. The predicted relevance score differences get diminished by this biased freshness weight in JRFL; the JRFL predicted relevance score difference between the best document in “ground-truth” (last row in the table) versus JRFL ordering (first row in the table) is 0.44, whereas the corresponding freshness score difference is -0.31. As a result, JRFL gives an arguably “bad” ranking result for this query.

In addition, we want to revisit the relationship between the predicted orders given by JRFL and CTR, as we did in Figure 2.1. This time we draw the scatter plot between the JRFL predicted ranking scores and CTRs on the same set of URLs as shown in Figure 2.1.

The monotonic relationship between the predicted ranking and CTRs is much more evident than the one given by the demoted grades: URLs with lower CTRs concentrate more densely in the area with lower prediction scores, and the average Pearson correlation between the predicted ranking score and CTR across all the queries is 0.7163, with standard deviation 0.1673, compared to the average of 0.5764 and standard deviation of 0.6401 in the demoted grades (see Figure 2.5).

image

FIGURE 2.5 Scatter plot of CTR versus JRFL’s prediction.

In this work, we proposed a joint learning framework, Joint Relevance Freshness Learning, for modeling the topical relevance and freshness, and the query-specific relative preference between these two aspects based on the clickthroughs for the news search task. Experiments on large-scale query logs and editorial annotations validate the effectiveness of the proposed learning method. We only instantiated the proposed joint learning framework by linear models, but many alternatives exist. It would be meaningful to employ other types of nonlinear functions to enhance JRFL. For example, using logistic functions can naturally avoid the range constraints over query weights in optimization. Besides, in our current setting, the preference between relevance and freshness is assumed to be only query-dependent. It would be interesting to extend this to user-dependent, i.e., personalized, search. By defining a proper set of user-related features, or profiles, the proposed JRFL can be easily applied in such user-centric retrieval environments. What is more, the proposed model can also be flexibly extended to other retrieval scenarios, where usefulness judgment is beyond pure topical relevance, such as opinions in blog searches and distance in location searches.

2.3 News Clustering

Clustering of search results provides a unified view of the search results by grouping together similar documents. This allows the user to examine all the related categories of the search results without having to go through hundreds of items. Clustering becomes more important in the context of a domain such as news because there could be thousands of related articles for a given query. Most of these news articles are related, however, and if we group the articles in terms of related stories, then the number quickly reduces to a handful. This makes it easier for the user to browse the search results in terms of news stories rather than individual news articles. Furthermore, news sites re-publish articles from different sources to provide comprehensive coverage of a particular topic. This further exacerbates the problem by increasing the redundancy in the search results. Instead of leaving the search result organization to the user, a clustered representation of search results provides an overview to explore a topic.

The news search also has another component of recency, which introduces additional complexities. Although two articles are very related in terms of document content, if they are far apart in the time axis, they might correspond to a repetitive event, such asCalifornia earthquake. Even though such events occur multiple times, they correspond to different clusters because they are totally independent events and the user may be interested in looking at them individually.

Most of the work on search result clustering focused more on salient feature extraction and utilizing this information to cluster the search results [153,56]. However, the salient features do not provide the complete picture about the document. Sometimes the most important information about the news article is buried deep in the document body. Later we provide experimental evidence that shows that the body of the document provides a great boost in the performance of the clustering system. Other approaches [153,56] utilize standard clustering algorithms such as a hierarchical agglomerative clustering (HAC) algorithm and partitional algorithms such as K-means. In [396], the clustering problem is modeled as phrase extraction from documents and then uses a suffix tree-based clustering algorithm to group documents that contain common phrases. However, we show that the query information is vital in clustering the search results, and we also show how to utilize this information in the clustering algorithm.

Other research related to clustering the search results focuses on utilizing various matrix factorization techniques [272,271] such as singular value decomposition, nonnegative matrix factorization [211], local nonnegative matrix factorization [220], and concept decomposition [97]. These clustering algorithms focus on extracting quality descriptions for the clusters by finding the labels with matrix factorization on the word features from snippets of search results. Although the descriptions may be important in Web search results clustering, they do not play a vital role in news search results, since the most representative document can be used to show the summary of the cluster. Another algorithm called DisCover [207] clusters search results by maximizing the coverage and distinctiveness of the clusters.

Another line of work is related to using named entities to browse the clusters of search results [349]. However, this work mainly focuses on extracting the named entities from search results, organizing them into clusters using the named entities. Other related work [135] identifies discriminative, ambiguous, and common terms by constructing a relational graph of terms and using them to cluster Web search results.

It is clear that the named entities that occur in the body of the documents are valuable for producing quality clusters of search results. However, it is expensive to process the entire documents to cluster the search results in real time as it impacts the user experience. In this work we propose a scalable clustering technique that can provide a fast execution time without sacrificing accuracy by utilizing the body features in terms of offline clusters that provide a prior of the document relatedness of search results. The offline clusters are obtained by a process that is run offline. We also handle incremental clustering for documents that arrive continuously.

In fact, the offline clusters that are obtained by clustering the entire corpus provide a useful representation, and the search results can be organized into clusters by using this information alone. Some of the existing work exactly follows these methodologies. However, such organization would have a fixed granularity of clustering, which may not be desirable for all queries. Some queries can be broad and require a higher level of clustering hierarchy; other queries can be very specific and require a fine level of clustering hierarchy. If the offline clusters alone are used to organize the search results, the granularity of the clustering cannot be adjusted according to query. However, the solution we provide overcomes this problem by applying a clustering algorithm on the search results in real time by utilizing the offline clusters as features.

The contributions of our work are twofold:

• We provide a unified framework for scalable online clustering and detail the architecture of the system that describes each of the three components: offline clustering, incremental clustering, and real-time clustering.

• We propose novel techniques to cluster search results in real time by incorporating the query information in the clustering algorithm itself.

The section that follows presents the overall architecture of our system. Sections 2.3.22.3.4 discuss various clustering algorithms corresponding to offline, incremental, and real-time clustering that we used in this work. Section 2.3.5 presents the experimental results for our algorithms.

2.3.1 Architecture of the System

To cluster the search results in real time, we need just the top-ranked search results. However, the user would like to browse all the related articles in relation to a particular news story, which might not be covered in the top-ranked news results. To address this issue, we rely on offline clustering that clusters the entire news corpus. We utilize the offline clusters as an additional source of information in real-time clustering, which helps its performance as well. However, this offline batch processing of documents, especially news articles, does not work efficiently, because news articles arrive continuously every second. To address this issue, we also incorporate incremental clustering solutions to address the streaming data clustering problem. Figure 2.6 shows the architecture of our system that describes individual components that address these needs.

image

FIGURE 2.6 The architecture of the news search result clustering.

The offline batch clustering algorithm will be run on a regular basis, multiple times a day, on the news corpus, which we limit to a full month of news articles. This clustering algorithm assigns a cluster ID to every document that is present to the algorithm. The incremental clustering algorithm works in a streaming fashion, assigning cluster IDs to documents that arrive in the interim time before the next offline batch clustering is run. This is a continuous process that assigns cluster IDs to documents based on the existing batch clustering algorithm output. Thus each document in the corpus at any given time will have at least a cluster ID from either the offline clustering algorithm or the incremental clustering algorithm. The real-time clustering algorithm, which is run when a query is issued, is executed at runtime utilizing these cluster IDs for each of the documents in the corpus. This real-time clustering groups the search results into clusters and provides the final presentation of clustered search results.

In following sections describe the offline clustering, incremental clustering, and real-time clustering components in detail.

2.3.2 Offline Clustering

Our offline clustering system draws from the work of [338]. In this method, the similarity between all pairs of articles is first computed via locality sensitive hashing (LSH). While we also use the LSH method to determine the pairwise similarities, we then construct a similarity graph on the corpus, wherein a pair of articles has an edge if the similarity between the two meets a threshold. A correlation clustering algorithm is then run on top of the similarity graph to produce the final clustering. Note that it is essential to define a good similarity measure while constructing the graph. False edges cause different story clusters to be merged, whereas missing edges cause stories to be split. Figure 2.7 shows the main components of our clustering system. Now let’s look in detail the design of each component.

image

FIGURE 2.7 Overview of the offline clustering system.

2.3.2.1 Feature Vector Generation

The good performance of oursystem is heavily dependent on the underlying similarity function that is used to construct the similarity graph. As mentioned earlier, a poorly designed similarity function can either merge different stories or split a story into many clusters. Note that we use the term story to represent the set of news articles that are about the same news story and hence should belong to the same cluster. News articles have several important information sources associated with them that can be used to define a custom similarity function. The different types of features that were extracted are:

Term frequency-inverse document frequency (TF-IDF). A unigram-based feature vector was constructed using the TF-IDF values for the words in a news article after stop-word removal and stemming.

Wikipedia topics. Wikipedia topics was extracted from the news article using the technique described in [389]. The set of Wikipedia topics was then assigned an “aboutness score,” which represents how important that topic is to the article. The ranked list was then used as features, with the feature values corresponding to the aboutness scores.

Part-of-speech tagging. The news article was also tagged with a part-of-speech tagger, and unigrams were extracted from nouns, adjectives, and verbs and used as features. Term frequencies were used as feature values.

In addition to these feature vectors, we also made use of presentation cues associated with the article to emphasize certain phrases or unigrams, such as the fact that a phrase appears in the title or abstract or is italicized. Thus, the different features mentioned previously were assigned a score based on their presentation in the news article. The features from the three different channels were then combined through a simple aggregation of weights assigned to unigrams from each channel. The feature vector was then unit normalized before being used to compute the cosine similarity. Another important feature that was used is that of time. News articles typically have a timestamp associated with them. Given two articles published on days image and image, the cosine similarity on the custom feature space was weighted by image. The intuition behind this weighting function is that the closer the date of publication of the two articles, the more likely they are to be similar. Since we believe that a story cluster typically should not contain any articles that are apart by more than a week, we decrease the similarity between such pairs even more.

Although it is trivial to compute the feature spaces and compare all pairs of articles within a small set, such pairwise comparison becomes computationally expensive for larger corpora. A corpus with 100,000 articles requires 10,000,000,000 such comparisons. However, once an article has been mapped into its feature space, the chances of a pair of completely unrelated articles sharing any useful features is quite low. It is then unnecessary to explicitly compute the pairwise similarity between such pairs. We use the LSH to eliminate unnecessary similarity computations. An important component of the LSH method is the generation of Minhash signatures for each article.

2.3.2.2 Minhash Signature Generation

The min-wise independent permutations locality-sensitive hashing scheme (Minhash) signatures are a succinct representation of each article, computed such that the probability that two articles have the same Minhash signature is equal to the Jaccard similarity between the two [171]. To compute the Minhash signature for each article, we first represent it as a feature vector in the custom feature space described in Section 2.3.2.1. We then construct a length-100 Minhash signature using the technique detailed in [338]. However, the randomized nature of the Minhash generation method requires further checks to increase the chances of uncovering all pairs of related articles and removing articles that were brought together by chance. Thus, we resort to LSH to reduce chance pairings. Prior to performing LSH, the Minhash signatures can also be quickly used to detect exact duplicates.

2.3.2.3 Duplicate Detection

Given each article and its 100-length Minhash signature, we use these signatures to identify articles that are duplicates of each other. If an article is a duplicate of another, it is unnecessary to include both in the clustering process. One article per group of duplicates is sufficient to determine the cluster memberships of the rest. Given that a typical news corpus has a number of duplicates due to different publishers reprinting the same story from a news agency such as the Associated Press, identifying such groups and using only the unique documents for clustering provide significant computational savings. If two articles have the same Minhash value in each of the 100 slots, they are highly likely to be duplicates and are marked as such. Thus, the articles are grouped by their Minhash signatures and a representative is chosen for each group to participate in the clustering process. Once the representative article has been assigned to a cluster, we then propagate that cluster membership to all of its duplicates as a post-processing step after clustering.

2.3.2.4 Locality-Sensitive Hashing

Although the Minhash technique allows us to represent each article by a length-100 signature and enables duplicate detection, we still need a Oimage comparison to determine the similarity between all pairs of articles. However, as mentioned earlier, documents that are unrelated are likely to have a very low similarity, the value of which need not be explicitly computed. Locality-sensitive hashing can be used to quickly eliminate pairs of articles that share very few features from the pairwise similarity computation. We use the method detailed in [338] for LSH. For each article we construct a shorter LSH signature by concatenating a smaller set (2) of Minhash signatures. This process is repeated 200 times. Thus, documents that contain at least a few words in common are likely to agree in at least one of the LSH signatures. Only those pairs of articles with at least one common LSH signature need to have their similarity computed. The particular settings of our LSH method, 200 LSH signatures of length 2, were obtained by offline experimentation on a smaller dataset. The current parameter settings enabled us to uncover image of all pairs of similar documents as a full-blown pairwise comparison. Pairwise similarity is computed on all the documents sharing an LSH signature. Note that cosine similarity is computed on the unit-normalized vectors represented in the custom feature space and not on the Minhash signatures. The cosine similarity thus computed is further weighted with the time information, as explained in Section 2.3.2.1. Only those pairs of articles whose similarity exceeds a user-defined threshold are recorded. A similarity graph is then constructed using the output of the LSH process. Each article is represented as a node in the graph, and an edge exists between two nodes if its similarity exceeds the threshold. The edges are also weighted by the cosine similarity measure.

2.3.2.5 Correlation Clustering

The similarity graph is then fed into a correlation clustering algorithm based on the work to partition the graph into clusters. Correlation clustering is also a randomized algorithm that attempts to minimize a cost function based on the number of dissimilar pairs in the same cluster and the number of similar pairs in different clusters. We modified the original algorithm to allow the weight on the edges that are cut or formed, as clustering proceeds, to participate in the cost function. The algorithm is sensitive to the initialization data point. So we start multiple correlation clustering algorithms multiple times with different random seeds and identify the one with the lowest cost as the final clustering solution. An important characteristic of the correlation clustering approach is that it does not require the specification of the number of clusters. This feature is important because it is not easy to guess the number of clusters in an evolving corpus, in which a major news event can trigger multiple stories over a few days.

2.3.2.6 Evaluation

To evaluate the performance of the offline clustering system in terms of the quality of the clusters produced, editorial tests were conducted. We first constructed a corpus of news articles collected over a week. The size of this corpus is image700,000. Pairwise similarities were then computed between all pairs of articles in the corpus. Roughly image5,000 article pairs were sampled by stratified random sampling, with the stratification done on the similarity measure. Pairs of articles were then presented to the editors and were labeled as follows:

Must-link. Includes pairs of articles that are duplicates, pairs in which one article summarizes or paraphrases the content in the other, and pairs covering the same news event but with different text.

Maybe-linked. When the two articles are about the same set of entities but in two different news stories, the pair is labeled maybe-linked.

Cannot-link. When the two articles in the pair are about unrelated news stories, it is marked as unrelated.

The clustering system was then run on the same corpus, and we compute the fraction of times pairs of articles appear in the same cluster for each label. Note that this number should be high for must-links and low for cannot-links. The maybe-linked label is the gray area where it is not entirely wrong to show the two articles in the same cluster, but for better user experience one would want to keep this number low as well. The results of the clustering system are shown in Table 2.11. As we can see, the offline clustering system performs well.

Table 2.11

Performance of offline clustering system.

image

2.3.3 Incremental Clustering

The offline clustering phase produces a set of clusters of similar/relevant documents. These clusters are then taken as groundtruth for constructing a classifier for incremental clustering. The incremental clustering refers to the task of assigning a new document that has just arrived at the system to a cluster it is most likely to be associated with.

Since offline clustering is usually scheduled to run at the interval of a couple of hours, it is likely the case that news that has just broken after the offline clustering phase does not belong to any of the existing clusters. Here we describe three simple classifiers with strategies to reduce the potential impact of this scenario:

Static. A standard classifier that must be retrained on the latest set of documents at a shorter time interval.

Semiadaptive. A classifier that is capable of creating a new class for news articles that are not “close enough” to any of the existing clusters. The closeness here is a value that requires careful tuning.

Fully adaptive. A classifier that is not only capable of creating new classes but also updating itself to cope with the evolution of the news in the existing clusters. That said, the classifier is also able to remove classes corresponding to “submerging” stories. Compared to the semiadaptive classifier, this classifier is more flexible but also more likely to suffer from suboptimality because it is sensitive to the order of the news articles that arrived at the system.2

The three types of classifiers we’ve described roughly cover the whole spectrum of incremental clustering. The choice of a specific classifier depends on the computational and time constraints we are confined to; as long as offline clustering can be carried out more frequently, the static classifier is perhaps the best choice, since it is simple to implement, is easy to deploy, and requires no maintenance effort. Otherwise, semi- or fullyadaptive classifiers, which are harder to maintain, can be used instead.

2.3.4 Real-Time Clustering

It is vital to adjust the granularity of clustering at the query time. For example, the query “earthquake” in a news search engine returns clusters of news stories where each cluster represents a news story discussing an earthquake that occurred in a particular location. Thus, all the results related to a Chile earthquake may be grouped into a single cluster. However, a related query such as “chile earthquake” might return detailed news stories that are all related to a particular Chile earthquake. In this case, the news stories may be discussing the damages caused by the earthquake, donation-related information, and the incredible force of the earthquake, all of which depict various aspects of “Chile earthquake.”

Thus adjusting the granularity in real-time clustering is very important, and we propose three novel techniques to handle this task, as described in the following sections. Each of the methodologies shows how to modify the similarity measure in a standard clustering algorithm that can be used to cluster the top documents retrieved by a news search engine. In our experiments we used hierarchical agglomerative clustering (HAC) to incorporate these similarity measures and compared the proposed similarity measures with the standard cosine similarity.

2.3.4.1 Meta Clustering and Textual Matching

This approach relies on the output of the offline clustering output and textual matching features among the query and the documents. The similarity measure in this clustering can be formulated as:

image

where

image is the number of offline clustering algorithms

image is the weight for the clustering algorithm image

image if the clustering algorithm image puts image and image in the same cluster

image if the clustering algorithm image puts image and image in different clusters

image is the overlap of documents image and image

The first term in this equation is the term corresponding to the offline clustering weights. This is similar to ensemble cluster learning or meta clustering that is a simple combination of various clustering solutions from the offline clustering assignments. The second term in the equation relies on the textual matching features among the query and the documents. We use BM25 [292] as a proxy for textual matching between the query and the document. Given a query image and a document image, a BM25 score can be computed as:

image

where image is the usual relevance weight of the term image in the document, which can often be the inverse document frequency, image is the frequency of the term image in the document image, and image is the weighted term frequency that can be further broken into fields that can, in turn, have individual weights. The term frequency is normalized with the document length with image, which is average document length in the corpus, so that long and short documents are treated in the same manner. image and image are parameters in this score, for which we used standard weights in our experimental setting.

2.3.4.2 Contextual Query-Based Term Weighting

In this QrySim methodology, we modify the weights of the term vectors to account for the context around the query in the document. We want to weigh the terms that appear closer to the query higher than the terms that appear farther from the query.

Our postulation is that terms that occur closer to the query terms are more important in real-time query-dependent clustering than those that occur far from the query terms. This can be validated from considering a piece of anecdotal evidence. Consider a query “earthquake” where the user might be interested in research about recent earthquakes that happened, so the important terms for clustering in this case would be locations such as “Haiti,” “Chile,” “Japan,” etc., all of which occur close to the term “earthquake.” However, if the query is “Chile Earthquake,” the important terms for clustering might be different, such as “Rescue,” “death toll,” “donations,” etc. Finding the representative terms for each query might be difficult, but a good approximation for finding them is to look for the context around the query.

We first construct the query position vector for each document, which lists the positions where the entire query occurs together in the document. We experimented with various n-grams of the query (unigrams and bigrams) as a proxy for the entire query to increase the coverage, but we found that using the entire query to build the position vector works well in practice. With this position vector, we parse each document and assign the weight for each term as the distance of the term from the closest query occurrence, i.e.,

image

where image is the original frequency/term weight of the term t and image is the closest distance of the term t to the query occurrence in the document.

With these new weights to each of the terms, we construct a new term vector for each document and use them in computing the similarity measure for clustering, which weights higher the terms that appear closer to the query terms.

2.3.4.3 Offline Clusters as Features

Although the body provides vital information to the clustering algorithm, the feature space increases dramatically if all the body features are used in the real-time clustering algorithm. Since the real-time clustering algorithm needs to be executed at runtime after the query is issued, this poses latency issues, since the clustering algorithm needs to compute a similarity measure between the documents that operate on this huge feature vector. Thus, it is expensive to use all the body features in the real-time clustering algorithm. To address this problem, we propose to utilize the offline cluster IDs as additional features to the online clustering algorithm. In computing the similarity measure for the real-time clustering, these offline cluster IDs can be used to determine the closeness of two documents based on whether they have similar offline cluster IDs. For this purpose, a document can have multipleIDs from either different algorithms or with the same algorithm, with different granular settings such as coarse and fine. We utilize the standard HAC for the real-time clustering algorithm with the standard cosine similarity and the regular term features and use the Jaccard similarity to compute the similarity in the offline cluster IDs.

Thus the final similarity measure between two documents in this setting is as follows:

image

where image is the cosine similarity computed on the bag of words, image is the Jaccard similarity measure between the vector of offline cluster IDs for two documents, and image is the tradeoff parameter between the two similarity measures, which we set to imagein our experiments. The Jaccard similarity between two documents can be computed as

image

where image and image are vectors of offline cluster IDs of two documents that correspond to output from various clustering algorithms or the same clustering algorithm with various granular settings.

2.3.4.4 Performance Analysis

In this section, we provide a short analysis to estimate the number of computational operations required by using the entire body features, compared with using just the offline clusters as an approximation. We performed this analysis on an editorially evaluated dataset containing image queries and approximately image news documents. The clustering features contain words from three sections of the news documents, including title, abstract, and body. The number of unique body features is image times more than the features contained in both title and abstract. Therefore we observe a far greater number of unique features in the body, thus increasing the total number of operations in computing the similarity between documents if the body features are included in the computation. Hence we gain image savings in terms of number of operations if we do not use the body features and use the offline clusters as a proxy for these features.

2.3.5 Experiments

In this section we present the experimental evaluation of our techniques on the Yahoo! news corpus and search engine. The results we present evaluate the end product of the entire system that we presented, which is real-time clustering. Thus we present several experimental evaluations to determine the quality of the clustering output on the search results given a query. In the following subsections, we first describe the experimental data we used, present various evaluation metrics, and finally show the results to evaluate various techniques we described.

2.3.5.1 Experimental Setup

We scrape the Yahoo! News search engine with a list of random queries sampled from the query log and collect the top image news search results returned by the ranking algorithm. Each of these search results corresponds to an individual news article. Our goal is to group these search results into clusters that refer to related news stories. Since the news articles are coming through feeds from news sources, we have access to the individual fields, such as publication time, URL, title, abstract, and body of the news article. We extract several features from these news articles, including simple term features, unigram and bigram features, and named entity-based features. To extract named entities from the text of news articles, we use a dictionary-based maximum entropy-based tagger [29].

All of the search results for a given query are editorially grouped into related clusters by incrementally going through them one by one. For example, the first document is assigned to cluster A. If the second document is similar to cluster A, it too is assigned to cluster A; if it is not similar, it is assigned to a new cluster, B. In this manner, all the top-image results are grouped into clusters. We define a cluster to be a news story. For example, the documents in the same cluster are referring to the same news story. If the story is a developing story, we require the related news articles to be in the same cluster, unless they talk about a significantly different news story. With this criterion for clustering the news articles, all documents within the same cluster correspond to the same news story from various news sources such as New York Times and The Wall Street Journal.

The offline clustering algorithms were run on the entire corpus of news documents within a one-month timeframe, comprising millions of news articles that correspond to U.S./English documents. Offline clustering algorithms utilize image features comprising unigram and bigram features and named entity-based features. For real-time clustering evaluation, we editorially labeled top-image results into clusters for a set of image queries. The features used for real-time clustering algorithms were chosen to be a subset of the features used in offline clustering algorithms, comprising image features.

2.3.5.2 Evaluation Metrics

We utilize the following extrinsic metrics to evaluate our clustering algorithms that compare the performance of a given clustering solution to the editorial solution:

Precision. If image is the number of clusters to be evaluated, image is the number of categories (from editorial judgments), and image is the total number of documents (image per query), precision can be computed as the weighted average of maximal precision values:

image

Recall. Recall, also referred to as inverse purity, focuses on the cluster with maximum recall for each category. Recall can be defined as:

image

F-measure. Standard F-measure can be computed from treating the precision and recall values equally.

image This is an information-theoretic validity measure that characterizes how well the cluster labels assigned to the objects by a clustering algorithm agree with their (manually assigned) class labels by the number of bits required to encode/compress the class labels of the objects, given knowledge of their cluster labels. This code length corresponds to a special encoding scheme inspired by the minimum description length principle [288,144]. The form of the measure that gives the bits per object to do this encoding is image. The measure-form image is simply a normalized form of image, designed to vary between zero (no information about class labels contained in cluster labels) and one (perfect clustering).

image

where image multinomial coefficient

image

These measures are refinements of the measures image and image derived in [104]. These refinements remove a form of redundancy known as incompleteness from image and image. The details of the refinements are discussed in [105]. We chose to include image in our set of clustering-accuracy metrics because it satisfies a set of desirable properties, which are not all satisfied by more traditionally used metrics based on pairwise agreement/disagreement. These properties are listed in [104], and the fact that they are not all satisfied by traditional measures is demonstrated there as well.

Jaccard. This is a simple Jaccard coefficient computed on the pairs of documents. It is computed as the number of pairs of documents that are supposed to be together, and the algorithm actually put them together over all of the number of pairs of documents.

image

where SS is the number of pairs that belong to the same algorithmic cluster and editorial class, SD is the number of pairs that belong to the same algorithmic cluster and different editorial class, and DS is the number of pairs that belong to a different algorithmic cluster and the same editorial class.

Rand. This statistic relies on counting pairs of documents and their assignments into appropriate clusters and categories. Essentially it is the ratio of the number of pairs of documents that got correctly assigned into categories to the total number of pairs of documents

image

2.3.5.3 Evaluating Meta Clustering and Textual Matching

Table 2.12 shows the performance of various offline clustering algorithms applied directly on the editorial dataset for real-time clustering. It can be seen that the best offline clustering algorithm is K-means, with image clusters, which achieves an F-measure of image. A meta-clustering algorithm that is simply a combination of various clustering algorithms achieves the same performance as the best offline algorithm, whereas that utilizing the query information improves on this result because it incorporates additional information in terms of query.

Table 2.12

Results with various simple offline clustering algorithms and the real-time clustering algorithm, which includes the meta-clustering algorithm.

image

2.3.5.4 Results with QrySim

Table 2.13 shows results with the QrySim method described in Section 5.2. HAC is used as a clustering algorithm for these results. We compared the standard cosine similarity with the QrySim similarity measure where we emphasize the weights on the terms that are close to the query. The QrySim method clearly outperforms the standard similarity measure, as shown in the results.

Table 2.13

Real-time clustering results with QrySim similarity measure that boosts the weights to the terms that occur close to the query term over the standard similarity measure (OrigSim) with equal weights to all terms.

image

2.3.5.5 Results with Offline Clusters as Features

Next, we present experimental results by evaluating the efficacy of using offline clusters as features in real-time clustering. In this setting, we used the Minhash offline clustering algorithm to cluster a big corpus of news documents and used HAC to cluster the top-100 search results by using the offline cluster IDs as features.

The baseline feature set includes simple features such as bag of words, and we experimented with three variations of the Minhash clustering algorithm with various number of hash functions for the offline clustering algorithm as additional features. We also utilize the structured fields in the news articles, such as title, abstract, and body, as individual features. The results are shown in Table 2.14.

Table 2.14

image values with the real-time clustering algorithm with various combinations of features. The baselines include features with title and abstract and a single offline clustering algorithm. Although the combined feature set with all the features is the best one, the features with the offline clusters and title and abstract features are comparable to the ones that include body features.

image

We also experimented with various granularity settings for the offline clustering algorithm mentioned in Section 2.3.2 and their usefulness as features in the real-time clustering algorithm. Table 2.15 shows the results with various combinations of such granularity settings as features in the real-time clustering algorithm. The results show that the redundancy in the clustering algorithms is helpful to achieve maximum accuracy in real-time clustering. We can also observe a trend that the accuracy decreases as we go from coarse to fine granularity in the clustering algorithm.

Table 2.15

image values with the real-time clustering algorithm and various granularity settings of offline clusters as features. The baseline feature set includes just title and abstract features. The numbers 1, 2, 3 refer to different settings of the offline clustering algorithm at different granularity settings, specifically varying from coarse to fine representation of clusters. It can be observed that the best accuracy is obtained by combining all the configurations, and individual cluster IDs themselves provide inferior performance.

image

The experimental results indicate that the meta clustering that combines various offline clustering algorithms is as good as the best clustering algorithm, but the real-time query-based clustering can be improved upon by utilizing the query information. We also show how to utilize the offline cluster information in the real-time clustering by using them as features; this shows an improvement in both accuracy and the performance of the system.

2.4 Summary

This chapter introduced the algorithms for both ranking problems and clustering problems for news search engines. For ranking problems, although we can follow the learning-to-rank framework, it is important to consider the freshness factor that plays a critical role in user experience. We proposed a few unique aspects for this task, such as data collection, data sampling, editorial judging, and evaluations metrics. We also deep-dived into a real user search log to get a better understanding of user behavior and better modeling. For clustering problems, we presented a system that involves clustering the entire news corpus with an offline clustering algorithm and handling the incoming streaming data with incremental clustering algorithm. The output from the offline and incremental clustering algorithms is then utilized for improving the real-time query-based clustering.


1 According to Yahoo!’s business rule, the reported metrics are normalized accordingly; therefore only the relative improvement makes sense.

2 This cause of suboptimality is also commonly seen in online learning.