# Relevance Ranking for Vertical Search Engines, FIRST EDITION (2014)

### Chapter 9. Cross-Vertical Search Ranking

**Abstract**

A traditional Web search engine conducts ranking mainly in a single domain, i.e., it focuses on one type of data source, and effective modeling relies on a sufficiently large number of labeled examples, which require an expensive and time-consuming labeling process. On the other side, it is very common for a vertical search engine to conduct ranking tasks in various verticals, thus presenting a more challenging ranking problem: cross-domain ranking. Although in this book our focus is on cross-vertical ranking, the proposed approaches can be applied to more general cases, such as cross-language ranking. Therefore, we use a more general term, cross-domain ranking, in this book. For cross-domain ranking, in some domains we may have a relatively large amount of training data, whereas in other domains we can only collect very little. Therefore finding ways to leverage labeled information from related heterogeneous domains to improve ranking in a target domain has become a problem of great interest. In this chapter, we propose a novel probabilistic model, the pairwise cross-domain factor (PCDF) model, to address this problem. The proposed model learns latent factors (features) for multidomain data in partially overlapped heterogeneous feature spaces. It is capable of learning homogeneous feature correlation, heterogeneous feature correlation, and pairwise preference correlation for cross-domain knowledge transfer. We also derive two PCDF variations to address two important special cases. Under the PCDF model, we derive a stochastic gradient-based algorithm, which facilitates distributed optimization and is flexible to adopt various loss functions and regularization functions to accommodate different data distributions.The extensive experiments on real-world datasets demonstrate the effectiveness of the proposed model and algorithm.

Keywords

**Cross-vertical ranking**

**homogeneous transfer ranking**

**heterogeneous transfer ranking**

**source domain**

**target domain**

**pairwise cross-domain factor model**

**stochastic gradient descent**

**Introduction**

Learning to rank represents an important class of supervised machine learning tasks with the goal of automatically constructing ranking functions from training data. As with many other supervised machine learning problems, the quality of a ranking function is highly correlated with the amount of labeled data used to train the function. Due to the complexity of many ranking problems, a large number of labeled training examples are usually required to learn a high-quality ranking function. In general, it is very expensive and time-consuming to acquire labeled data.

On the other hand, for modern vertical search engines, ranking over multiple related verticals or domains becomes a common situation. In some domains we may have a relatively large amount of training data, whereas in some other domains we can only collect very little. In those situations, making use of labeled data from related domains is a desirable direction to take to address the data scarcity in the target domain.

Besides ranking applications, this learning scenario is also popular for other applications; it has been studied as transfer learning in the literature. Existing transfer learning approaches mainly focus on knowledge transfer in the same feature space, i.e., the data from different domains are assumed in a common feature space. However, in practice, we often face the problem where the labeled data are scarce in their own feature space, whereas there may be a large amount of labeled heterogeneous data in another feature space. In fact, this problem arises frequently in vertical search systems. For example, a vertical search engine often conducts ranking learning tasks in various verticals (e.g., news search, blog search, product search, etc.); here, data from the *news* vertical may be helpful for blog search; these data usually exist in different feature spaces that are vertical-dependent. In international vertical search systems, there are also different language domains (e.g., English news search, Spanish news search, etc.). Here, data from an English language domain may be helpful for a Spanish language domain. However, the data usually exist in different feature spaces that are language dependent. Therefore, for a wide range of vertical search systems, it would be desirable to transfer the knowledge from heterogeneous domains to a target domain where we have relatively little training data available, since doing so provides a scalable and efficient solution for ranking learning for multiple verticals.

For homogeneous transfer learning, since the data are in a common feature space, the main challenge is to overcome the data distribution difference to learn domain correlation for knowledge transfer. On the other hand, for heterogeneous transfer learning, the domain difference is beyond distribution difference, since distributions from heterogeneous spaces are not even comparable. Therefore, in general, heterogeneous transfer learning is more challenging than homogenous transfer learning.

When it comes to ranking, the problem becomes *heterogeneous transfer ranking*, which is even more challenging due to the following facts. First, unlike in classification or regression, in ranking the labels (relevance scores) for different domains may not be comparable. For example, a domain can have five grade relevance scores; another domain can have binary relevance scores. In fact, since the relevance scores may be query-dependent, the absolute values are not important; the preference order between instances is more important. Therefore, heterogeneous transfer ranking needs to catch correlations between preference orders from different domains instead of the traditional label correlations in classification and regression. Second, in general, a ranking application needs thousands of (or millions of) training examples. It is important to develop a method that can scale well to large datasets.

In this chapter, we propose a general probabilistic model, the *pairwise cross-domain factor* (PCDF) model *[419]*, for heterogeneous transfer ranking. The PCDF model is a feature-based transfer ranking model that learns common latent factors (features) to transfer knowledge across multiple heterogeneous ranking domains. PCDF assumes that (1) homogeneous features, heterogeneous features, and hidden relevance scores are generated conditioned on latent factors, and (2) preference orders are generated conditioned on hidden relevance scores. Through direct and indirect parameter sharing for the generative processes in different domains, the latent factors catch different types of domain correlations to extract the common knowledge for different domains.

**9.1 The PCDF Model**

**9.1.1 Problem Formulation**

For ease of exposition and to avoid notational clutter, we use the terms *target domain* and *source domain* to distinguish two given domains in a transfer learning task, though the discussions in this study are applicable to the situation that two domains are exchangeable and mutually helpful and can be easily extended to multiple domains.

We begin with notations. We consider that the target domain data exist in a dimension space and the source domain data exist in a dimension space, where is the number of dimensions for their overlapped feature space (denoted as ), and and are the number of dimensions for their dedicated feature spaces (denoted as and ), respectively. For traditional homogeneous transfer learning, all data are in the same feature space, i.e., and . For totally heterogeneous transfer learning, the feature spaces for the different domains have no overlap, i.e., . In this study, we consider the most general case, partially overlapped heterogeneous feature spaces, which arises frequently in real applications.

We let and denote the numbers of instances in the target domain and the source domain, respectively. We let

denote the target domain data, where denote the target domain data in its dedicated feature space, and denote target domain data in the common feature space. Similarly,

denotes the source domain data, where denote the source domain data in its dedicated feature space, and denote the source domain data in the overlapped feature space. To denote the -th data instance in the target domain or source domain, we use or .

Furthermore, we let denote the preference value between -th instance and -th instance in the target domain, such that

**(9.1)**

where (however, it is not necessary that there is for any pair of and in the data). In general . In the special case that only the preference order matters, . Note that even though the data are given with relevance labels or lists of ordered instances, they can be easily converted to the pairwise preferences. Similarly, we let denote the preference value between the -th instance and the -th instance in the source domain.

In heterogeneous transfer ranking, given target domain data , and and source domain data , , and , there are three types of domain difference to impede us from directly applying target domain data to the target domain: different feature distributions in the shared feature space; different dedicated feature spaces; and different distributions for pairwise preference values.

On the other hand, we need to catch three types of domain correlations for knowledge transfer. The first one is homogeneous feature correlation hidden in the overlapped feature space, i.e., correlation between and , which is the focus of traditional homogeneous transfer learning. The second one is heterogeneous feature correlation hidden in the dedicated feature spaces, i.e., correlation between and . The third one is preference correlation, i.e., the correlation between and .

**9.1.2 Model Formulation**

In this study we consider a generative model, in which features and pairwise preferences are generated conditioned on the latent variables.

**9.1.2.1 Feature Generating**

First we assume that the two domains’ features are generated conditioned on the common latent factors with maximum domain correlations and minimum domain differences. In heterogenous transfer ranking, there are two types of feature correlations between the two domains, and intuitively it is difficult to catch them in one type of latent factor. Based on this observation, we propose the concept two-component latent factor. The two-component latent factors for the two domains are given as follows

**(9.2)**

and

**(9.3)**

where is the dimension of latent factors for the common features, and is the dimensions of latent factors for the dedicated features. In the target domain latent factor , the component is to catch heterogeneous feature correlation and the component is to catch homogeneous feature correlation.

Then the common features in the overlapped feature space are generated according to the following probabilities

**(9.4)**

**(9.5)**

where is a link function and is the function parameter. Through the shared parameter , the latent factors and catch common knowledge from the two domains’ common features.

However, for the dedicated features and , since they are in the different feature spaces, in general direct knowledge transfer by sharing the link function parameter is not feasible. For example, if and have different dimensions and we use the popular linear link function such that , then it is not feasible for a shared parameter matrix to make the latent factor and have the same dimension . In other words, the parameter sharing is too strong an assumption for the heterogeneous features. On the other hand, it is more reasonable to learn the heterogeneous feature correlation indirectly through the interaction between different types of latent factors and interaction between latent factors and pairwise preferences (discussed in more detail later). Therefore, we assume that the dedicated features are generated with their own link function parameters as follows:

**(9.6)**

**(9.7)**

where is a link function and and are the function parameters.

Furthermore, we assume prior distributions for , and to reduce overfitting,

**(9.8)**

**(9.9)**

**(9.10)**

**9.1.2.2 Generating Pairwise Preferences**

To generate another observed variable, the pairwise preference score , we propose the concept of latent relevance score. We assume that each instance has a latent relevance score such that comparing a pair of and will give thebetween instance and .

In other words, we assume that is generated conditioned on and for both domains

**(9.11)**

**(9.12)**

where is a link function. An intuitive choice for is the difference function, i.e., . For example, we can assume that the distribution of is the normal distribution with the difference of and as the mean such that

**(9.13)**

We further assume that the latent relevance score is generated conditioned on the latent factor:

**(9.14)**

**(9.15)**

where h (;) is a link function, is the function parameter, and and are defined as in Eqs. *9.2* and *9.3*. Therefore, through the latent relevance scores and the shared parameter , the latent factor is able to catch the pairwise preference correlation in addition to homogeneous feature correlation and heterogeneous feature correlation.

Similarly, we assume prior distributions for to reduce overfitting,

**(9.16)**

**9.1.2.3 PCDF and Its Variations**

*Figure 9.1* summarizes the PCDF mode as a Bayesian network. From *Figure 9.1*, we can observe that the two domains share two parameters and , which are bridges of knowledge transfer. Through the information propagation among the Bayesian network, the two-component latent factors catch the three types of domain correlations, i.e., the common knowledge crosses the two domains.

**FIGURE 9.1** A PCDF Bayesian network.

As a general model for heterogenous transfer ranking, PCDF can be easily modified to accommodate special cases.

First, it can be easily applied to homogeneous transfer ranking, i.e., two domains in a shared feature space. *Figure 9.2* shows the PCDF model for homogenous data. From *Figure 9.2*, we observe that without the heterogeneous features and heterogeneous latent factors, the PCDF model is reduced to a new homogeneous transfer ranking model.

**FIGURE 9.2** A PCDF model for homogeneous data.

Second, with a little modification, the PCDF model can also be applied to the situation of data with absolute relevance scores. In such situations, and become observed variables and pairwise preferences are omitted. Hence, the PCDF model is reduced to a pointwise cross-domain factor model, which is shown in *Figure 9.3*. More interestingly, pointwise cross-domain factor model is beyond ranking applications, i.e., it can be directly applied to general regression and classification applications.

**FIGURE 9.3** The pointwise cross-domain factor model.

**9.2 Algorithm Derivation**

In this section, we derive the algorithm to learn the parameters for the PCDF model.

**9.2.1 Objective Specification**

The likelihood function of the PCDF model is given in Eq. *9.17*, in which denotes observed data; denotes all parameters; denotes the set of observed pairwise preferences of instances and for domain .

**(9.17)**

Equation *9.17* is a general objective function. In this study, to instantiate Eq. *9.17*, we adopt popular linear functions for the link functions , , and , and we use an intuitive difference function for . Furthermore, It has been observed in the literature *[74]* that maximizing likelihood under a certain distribution corresponds to minimizing distance under the corresponding distortion measure. For example, the normal distribution, Bernoulli distribution, multinomial distribution, and exponential distribution correspond to Euclidean distance, logistic loss, KL-divergence, and Itakura-Saito distance, respectively. Therefore, the problem of minimizing the negative log posterior of PCDF boils down to the following objective:

**(9.18)**

where , and are tradeoff parameters; are the loss functions (for convenience, we use for all terms; in general, it could be different for different terms) corresponding to conditional distributions in Eq. *9.17*; is the regularization loss function corresponding to the prior distribution in Eq. *9.17*.

The motivations for a computational framework instead of direct probabilistic inference are mainly twofold: First, the two formulations are somewhat equivalent, i.e., the conditional distributions can be encoded precisely through the choice of loss functions; likewise, the prior distributions over parameters could also be readily translated into the regularization penalties. Second, computational models allow more scalable algorithms, e.g., via stochastic gradient descent, whereas probabilistic reasoning often requires Monte Carlo sampling or quite nontrivial variational approximations.

**9.2.2 Optimization and Implementation**

In general, minimizing Eq. *9.19* is a nonconvex problem regardless of the choice of the loss functions and regularizers. Although there are convex reformulations for some settings, they tend to be computationally inefficient for large-scale problems; the convex formulations require the manipulation of a full matrix, which is impractical for anything beyond thousands of instances.

We established algorithms for distributed optimization based on the Hadoop MapReduce framework. The basic idea is to decompose the objective in Eq. *9.19* by optimizing with respect to each pairwise preference and to combine the results for the parameters in the Reduce phase.

We briefly describe a stochastic gradient descent algorithm to solve the optimization of Eq. *9.19*. The algorithm is computationally efficient and decouples different pairwise preferences. For a detailed discussion, see *[416]*. The algorithm loops over all the observations and updates the parameters by moving in the direction defined by a negative gradient. Specifically, for each observation , the algorithm performs the following sequence of updating. First, the hidden variables related to instance are updated as follows:

**(9.19)**

**(9.20)**

**(9.21)**

where is the learning rate, denotes elementwise multiplication, denotes vector of 1s, and for convenience we write into .

Second, the latent variables related to instance are updated as follows:

**(9.22)**

**(9.23)**

**(9.24)**

Third, the parameters are updated as follows:

**(9.25)**

**(9.26)**

**(9.27)**

In summary, the algorithm loops over all instances of to perform updating rules (Eqs. (*9.21)* through (*9.27)*) until it converges. In practice, the algorithm may not need to update parameters for each , since the changes of parameters may not be significant for one observation. Instead it could be more efficient to perform updating rules (Eqs. (*9.25*) throught (*9.27*)) after performing updating rules (Eq. (*9.19)* through (*9.24)*) on a batch of observations. The general PCDF algorithm is summarized in *Algorithm 1*. Note that following the similar procedure, it is easy to derive the algorithms for the variations of themodel in *Figures 9.2* and *9.3*, PCDF for homogenous data and pointwise (regression-based) cross-domain factor model.

The proposed stochastic gradient descent-based algorithm has two desirable properties. First, it is ready for distributed optimization based on the Hadoop MapReduce framework, since it decouples different pairwise preference observations. Second, it is flexible to adopt different loss functions and regularization functions for different types of data with different distributions.

**9.3 Experimental Evaluation**

As a general heterogeneous ranking algorithm, PCDF can be applied to different ranking applications with different data distributions. In this section, we apply PCDF to Web search data to demonstrate the properties and effectiveness of PCDF.

Although the PCDF model is flexible in adopting different loss and regularization functions, in this study we evaluate PCDF with the most popular loss function, L2 loss, corresponding to normal distribution, i.e., we use L2 loss for all in minimization (Eq. *9.19*). For the regularization functions, we evaluate both L2 loss (normal distribution prior) and L1 loss (laplace distribution prior). We denote those two algorithms as*PCDF-n-he* and *PCDF-l-he* ( is for normal distribution prior; is for laplace distribution prior; *he* is for heterogeneous ranking).

To our best knowledge, there is no existing transfer learning algorithm that is directly applicable to partially overlapped heterogeneous feature spaces. (However, we thank an anonymous reviewer for pointing out very recent works, such as *[362]*, which may be modified to this learning situation.) In this study, we use two state-of-the-art homogeneous transfer learning algorithms, sparse coding (called *SC-ho*)[*285*,*212*]and multitask feature learning (called *MTFL-ho*) *[18]*, as comparisons. Furthermore, an intuitive way for heterogeneous transfer learning to occur is to treat one domain’s dedicated features in another domain as missing values and apply existing homogeneous transfer learning to the data with missing values; hence we modify the sparse coding algorithm (it is difficult to modify MFLF for this) to handle missing values (called *SC-he*) to evaluate this idea. Finally, the baseline is using the original target domain training data only (called *TD*) for ranking learning.

We also evaluate algorithms for two PCDF variations in *Figures 9.2* and *9.3* to gain deep understanding of PCDF itself. Similarly, we use L2 loss for all the loss functions, and L2 and L1 loss for the regularizer functions. For PCDF for homogeneous data in *Figure 9.2*, they are denoted as *PCDF-n-ho* and *PCDF-l-ho*, respectively; for the regression based cross-domain factor model in *Figure 9.3*, they are denoted *RCDF-n-he* and *RCDF-l-he*.

In summary, we compare 10 algorithms: TD, SC-ho, SC-he, MTFL-ho, PCDF-n-ho, PCDF-l-ho, RCDF-n-he, RCDF-n-he, PCDF-n-he, and PCDF-l-he.

**9.3.1 Data**

We use Web search data from a commercial search engine system, which conducts ranking learning tasks in various domains with different languages or different verticals. The source domain (denoted S0) is a general Web search for an English-speaking country, which has a relatively large amount of labeled data. The first target domain (denoted T1) is a general Web search for a Spanish-speaking country; the second target domain (denoted T2) is a news article search for the same country as S0; the third target domain (denoted T3) is an answer search (providing a text search for a knowledge-sharing community) for another non-English-speaking country.

In the data, each query-document example is represented by a feature vector. Those query-document examples in domain S and T1 are originally labeled using a five-grade labeling scheme, and those in domain T1 and T2 are originally labeled using a four-grade labeling scheme. We then transform them into pairwise preference data.

The features generally fall into the following three categories: Query features comprise features dependent on the query only and have constant values across all the documents—for example, the number of terms in the query, whether or not the query is a person’s name, and so on. Document features comprise features dependent on the document only and have constant values across all the queries—for example, the number of inbound links pointing to the document, the amount of anchor texts in bytes for the document, and the language identity of the document. Query-document features comprise features dependent on the relation of the query with respect to the document —for example, the number of times each term in the query appears in the document , the number of times each term in the query appears in the anchor texts of the document, a.

The four domains have an overlapped feature space consisting of text match features, which describes various aspects of text match between a query and a document, such as title match and body match. Each domain has its own dedicated feature space. For example, all domains have their own dedicated features related to term segmentation in a query that is language dependent; T2 has its dedicated features related to news articles, freshness. *Table 9.1* summarizes the data information for four domains. In our experiments, of data for each target domain are used as training data (i.e., labeled data are very scarce in target domains); of them are used as test data to make evaluation robust. All the source domain data are used as training data to help the target domains.

**Table 9.1**

**Data summary for one source domain and three target domains.**

**9.3.2 Experimental Setting**

To evaluate those transfer learning algorithms, first the new latent features and parameters are learned by the algorithms; then source domain and target domain training data with the new latent features are used by a base ranking learner to train a ranking model. Third, the ranking models are tested on the test data of the target domain, which are also projected into the new feature space.

For the base ranking learner, we use the gradient-boosting decision tree (GBDT) *[130]*. For performance measure of the ranking models, we adopt the widely used discounted cumulative gain (DCG). Specifically, we use DCG-k, since users of a search engine are only interested in the top results of a query rather than a sorted order of the entire document collection. In this study, we select as 5 and 1. For every experimental setting, 10 runs are repeated, and the average DCG of 10 runs is reported for each experiment setting.

The gradient of L1 regularization function is the discontinuous sign function. We approximate it by a steep soft-sign function , where is a positive number controlling the ramp of (we use ).

For the dimension of the latent factor, in our preliminary experiments we observe that the performance is not sensitive to it as long as it is not too small. In this study, we simply set and . Other parameters, such as the learning rate, are selected by cross-validation.

**9.3.3 Results and Discussions**

The relative weight of source domain data with respect to the target domain data controls how much the source domain affects the target domain. It is very important because it directly affects the efficiency of the knowledge transfer. For example, the low weight will impede the knowledge transfer no matter how close the two domains are.

We evaluate the effects of relative weights of the source domain data for PCDF-n-he and PCDF-l-he algorithms to illustrate this important aspect. *Figure 9.4* shows the effects of relative weights of source domain data on DCG5 performance for all three target domains. From *Figure 9.4*, we observe that the algorithms perform best at weight 0.6 for target domain T1, weight 0.5 for target domain T2, and weight 0.4 for target domain T3. The results imply that T1 is the closest to the source domain, T2 is the second closest, and T3 is the third closest. This is consistent with our domain knowledge about the four domains. Note that it is flexible for PCDF algorithms to adopt the optimal relative weights that can be decided by cross-validation.

**FIGURE 9.4** The effects of relative weights of source domain data on DCG5 performance.

The DCG5 and DCG1 comparisons of the 10 algorithms are shown in *Figures 9.5* and *9.6*. From these figures, we observe the following interesting facts:

• TD performs worst in all settings. This is due to insufficient training data for the target domains and confirms the basic motivation of transfer ranking: to improve poor ranking performance in domains with insufficient training data.

• Overall, PCDF-l-he performs best and PCDF-n-he performs second. This shows that PCDF can effectively catch domain correlations in both overlapped and heterogeneous feature spaces to improve learning in the target domain.

• PCDF-n-he and PCDF-l-he perform better than RCDF-n-he and RCDF-l-he. The possible reason is that PCDF catches correlations from preference orders, which matters more for ranking applications than absolute scores.

• PCDF-n-he and PCDF-l-he perform better than PCDF-n-ho and PCDF-l-ho, since PCDF-n-ho and PCDF-l-ho can use only features in the overlapped space and miss the knowledge transfer in the heterogenous feature spaces.

• PCDF-n-ho and PCDF-l-ho perform better than SC-ho and MTFL-ho. This shows that PCDF helps not only common knowledge learning across heterogeneous feature spaces but also in the same feature space, i.e., PCDF also provides an effective homogeneous transfer ranking model.

• The MTFL algorithm performs better than SC algorithms. The possible reason for this is that as a supervised learning algorithm, MTFL learns the more informative latent features by making use of label information.

• SC-he does not perform better than SC-ho, even though SC-he uses both overlapped features and heterogeneous features. This implies that simply treating one domain’s dedicated features in another domain as missing values cannot effectively catch the heterogeneous feature correlation.

• Overall, the L1 regularization algorithms perform better than L2 regularization algorithms. This shows that the choice of the regularization functions could have significant impact on performance, and hence it is desirable for an algorithm to be flexible to adopt different regularization functions.

**FIGURE 9.5** DCG5 comparisons of 10 algorithms on the three target domains.

**FIGURE 9.6** DCG1 comparisons of 10 algorithms on the three target domains.

**9.4 Related Work**

Cross-domain ranking is the overlapping field of ranking and transfer learning. Since ranking literature is extensitively covered in other chapters of this book, here we focus on transfer learning-related work.

Historically, special cases of transfer learning have been explored in the literature under different names, including sample selection bias *[155]*, class imbalance *[174]*, and covariate shift *[313]*. Those studies mainly address the difference between training and testing data distribution. For example, class imbalance assumes that the density of input variables conditioned on output variables is the same for train and test data, whereas the marginal density of output variables is different for training and test data.

Transfer learning approaches can be mainly categorized into three classes *[275]*: A popular class of transfer learning methods is instance-based [*31*,*85*,*224*,*34*,*167*,*90*,*332*], which assumes that certain parts of the data in the source domain can be reused for the target domain by reweighting. *[179]* proposed a heuristic method to remove “misleading” training instances from the source domain so as to include “good” instances from labeled source-domain instances and unlabeled target-domain instances. *[85]* introduced a boosting algorithm, TrAdaBoost, which assumes that the source and target domain data use exactly the same set of features and labels, but the distributions of the data in the two domains are different. In addition, TrAdaBoost assumes that due to the difference in distributions between the source and the target domains, some of them could be useful in learning for the target domain, but some of them may not and could even be harmful. Therefore, TrAdaBoost attempts to iteratively reweight the source domain data and target domain data to reduce the effect of the “bad” source data while encouraging the “good” source data to contribute more for the target domains. *[31]* proposed a framework to simultaneously reweight the source domain data and train models on the reweighted data with a kernel logistic regression classifier. The studies have shown the promising results of data reweighting for many applications, such as NLP *[179]*, and special learning tasks such as binary classification [*85*,*89*].

Another category can be viewed as model-based approaches [*306*,*210*,*116*,*39*]. These assume that the source tasks and the target tasks share some parameters or priors of their models. An efficient algorithm MT-IVM *[210]*, which is based on the Gaussian process (GP), was proposed to handle multidomain learning cases. MT-IVM tries to learn parameters of GP over multiple tasks by assigning the same GP prior to the tasks. Similarly, hierarchical Bayes (HB) is used with GP for multitask learning *[306]*. *[116]* borrowed the idea of *[306]* and used SVMs for multidomain learning. The parameters of SVMs for each domain are assumed to be separable into two terms: a common term across tasks and a task-specific term. *[244]* proposed a consensus regularization framework for transfer learning from multiple source domains to a target domain.

The third category of transfer learning approaches are feature based. [*35*,*285*,*84*,*11*,*18*,*19*,*213*], where a feature representation is learned for the target domain and used to transfer knowledge across domains. A structural correspondence learning (SCL) algorithm*[35]* is proposed to use unlabeled data from the target domain to extract features so as to reduce the difference between source and target domains. A simple kernel mapping function, introduced in *[88]*, maps the data from both domains to a high-dimensional feature space. *[285]* proposed to apply sparse coding, an unsupervised feature construction method, to learning higher-level features across domains. On the other hand, heterogeneous transfer learning starts to attract attention very recently. We notice that *[382]*extends PLSA to a specific application, using social Web data to help image clustering; *[362]* proposes a manifold alignment-based approach for heterogeneous domain adaptation; *[150]* formulates heterogeneous transfer learning as multitask and multiview learning and proposes a graph-based solution; and *[148]* focus on single-task learning with multiple outlooks, which is also related to heterogeneous transfer learning.

The fourth category of approaches is based on relational knowledge transfer [*258*,*91*,*257*]. *[257]* proposed the algorithm TAMAR, which transfers relational knowledge with Markov logic networks (MLNs) across relational domains. MLNs combines the compact expressiveness of first-order logic with the flexibility of probability. TAMAR was later extended to the single-entity-centered setting of transfer learning *[258]*, where only one entity in the target domain is available. *[91]* proposed an approach to transferring relational knowledge based on a form of second-order Markov logic. In summary, those approaches assume that some relationships among the data in the source and target domains are similar.

Another related field is semisupervised learning [*36*,*415*,*183*,*268*], which mainly addresses the problem that the labeled data are too few to build a good classification/regression function by making use of a large amount of unlabeled data and a small amount of labeled data. Co-training [*36*,*415*], which trains two learners for two views by iteratively including unlabeled data, is closely related to transfer learning. The major difference is that in co-training, we have one set of instances with features partitioned into two “views.” In transfer learning, we use different sets of data instances from different domains.

A few studies have been applied to the idea of transfer learning for the learning-to-rank problem. Zha et al. *[399]* uses multitask learning techniques to incorporate query difference, where each query is regarded as a task. However, the objective of this work is to learn a single ranking function instead of multiple functions for multiple tasks. TransRank *[60]* considers cross-domain information to attack transfer learning problems for ranking, which utilizes the labeled data from a source domain to enhance the learning-of-ranking function in the target domain with augmented features. However, this approach does not make use of unlabeled data. Gao et al. *[134]* explore several model adaptation methods for Web search ranking. They trained two ranking functions separately, then interpolated the two functions for the final result, and their experiments show that the simple model interpolation method achieves the best results. Similarly, heterogeneous transfer ranking is rarely touched in the literature. We notice that *[361]* proposes a regularized framework to addresses ranking cross-heterogeneous domains. It simultaneously minimizes two loss functions corresponding to two related domains by mapping each domain onto a shared latent space.

Among those transfer learning approaches generally instance-based and model-based approaches generally depend on the assumption of homogeneous features more strongly than do feature-based approaches. Another advantage of feature-based approaches is their flexibility of adopting different base ranking learners in real applications, i.e., after the common latent features learned from different domains, it is flexible to use any ranking learner on the new training data with common latent features to train ranking functions. Those motivate us to focus on deriving a feature-based model in this study.

**9.5 Conclusions**

In this chapter, we proposed a novel probabilistic model, PCDF, for cross-domain (vertical) ranking. The proposed model learns latent factors for multidomain data in partially overlapped heterogeneous feature spaces. The model is capable of learning homogeneous feature correlation, heterogeneous feature correlation, and pairwise preference correlation for cross-domain knowledge transfer. We also derive two PCDF variations to address two important special cases. Under the PCDF model, we derive a stochastic gradient-based algorithm, which facilitates distributed optimization and is flexible in adopting different loss functions and regularization functions to accommodate different data distributions. The extensive experiments on real Web search datasets demonstrate the effectiveness of the PCDF model and algorithms.