Relevance Ranking for Vertical Search Engines, FIRST EDITION (2014)
Chapter 1. Introduction
This book aims to a present systematic study of practices and theories for vertical search ranking. The studies in this book can be categorized into two major classes. One class is single-domain-related ranking that focuses on ranking for a specific vertical, such as news search ranking, medical domain search ranking, visual search ranking, mobile search ranking, and entity search ranking. Another class is multidomain-related ranking, which focuses on ranking that involves multiple verticals, such as multiaspect ranking, aggregating vertical search ranking, and cross-vertical ranking. This chapter discusses organization, audience, and further reading for this book.
Vertical search ranking
news search ranking
medical domain search ranking
visual search ranking
mobile search ranking
multiaspect relevance ranking
aggregated vertical search
cross-vertical search ranking
1.1 Defining the Area
In the past decade, the impact of general Web search capabilities has been stunning. However, with exponential information growth on the Internet, it becomes more and more difficult for a general Web search engine to address the particular informational and research needs of niche users. As a response to the great need for deeper, more specific, more relevant search results, vertical search engines have emerged in various domains. By leveraging domain knowledge and focusing on specific user tasks, vertical search has great potential to serve users highly relevant search results from specific domains.
The core component of vertical search is relevance ranking, which has attracted more and more attention from both industry and academia during the past few years. This book aims to present systematic study of practices and theories for vertical search ranking. The studies in this book can be categorized into to two major classes. One class is single-domain-related ranking that focuses on ranking for a specific vertical, such as news search ranking and medical domain search ranking. However, in this book the term verticalhas a more general meaning than topic. It refers to specific topics such as news and medical information, specific result types such as entities, and specific search interfaces such as mobile search. The second class of vertical search study covered in this book class is multidomain-related ranking, which focuses on ranking involving multiple verticals, such as multiaspect ranking, aggregating vertical search ranking, and cross-vertical ranking.
1.2 The Content and Organization of This Book
This book aims to present an in-depth and systematic study of practices and theories related to vertical search ranking. The organization of this book is as follows.
Chapter 2 covers news vertical search ranking. News is one of the most important of Internet users’ online activities. For a commercial news search engine, it is critical to provide users with the most relevant and fresh ranking results. Furthermore, it is necessary to group the related news articles so that users can browse search results in terms of news stories rather than individual news articles. This chapter describes a few algorithms for news search engines, including ranking algorithms and clustering algorithms. For the ranking problem, the main challenge is achieving appropriate balance between topical relevance and freshness. For the clustering problem, the main challenge is how to group related news articles into clusters in a scalable mode. Chapter 2 introduces a few news search ranking approaches, including a learning-to-rank approach and a joint learning approach from clickthroughs. The chapter then describes a scalable clustering approach to group news search results.
Chapter 3 studies another important vertical search, the medical domain search. With the exponential growth of electronic health records (EHRs), it is imperative to identify effective means to help medical clinicians as well as administrators and researchers retrieve information from EHRs. Recent research advances in natural language processing (NLP) have provided improved capabilities for automatically extracting concepts from narrative clinical documents. However, before these NLP-based tools become widely available and versatile enough to handle vaguely defined information retrieval needs by EHR users, a convenient and cost-effective solution continues to be in great demand. In this chapter, we introduce the concept of medical information retrieval, which provides medical professionals a handy tool to search among unstructured clinical narratives via an interface similar to that of general-purpose Web search engines, e.g., Google. In the latter part of the chapter, we also introduce several advanced features, such as intelligent, ontology-driven medical search query recommendation services and a collaborative search feature that encourages sharing of medical search knowledge among end users of EHR search tools.
Chapter 4 is intended to introduce some fundamental and practical technologies as well as some major emerging trends in visual search ranking. The chapter first describes the generic visual search system, in which three categories of visual search are presented: i.e., text-based, query example-based and concept-based visual search ranking. Then we describe the three categories in detail, including a review of various popular algorithms. To further improve the performance of initial search results, visual search re-ranking of four paradigms will be presented: 1) self-reranking, which focuses on detecting relevant patterns from initial search results without any external knowledge; 2) example-based reranking, in which the query examples are provided by users so that the relevant patterns can be discovered from these examples; 3) crowd-reranking, which mines relevant patterns from crowd-sourcing information available on the Web; and 4) interactive reranking, which utilizes user interaction to guide the reranking process. In addition, we also discuss the relationship between learning and visual search, since most recent visual search ranking frameworks are developed based on machine learning technologies. Last, we conclude with several promising directions for future research.
Chapter 5 introduces mobile search ranking. The wide availability of Internet access on mobile devices, such as phones and personal media players, has allowed users to search and access Web information while on the go. The availability of continuous fine-grained location information on these devices has enabled mobile local search, which employs user location as a key factor to search for local entities (e.g., a restaurant, store, gas station, or attraction) to overtake a significant part of the query volume. This is also evident by the rising popularity of location-based search engines on mobile devices, such as Bing Local, Google Local, Yahoo! Local, and Yelp. The quality of any mobile local search engine is mainly determined by its ranking function, which formally specifies how we retrieve and rank local entities in response to a user’s query. Acquiring effective ranking signals and heuristics to develop an effective ranking function is arguably the single most important research problem in mobile local search. This chapter first overviews the ranking signals in mobile local search (e.g., distance and customer rating score of a business), which have been recognized to be quite different from general Web search. We next present a recent data analysis that studies the behavior of mobile local search ranking signals using a large-scale query log, which reveals interesting heuristics that can be used to guide the exploitation of different signals to develop effective ranking features. Finally, we also discuss several interesting future research directions.
Chapter 6 is about entity ranking, which is a recent paradigm that refers to retrieving and ranking related objects and entities from different structured sources in various scenarios. Entities typically have associated categories and relationships with other entities. In this chapter, we introduce how to build a Web-scale entity ranking system based on machine = learned ranking models. Specifically, the entity ranking system usually takes advantage of structured knowledge bases, entity relationship graphs, and user data to derive useful features for facilitating semantic search with entities directly within the learning-to-rank framework. Similar to generic Web search ranking, entity pairwise preference can be leveraged to form the objective function of entity ranking. More than that, this chapter introduces ways to incorporate the categorization information and preference of related entities into the objective function for learning. This chapter further discusses how entity ranking is different from regular Web search in terms of presentation bias and the interaction of categories of query entities and result facets.
Chapter 7 presents learning to rank with multiaspect relevance for vertical searches. Many vertical searches, such as local search, focus on specific domains. The meaning of relevance in these verticals is domain-specific and usually consists of multiple well-defined aspects. For example, in local search, text matching and distance are two important aspects to assess relevance. Usually, the overall relevance between a query and a document is a tradeoff among multiple aspect relevancies. Given a single vertical, such a tradeoff can vary for different types of queries or in different contexts. In this chapter, we explore these vertical-specific aspects in the learning-to-rank setting. We propose a novel formulation in which the relevance between a query and a document is assessed with respect to each aspect, forming the multiaspect relevance. To compute a ranking function, we study two types of learning-based approaches to estimate the tradeoff among these aspect relevancies: a label aggregation method and a model aggregation method. Since there are only a few aspects, a minimal amount of training data is needed to learn the tradeoff. We conduct both offline and online bucket-test experiments on a local vertical search engine, and the experimental results show that our proposed multiaspect relevance formulation is very promising. The two types of aggregation methods perform more effectively than a set of baseline methods including a conventional learning-to-rank method.
Chapter 8 focuses on aggregated vertical search. Commercial information access providers increasingly incorporate content from a large number of specialized services created for particular information-seeking tasks. For example, an aggregated Web search page may include results from image databases and news collections in addition to the traditional Web search results; a news provider may dynamically arrange related articles, photos, comments, or videos on a given article page. These auxiliary services, known asverticals, include search engines that focus on a particular domain (e.g., news, travel, sports), search engines that focus on a particular type of media (e.g., images, video, audio), and application programming interfaces (APIs) to highly targeted information (e.g., weather forecasts, map directions, or stock prices). The goal of aggregated search is to provide integrated access to all verticals within a single information context. Although aggregated search is related to classic work in distributed information retrieval, it has unique signals, techniques, and evaluation methods in the context of the Web and other production information access systems. In this chapter, we present the core problems associated with aggregated search, which include sources of predictive evidence, relevance modeling, and evaluation.
Chapter 9 presents recent advances in cross-vertical ranking. A traditional Web search engine conducts ranking mainly in a single domain, i.e., it focuses on one type of data source, and effective modeling relies on a sufficiently large number of labeled examples, which require an expensive and time-consuming labeling process. On the other side, it is very common for a vertical search engine to conduct ranking tasks in various verticals, which presents a more challenging ranking problem, that of cross-domain ranking. Although in this book our focus is on cross-vertical ranking, the proposed approaches can be applied to more general cases, such as cross-language ranking. Therefore, we use a more general term, cross-domain ranking, in this book. For cross-domain ranking, in some domains we may have a relatively large amount of training data, whereas in other domains we can only collect very little. Theretofore, finding a way to leverage labeled information from related heterogeneous domain to improve ranking in a target domain has become a problem of great interest. In this chapter, we propose a novel probabilistic model, pairwise cross-domain factor (PCDF) model, to address this problem. The proposed model learns latent factors (features) for multidomain data in partially overlapped heterogeneous feature spaces. It is capable of learning homogeneous feature correlation, heterogeneous feature correlation, and pairwise preference correlation for cross-domain knowledge transfer. We also derive two PCDF variations to address two important special cases. Under the PCDF model, we derive a stochastic gradient-based algorithm, which facilitates distributed optimization and is flexible to adopt various loss functions and regularization functions to accommodate different data distributions. The extensive experiments on real-world data sets demonstrate the effectiveness of the proposed model and algorithm.
1.3 The Audience for This Book
The book covers major fields as well as recently emerging fields for vertical search. Therefore, the expected readership of this book includes all the researchers and systems development engineers working in these areas, including, but not limited to, Web search, information retrieval, data mining, and specific application areas related to vertical search, such as various specific vertical search engines. Since this book is self-contained in its presentation of the material, it also serves as an ideal reference book for people who are new to the topic of vertical search ranking. Consequently, in addition, the audience also includes anyone with interest or who works in a field requiring this reference book. Finally, this book can be used as a reference for a graduate course on advanced topics of information retrieval or data mining, since it provides a systematic introduction to this booming new subarea of information technology.
1.4 Further Reading
As a newly emerging area of information retrieval and data mining, vertical search ranking is still in its infant stage; currently there is no dedicated, premier venue for the publication of research in this area. Consequently, related work in this area, as the supplementary information to this book for further readings, may be found in the literature of the two parent areas.
In information retrieval, related work may be found in the premier conferences, such as the annual Association for Computing Machinery (ACM) Special Interest Group on Information Retrieval (SIGIR) conference, the International World Wide Web Conference (WWW), and the ACM, the International Conference on Information and Knowledge Management (ACM CIKM). For journals, the premier journals in the information retrieval area, including Information Retrieval, Foundations and Trends in Information Retrieval(FTIR), may contain related work in vertical search ranking.
In data mining, related work may be found in the premier conferences, such as the ACM International Conference on Knowledge Discovery and Data Mining (KDD), the Institute of Electrical and Electronics Engineers (IEEE), International Conference on Data Mining (ICDM), and the Society for Industrial and Applied Mathematics (SIAM) International Conference on Data Mining (SDM). In particular, related work may be found in the workshop dedicated to the area of relational learning, such as the Statistical Relational Learning workshop. The premier journals in the data mining area, including IEEE Transactions on Knowledge and Data Engineering (TKDE), ACM Transactions on Data Mining (TDM), and Knowledge and Information Systems (KAIS), may contain related work in relational data clustering.