Using Sphinx with MySQL - High Performance MySQL (2012)

High Performance MySQL (2012)

Appendix F. Using Sphinx with MySQL

Sphinx (http://www.sphinxsearch.com) is a free, open source, full-text search engine, designed from the ground up to integrate well with databases. It has DBMS-like features, is very fast, supports distributed searching, and scales well. It is also designed for efficient memory and disk I/O, which is important because they’re often the limiting factors for large operations.

Sphinx works well with MySQL. It can be used to accelerate a variety of queries, including full-text searches; you can also use it to perform fast grouping and sorting operations, among other applications. It speaks MySQL’s wire protocol and a mostly MySQL-compatible SQL-like dialect, so you can actually query it just like a MySQL database. Additionally, there is a pluggable storage engine that lets a programmer or administrator access Sphinx directly through MySQL. Sphinx is especially useful for certain queries that MySQL’s general-purpose architecture doesn’t optimize very well for large datasets in real-world settings. In short, Sphinx can enhance MySQL’s functionality and performance.

The source of data for a Sphinx index is usually the result of a MySQL SELECT query, but you can build an index from an unlimited number of sources of varying types, and each instance of Sphinx can search an unlimited number of indexes. For example, you can pull some of the documents in an index from a MySQL instance running on one remote server, some from a PostgreSQL instance running on another server, and some from the output of a local script through an XML pipe mechanism.

In this appendix, we explore some use cases where Sphinx’s capabilities enable enhanced performance, show a summary of the steps needed to install and configure it, explain its features in detail, and we discuss several examples of real-world implementations.

A Typical Sphinx Search

We start with a simple but complete Sphinx usage example to provide a starting point for further discussion. We use PHP because of its popularity, although APIs are available for a number of other languages, too.

Assume that we’re implementing full-text searching for a comparison-shopping engine. Our requirements are to:

§ Maintain a searchable full-text index on a product table stored in MySQL

§ Allow full-text searches over product titles and descriptions

§ Be able to narrow down searches to a given category if needed

§ Be able to sort the result not only by relevance, but by item price or submission date

We begin by setting up a data source and an index in the Sphinx configuration file:

source products

{

type = mysql

sql_host = localhost

sql_user = shopping

sql_pass = mysecretpassword

sql_db = shopping

sql_query = SELECT id, title, description, \

cat_id, price, UNIX_TIMESTAMP(added_date) AS added_ts \

FROM products

sql_attr_uint = cat_id

sql_attr_float = price

sql_attr_timestamp = added_ts

}

index products

{

source = products

path = /usr/local/sphinx/var/data/products

docinfo = extern

}

This example assumes that the MySQL shopping database contains a products table with the columns we request in our SELECT query to populate our Sphinx index. The Sphinx index is also named products. After creating a new source and index, we run the indexer program to create the initial full-text index data files and then (re)start the searchd daemon to pick up the changes:

$ cd /usr/local/sphinx/bin

$ ./indexer products

$ ./searchd --stop

$ ./searchd

The index is now ready to answer queries. We can test it with Sphinx’s bundled test.php sample script:

$ php -q test.php -i products ipod

Query 'ipod ' retrieved 3 of 3 matches in 0.010 sec.

Query stats:

'ipod' found 3 times in 3 documents

Matches:

1. doc_id=123, weight=100, cat_id=100, price=159.99, added_ts=2008-01-03 22:38:26

2. doc_id=124, weight=100, cat_id=100, price=199.99, added_ts=2008-01-03 22:38:26

3. doc_id=125, weight=100, cat_id=100, price=249.99, added_ts=2008-01-03 22:38:26

The final step is to add searching to our web application. We need to set sorting and filtering options based on user input and format the output nicely. Also, because Sphinx returns only document IDs and configured attributes to the client—it doesn’t store any of the original text data—we need to pull additional row data from MySQL ourselves:

1 <?php

2 include ( "sphinxapi.php" );

3 // ... other includes, MySQL connection code,

4 // displaying page header and search form, etc. all go here

5

6 // set query options based on end-user input

7 $cl = new SphinxClient ();

8 $sortby = $_REQUEST["sortby"];

9 if ( !in_array ( $sortby, array ( "price", "added_ts" ) ) )

10 $sortby = "price";

11 if ( $_REQUEST["sortorder"]=="asc" )

12 $cl->SetSortMode ( SPH_SORT_ATTR_ASC, $sortby );

13 else

14 $cl->SetSortMode ( SPH_SORT_ATTR_DESC, $sortby );

15 $offset = ($_REQUEST["page"]-1)*$rows_per_page;

16 $cl->SetLimits ( $offset, $rows_per_page );

17

18 // issue the query, get the results

19 $res = $cl->Query ( $_REQUEST["query"], "products" );

20

21 // handle search errors

22 if ( !$res )

23 {

24 print "<b>Search error:</b>" . $cl->GetLastError ();

25 die;

26 }

27

28 // fetch additional columns from MySQL

29 $ids = join ( ",", array_keys ( $res["matches"] );

30 $r = mysql_query ( "SELECT id, title FROM products WHERE id IN ($ids)" )

31 or die ( "MySQL error: " . mysql_error() );

32 while ( $row = mysql_fetch_assoc($r) )

33 {

34 $id = $row["id"];

35 $res["matches"][$id]["sql"] = $row;

36 }

37

38 // display the results in the order returned from Sphinx

39 $n = 1 + $offset;

40 foreach ( $res["matches"] as $id=>$match )

41 {

42 printf ( "%d. <a href=details.php?id=%d>%s</a>, USD %.2f<br>\n",

43 $n++, $id, $match["sql"]["title"], $match["attrs"]["price"] );

44 }

45

46 ?>

Even though the snippet just shown is pretty simple, there are a few things worth highlighting:

§ The SetLimits() call tells Sphinx to fetch only the number of rows that the client wants to display on a page. It’s cheap to impose this limit in Sphinx (unlike in MySQL’s built-in search facility), and the number of results that would have been returned without the limit are available in$result['total_found'] at no extra cost.

§ Because Sphinx only indexes the title column and doesn’t store it, we must fetch that data from MySQL.

§ We retrieve data from MySQL with a single combined query for the whole document batch using the clause WHERE id IN (...), instead of running one query for each match (which would be inefficient).

§ We inject the rows pulled from MySQL into our full-text search result set, to keep the original sorting order. We explain this more in a moment.

§ We display the rows using data pulled from both Sphinx and MySQL.

The row injection code, which is PHP-specific, deserves a more detailed explanation. We couldn’t simply iterate over the result set from the MySQL query, because the row order can (and in most cases actually will) be different from that specified in the WHERE id IN (...) clause. PHP hashes (associative arrays), however, keep the order in which the matches were inserted into them, so iterating over $result["matches"] will produce rows in the proper sorting order as returned by Sphinx. To keep the matches in the proper order returned from Sphinx (rather than the semirandom order returned from MySQL), therefore, we inject the MySQL query results one by one into the hash that PHP stores from the Sphinx result set of matches.

There are also a few major implementation and performance differences between MySQL and Sphinx when it comes to counting matches and applying a LIMIT clause. First, LIMIT is cheap in Sphinx. Consider a LIMIT 500,10 clause. MySQL will retrieve 510 semirandom rows (which is slow) and throw away 500, whereas Sphinx will return the IDs that you will use to retrieve the 10 rows you actually need from MySQL. Second, Sphinx will always return the exact number of matches it actually found in the result set, no matter what’s in the LIMIT clause. MySQL can’t do this efficiently, although in MySQL 5.6 it will have partial improvements for this limitation.

Why Use Sphinx?

Sphinx can complement a MySQL-based application in many ways, bolstering performance where MySQL is not a good solution and adding functionality MySQL can’t provide. Typical usage scenarios include:

§ Fast, efficient, scalable, relevant full-text searches

§ Optimizing WHERE conditions on low-selectivity indexes or columns without indexes

§ Optimizing ORDER BY ... LIMIT N queries and GROUP BY queries

§ Generating result sets in parallel

§ Scaling up and scaling out

§ Aggregating partitioned data

We explore each of these scenarios in the following sections. This list is not exhaustive, though, and Sphinx users find new applications regularly. For example, one of Sphinx’s most important uses—scanning and filtering records quickly—was a user innovation, not one of Sphinx’s original design goals.

Efficient and Scalable Full-Text Searching

MyISAM’s full-text search capability is fast for smaller datasets but performs badly when the data size grows. With millions of records and gigabytes of indexed text, query times can vary from a second to more than 10 minutes, which is unacceptable for a high-performance web application. Although it’s possible to scale MyISAM’s full-text searches by distributing the data in many locations, this requires you to perform searches in parallel and merge the results in your application.

Sphinx works significantly faster than MyISAM’s built-in full-text indexes. For instance, it can search over 1 GB of text within 10 to 100 milliseconds—and that scales well up to 10–100 GB per CPU. Sphinx also has the following advantages:

§ It can index data stored with InnoDB and other engines, not just MyISAM.

§ It can create indexes on data combined from many source tables, instead of being limited to columns in a single table.

§ It can dynamically combine search results from multiple indexes.

§ In addition to indexing textual columns, its indexes can contain an unlimited number of numeric attributes, which are analogous to “extra columns.” Sphinx attributes can be integers, floating-point numbers, and Unix timestamps.

§ It can optimize full-text searches with additional conditions on attributes.

§ Its phrase-based ranking algorithm helps it return more relevant results. For instance, if you search a table of song lyrics for “I love you, dear,” a song that contains that exact phrase will turn up at the top, before songs that just contain “love” or “dear” many times.

§ It makes scaling out much easier.

Applying WHERE Clauses Efficiently

Sometimes you’ll need to run SELECT queries against very large tables (containing millions of records), with several WHERE conditions on columns that have poor index selectivity (i.e., return too many rows for a given WHERE condition) or could not be indexed at all. Common examples include searching for users in a social network and searching for items on an auction site. Typical search interfaces let the user apply WHERE conditions to 10 or more columns, while requiring the results to be sorted by other columns. See the indexing case study in Chapter 5 for an example of such an application and the required indexing strategies.

With the proper schema and query optimizations, MySQL can work acceptably for such queries, as long as the WHERE clauses don’t contain too many columns. But as the number of columns grows, the number of indexes required to support all possible searches grows exponentially. Covering all the possible combinations for just four columns strains MySQL’s limits. It becomes very slow and expensive to maintain the indexes, too. This means it’s practically impossible to have all the required indexes for many WHERE conditions, and you have to run the queries without indexes.

More importantly, even if you can add indexes, they won’t give much benefit unless they’re selective. The classic example is a gender column, which isn’t much help because it typically selects around half of all rows. MySQL will generally revert to a full table scan when the index isn’t selective enough to help it.

Sphinx can perform such queries much faster than MySQL. You can build a Sphinx index with only the required columns from the data. Sphinx then allows two types of access to the data: an indexed search on a keyword or a full scan. In both cases, Sphinx applies filters, which are its equivalent of a WHERE clause. Unlike MySQL, which decides internally whether to use an index or a full scan, Sphinx lets you choose which access method to use.

To use a full scan with filters, specify an empty string as the search query. To use an indexed search, add pseudokeywords to your full-text fields while building the index and then search for those keywords. For example, if you wanted to search for items in category 123, you’d add a “category123” keyword to the document during indexing and then perform a full-text search for “category123.” You can either add keywords to one of the existing fields using the CONCAT() function, or create a special full-text field for the pseudokeywords for more flexibility. Normally, you should use filters for nonselective values that cover over 30% of the rows, and fake keywords for selective ones that select 10% or less. If the values are in the 10–30% gray zone, your mileage may vary, and you should use benchmarks to find the best solution.

Sphinx will perform both indexed searches and scans faster than MySQL. Sometimes Sphinx actually performs a full scan faster than MySQL can perform an index read.

Finding the Top Results in Order

Web applications frequently need the top N results in order. As we discussed previously, this is hard to optimize in MySQL 5.5 and older versions.

The worst case is when the WHERE condition finds many rows (let’s say 1 million) and the ORDER BY columns aren’t indexed. MySQL uses the index to identify all the matching rows, reads the records one by one into the sort buffer with semirandom disk reads, sorts them all with a filesort, and then discards most of them. It will temporarily store and process the entire result, ignoring the LIMIT clause and churning RAM. And if the result set doesn’t fit in the sort buffer, it will need to go to disk, causing even more disk I/O.

This is an extreme case, and you might think it happens rarely in the real world, but in fact the problems it illustrates happen often. MySQL’s limitations on indexes for sorting—using only the leftmost part of the index, not supporting loose index scans, and allowing only a single range condition—mean many real-world queries can’t benefit from indexes. And even when they can, using semirandom disk I/O to retrieve rows is a performance killer.

Paginated result sets, which usually require queries of the form SELECT ... LIMIT N, M, are another performance problem in MySQL. They read N + M rows from disk, causing a large amount of random I/O and wasting memory resources. Sphinx can accelerate such queries significantly by eliminating the two biggest problems:

Memory usage

Sphinx’s RAM usage is always strictly limited, and the limit is configurable. Sphinx supports a result set offset and size similar to the MySQL LIMIT N, M syntax, but it also has a max_matches option. This controls the equivalent of the “sort buffer” size, on both a per-server and a per-query basis. Sphinx’s RAM footprint is guaranteed to be within the specified limits.

I/O

If attributes are stored in RAM, Sphinx does not do any I/O at all. And even if attributes are stored on disk, Sphinx will perform sequential I/O to read them, which is much faster than MySQL’s semirandom retrieval of rows from disks.

You can sort search results by a combination of relevance (weight), attribute values, and (when using GROUP BY) aggregate function values. The sorting clause syntax is similar to a SQL ORDER BY clause:

<?php

$cl = new SphinxClient ();

$cl->SetSortMode ( SPH_SORT_EXTENDED, 'price DESC, @weight ASC' );

// more code and Query() call here...

?>

In this example, price is a user-specified attribute stored in the index, and @weight is a special attribute, created at runtime, that contains each result’s computed relevance. You can also sort by an arithmetic expression involving attribute values, common math operators, and functions:

<?php

$cl = new SphinxClient ();

$cl->SetSortMode ( SPH_SORT_EXPR, '@weight + log(pageviews)*1.5' );

// more code and Query() call here...

?>

Optimizing GROUP BY Queries

Support for everyday SQL-like clauses would be incomplete without GROUP BY functionality, so Sphinx has that, too. But unlike MySQL’s general-purpose implementation, Sphinx specializes in solving a practical subset of GROUP BY tasks efficiently. This subset covers the generation of reports from big (1–100 million row) datasets when one of the following cases holds:

§ The result is only a “small” number of grouped rows (where “small” is on the order of 100,000 to 1 million rows).

§ Very fast execution speed is required and approximate COUNT(*) results are acceptable, when many groups are retrieved from data distributed over a cluster of machines.

This is not as restrictive as it might sound. The first scenario covers practically all imaginable time-based reports. For example, a detailed per-hour report for a period of 10 years will return fewer than 90,000 records. The second scenario could be expressed in plain English as something like “as quickly and accurately as possible, find the 20 most important records in a 100-million-row sharded table.”

These two types of queries can accelerate general-purpose queries, but you can also use them for full-text search applications. Many applications need to display not only full-text matches, but some aggregate results as well. For example, many search result pages show how many matches were found in each product category, or display a graph of matching document counts over time. Another common requirement is to group the results and show the most relevant match from each category. Sphinx’s group-by support lets you combine grouping and full-text searching, eliminating the overhead of doing the grouping in your application or in MySQL.

As with sorting, grouping in Sphinx uses fixed memory. It is slightly (10% to 50%) more efficient than similar MySQL queries on datasets that fit in RAM. In this case, most of Sphinx’s power comes from its ability to distribute the load and greatly reduce the latency. For huge datasets that could never fit in RAM, you can build a special disk-based index for reporting, using inline attributes (defined later). Queries against such indexes execute about as fast as the disk can read the data—about 30–100 MB/sec on modern hardware. In this case, the performance can be many times better than MySQL’s, though the results will be approximate.

The most important difference from MySQL’s GROUP BY is that Sphinx may, under certain circumstances, yield approximate results. There are two reasons for this:

§ Grouping uses a fixed amount of memory. If there are too many groups to hold in RAM and the matches are in a certain “unfortunate” order, per-group counts might be smaller than the actual values.

§ A distributed search sends only the aggregate results, not the matches themselves, from node to node. If there are duplicate records in different nodes, per-group distinct counts might be greater than the actual values, because the information that can remove the duplicates is not transmitted between nodes.

In practice, it is often acceptable to have fast approximate group-by counts. If this isn’t acceptable, it’s often possible to get exact results by configuring the daemon and client application carefully.

You can generate the equivalent of COUNT(DISTINCT <attribute>), too. For example, you can use this to compute the number of distinct sellers per category in an auction site.

Finally, Sphinx lets you choose criteria to select the single “best” document within each group. For example, you can select the most relevant document from each domain, while grouping by domain and sorting the result set by per-domain match counts. This is not possible in MySQL without a complex query.

Generating Parallel Result Sets

Sphinx lets you generate several results from the same data simultaneously, again using a fixed amount of memory. Compared to the traditional SQL approach of either running two queries (and hoping that some data stays in the cache between runs) or creating a temporary table for each search result set, this yields a noticeable improvement.

For example, assume you need per-day, per-week, and per-month reports over a period of time. To generate these with MySQL you’d have to run three queries with different GROUP BY clauses, processing the source data three times. Sphinx, however, can process the underlying data once and accumulate all three reports in parallel.

Sphinx does this with a multi-query mechanism. Instead of issuing queries one by one, you batch several queries and submit them in one request:

<?php

$cl = new SphinxClient ();

$cl->SetSortMode ( SPH_SORT_EXTENDED, "price desc" );

$cl->AddQuery ( "ipod" );

$cl->SetGroupBy ( "category_id", SPH_GROUPBY_ATTR, "@count desc" );

$cl->AddQuery ( "ipod" );

$cl->RunQueries ();

?>

Sphinx will analyze the request, identify query parts it can combine, and parallelize the queries where possible.

For example, Sphinx might notice that only the sorting and grouping modes differ, and that the queries are otherwise the same. This is the case in the sample code just shown, where the sorting is by price but the grouping is by category_id. Sphinx will create several sorting queues to process these queries. When it runs the queries, it will retrieve the rows once and submit them to all queues. Compared to running the queries one by one, this eliminates several redundant full-text search or full scan operations.

Note that generating parallel result sets, although it’s a common and important optimization, is only a particular case of the more generalized multi-query mechanism. It is not the only possible optimization. The rule of thumb is to combine queries in one request where possible, which generally allows Sphinx to apply internal optimizations. Even if Sphinx can’t parallelize the queries, it still saves network round-trips. And if Sphinx adds more optimizations in the future, your queries will use them automatically with no further changes.

Scaling

Sphinx scales well both horizontally (scaling out) and vertically (scaling up).

Sphinx is fully distributable across many machines. All the use cases we’ve mentioned can benefit from distributing the work across several CPUs.

The Sphinx search daemon (searchd) supports special distributed indexes, which know which local and remote indexes should be queried and aggregated. This means scaling out is a trivial configuration change. You simply partition the data across the nodes, configure the master node to issue several remote queries in parallel with local ones, and that’s it.

You can also scale up, as in using more cores or CPUs on a single machine to improve latency. To accomplish this, you can just run several instances of searchd on a single machine and query them all from another machine via a distributed index. Alternatively, you can configure a single instance to communicate with itself so that the parallel “remote” queries actually run on a single machine, but on different CPUs or cores.

In other words, with Sphinx a single query can be made to use more than one CPU (multiple concurrent queries will use multiple CPUs automatically). This is a major difference from MySQL, where one query always gets one CPU, no matter how many are available. Also, Sphinx does not need any synchronization between concurrently running queries. That lets it avoid mutexes (a synchronization mechanism), which are a notorious MySQL performance bottleneck on multi-CPU systems.

Another important aspect of scaling up is scaling disk I/O. Different indexes (including parts of a larger distributed index) can easily be put on different physical disks or RAID volumes to improve latency and throughput. This approach has some of the same benefits as MySQL 5.1’s partitioned tables, which can also partition data into multiple locations. However, distributed indexes have some advantages over partitioned tables. Sphinx uses distributed indexes both to distribute the load and to process all parts of a query in parallel. In contrast, MySQL’s partitioning can optimize some queries (but not all) by pruning partitions, but the query processing will not be parallelized. And even though both Sphinx and MySQL partitioning will improve query throughput, if your queries are I/O-bound, you can expect linear latency improvement from Sphinx on all queries, whereas MySQL’s partitioning will improve latency only on those queries where the optimizer can prune entire partitions.

The distributed searching workflow is straightforward:

1. Issue remote queries on all remote servers.

2. Perform sequential local index searches.

3. Read the partial search results from the remote servers.

4. Merge all the partial results into the final result set, and return it to the client.

If your hardware resources permit it, you can search through several indexes on the same machine in parallel, too. If there are several physical disk drives and several CPU cores, the concurrent searches can run without interfering with each other. You can pretend that some of the indexes are remote and configure searchd to contact itself to launch a parallel query on the same machine:

index distributed_sample

{

type = distributed

local = chunk1 # resides on HDD1

agent = localhost:3312:chunk2 # resides on HDD2, searchd contacts itself

}

From the client’s point of view, distributed indexes are absolutely no different from local indexes. This lets you create “trees” of distributed indexes by using nodes as proxies for sets of other nodes. For example, the first-level node could proxy the queries to a number of the second-level nodes, which could in turn either search locally themselves or pass the queries to other nodes, to an arbitrary depth.

Aggregating Sharded Data

Building a scalable system often involves sharding (partitioning) the data across different physical MySQL servers. We discussed this in depth in Chapter 11.

When the data is sharded at a fine level of granularity, simply fetching a few rows with a selective WHERE (which should be fast) means contacting many servers, checking for errors, and merging the results together in the application. Sphinx alleviates this problem, because all the necessary functionality is already implemented inside the search daemon.

Consider an example where a 1 TB table with a billion blog posts is sharded by user ID over 10 physical MySQL servers, so a given user’s posts always go to the same server. As long as queries are restricted to a single user, everything is fine: we choose the server based on user ID and work with it as usual.

Now assume that we need to implement an archive page that shows the user’s friends’ posts. How are we going to display “Other sysbench features,” with entries 981 to 1000, sorted by post date? Most likely, the various friends’ data will be on different servers. With only 10 friends, there’s about a 90% chance that more than 8 servers will be used, and that probability increases to 99% if there are 20 friends. So, for most queries, we will need to contact all the servers. Worse, we’ll need to pull 1,000 posts from each server and sort them all in the application. Following the suggestions we’ve made previously in this book, we’d trim down the required data to the post ID and timestamp only, but that’s still 10,000 records to sort in the application. Most modern scripting languages consume a lot of CPU time for that sorting step alone. In addition, we’ll either have to fetch the records from each server sequentially (which will be slow) or write some code to juggle the parallel querying threads (which will be difficult to implement and maintain).

In such situations, it makes sense to use Sphinx instead of reinventing the wheel. All we’ll have to do in this case is set up several Sphinx instances, mirror the frequently accessed post attributes from each table—in this example, the post ID, user ID, and timestamp—and query the master Sphinx instance for entries 981 to 1000, sorted by post date, in approximately three lines of code. This is a much smarter way to scale.

Architectural Overview

Sphinx is a standalone set of programs. The two main programs are:

indexer

A program that fetches documents from specified sources (e.g., from MySQL query results) and creates a full-text index over them. This is a background batch job, which sites usually run regularly.

searchd

A daemon that serves search queries from the indexes indexer builds. This provides the runtime support for applications.

The Sphinx distribution also includes native searchd client APIs in a number of programming languages (PHP, Python, Perl, Ruby, and Java, at the time of this writing), and SphinxSE, which is a client implemented as a pluggable storage engine for MySQL 5.0 and newer. The APIs and SphinxSE allow a client application to connect to searchd, pass it the search query, and fetch back the search results.

Each Sphinx full-text index can be compared to a table in a database; in place of rows in a table, the Sphinx index consists of documents. (Sphinx also has a separate data structure called a multivalued attribute, discussed later.) Each document has a unique 32-bit or 64-bit integer identifier that should be drawn from the database table being indexed (for instance, from a primary key column). In addition, each document has one or more full-text fields (each corresponding to a text column from the database) and numerical attributes. Like a database table, the Sphinx index has the same fields and attributes for all of its documents. Table F-1 shows the analogy between a database table and a Sphinx index.

Table F-1. Database structure and corresponding Sphinx structure

Database structure

Sphinx structure

CREATE TABLE documents (

id int(11) NOT NULL auto_increment,

title varchar(255),

content text,

group_id int(11),

added datetime,

PRIMARY KEY (id)

);

index documents

document ID

title field, full-text indexed

content field, full-text indexed

group_id attribute, sql_attr_uint

added attribute, sql_attr_timestamp

Sphinx does not store the text fields from the database but just uses their contents to build a search index.

Installation Overview

Sphinx installation is straightforward and typically includes the following steps:

1. Building the programs from sources:

$ configure && make && make install

2. Creating a configuration file with definitions for data sources and full-text indexes

3. Initial indexing

4. Launching searchd

After that, the search functionality is immediately available for client programs:

<?php

include ( 'sphinxapi.php' );

$cl = new SphinxClient ();

$res = $cl->Query ( 'test query', 'myindex' );

// use $res search result here

?>

The only thing left to do is run indexer regularly to update the full-text index data. Indexes that searchd is currently serving will stay fully functional during reindexing: indexer will detect that they are in use, create a “shadow” index copy instead, and notify searchd to pick up that copy on completion.

Full-text indexes are stored in the filesystem (at the location specified in the configuration file) and are in a special “monolithic” format, which is not well suited for incremental updates. The normal way to update the index data is to rebuild it from scratch. This is not as big a problem as it might seem, though, for the following reasons:

§ Indexing is fast. Sphinx can index plain text (without HTML markup) at a rate of 4–8 MB/sec on modern hardware.

§ You can partition the data in several indexes, as shown in the next section, and reindex only the updated part from scratch on each run of indexer.

§ There is no need to “defragment” the indexes—they are built for optimal I/O, which improves search speed.

§ Numeric attributes can be updated without a complete rebuild.

A future version will offer an additional index backend, which will support real-time index updates.

Typical Partition Use

Let’s discuss partitioning in a bit more detail. The simplest partitioning scheme is the main + delta approach, in which two indexes are created to index one document collection. main indexes the whole document set, while delta indexes only documents that have changed since the last time the main index was built.

This scheme matches many data modification patterns perfectly. Forums, blogs, email and news archives, and vertical search engines are all good examples. Most of the data in those repositories never changes once it is entered, and only a tiny fraction of documents are changed or added on a regular basis. This means the delta index is small and can be rebuilt as frequently as required (e.g., once every 1–15 minutes). This is equivalent to indexing just the newly inserted rows.

You don’t need to rebuild the indexes to change attributes associated with documents—you can do this online via searchd. You can mark rows as deleted by simply setting a “deleted” attribute in the main index. Thus, you can handle updates by marking this attribute on documents in the main index, then rebuilding the delta index. Searching for all documents that are not marked as “deleted” will return the correct results.

Note that the indexed data can come from the results of any SELECT statement; it doesn’t have to come from just a single SQL table. There are no restrictions on the SELECT statements. That means you can preprocess the results in the database before they’re indexed. Common preprocessing examples include joins with other tables, creating additional fields on the fly, excluding some fields from indexing, and manipulating values.

Special Features

Besides “just” indexing and searching through database content, Sphinx offers several other special features. Here’s a partial list of the most important ones:

§ The searching and ranking algorithms take word positions and the query phrase’s proximity to the document content into account.

§ You can bind numeric attributes to documents, including multivalued attributes.

§ You can sort, filter, and group by attribute values.

§ You can create document snippets with search query keyword highlighting.

§ You can distribute searching across several machines.

§ You can optimize queries that generate several result sets from the same data.

§ You can access the search results from within MySQL using SphinxSE.

§ You can fine-tune the load Sphinx imposes on the server.

We covered some of these features earlier. This section covers a few of the remaining features.

Phrase Proximity Ranking

Sphinx remembers word positions within each document, as do other open source full-text search systems. But unlike most other ones, it uses the positions to rank matches and return more relevant results.

A number of factors might contribute to a document’s final rank. To compute the rank, most other systems use only keyword frequency: the number of times each keyword occurs. The classic BM25 weighting function[231] that virtually all full-text search systems use is built around giving more weight to words that either occur frequently in the particular document being searched or occur rarely in the whole collection. The BM25 result is usually returned as the final rank value.

In contrast, Sphinx also computes query phrase proximity, which is simply the length of the longest verbatim query subphrase contained in the document, counted in words. For instance, the phrase “John Doe Jr” queried against a document with the text “John Black, John White Jr, and Jane Dunne” will produce a phrase proximity of 1, because no two words in the query appear together in the query order. The same query against “Mr. John Doe Jr and friends” will yield a proximity of 3, because three query words occur in the document in the query order. The document “John Gray, Jane Doe Jr” will produce a proximity of 2, thanks to its “Doe Jr” query subphrase.

By default, Sphinx ranks matches using phrase proximity first and the classic BM25 weight second. This means that verbatim query quotes are guaranteed to be at the very top, quotes that are off by a single word will be right below those, and so on.

When and how does phrase proximity affect results? Consider searching 1,000,000 pages of text for the phrase “To be or not to be.” Sphinx will put the pages with verbatim quotes at the very top of the search results, whereas BM25-based systems will first return the pages with the most mentions of “to,” “be,” “or,” and “not”—pages with an exact quote match but only a few instances of “to” will be buried deep in the results.

Most major web search engines today rank results with keyword positions as well. Searching for a phrase on Google will likely result in pages with perfect or near-perfect phrase matches appearing at the very top of the search results, followed by the “bag of words” documents.

However, analyzing keyword positions requires additional CPU time, and sometimes you might need to skip it for performance reasons. There are also cases when phrase ranking produces undesired, unexpected results. For example, searching for tags in a cloud is better without keyword positions: it makes no difference whether the tags from the query are next to each other in the document.

To allow for flexibility, Sphinx offers a choice of ranking modes. Besides the default mode of proximity plus BM25, you can choose from a number of others that include BM25-only weighting, fully disabled weighting (which provides a nice optimization if you’re not sorting by rank), and more.

Support for Attributes

Each document might contain an unlimited number of numeric attributes. Attributes are user-specified and can contain any additional information required for a specific task. Examples include a blog post’s author ID, an inventory item’s price, a category ID, and so on.

Attributes enable efficient full-text searches with additional filtering, sorting, and grouping of the search results. In theory, they could be stored in MySQL and pulled from there every time a search is performed. But in practice, if a full-text search locates even hundreds or thousands of rows (which is not many), retrieving them from MySQL is unacceptably slow.

Sphinx supports two ways to store attributes: inline in the document lists or externally in a separate file. Inlining requires all attribute values to be stored in the index many times, once for each time a document ID is stored. This inflates the index size and increases I/O, but reduces use of RAM. Storing the attributes externally requires preloading them into RAM upon searchd startup.

Attributes normally fit in RAM, so the usual practice is to store them externally. This makes filtering, sorting, and grouping very fast, because accessing data is a matter of quick in-memory lookup. Also, only the externally stored attributes can be updated at runtime. Inline storage should be used only when there is not enough free RAM to hold the attribute data.

Sphinx also supports multivalued attributes (MVAs). MVA content consists of an arbitrarily long list of integer values associated with each document. Examples of good uses for MVAs are lists of tag IDs, product categories, and access control lists.

Filtering

Having access to attribute values in the full-text engine allows Sphinx to filter and reject candidate matches as early as possible while searching. Technically, the filter check occurs after verification that the document contains all the required keywords, but before certain computationally intensive calculations (such as ranking) are done. Because of these optimizations, using Sphinx to combine full-text searching with filtering and sorting can be 10 to 100 times faster than using Sphinx for searching and then filtering results in MySQL.

Sphinx supports two types of filters, which are analogous to simple WHERE conditions in SQL:

§ An attribute value matches a specified range of values (analogous to a BETWEEN clause, or numeric comparisons).

§ An attribute value matches a specified set of values (analogous to an IN() list).

If the filters will have a fixed number of values (“set” filters instead of “range” filters), and if such values are selective, it makes sense to replace the integer values with “fake keywords” and index them as full-text content instead of attributes. This applies to both normal numeric attributes and MVAs. We’ll see some examples of how to do this later.

Sphinx can also use filters to optimize full scans. Sphinx remembers minimum and maximum attribute values for short continuous row blocks (128 rows, by default) and can quickly throw away whole blocks based on filtering conditions. Rows are stored in the order of ascending document IDs, so this optimization works best for columns that are correlated with the ID. For instance, if you have a row-insertion timestamp that grows along with the ID, a full scan with filtering on that timestamp will be very fast.

The SphinxSE Pluggable Storage Engine

Full-text search results received from Sphinx almost always require additional work involving MySQL—at the very least, to pull out the text column values that the Sphinx index does not store. As a result, you’ll frequently need to JOIN search results from Sphinx with other MySQL tables.

Although you can achieve this by sending the result’s document IDs to MySQL in a query, that strategy leads to neither the cleanest nor the fastest code. For high-volume situations, you should consider using SphinxSE, a pluggable storage engine that you can compile into MySQL 5.0 or newer, or load into MySQL 5.1 or newer as a plugin.

SphinxSE lets programmers query searchd and access search results from within MySQL. The usage is as simple as creating a special table with an ENGINE=SPHINX clause (and an optional CONNECTION clause to locate the Sphinx server if it’s at a nondefault location), and then running queries against that table:

mysql> CREATE TABLE search_table (

-> id INTEGER NOT NULL,

-> weight INTEGER NOT NULL,

-> query VARCHAR(3072) NOT NULL,

-> group_id INTEGER,

-> INDEX(query)

-> ) ENGINE=SPHINX CONNECTION="sphinx://localhost:3312/test";

Query OK, 0 rows affected (0.12 sec)

mysql> SELECT * FROM search_table WHERE query='test;mode=all' \G

*************************** 1. row ***************************

id: 123

weight: 1

query: test;mode=all

group_id: 45

1 row in set (0.00 sec)

Each SELECT passes a Sphinx query as the query column in the WHERE clause. The Sphinx searchd server returns the results. The SphinxSE storage engine then translates these into MySQL results and returns them to the SELECT statement.

Queries might include JOINs with any other tables stored using any other storage engines.

The SphinxSE engine supports most searching options available via the API, too. You can specify options such as filtering and limits by plugging additional clauses into the query string:

mysql> SELECT * FROM search_table WHERE query='test;mode=all;

-> filter=group_id,5,7,11;maxmatches=3000';

Per-query and per-word statistics that are returned by the API are also accessible through SHOW STATUS:

mysql> SHOW ENGINE SPHINX STATUS \G

*************************** 1. row ***************************

Type: SPH INX

Name: stats

Status: total: 3, total found: 3, time: 8, words: 1

*************************** 2. row ***************************

Type: SPHINX

Name: words

Status: test:3:5

2 rows in set (0.00 sec)

Even when you’re using SphinxSE, the rule of thumb still is to allow searchd to perform sorting, filtering, and grouping—i.e., to add all the required clauses to the query string rather than use WHERE, ORDER BY, or GROUP BY. This is especially important for WHERE conditions. The reason is that SphinxSE is only a client to searchd, not a full-blown built-in search library. Thus, you need to pass everything that you can to the Sphinx engine to get the best performance.

Advanced Performance Control

Both indexing and searching operations could impose a significant additional load on either the search server or the database server. Fortunately, a number of settings let you limit the load coming from Sphinx.

An undesired database-side load can be caused by indexer queries that either stall MySQL completely with their locks or just occur too quickly and hog resources from other concurrent queries.

The first case is a notorious problem with MyISAM, where long-running reads lock the tables and stall other pending reads and writes—you can’t simply do SELECT * FROM big_table on a production server, because you risk disrupting all other operations. To work around that, Sphinx offers ranged queries. Instead of configuring a single huge query, you can specify one query that quickly computes the indexable row ranges and another query that pulls out the data step by step, in small chunks:

sql_query_range = SELECT MIN(id),MAX(id) FROM documents

sql_range_step = 1000

sql_query = SELECT id, title, body FROM documents \

WHERE id>=$start AND id<=$end

This feature is extremely helpful for indexing MyISAM tables, but it should also be considered when using InnoDB tables. Although InnoDB won’t just lock the table and stall other queries when running a big SELECT *, it will still use significant machine resources because of its MVCC architecture. Multiversioning for a thousand transactions that cover a thousand rows each can be less expensive than a single long-running million-row transaction.

The second cause of excessive load happens when indexer is able to process the data more quickly than MySQL provides it. You should also use ranged queries in this case. The sql_ranged_throttle option forces indexer to sleep for a given time period (in milliseconds) between subsequent ranged query steps, increasing indexing time but easing the load on MySQL.

Interestingly enough, there’s a special case where you can configure Sphinx to achieve exactly the opposite effect: that is, you can improve indexing time by placing more load on MySQL. When the connection between the indexer box and the database box is 100 Mbps, and the rows compress well (which is typical for text data), the MySQL compression protocol can improve overall indexing time. The improvement comes at a cost of more CPU time spent on both the MySQL and indexer sides to compress and uncompress the rows transmitted over the network, respectively but the overall indexing time could be up to 20–30% less because of greatly reduced network traffic.

Search clusters can suffer from occasional overload, too, so Sphinx provides a few ways to help avoid searchd going off on a spin.

First, a max_children option simply limits the total number of concurrently running queries and tells clients to retry when that limit is reached.

Then there are query-level limits. You can specify that query processing stop either at a given threshold of matches found or a given threshold of elapsed time, using the SetLimits() and SetMaxQueryTime() API calls, respectively. This is done on a per-query basis, so you can ensure that more important queries always complete fully.

Finally, periodic indexer runs can cause bursts of additional I/O that will in turn cause intermittent searchd slowdowns. To prevent that, options that limit indexer disk I/O exist. max_iops enforces a minimal delay between I/O operations that ensures that no more than max_iops disk operations per second will be performed. But even a single operation could be too much; consider a 100 MB read() call as an example. The max_iosize option takes cares of that, guaranteeing that the length of every disk read or write will be under a given boundary. Larger operations are automatically split into smaller ones, and these smaller ones are then controlled by max_iops settings.


[231] See http://en.wikipedia.org/wiki/Okapi_BM25 for details.

Practical Implementation Examples

Each of the features we’ve described can be found successfully deployed in production. The following sections review several of these real-world Sphinx deployments, briefly describing the sites and some implementation details.

Full-Text Searching on Mininova.org

A popular torrent search engine, Mininova (http://www.mininova.org) provides a clear example of how to optimize “just” full-text searching. Sphinx replaced several MySQL replicas using MySQL built-in full-text indexes, which were unable to handle the load. After the replacement, the search servers were underloaded; the current load average is now in the 0.3–0.4 range.

Here are the database size and load numbers:

§ The site has a small database, with about 300,000–500,000 records and about 300–500 MB of index.

§ The site load is quite high: about 8–10 million searches per day at the time of this writing.

The data mostly consists of user-supplied filenames, frequently without proper punctuation. For this reason, prefix indexing is used instead of whole-word indexing. The resulting index is several times larger than it would otherwise be, but it is still small enough that it can be built quickly and its data can be cached effectively.

Search results for the 1,000 most frequent queries are cached on the application side. About 20–30% of all queries are served from the cache. Because of the “long tail” query distribution, a larger cache would not help much more.

For high availability, the site uses two servers with complete full-text index replicas. The indexes are rebuilt from scratch every few minutes. Indexing takes less than one minute, so there’s no point in implementing more complex schemes.

The following are lessons learned from this example:

§ Caching search results in the application helps a lot.

§ There might be no need to have a huge cache, even for busy applications. A mere 1,000 to 10,000 entries can be enough.

§ For databases on the order of 1 GB in size, simple periodic reindexing instead of more complicated schemes is OK, even for busy sites.

Full-Text Searching on BoardReader.com

Mininova is an extreme high-load project case—there’s not that much data, but there are a lot of queries against that data. BoardReader (http://www.boardreader.com) is just the opposite: a forum search engine that performs many fewer searches on a much larger dataset. Sphinx replaced a commercial full-text search engine that took up to 10 seconds per query to search through a 1 GB collection. Sphinx allowed BoardReader to scale greatly, both in terms of data size and query throughput.

Here’s some general information:

§ There are more than 1 billion documents and 1.5 TB of text in the database.

§ There are about 500,000 page views and between 700,000 and 1 million searches per day.

At the time of this writing, the search cluster consists of six servers, each with four logical CPUs (two dual-core Xeons), 16 GB of RAM, and 0.5 TB of disk space. The database itself is stored on a separate cluster. The search cluster is used only for indexing and searching.

Each of the six servers runs four searchd instances, so all four cores are used. One of the four instances aggregates the results from the other three. That makes a total of 24 searchd instances. The data is distributed evenly across all of them. Every searchd copy carries several indexes over approximately 1/24 of the total data (about 60 GB).

The search results from the six “first-tier” searchd nodes are in turn aggregated by another searchd instance running on the frontend web server. This instance carries several purely distributed indexes, which reference the six search cluster servers but have no local data at all.

Why have four searchd instances per node? Why not have only one searchd instance per server, configure it to carry four index chunks, and make it contact itself as though it’s a remote server to utilize multiple CPUs, as we suggested earlier? Having four instances instead of just one has its benefits. First, it reduces startup time. There are several gigabytes of attribute data that need to be preloaded in RAM; starting several daemons at a time lets us parallelize that. Second, it improves availability. In the event of searchd failures or updates, only 1/24 of the whole index is inaccessible, instead of 1/6.

Within each of the 24 instances on the search cluster, we used time-based partitioning to reduce the load even further. Many queries need to be run only on the most recent data, so the data is divided into three disjoint index sets: data from the last week, from the last three months, and from all time. These indexes are distributed over several different physical disks on a per-instance basis. This way, each instance has its own CPU and physical disk drive and won’t interfere with the others.

Local cron jobs update the indexes periodically. They pull the data from MySQL over the network but create the index files locally.

Using several explicitly separated “raw” disks proved to be faster than a single RAID volume. Raw disks give control over which files go on which physical disk. That is not the case with RAID, where the controller decides which block goes on which physical disk. Raw disks also guarantee fully parallel I/O on different index chunks, but concurrent searches on RAID are subject to I/O stepping. We chose RAID 0, which has no redundancy, because we don’t care about disk failures; we can easily rebuild the indexes on the search nodes. We could also have used several RAID 1 (mirror) volumes to give the same throughput as raw disks while improving reliability.

Another interesting thing to learn from BoardReader is how Sphinx version updates are performed. Obviously, the whole cluster cannot be taken down. Therefore, backward compatibility is critical. Fortunately, Sphinx provides it—newer searchd versions usually can read older index files, and they are always able to communicate to older clients over the network. Note that the first-tier nodes that aggregate the search results look just like clients to the second-tier nodes, which do most of the actual searching. Thus, the second-tier nodes are updated first, then the first-tier ones, and finally the web frontend.

Lessons learned from this example are:

§ The Very Large Database Motto: partition, partition, partition, parallelize.

§ On big search farms, organize searchd in trees with several tiers.

§ Build optimized indexes with a fraction of the total data where possible.

§ Map files to disks explicitly rather than relying on the RAID controller.

Optimizing Selects on Sahibinden.com

Sahibinden (http://www.sahibinden.com), a leading Turkish online auction site, had a number of performance problems, including full-text search performance. After deploying Sphinx and profiling some queries, we found that Sphinx could perform many of the frequent application-specific queries with filters faster than MySQL—even when there was an index on one of the participating columns in MySQL. Besides, using Sphinx for non-full-text searches resulted in unified application code that was simpler to write and support.

MySQL was underperforming because the selectivity on each individual column was not enough to reduce the search space significantly. In fact, it was almost impossible to create and maintain all the required indexes, because too many columns required them. The product information tables had about 100 columns, each of which the web application could technically use for filtering or sorting.

Active insertion and updates to the “hot” products table slowed to a crawl, because of too many index updates.

For that reason, Sphinx was a natural choice for all the SELECT queries on the product information tables, not just the full-text search queries.

Here are the database size and load numbers for the site:

§ The database contains about 400,000 records and 500 MB of data.

§ The load is about 3 million queries per day.

To emulate normal SELECT queries with WHERE conditions, the Sphinx indexing process included special keywords in the full-text index. The keywords were of the form _ _CATN_ _ _, where N was replaced with the corresponding category ID. This replacement happened during indexing with the CONCAT() function in the MySQL query, so the source data was not altered.

The indexes needed to be rebuilt as frequently as possible. We settled on rebuilding them every minute. A full reindexing took 9–15 seconds on one of many CPUs, so the main + delta scheme discussed earlier was not necessary.

The PHP API turned out to spend a noticeable amount of time (7–9 milliseconds per query) parsing the result set when it had many attributes. Normally, this overhead would not be an issue because the full-text search costs, especially over big collections, would be higher than the parsing cost. But in this specific case, we also needed non-full-text queries against a small collection. To alleviate the issue, the indexes were separated into pairs: a “lightweight” one with the 34 most frequently used attributes, and a “complete” one with all 99 attributes.

Other possible solutions would have been to use SphinxSE or to implement a feature to pull only the specified columns into Sphinx. However, the workaround with two indexes was by far the fastest to implement, and time was a concern.

The following are the lessons learned from this example:

§ Sometimes, a full scan in Sphinx performs better than an index read in MySQL.

§ For selective conditions, use a “fake keyword” instead of filtering on an attribute, so the full-text search engine can do more of the work.

§ APIs in scripting languages can be a bottleneck in certain extreme but real-world cases.

Optimizing GROUP BY on BoardReader.com

An improvement to the BoardReader service required counting hyperlinks and building various reports from the linking data. For instance, one of the reports needed to show the top N second-level domains linked to during the last week. Another counted the top N second- and third-level domains that linked to a given site, such as YouTube. The queries to build these reports had the following common characteristics:

§ They always group by domain.

§ They sort by count per group or by the count of distinct values per group.

§ They process a lot of data (up to millions of records), but the result set with the best groups is always small.

§ Approximate results are acceptable.

During the prototype-testing phase, MySQL took up to 300 seconds to execute these queries. In theory, by partitioning the data, splitting it across servers, and manually aggregating the results in the application, it would have been possible to optimize the queries to around 10 seconds. But this is a complicated architecture to build; even the partitioning implementation is far from straightforward.

Because we had successfully distributed the search load with Sphinx, we decided to implement an approximate distributed GROUP BY with Sphinx, too. This required preprocessing the data before indexing to convert all the interesting substrings into standalone “words.” Here’s a sample URL before and after preprocessing:

source_url = http://my.blogger.com/my/best-post.php

processed_url = my$blogger$com, blogger$com, my$blogger$com$my,

my$blogger$com$my$best, my$blogger$com$my$best$post.php

Dollar signs ($) are merely a unified replacement for URL separator characters so that searches can be conducted on any URL part, be it domain or path. This type of preprocessing extracts all “interesting” substrings into single keywords that are the fastest to search. Technically, we could have used phrase queries or prefix indexing, but that would have resulted in bigger indexes and slower performance.

Links are preprocessed at indexing time using a specially crafted MySQL UDF. We also enhanced Sphinx with the ability to count distinct values for this task. After that, we were able to move the queries completely to the search cluster, distribute them easily, and reduce query latency greatly.

Here are the database size and load numbers:

§ There are about 150–200 million records, which becomes about 50–100 GB of data after preprocessing.

§ The load is approximately 60,000–100,000 GROUP BY queries per day.

The indexes for the distributed GROUP BY were deployed on the same search cluster of 6 machines and 24 logical CPUs described previously. This is a minor complementary load to the main search load over the 1.5 TB text database.

Sphinx replaced MySQL’s exact, slow, single-CPU computations with approximate, fast, distributed computations. All of the factors that introduce approximation errors are present here: the incoming data frequently contains too many rows to fit in the “sort buffer” (we use a fixed RAM limit of 100K rows), we use COUNT(DISTINCT), and the result sets are aggregated over the network. Despite that, the results for the first 10 to 1000 groups—which are actually required for the reports—are from 99% to 100% correct.

The indexed data is very different from the data that would be used for an ordinary full-text search. There are a huge number of documents and keywords, even though the documents are very small. The document numbering is nonsequential, because a special numbering convention (source server, source table, and primary key) that does not fit in 32 bits is used. The huge amount of search “keywords” was also causing frequent CRC32 collisions (Sphinx uses CRC32 to map keywords to internal word IDs). For these reasons, we were forced to use 64-bit identifiers everywhere internally.

The current performance is satisfactory. For the most complex domains, queries normally complete in 0.1 to 1.0 seconds.

The following are the lessons learned from this example:

§ For GROUP BY queries, some precision can be traded for speed.

§ With huge textual collections or moderately sized special collections, 64-bit identifiers might be required.

Optimizing Sharded JOIN Queries on Grouply.com

Grouply (http://www.grouply.com) built a Sphinx-based solution to search its multimillion-record database of tagged messages, using Sphinx’s MVA support. The database is split across many physical servers for massive scalability, so it might be necessary to query tables that are located on different servers. Arbitrary large-scale joins are impossible because there are too many participating servers, databases, and tables.

Grouply uses Sphinx’s MVA attributes to store message tags. The tag list is retrieved from a Sphinx cluster via the PHP API. This replaces multiple sequential SELECTs from several MySQL servers. To reduce the number of SQL queries as well, certain presentation-only data (for example, a small list of users who last read the message) is also stored in a separate MVA attribute and accessed through Sphinx.

Two key innovations here are using Sphinx to prebuild JOIN results and using its distributed capabilities to merge data scattered over many shards. This would be next to impossible with MySQL alone. Efficient merging would require partitioning the data over as few physical servers and tables as possible, but that would hurt both scalability and extensibility.

Lessons learned from this example are:

§ Sphinx can be used to aggregate highly partitioned data efficiently.

§ MVAs can be used to store and optimize prebuilt JOIN results.

Summary

We’ve discussed the Sphinx full-text search system only briefly in this appendix. To keep it short, we intentionally omitted discussions of many other Sphinx features, such as HTML indexing support, ranged queries for better MyISAM support, morphology and synonym support, prefix and infix indexing, and CJK indexing. Nevertheless, this appendix should give you some idea of how Sphinx can solve many different real-world problems efficiently. It is not limited to full-text searching; it can solve a number of difficult problems that would traditionally be done in SQL.

Sphinx is neither a silver bullet nor a replacement for MySQL. However, in many cases (which are becoming common in modern web applications), it can be used as a very useful complement to MySQL. You can use it to simply offload some work, or even to create new possibilities for your application.

Download it at http://www.sphinxsearch.com—and don’t forget to share your own usage ideas!