Searching Data Using Sphinx - MariaDB Cookbook (2014)

MariaDB Cookbook (2014)

Chapter 9. Searching Data Using Sphinx

In this chapter, we will cover the following recipes:

· Installing SphinxSE in MariaDB

· Installing the Sphinx daemon on Linux

· Installing the Sphinx daemon on Windows

· Configuring the Sphinx daemon

· Searching using the Sphinx daemon and SphinxSE

Introduction

With any growing or evolving database, there comes a time when the limitations of MariaDB's built in, full text-searching functionality becomes more of a hindrance than its convenience is worth. At that point, another method is needed to efficiently index and search through our textual data. This is where Sphinx comes in.

There are actually two parts to Sphinx: an external daemon called Sphinx that does the work of building and maintaining the search index using that we use to search our data, and a storage engine component called SphinxSE that is part of MariaDB, which the Sphinx daemon uses to talk to MariaDB. The recipes in this chapter will cover setting up and using both these.

Installing SphinxSE in MariaDB

Before we can start using Sphinx, we need to enable SphinxSE in MariaDB.

How to do it...

1. Open the mysql command-line client and connect to our database server with a user that has the SUPER privilege.

2. Run the following command to install SphinxSE:

3. INSTALL SONAME 'ha_sphinx';

4. Run the following command, shown as follows, to check that the Sphinx Storage Engine (SphinxSE) is enabled (the Support column will say YES):

5. SHOW storage engines;

How to do it...

6. Run the following command to view the SphinxSE status variables (they will be empty):

7. SHOW STATUS LIKE 'sphinx_%';

How to do it...

How it works...

SphinxSE is included in MariaDB, but it is disabled by default. To enable it, we run the INSTALL SONAME command with the name of the plugin (ha_sphinx). This is a one-time operation.

Once SphinxSE is enabled in this manner and if we have the external Sphinx daemon installed and running, we can start using Sphinx to search our data. This is the topic of the following two recipes.

There's more...

The following are a couple of minor things to keep in mind when working with SphinxSE.

SphinxSE versus Sphinx

Despite its name, SphinxSE does not actually store data. It's called a storage engine because it uses the storage engine API to communicate with the rest of MariaDB. In reality, SphinxSE is a client that talks to an externally running indexing and searching program (or daemon) called Sphinx. SphinxSE's purpose is to allow us to talk to the Sphinx daemon from within MariaDB.

Getting SphinxSE's status

Another way to get the status of SphinxSE is to use the following command:

SHOW ENGINE SPHINX STATUS;

For the SHOW STATUS command, if we haven't actually started using Sphinx, there won't be any status to display. The difference is that with SHOW ENGINE SPHINX STATUS; if there is nothing to display, then there will be no output. The output of the preceding statement is shown as follows:

Getting SphinxSE's status

So, with the SHOW STATUS command, all the status variables are shown even if they have no values to display, and with the SHOW ENGINE SPHINX STATUS command, only status variables that have something to tell us will be shown.

See also

· The full documentation of SphinxSE is available at: https://mariadb.com/kb/en/sphinx-storage-engine/

Installing the Sphinx daemon on Linux

In order to use SphinxSE and Sphinx, we must install the daemon on our server. This recipe covers the process for Linux servers running Ubuntu, Debian, Red Hat, Fedora, or CentOS.

How to do it...

1. On Red Hat, CentOS, Ubuntu, or Debian, go to the Sphinx download site at http://sphinxsearch.com/downloads/release/ and download the latest Sphinx package for our Linux distribution.

2. On Debian or Ubuntu servers, run the following statement to install the Sphinx daemon:

3. sudo apt-get install unixodbc libpq5 mariadb-client

4. sudo dpkg -i sphinxsearch*.deb

5. On Red Hat and CentOS, run the following statement to install the Sphinx daemon:

6. sudo yum install postgresql-libs unixODBC

7. sudo rpm -Uhv sphinx*.rpm

8. On Fedora, run the following command to install Sphinx:

9. sudo yum install sphinx

10. On all server types, configure the Sphinx daemon as described in the Configuring the Sphinx daemon recipe in this chapter.

11. On Ubuntu and Debian, edit the /etc/default/sphinxsearch file and set START=yes. Then run the following command to start the Sphinx daemon:

12.sudo service sphinxsearch start

13. On Fedora, CentOS, or Red Hat, run the following command to start the Sphinx daemon:

14.sudo service searchd start

15. To stop the Sphinx daemon, run the service command again, this time with stop instead of start.

How it works...

The Sphinx daemon is in the package repositories for CentOS, RedHat, Fedora, Ubuntu, and Debian. However, it is usually older than the version available directly from the official Sphinx website. The versions included in the package repositories also do not include the API files that let us easily integrate searching using Sphinx into our applications. These files are sometimes available as separate packages, but not always, so it's better to get Sphinx straight from the source so that we have everything we may need.

In Debian and Ubuntu, the Sphinx package and daemon are called sphinxsearch. In Red Hat, Fedora, and CentOS, the package is called sphinx and the daemon is called searchd.

See also

· The complete documentation of installing and using Sphinx is available on the Sphinx website: http://sphinxsearch.com/docs/current.html.

Installing the Sphinx daemon on Windows

In order to use Sphinx, we must install the daemon on our server. This recipe is all about installing the Windows version of the Sphinx daemon.

How to do it...

1. Go to http://sphinxsearch.com/downloads/release/ and download the latest version of Sphinx for MySQL, either the 64-bit or 32-bit version depending on our version of Windows. For this recipe, we'll assume that we're running a 64-bit version of Windows.

2. Navigate to the Downloads folder and extract the ZIP file.

3. Extract the file to a location that is convenient. This recipe will assume that we extracted the files to C:\Sphinx (as the Sphinx docs recommend).

4. If the unzipping process creates a Sphinx subfolder (such as sphinx-2.1.3-release-win64) under C:\Sphinx move the contents of that subdirectory to the C:\Sphinx folder and then remove the empty directory using the following statements:

5. cd C:\Sphinx

6. mv .\sphinx-2.1.3-release-win64\* .\

7. rmdir sphinx-2.1.3-release-win64

8. Configure Sphinx as described in the Configuring the Sphinx daemon recipe in this chapter

9. Complete the Getting ready portion of the Searching with the Sphinx daemon and SphinxSE recipe in this chapter and then run the following command:

10.C:\Sphinx\bin\indexer --all

11. Open a PowerShell or terminal window and install the Sphinx searchd program as a Windows service with the following commands:

12.cd C:\Sphinx

13.C:\Sphinx\bin\searchd --install –-config C:\Sphinx\sphinx.conf --servicename SphinxSearch

14. The output of the previous step will be similar to the following screenshot:

How to do it...

15. Open the Windows Management Console (also called Computer Management under the Tools menu of the Server Manager), click on the Services and Applications toggle, and then click on Services.

16. Locate SphinxSearch in the list of services. Right-click on it and choose Start as shown in the following screenshot:

How to do it...

How it works...

The Sphinx developers do not provide an MSI install package for Windows like the MariaDB developers do. Instead, they simply offer a ZIP file that we can download and use to install Sphinx manually.

There are several different versions of Sphinx that we can download. Any of the ones that mention MySQL is fine. We can download the ones that also include PostgreSQL and PgSQL support if we want, but they are only needed if we are using those databases in addition to MariaDB.

When we unzip the files, it helps if we put them in an easy-to-remember and easy-to-use location. So in this recipe we use C:\Sphinx\. In order to successfully start the Sphinx service, we need to bootstrap the search database by creating and populating our documents table and then use the indexer program to create our initial search index. After that is done we can run the command to install the SphinxSearch service and then start it.

There's more...

We can name the Windows service whatever we want. The Sphinx documentation recommends using the name SphinxSearch. Another popular choice is to name it searchd just like it is named in CentOS, Red Hat, and Fedora Linux.

See also

· The complete documentation of the various Sphinx configuration options is available on the Sphinx website at http://sphinxsearch.com/docs/current.html

Configuring the Sphinx daemon

In order to use Sphinx, we need to add a user to our MariaDB database and configure the Sphinx daemon so that it indexes the content we want it to.

Getting ready

Install the SphinxSE as described in the Installing SphinxSE in MariaDB recipe in this chapter. Install the Sphinx daemon as described in either the Installing the Sphinx daemon on Linux recipe or the Installing the Sphinx daemon on Windows recipe, both in this chapter, depending on which operating system we are using.

How to do it...

1. Open the mysql command-line client and create a user with SELECT rights for the tables in the databases we want the Sphinx daemon to index, using the following statements:

2. CREATE USER 'sphinx'@localhost

3. IDENTIFIED BY 'sphinxsecretpassword';

4. GRANT SELECT on test.* to 'sphinx'@localhost;

5. Move the default sphinx.conf file out of the way; for example, the following will work on Linux:

6. sudo mv -vi sphinx.conf sphinx.conf.dist

7. Create a new sphinx.conf file in a text editor (such as vim, gedit, or pluma on Linux or Notepad on Windows) with the following statements:

8. #-----------------------------------------------------------------

9. # Example data source and index config

10.#-----------------------------------------------------------------

11.source docstbl {

12. type = mysql

13. sql_host = localhost

14. sql_user = sphinx

15. sql_pass = sphinxsecretpassword

16. sql_db = test

17. sql_port = 3306

18. sql_attr_timestamp = date_added

19. sql_query = \

20. SELECT id, UNIX_TIMESTAMP(date_added) AS date_added, \

21. title, content FROM documents

22. sql_query_info = SELECT * FROM documents WHERE id=$id

23.}

24.

25.index docsidx {

26. source = docstbl

27. path = /tmp/docsidx

28. dict = keywords

29.}

30.

31.indexer {

32. mem_limit = 32M

33.}

34.

35.searchd {

36. pid_file = /var/run/sphinxsearch/searchd.pid

37.}

38. On Windows, the pid_file and path lines will need to be changed to valid paths. They are as follows:

39.path = C:\Sphinx\docsidx

pid_file = C:\Sphinx\sphinx.pid

How it works...

The Sphinx daemon indexes our data using a database user that has the SELECT privilege on the tables we specify. We could use an existing user, but it's far better to create one specifically for the Sphinx daemon that only has the SELECT right on specific databases and tables we want it to have access to. For our recipe, we create a user named sphinx and grant it the SELECT right on all of the tables in the test database.

After creating the user, we need to inform the Sphinx daemon of the name and password of the user, the database to connect to, and the query to use when building our index, among other things. The Sphinx daemon's configuration file is named sphinx.conf, and it will be in one of the few different locations depending on our operating system. The actual configuration files are very similar; the main differences are the paths to various locations, which are different depending on where those locations are on our operating system.

On Windows, the sphinx.conf file is located wherever we unzipped the files. A common location is C:\Sphinx\sphinx.conf. On Red Hat, Fedora, or CentOS, the sphinx.conf file is located in /etc/sphinx/sphinx.conf. On Ubuntu and Debian, the sphinx.conf file is located in /etc/sphinxsearch/sphinx.conf.

The default configuration file contains examples of every possible option with short descriptions for each of them. Most are commented out in the file as they aren't needed.

The actual variables we need to set are quite minimal. To make it easy, the book's website has a file, 4399OS_09_sphinx.conf, with the configuration from the recipe. We can simply add the contents of this file to our local sphinx.conf file instead of manually typing it in.

One important note is that in the example, the path to the index is set to /tmp/docsidx. In reality, we would likely never want to store our index in the /tmp/ directory. This location is fine for our test index though. A better place on Linux would be under /var/lib/. For example, the default location for our index file on Ubuntu and Debian would be /var/lib/sphinxsearch/data/docsidx

The descriptions in the example configuration file are often enough to understand what the variable in question does, so there is no need to go through all of them here. The main purpose of the configuration file is to define our data sources in the source{} sections, define indexes attached or based on those data sources in the index{} sections, and set options for Sphinx's indexer and searchd, the two main parts of the Sphinx daemon, which do the work of indexing and searching through our data, respectively.

There's more...

There are a few things to keep in mind when configuring Sphinx. They are discussed in the following sections.

The Sphinx daemon and MariaDB on different hosts

First, the Sphinx daemon doesn't have to run on the server running MariaDB. It can talk to our database over a network connection just like any other MariaDB client. In that case, we just set the sql_host option to the hostname or IP address of our MariaDB server and configure everything else in the same manner as if we were running both on the same server. We just need to remember that the user we create must be able to login from the server on which Sphinx is running.

Sphinx queries

Another thing to study when looking at the example configuration file is that when we set up our data sources, one of the things we do is to define a query with the sql_query variable. This query could be something simple like the following code:

SELECT id, data FROM documents

Or it could be something more complex like what we used in the recipe:

sql_query = \

SELECT id, group_id, \

UNIX_TIMESTAMP(date_added) AS date_added, \

title, content \

FROM documents

In either case, the very first column must be an integer. This first column is the document ID in Sphinx and it is mandatory. The second thing is that any data we want to be searchable must be returned by this query (or in the query part of another data source section). If it isn't, Sphinx won't know about it, so it won't be able to help us search for it.

This goes both ways as there may be cases where we don't want some data to be searchable. In this case, just don't include it in the sql_query and Sphinx will not even see it.

Now that we have the Sphinx daemon configured, we can test it by searching for our data. That is the topic of the following recipe.

See also

· The complete, detailed documentation for configuring Sphinx is available on the Sphinx website at http://sphinxsearch.com/docs/current.html

Searching with the Sphinx daemon and SphinxSE

After completing the other recipes in this chapter, this is the recipe where we actually get to see Sphinx doing something.

Getting ready

This recipe requires that we install and configure SphinxSE and the Sphinx daemon. See the previous recipes in this chapter for instructions.

In the previous recipe, we configured the Sphinx daemon to index and search a table called documents in the test database. For the purposes of this recipe and to match the previous recipe, we need to create this table with the following CREATE TABLE command:

CREATE TABLE documents (

id SERIAL PRIMARY KEY,

date_added TIMESTAMP,

title VARCHAR(256),

content TEXT

);

We also need to add some example data to the table. The 4399OS_09_documents.sql file, available from this book's website, will create the table and populate it with some example data. We can load the file using the following statement:

mysql -u user -p test < 4399OS_09_documents.sql

We'll need to change the user to a valid username and provide a valid password when prompted.

How to do it...

1. In a terminal window, stop the Sphinx daemon if it is running and then start it again.

2. Run the following indexer command:

3. indexer --rotate –all

How to do it...

4. Run the following search command:

5. search -q nosql

How to do it...

6. Open the mysql command-line client, connect to the test database, and create a SphinxSE table connected to our local Sphinx daemon:

7. CREATE TABLE documents_search (

8. id BIGINT UNSIGNED NOT NULL,

9. weight INT NOT NULL,

10. query VARCHAR(3072) NOT NULL,

11. INDEX(query)

12.) ENGINE=SPHINX;

13. Test your documents_search table by running some queries, such as the following:

14.SELECT * FROM documents_search WHERE query='nosql';

15.SELECT * FROM documents_search WHERE query='sphinx';

How to do it...

16. Exit the mysql command-line client, stop MariaDB, relaunch the mysql command-line client by connecting directly to the Sphinx daemon, and run the following queries:

17.mysql -u root -h 0 -P 9306

18.SELECT * FROM docsidx WHERE MATCH('syntax diagrams');

19.SELECT * FROM docsidx WHERE MATCH('diving');

20.SELECT * FROM docsidx WHERE MATCH('tokudb|cassandra');

How to do it...

How it works...

There are several ways to search through our data using SphinxSE and the Sphinx daemon. However, before doing that, we need to index our data using the indexer command; so, that's what we do first.

The Sphinx daemon ships with a command-line utility called search, which we can use to search through our indexes directly. This is useful for testing to make sure our data has been indexed correctly and in shell scripts.

The second way of searching our data is with SphinxSE. As a storage engine, to use SphinxSE, we need to create a table. This table looks like a regular table for the most part, but when creating it, what we are really doing is defining our connection to the Sphinx daemon. The table has three columns and an index. We can name them what we want, but their datatypes need to match what SphinxSE and the Sphinx daemon expect. The first column is for the id of our documents; it must be defined as BIGINT and its natural name is id. The second column is the weight. This column will show us an integer. The higher the number, the better the document matches our query. The third column is for our queries. In our recipe, we name this column query, but another common name is q.

The last part of the SphinxSE table definition is to set the ENGINE for the table to SPHINX. We could also set a CONNECTION parameter, which has the following form:

CONNECTION="sphinx://HOST:PORT/INDEXNAME"

By leaving this parameter off in our table definition, SphinxSE goes with the default values, which are as follows:

CONNECTION="sphinx://localhost:9312/*"

After setting up our special SphinxSE table, we can search for data using a simple SELECT statement with our query in the WHERE clause. SphinxSE will reply with the ID of the documents that match along with three columns of information. The first will be an id column containing the IDs of the matching documents. The second column, named weight contains an integer. The higher this number, the better it matches our query. The third column contains our actual query.

Finally, in our recipe, we have some fun and connect directly to the Sphinx daemon. We can do this because it speaks the MariaDB binary network protocol. We're not actually connecting to MariaDB; we turned it off after all. Here, we use the mysql command-line client but we could use any MariaDB-compatible client.

The SQL we use in queries using this special mode is actually a Sphinx-specific variant called SphinxQL. It is a subset of the regular SQL and is used specifically for Sphinx queries from the command-line (or another) client.

There's more...

To get the most out of SphinxSE and the Sphinx daemon, we probably want to make use of the Sphinx API.

There isn't time here to go into how to use the Sphinx APIs for the various programming languages that Sphinx supports. Thankfully, there are example test programs included with the Sphinx daemon for all of them.

On Windows, they can be found under the C:\Sphinx\api folder (if we unzipped Sphinx to C:\Sphinx). On Linux, the examples are generally found under /usr/share/sphinxsearch/api/ or /usr/share/sphinx/api/ depending on whether we're using .deb or .rpm packages, respectively.

See also

· The full documentation of SphinxQL can be found at http://sphinxsearch.com/docs/current.html#sphinxql-reference

· The full documentation of rt indexes can be found at http://sphinxsearch.com/docs/current.html#rt-indexes

· Some examples showing the differences between using SphinxQL and the API can be found at http://sphinxsearch.com/blog/2013/07/23/from-api-to-sphinxql-and-back-again/