Query Optimization - Advanced Database Internals - Expert MySQL, Second Edition (2012)

Expert MySQL, Second Edition (2012)

PART 3. Advanced Database Internals

CHAPTER 13. Query Optimization

The query-tree class shown in Chapter 12 forms the starting point for building the experimental-query optimization and execution engine for DBXP. In this chapter, I show you how to add the optimizer to the query-tree class. I begin by explaining the rationale for the heuristics (or rules) used in the optimizer and then jump into writing the code. Because the code for some of the functions is quite lengthy, the examples in this chapter are excerpts. If you are following along by coding the examples, download the source code for this chapter instead of typing in the code from scratch.

Types of Query Optimizers

The first query optimizers were designed for use in early database systems, such as System R1 and INGRES.2 These optimizers were developed for a particular implementation of the relational model, and they have stood the test of time as illustrations for how to implement optimizers. Many commercially available database systems are based on these works. Since then, optimizers have been created for extensions of the relational model to include object-oriented and distributed database systems.

One example is the Volcano optimizer, which uses a dynamic programming algorithm3 to generate query plans for cost-based optimization in object-oriented database systems. Another example is concerned with how to perform optimization in heterogeneous database systems (similar to distributed systems, but there is no commonly shared concept of organization). In these environments, it is possible to use statistical methods for deriving optimization strategies.

Another area in which the requirements for query optimization generate unique needs is that of in-memory database systems. These systems are designed to contain the entire system and all of the data in the computer’s secondary memory (i.e., disk). While most of these applications are implemented as embedded systems, some larger distributed systems that consist of a collection of systems use in-memory databases to expedite information-retrieval optimization in in-memory database systems require efficient algorithms, because the need for optimizing retrieval is insignificant compared to the need for processing the query itself.4

All the research into traditional and nontraditional optimization is based on the firmament of the System R optimizer. The System R optimizer is a cost-based optimizer that uses information gathered about the database and the data, or statistics, in relation to forming cost estimates for how the query would perform. Additionally, the concept of arranging the internal representation of the query into different, but equivalent, internal representations (they generate the same answer) provides a mechanism to store the alternative forms. Each of these alternative forms is called a query plan. The plan with the least cost is chosen as the most efficient way to execute the query.

One key feature identified in the System R work was the concept of selectivity—the prediction of results based on the evaluation of an expression that contained references to attributes and their values. Selectivity is central to determining in what order the simple expressions in a conjunctive selection should be tested. The most selective expression (that is, the one with the smallest selectivity) will retrieve the smallest number of tuples (rows). Thus, that expression should be the basis for the first operation in a query. Conjunctive selections can be thought of as the “intersection” conditions. Conversely, disjunctive selections are the “union” conditions. Order has no effect among the disjunctive conditions.

Certain query optimizers, such as System R, do not process all possible join orders. Rather, they restrict the search to certain types of join orders that are known to produce more efficient execution. For example, multiway joins might be ordered so that the conditions that generate the fewest possible results are performed first. Similarly, the System R optimizer considers only those join orders in which the right operand of each join is one of the initial relations. Such join orders are called left-deep join orders. Left-deep join orders are particularly convenient for pipeline execution, because the right operand is normally a relation (versus an intermediate relation), and thus, only one input to each join is pipelined. The use of pipelining is a key element of the optimizer and execution engine for the database-experiment project.

Cost-Based Optimizers

A cost-based optimizer generates a range of query-evaluation plans from the given query by using the equivalence rules, and it chooses the one with the lowest cost based on the metrics (or statistics) gathered about the relations and operations needed to execute the query. For a complex query, many equivalent plans are possible.

The goal of cost-based optimization is to arrange the query execution and table access utilizing indexes and statistics gathered from past queries. Systems such as Microsoft SQL Server and Oracle use cost-based optimizers.

The portion of the database system responsible for acquiring and processing statistics (and many other utility functions) is called the database catalog. The catalog maintains statistics about the referenced relations and the access paths available on each of them. These will be used later in access-path selection to select the most efficient plan (with the lowest cost). For example, System R maintains statistics for each table on: the following:

· The cardinality of each relation

· The number of pages in the segment that hold tuples of each relation

· The fraction of data pages in the segment that hold tuples of relation (blocking factor, or fill)

· For each index:

· The number of distinct keys in each index

· The number of pages in each index

These statistics come from several sources within the system. The statistics are created when a relation is loaded and when an index is created. They are then updated periodically by a user command,5 which can be run by any user. System R does not update these statistics in real time because of the extra database operations and the locking bottleneck that this would create at the system catalogs. Dynamic updating of statistics would tend to serialize accesses that modify the relation contents and thus limit the ability of the system to process simultaneous queries in a multiuser environment.

The use of statistics in cost-based optimization is not very complex. Most database professionals interviewed seem to think that the gathering and application of statistics is a complex and vital element of query optimization. Although cost-based query optimization and even hybrid optimization schemes use statistics for cost and/or ranking, the optimization schemes are neither complex nor critical. Take, for instance, the concept of evenly distributed values among attributes. This concept alone is proof of the imprecise nature of the application of statistics. Statistical calculations are largely categorical in nature, and they are not designed to generate a precise value. They merely assist in determining whether one query execution plan is usually more costly than another.

Frequency distribution of attribute values is a common method for predicting the size of query results. By forming a distribution of possible (or actual6) values of an attribute, the database system can use the distribution to calculate a cost for a given query plan by predicting the number of tuples (or rows) that the plan must process. Modern database systems, however, deal with frequency distributions of individual attributes only, because considering all possible combinations of attributes is very expensive. This essentially corresponds to what is known as the attribute value independence assumption, and although it rarely is true, it is adopted by almost all relational database systems.

Gathering the distribution data requires either constant updating of the statistics or predictive analysis of the data. Another tactic is the use of uniform distributions in which the distribution of the attribute values is assumed to be equal for all distinct values. For example, given 5,000 tuples and a possible 50 values for a given attribute, the uniform distribution assumes that each value is represented 100 times. This is rarely the case and is often incorrect. Given the absence of any statistics, however, it is still a reasonable approximation of reality in many cases.

The memory requirements and running time of dynamic programming grow exponentially with query size (i.e., number of joins) in the worst case, since all viable partial plans generated in each step must be stored to be used in the next one. In fact, many modern systems place a limit on the size of queries that can be submitted (usually around 15 joins), because for larger queries, the optimizer crashes due to very high memory requirements. Nevertheless, most queries seen in practice involve fewer than ten joins, and the algorithm has proved to be effective in such contexts. It is considered the standard in query-optimization search strategies. Some of the statistics gathered about rows (or tuples) in tables (or relations) for use in cost-based optimizers include:

· The number of tuples in the table

· The number of blocks containing rows (the block count)

· The size of the row in bytes

· The number of distinct values for each attribute (or column)

· The selection cardinality of each attribute (sometimes represented as evenly distributed)

· The fan-out of internal nodes of an index (the number of children resulting in subtrees)

· The height of the B-tree for an index

· The number of blocks at the leaf level of the index

The cost of writing the final result of an operation back to disk is ignored. Regardless of the query-evaluation plan used, this cost does not change; therefore, not including it in the calculations does not affect the choice of the plan.

Most database systems today use a form of dynamic programming to generate all possible query plans. While dynamic programming offers good performance for cost optimization, it is a complex algorithm that can require more resources for the more complex queries. While most database systems do not encounter these types of queries, researchers in the areas of distributed database systems and high-performance computing have explored alternatives and variants to dynamic programming techniques. The recent research by Kossmann and Stocker shows that we are beginning to see the limits of traditional approaches to query optimization.7 What are needed are more efficient optimization techniques that generate execution plans that follow good practices rather than exhaustive exploration. In other words, we need optimizers that perform well in a variety of general environments as well as optimizers that perform well in unique database environments.

Heuristic Optimizers

The goal of heuristic optimization is to apply rules that ensure good practices for query execution. Systems that use heuristic optimizers include INGRES and various academic variants. Most systems typically use heuristic optimization as a means of avoiding the really bad plans rather than as a primary means of optimization.

Heuristic optimizers use rules on how to shape the query into the most optimal form prior to choosing alternative implementations. The application of heuristics, or rules, can eliminate queries that are likely to be inefficient. Using heuristics as a basis to form the query plan ensures that the query plan is most likely (but not always) optimized prior to evaluation. Such heuristics include:

· Performing selection operations as early as possible. It is usually better to perform selections earlier than projections, because they reduce the number of tuples to be sent up the tree.

· Performing projections early.

· Determining which selection operations and join operations produce the smallest result set and using those first (left-most-deep).

· Replacing Cartesian products with join operations.

· Moving projection attributes as far down as possible in the tree.

· Identifying subtrees whose operations can be pipelined.

Heuristic optimizers are not new technologies. Researchers have created rules-based optimizers for various specialized purposes. One example is the Prairie rule-based query optimizer. This rule-based optimizer permits the creation of rules based on a given language notation. Queries are processed using the rules to govern how the optimizer performs. In this case, the Prairie optimizer is primarily a cost-based optimizer that uses rules to tune the optimizer.

Aside from examples such as Prairie and early primitives such as INGRES, no commercial database systems implement a purely heuristic optimizer. For those that do have a heuristic or rule-based optimization step, it is usually implemented as an addition to or as a preprocessor to a classic cost-based optimizer or as a preprocessing step in the optimization.

Semantic Optimizers

The goal of semantic optimization is to form query-execution plans that use the semantics, or topography, of the database and the relationships and indexes within to form queries that ensure the best practice available for executing a query in the given database. Semantic query optimization uses knowledge of the schema (e.g., integrity constraints) for transforming a query into a form that may be answered more efficiently than the original version.

Although not yet implemented in commercial database systems as the primary optimization technique, semantic optimization is currently the focus of considerable research. Semantic optimization operates on the premise that the optimizer has a basic understanding of the actual database schema. When a query is submitted, the optimizer uses its knowledge of system constraints to simplify or to ignore a particular query if it is guaranteed to return an empty result set. This technique holds great promise for providing even more improvements to query-processing efficiency in future relational database systems.

Parametric Optimizers

Ioannidis, in his work on parametric query optimization, describes a query-optimization method that combines the application of heuristic methods with cost-based optimization. The resulting query optimizer provides a means to produce a smaller set of effective query plans from which cost can be estimated, and thus, the lowest-cost plan of the set can be executed.8 Query-plan generation is created using a random algorithm, called sipR. This permits systems that utilize parametric query optimization to choose query plans that can include the uncertainty of parameter changes (such as buffer sizes) to choose optimal plans either formed on the fly or from storage.

It is interesting to note that in his work, Ioannidis suggests that the use of dynamic programming algorithms may not be needed, and thus the overhead in using these techniques can be avoided. Furthermore, he found that database systems that use heuristics to prune or shape the query prior to applying dynamic programming algorithms for query optimization are usually an enhanced version of the original algorithm of System R. Ioannidis showed that for small queries (approximately up to 10 joins), dynamic programming is superior to randomized algorithms, whereas for large queries, the opposite holds true.

Heuristic Optimization Revisited

The heuristic-optimization process uses a set of rules that have been defined to guarantee good execution plans. Thus, the effectiveness of a heuristic optimizer to produce good plans is based solely on the effectiveness and completeness of its rules.

The following paragraphs describe the rules used to create the DBXP query optimizer. Although these rules are very basic, when they are applied to typical queries, the resulting execution is near optimal, with fast performance and accurate results.

Some basic strategies were used to construct the query tree initially. Specifically, all executions take place in the query-tree node. Restrictions and projections are processed on a branch and do not generate intermediate relations. Joins are always processed as an intersection of two paths. A multiway join would be formed using a series of two-way joins. The following rules represent the best practices for forming a set of heuristics to generate good execution plans. The DBXP optimizer is designed to apply these rules in order to transform the query tree into a form that ensures efficient execution.9

1. Split any nodes that contain a project and join or restrict and join. This step is necessary because some queries specify the join condition in the WHERE clause10 and thus can “fool” the optimizer into forming join nodes that have portions of the expressions that are not part of the join condition.

2. Push all restrictions down the tree to leaves. Expressions are grouped according to their respective relations into individual query-tree nodes. Although some complex expressions cannot be reduced, most can be easily reduced to a single relation. By placing the restrictions at the leaves, the number of resulting tuples that must be passed up the tree is reduced.

3. Place all projections at the lowest point in the tree. Projections should be placed in a node above restrictions, and they will further reduce the amount of data passed through the tree by eliminating unneeded attributes from the resulting tuples. It should be noted that the projections may be modified to include attributes that are needed for operations, such as joins that reside in the parentage of the projection query tree node.

4. Place all joins at intersections of projections or restrictions of the relations contained in the join clause.11 This ensures that the fewest amount of tuples are evaluated for the most expensive operation—joins. Intermediate query-tree nodes that order the resulting tuples from the child nodes may be necessary. These intermediate nodes, called utility operations, may sort or group the tuples depending on the type of join, and they can greatly increase the performance of the join.

image Note Other heuristics can be used. The previous list contains those that generate the greatest gain in performance.

Lee, Shih, and Chen give an interesting counterargument to the practice of pushing selections and restrictions down the tree.12 They suggest that under some conditions, some selections and projections may be more costly than joins. Their argument presents a query optimizer based on graph theory that can more accurately predict query optimization for situations in which complex selects and projections are present. Nevertheless, the general case is that “efficient” execution plans can be constructed for the majority of queries using the rules I’ve listed.

The DBXP Query Optimizer

Although these rules offer a complete set of operations for forming the optimal query tree, they do not address balancing multiway joins or applying indexes. These steps are considered cost-based optimizations. For this reason, most heuristic optimizers are implemented as a two-phase optimization, with the first pass generating an optimized query path and a second pass applying cost-optimization strategies.

image Note DBXP optimizer is implemented as a two-pass operation. The first operation rearranges the tree for execution using a heuristic algorithm. The second pass walks the tree, changing the access method for nodes that have relations with indexes available on the attributes being operated on. I leave the implementation of the cost-optimization pass as an exercise for the reader.

Creating comprehensive tests for a heuristic optimizer would require writing SQL statements that cover all possible paths through the optimizer. In essence, you need to create a test that tests all possible queries, both valid and invalid (invalid queries are normally captured in the SQL parser code). Implementing the heuristic optimizer is only the second part of the DBXP engine, however. In the previous chapter, we created the basic query-tree internal representation and stubbed the execution methods. In this chapter, we will create the optimizer but will not be able to execute the queries. You can continue to use the stubbed execution to test the optimizer but, rather than presenting the results of the queries, you can reuse the code from the previous chapter to show the query plan instead of the query results.

With this in mind, let’s design a few basic queries that exercise the optimizer to show it is processing the queries. We take care of query execution in the next chapter. Listing 13-1 shows a sample test that exercises the query optimizer.

Listing 13-1. Sample DBXP Query-optimizer Test (Ch13.test)

#
# Sample test to test the DBXP_SELECT optimizer
#

# Test 1:
DBXP_SELECT * FROM staff;

# Test 2:
DBXP_SELECT id FROM staff WHERE staff.id = '123456789';

# Test 3:
DBXP_SELECT id, dir_name FROM staff, directorate
WHERE staff.dno = directorate.dnumber;

# Test 4:
DBXP_SELECT * FROM staff JOIN tasking ON staff.id = tasking.id
WHERE staff.id = '123456789';

image Tip The database used in these examples is included in the Appendix.

You can use this test as a guide and add your own commands to explore the new code. Please refer to Chapter 4 for more details on how to create and run this test using the MySQL Test Suite.

Stubbing the DBXP_SELECT Command

Since there is no query-execution capability, the query commands can be optimized but not executed. The show plan mechanism (the EXPLAIN command) can serve as a means to demonstrate the optimizer. To add this functionality, you can open the sql_dbxp_parse.cc file and alter the DBXP_select_command() method as shown in Listing 13-2.

Listing 13-2. Stubbing the Query Optimizer for Testing

int DBXP_explain_select_command(THD *thd);

/*
Perform DBXP_SELECT Command

SYNOPSIS
DBXP_select_command()
THD *thd IN the current thread

DESCRIPTION
This method executes the SELECT command using the query tree and optimizer.

RETURN VALUE
Success = 0
Failed = 1
*/
int DBXP_select_command(THD *thd)
{
DBUG_ENTER("DBXP_select_command");
DBXP_explain_select_command(thd);
DBUG_RETURN(0);
}

These changes alter the code to call the EXPLAIN command code rather than executing the query. This allows the tests to return a valid result set (the query plan) so that we can test the optimizer without the query execution portion.

image Note I use a function declaration above the DBXP_select_command() method. This allows the code to call forward to the DBXP_explain_select_command() method without using a header file.

There is also a change necessary to the DBXP_explain_select_command() method. You need to add the call to the new optimize methods. This includes the heuristic_optimization() and cost_optimization() methods. I discuss the heuristic optimization in more detail in the following sections. Listing 13-3 shows the modifications to the EXPLAIN code.

Listing 13-3. Modifications to the EXPLAIN Command Code

/*
Perform EXPLAIN command.

SYNOPSIS
DBXP_explain_select_command()
THD *thd IN the current thread

DESCRIPTION
This method executes the EXPLAIN SELECT command.

RETURN VALUE
Success = 0
Failed = 1
*/
int DBXP_explain_select_command(THD *thd)
{
bool res= 0;

DBUG_ENTER("DBXP_explain_select_command");

/* Prepare the tables (check access, locks) */
res = check_table_access(thd, SELECT_ACL, thd->lex->query_tables, 0, 1, 1);
if (res)
DBUG_RETURN(1);
res = open_and_lock_tables(thd, thd->lex->query_tables, 0,
MYSQL_LOCK_IGNORE_TIMEOUT);
if (res)
DBUG_RETURN(1);

/* Create the query tree and optimize it */
Query_tree *qt = build_query_tree(thd, thd->lex,
(TABLE_LIST*) thd->lex->select_lex.table_list.first);
qt->heuristic_optimization();
qt->cost_optimization();

/* create a field list for returning the query plan */
List<Item> field_list;

/* use the protocol class to communicate to client */
Protocol *protocol= thd->protocol;

/* write the field to the client */
field_list.push_back(new Item_empty_string("Execution Path",NAME_LEN));
if (protocol->send_result_set_metadata(&field_list,
Protocol::SEND_NUM_ROWS | Protocol::SEND_EOF))
DBUG_RETURN(TRUE);
protocol->prepare_for_resend();

/* generate the query plan and send it to client */
show_plan(protocol, qt->root, qt->root, false);
my_eof(thd); /* end of file tells client no more data is coming */

/* unlock tables and cleanup memory */
mysql_unlock_read_tables(thd, thd->lock);
delete qt;
DBUG_RETURN(0);
}

Important MySQL Structures and Classes

There are a number of key structures and classes in the MySQL source code. You have already seen many of them in the examples thus far. Some of the more important ones are documented in the MySQL Internals manual. Unfortunately, no document lists them all. The following sections describe some of these structures and classes that you will encounter when working with the DBXP query optimizer (and later, the query execution code). These include the TABLE structure, the Field class, and a few of the common Item iterators (the Item class is discussed in Chapter 3).

TABLE Structure

The most important MySQL structure that you will work with when writing the optimizer is the TABLE structure.

This structure is important, because it contains all pertinent data for a table. It contains everything from a pointer to the appropriate storage-handler class to a list of the fields, keys, and temporary buffers for storing rows while executing the query.

While the structure is immense (like most important structures in MySQL), you will see a few key attributes over and over. Table 13-1 lists some of the more important attributes for the TABLE structure. For a detailed examination of the TABLE structure, see the handler.h file.

Table 13-1. TABLE Structure Overview

Attribute

Description

file

A reference to the storage-engine object.

field

An array of fields for the table.

fields

The number of fields in the field array.

next

A pointer to the next table in a list of tables.

prev

A pointer to the previous table in a list of tables.

The Field Class

The Field class contains all the attributes and methods for creating, assigning values to, and manipulating fields (or attributes) for the tables in the database. The Field class is defined in the /sql/field.h file and implemented in the /sql/field.cc file. The Field class is actually a base from which several types of fields are derived. These derived classes, named Field_XXX, can be found in several places within the MySQL source code.

Since it is only a base class,13 many methods in the class are intended to be overwritten by the derived class (they’re defined as virtual). Many of the derived classes have the same set of basic attributes and methods, however. Table 13-2 lists the attributes and methods that you will encounter in working with the DBXP source code. For a detailed examination of the Field class, see the field.h file.

Table 13-2. The Field Class

Attribute/Method

Description

ptr

A pointer to the field within the record buffer.

null_ptr

A pointer to a byte (or bytes) in the record buffer that indicates which attributes can contain NULL.

table_name

The table name associated with this field.

field_name

The attribute name for this field.

field_length

The length of the field. Indicates the number of bytes that can be stored.

is_null()

Checks to see if the field is NULL.

move_field()

Changes the pointer of the field in memory to point to a different location.

store()

A series of overloaded methods used to store values into the fields.

val_str()

Gets the value of the field as a string.

val_int()

Gets the value of the field as an integer.

result_type()

Gets the data type of the field.

cmp()

Returns the comparison result of the field with the value passed.

Iterators

There are three types of iterators in the MySQL source code. You have already seen these iterators while working with the code in previous chapters. Iterators are special constructs that make it easy to create and navigate a list of objects, and they are typically presented as either linked lists or arrays. The iterators in MySQL are implemented as template classes, which take as a parameter the type of the data on which the list operates. The MySQL iterators are linked lists, but some behave more like queues and stacks. The following sections describe some of the available iterator classes in MySQL. These iterators are defined in the sql/sql_list.h header file.

template <> class List

The List template class is implemented as a queue or stack with methods for pushing items onto the back of the list using the push_back() method or on the front of the list using the push_front() method. Items can be retrieved using the pop() method or deleted by using theremove() method. You can loop through the list by using the next attribute of the data item, but the list is usually used to form a linked list of items (e.g., List<Item> item_list), and then one of the List_iterator classes is used to loop through the list quickly. This class is derived from the base_list class (also defined in /sql/sql_list.h).

template <> class List_iterator

The List_iterator class is implemented as a linked list with methods for moving through the list using the overloaded ++ operator. Items can be retrieved using the ref() method or deleted by using the remove() method. The list can be restarted from the front by issuing therewind() method. This class is derived from the base_list class (also defined in /sql/sql_list.h).

template <> class List_iterator_fast

The List_iterator_fast class is essentially the same as the List_iterator class, but it is optimized for fast-forward traversal. It is implemented as a linked list with methods for moving through the list using the overloaded ++ operator. Items can be retrieved using the ref()method or deleted by using the remove() method. This class is derived from the base_list class (also defined in /sql/sql_list.h).

Examples

Using the iterators is easy. If you want to use a list to manipulate items, a simple list, such as the List<Item_field>, would be the best choice. If you want to loop through a list of fields quickly, you can create a list iterator as either a List_iterator<Item_field> or aList_iterator_fast<Item_field>. Examples of loop structures are shown in Listing 13-4.

Listing 13-4. Example Iterators

/* create a list and populate with some items */
List<Item> item_list;
item_list.push_back(new Item_int((int32)
join->select_lex->select_number));
item_list.push_back(new Item_string(join->select_lex->type,
strlen(join->select_lex->type), cs));
item_list.push_back(new Item_string(message,strlen(message),cs));

../* start a basic list iterator to iterate through the item_list */
List_iterator<Item_field> item_list_it(*item_list);

/* control the iteration using an offset */
while ((curr_item= item_list_it++))
{
/* do something */
}

../* start a fast list iterator to iterate through the item_list */
List_iterator_fast<Item_field> li(item_equal->fields);

/* control the iteration using an offset */
while ((item= li++))
{
/* do something */
}

The DBXP Helper Classes

I mentioned in Chapter 11 two additional classes used in the DBXP engine (Attribute and Expression). These classes are designed to make the optimizer easier to code and easier to understand. They are encapsulations of the existing MySQL classes (and structures), and they reuse many of the methods available in the MySQL code.

The first helper class is a class that encapsulates the attributes used in a query. These attributes are represented in the MySQL code as the Item class. The helper class, named Attribute, makes access to these classes easier by providing a common interface for accessing items. Listing 13-5 shows the header file for the Attribute class.

Listing 13-5. The Attribute Class Header

#include "sql_priv.h"
#include "sql_class.h"
#include "table.h"

class Attribute
{
public:
Attribute(void);
int remove_attribute(int num);
Item *get_attribute(int num);
int add_attribute(bool append, Item *new_item);
int num_attributes();
int index_of(char *table, char *value);
int hide_attribute(Item *item, bool hide);
char *to_string();
private:
List<Item> attr_list;
bool hidden[256];
};

The second helper class encapsulates the expressions used in a query. Like the attributes, the expressions are represented in the MySQL code as Item class instances. The helper class, named Expression, provides a common (and simplified) interface to the Item classes. Listing 13-6shows the header file for the Expression class.

Listing 13-6. The Expression Class Header

#include "sql_priv.h"
#include "sql_class.h"
#include "table.h"
#include <sql_string.h>

struct expr_node
{
Item *left_op;
Item *operation;
Item *right_op;
Item *junction;
expr_node *next;
};

class Expression
{
public:
Expression(void);
int remove_expression(int num, bool free);
expr_node *get_expression(int num);
int add_expression(bool append, expr_node *new_item);
int num_expressions();
int index_of(char *table, char *value);
int reduce_expressions(TABLE *table);
bool has_table(char *table);
int convert(THD *thd, Item *mysql_expr);
char *to_string();
bool evaluate(TABLE *table 1);
int compare_join(expr_node *expr, TABLE *t1, TABLE *t2);
int get_join_expr(Expression *where_expr);
private:
expr_node *root;
Field *find_field(TABLE *tbl, char *name);
bool compare(expr_node *expr, TABLE *t1);
int num_expr;
};

I use a structure to contain the expressions in the form of left operand, operator, and right operand. This is a more simplified approach than the expression tree that the MySQL classes represents, making it easier to read the optimizer code. The simpler approach also makes it easier to evaluate the conditions in an interactive debugger.

image Note I omit some of the details of these helper classes in the text, because they are very simple abstractions of calling the MySQL methods for the TABLE structure and the Item and Field classes. The files are included in the online chapter source code, however. The source code for this book is available for download at http://www.apress.com in the Source Code section.

These helper class and header files should be placed in the /sql directory and added to your CMakeLists.txt file. I show you how to do that in the “Compiling and Testing the Code” section.

Modifications to the Existing Code

There is one other minor modification necessary to implement the optimizer: we need to add the code to use the new Attribute and Expression classes. Open the query_tree.h header file and make the changes shown in Listing 13-7. As you can see, I’ve changed thewhere_expr and join_expr attributes to use the new Expression class. Likewise, I changed the attributes attribute to use the new Attribute class.

Listing 13-7. Changes to the Query Tree Class

#include "attribute.h"
#include "expression.h"
#include "sql_priv.h"
#include "sql_class.h"
#include "table.h"
#include "records.h"

const int MAXNODETABLES = 4;
const int LEFTCHILD = 0;
const int RIGHTCHILD = 1;

class Query_tree
{
public:
enum query_node_type //this enumeration lists the available
{ //query node (operations)
qntUndefined = 0,
qntRestrict = 1,
qntProject = 2,
qntJoin = 3,
qntSort = 4,
qntDistinct = 5
};

enum join_con_type //this enumeration lists the available
{ //join operations supported
jcUN = 0,
jcNA = 1,
jcON = 2,
jcUS = 3
};

enum type_join //this enumeration lists the available
{ //join types supported.
jnUNKNOWN = 0, //undefined
jnINNER = 1,
jnLEFTOUTER = 2,
jnRIGHTOUTER = 3,
jnFULLOUTER = 4,
jnCROSSPRODUCT = 5,
jnUNION = 6,
jnINTERSECT = 7
};

enum AggregateType //used to add aggregate functions
{
atNONE = 0,
atCOUNT = 1
};

/*
STRUCTURE query_node

DESCRIPTION
This this structure contains all of the data for a query node:

NodeId -- the internal id number for a node
ParentNodeId -- the internal id for the parent node (used for insert)
SubQuery -- is this the start of a subquery?
Child -- is this a Left or Right child of the parent?
NodeType -- synonymous with operation type
JoinType -- if a join, this is the join operation
join_con_type -- if this is a join, this is the "on" condition
Expressions -- the expressions from the "where" clause for this node
Join Expressions -- the join expressions from the "join" clause(s)
Relations[] -- the relations for this operation (at most 2)
PreemptPipeline -- does the pipeline need to be halted for a sort?
Fields -- the attributes for the result set of this operation
Left -- a pointer to the left child node
Right -- a pointer to the right child node
*/
struct query_node
{
query_node();
∼query_node();
int nodeid;
int parent_nodeid;
bool sub_query;
int child;
query_node_type node_type;
type_join join_type;
join_con_type join_cond;
Expression *where_expr;
Expression *join_expr;
TABLE_LIST *relations[MAXNODETABLES];
int eof[MAXNODETABLES];
int ndx[MAXNODETABLES];
bool preempt_pipeline;
Attribute *attributes;
query_node *left;
query_node *right;
};

struct record_buff
{
uchar *field_ptr;
long field_length;
record_buff *next;
record_buff *prev;
READ_RECORD *record;
};

A number of methods also need to be added to the query-tree class. Instead of describing the details of every method and its implementation, I have included the rest of the query-tree definition in Listing 13-8. This code is also added to the query_tree.h file.

Listing 13-8. New Methods for the Query-tree Class

query_node *root; //The ROOT node of the tree

Query_tree(void);
∼Query_tree(void);
int init_node(query_node *qn);
int heuristic_optimization();
int cost_optimization();
int insert_attribute(query_node *qn, Item *c);
bool distinct;
int prepare(query_node *qn);
int cleanup(query_node *qn);
bool Eof(query_node *qn);
READ_RECORD *get_next(query_node *qn);
List <Item> result_fields;

private:
bool h_opt; //has query been optimized (rules)?
bool c_opt; //has query been optimized (cost)?
READ_RECORD *lbuff;
READ_RECORD *rbuff;
record_buff *left_record_buff;
record_buff *right_record_buff;
record_buff *left_record_buffer_ptr;
record_buff *right_record_buffer_ptr;

int push_projections(query_node *qn, query_node *pNode);
query_node *find_projection(query_node *qn);
bool is_leaf(query_node *qn);
bool has_relation(query_node *qn, char *Table);
bool has_attribute(query_node *qn, Item *a);
int del_attribute(query_node *qn, Item *a);
int push_restrictions(query_node *qn, query_node *pNode);
query_node *find_restriction(query_node *qn);
query_node *find_join(query_node *qn);
int push_joins(query_node *qn, query_node *pNode);
int prune_tree(query_node *prev, query_node *cur_node);
int balance_joins(query_node *qn);
int split_restrict_with_project(query_node *qn);
int split_restrict_with_join(query_node *qn);
int split_project_with_join(query_node *qn);
bool find_table_in_tree(query_node *qn, char *tbl);
bool find_table_in_expr(Expression *expr, char *tbl);
bool find_attr_in_expr(Expression *expr, char *tbl, char *value);
int apply_indexes(query_node *qn);
bool do_restrict(query_node *qn, READ_RECORD *t);
READ_RECORD *do_project(query_node *qn, READ_RECORD *t);
READ_RECORD *do_join(query_node *qn);
int find_index_in_expr(Expression *e, char *tbl);
TABLE *get_table(query_node *qn);
int insertion_sort(bool left, Field *field, READ_RECORD *rcd);
int check_rollback(record_buff *cur_left, record_buff *curr_left_prev,
record_buff *cur_right, record_buff *cur_right_prev);
};

Notice that there are several public methods, including heuristic_optimization() and cost_optimization(). I have also added a public attribute, distinct, that you can use to assist in implementing the distinct operation (see the exercises at the end of the chapter). The rest of the methods are the helper methods for the optimization code. I explain some of the more interesting ones and leave the mundane for you to explore.

Now that we have some helper classes to make the optimizer easier to implement, we need to incorporate them into the translation code that translates the MySQL internal-query representation to the DBXP query tree. Open the sql_dbxp_parse.cc file and locate thebuild_query_tree() method. Listing 13-9 shows the changes necessary to add the new Attribute and Expression classes.

Listing 13-9. Changes to the Build-query-tree Method


/*
Build Query Tree

SYNOPSIS
build_query_tree()
THD *thd IN the current thread
LEX *lex IN the pointer to the current parsed structure
TABLE_LIST *tables IN the list of tables identified in the query

DESCRIPTION
This method returns a converted MySQL internal representation (IR) of a
query as a query_tree.

RETURN VALUE
Success = Query_tree * -- the root of the new query tree.
Failed = NULL
*/
Query_tree *build_query_tree(THD *thd, LEX *lex, TABLE_LIST *tables)
{
DBUG_ENTER("build_query_tree");
Query_tree *qt = new Query_tree();
Query_tree::query_node *qn =
(Query_tree::query_node *)my_malloc(sizeof(Query_tree::query_node),
MYF(MY_ZEROFILL | MY_WME));
TABLE_LIST *table;
int i = 0;
Item *w;
int num_tables = 0;

/* create a new restrict node */
qn->parent_nodeid = −1;
qn->child = false;
qn->join_type = (Query_tree::type_join) 0;
qn->nodeid = 0;
qn->node_type = (Query_tree::query_node_type) 2;
qn->left = NULL;
qn->right = NULL;
qn->attributes = new Attribute();
qn->where_expr = new Expression();
qn->join_expr = new Expression();

/* Get the tables (relations) */
i = 0;
for(table = tables; table; table = table->next_local)
{
num_tables++;
qn->relations[i] = table;
i++;
}

/* prepare the fields (find associated tables) for query */
List <Item> all_fields;
Name_resolution_context context;
List_iterator <Item> it(thd->lex->select_lex.item_list);
it++;
if (lex->select_lex.with_wild)
{
bool found = FALSE;
Field_iterator_table_ref field_iterator;
for(table = tables; table; table = table->next_local)
{
field_iterator.set(table);
for (; !field_iterator.end_of_fields(); field_iterator.next())
{
Item *item= field_iterator.create_item(thd);
if (!found)
{
found= TRUE;
it.replace(item); /* Replace '*' with the first found item. */
}
else
{
it.after(item); /* Add 'item' to the SELECT list. */
}
}
}
}
if (setup_fields(thd, lex->select_lex.ref_pointer_array,
lex->select_lex.item_list, thd->mark_used_columns,
&all_fields, 1))
DBUG_RETURN(NULL);
qt->result_fields = lex->select_lex.item_list;

/* get the attributes from the raw query */
w = lex->select_lex.item_list.pop();
while (w != 0)
{
uint unused_field_idx= NO_CACHED_FIELD_INDEX;
TABLE_LIST *dummy;
Field *f = NULL;
for(table = tables; table; table = table->next_local)
{
f = find_field_in_table_ref(thd, table, ((Field *)w)->field_name,
strlen(((Field *)w)->field_name),
((Field *)w)->field_name, NULL, NULL, NULL,
FALSE, FALSE, &unused_field_idx, FALSE,
&dummy);
if (f)
{
qn->attributes->add_attribute(true, (Item *)f);
break;
}
}
w = lex->select_lex.item_list.pop();
}

/* get the joins from the raw query */
if (num_tables > 0) //indicates more than 1 table processed
for(table = tables; table; table = table->next_local)
{
if (table->join_cond() != 0)
qn->join_expr->convert(thd, (Item *)table->join_cond());
}

/* get the expressions for the where clause */
qn->where_expr->convert(thd, lex->select_lex.where);

/* get the join conditions for the joins */
qn->join_expr->get_join_expr(qn->where_expr);

/* if there is a where clause, set node to restrict */
if (qn->where_expr->num_expressions() > 0)
qn->node_type = (Query_tree::query_node_type) 1;

qt->root = qn;
DBUG_RETURN(qt);
}

At this point, the include files need adjusting to ensure that we’re including everything needed to compile the code. For example, the following statements appear at the top of the query_tree.cc file:

#include "query_tree.h"

The query_tree.h header file includes the attribute and expression header files as well as the needed MySQL includes files using the following.

#include "query_tree.h"
#include "sql_base.h"
#include "sql_acl.h"
#include "sql_parse.h"
#include "lock.h"

image Caution If you encounter strange errors while compiling, check that you are not including the attribute, expression, and query_tree header files in your CMakeLists.txt file. The compiler will automatically include these files by following the include directives.

Details of the Heuristic Optimizer

The heuristic optimizer is implemented using the model of the rules described earlier. The methods used in the heuristic optimizer each implement some or all the rules. These methods are listed in Table 13-3.

Table 13-3. Heuristic Methods in the Heuristic Optimizer

Method

Description

split_restrict_with_join()

Searches the tree for nodes that have a restriction (has expressions) and a join expression. It divides the node into two nodes: one for the restriction and one for the join.

split_project_with_join()

Searches the tree for nodes that have a projection (has attributes) and a join expression. It divides the node into two nodes: one for the projection and one for the join.

split_restrict_with_project()

Searches the tree for nodes that have a restriction (has expressions) and a projection (has attributes). It divides the node into two nodes: one for the restriction and one for the projection.

find_restriction()

Searches the tree for a restriction node that is not already at a leaf.

push_restrictions()

Pushes the restrictions down the tree to the lowest node possible. It looks for situations in which the restriction can reside at a leaf. This method is used with find_restrictions() in a loop (the loop ends when no more restrictions that are not already at a leaf are found).

find_projection()

Searches the tree for a projection node that is not already at a leaf.

push_projections()

Pushes the projections down the tree to the lowest node possible. It looks for situations in which the projection can reside at a leaf or as a parent of a restriction. This method is used with find_projections() in a loop (the loop ends when no more projections are found that are not already at a leaf or the parent of a leaf that is a restriction).

find_join()

Searches the tree for a join node.

push_joins()

Pushes the joins down the tree to the nodes as parents to qualifying restrictions and/or projections (those that operate on the tables in the join).

prune_tree()

Identifies nodes in the tree that have been optimized away and are no longer valid (no attributes or expressions and not a join or sort), and deletes them.

The implementation of the heuristic optimizer reads very easily. Listing 13-10 shows the source-code implementation for the heuristic_optimization() method.

Listing 13-10. The DBXP Heuristic-optimization Method14

/*
Perform heuristic optimization

SYNOPSIS
heuristic_optimization()

DESCRIPTION
This method performs heuristic optimization on the query tree. The
operation is destructive in that it rearranges the original tree.

RETURN VALUE
Success = 0
Failed = 1
*/
int Query_tree::heuristic_optimization()
{
DBUG_ENTER("heuristic_optimization");
query_node *pNode;
query_node *nNode;

h_opt = true;
/*
First, we have to correct the situation where restrict and
project are grouped together in the same node.
*/
split_restrict_with_join(root);
split_project_with_join(root);
split_restrict_with_project(root);

/*
Find a node with restrictions and push down the tree using
a recursive call. continue until you get the same node twice.
This means that the node cannot be pushed down any further.
*/
pNode = find_restriction(root);
while(pNode != 0)
{
push_restrictions(root, pNode);
nNode = find_restriction(root);
/*
If a node is found, save a reference to it unless it is
either the same node as the last node found or
it is a leaf node. This is done so that we can ensure we
continue searching down the tree visiting each node once.
*/
if(nNode != 0)
{
if(nNode->nodeid == pNode->nodeid)
pNode = 0;
else if(is_leaf(nNode))
pNode = 0;
else
pNode = nNode;
}
}

/*
Find a node with projections and push down the tree using
a recursive call. Continue until you get the same node twice.
This means that the node cannot be pushed down any further.
*/
pNode = find_projection(root);
while(pNode != 0)
{
push_projections(root, pNode);
nNode = find_projection(root);
/*
If a node is found, save a reference to it unless it is
either the same node as the last node found or
it is a leaf node. This is done so that we can ensure we
continue searching down the tree visiting each node once.
*/
if(nNode != 0)
{
if(nNode->nodeid == pNode->nodeid)
pNode = 0;
else if(is_leaf(nNode))
pNode = 0;
else
pNode = nNode;
}
}

/*
Find a join node and push it down the tree using
a recursive call. Continue until you get the same node twice.
This means that the node cannot be pushed down any further.
*/
pNode = find_join(root);
while(pNode != 0)
{
push_joins(root, pNode);
nNode = find_join(root);
/*
If a node is found, save a reference to it unless it is
either the same node as the last node found or
it is a leaf node. This is done so that we can ensure we
continue searching down the tree visiting each node once.
*/
if(nNode != 0)
{
if(nNode->nodeid == pNode->nodeid)
pNode = 0;
else if(is_leaf(nNode))
pNode = 0;
else
pNode = nNode;
}
else
pNode = nNode;
}

/*
Prune the tree of "blank" nodes
Blank Nodes are:
1) projections without attributes that have at least 1 child
2) restrictions without expressions
BUT...Can't delete a node that has TWO children!
*/
prune_tree(0, root);

/*
Lastly, check to see if this has the DISTINCT option.
If so, create a new node that is a DISTINCT operation.
*/
if(distinct && (root->node_type != qntDistinct))
{
int i;
pNode = (query_node*)my_malloc(sizeof(query_node),
MYF(MY_ZEROFILL | MY_WME));
init_node(pNode);
pNode->sub_query = 0;
pNode->attributes = 0;
pNode->join_cond = jcUN; /* (join_con_type) 0; */
pNode->join_type = jnUNKNOWN; /* (type_join) 0; */
pNode->left = root;
pNode->right = 0;
for(i = 0; i < MAXNODETABLES; i++)
pNode->relations[i] = NULL;
pNode->nodeid = 90125; // sentinel value to indicate node is not set
pNode->child = LEFTCHILD;
root->parent_nodeid = 90125; // sentinel value to indicate node is not set
root->child = LEFTCHILD;
pNode->parent_nodeid = −1;
pNode->node_type = qntDistinct;
pNode->attributes = new Attribute();
pNode->where_expr = new Expression();
pNode->join_expr = new Expression();
root = pNode;
}
DBUG_RETURN(0);
}

Notice the loops for locating restrictions, projections, and joins. The code is designed to walk through the tree using a preorder traversal, applying the rules until there are no more conditions that violate the rules (i.e., no “bad” node placements).

The following listings show some of the source code for the major methods in the heuristic_optimization() method as described earlier. To save space, I omitted listing the lesser helper methods, because they are simple abstractions of the MySQL structure and class methods. You should download the source code for this chapter and examine the other helper methods to see how they work.

The split_restrict_with_join() method searches the tree for joins that have where expressions (thus are both joins and restrictions) and breaks them into two nodes: a join and a restrict node. Listing 13-11 shows the source code for this method.

Listing 13-11. Split Restrict With Join

/*
Split restrictions that have joins.

SYNOPSIS
split_restrict_with_join()
query_node *QN IN the node to operate on

DESCRIPTION
This method looks for joins that have where expressions (thus are both
joins and restrictions) and breaks them into two nodes.

NOTES
This is a RECURSIVE method!

RETURN VALUE
Success = 0
Failed = 1
*/
int Query_tree::split_restrict_with_join(query_node *QN)
{
int j = 0;
int i = 0;

DBUG_ENTER("split_restrict_with_join");
if(QN != 0)
{
if(((QN->join_expr->num_expressions() > 0) &&
(QN->where_expr->num_expressions() > 0)) &&
((QN->node_type == qntJoin) || (QN->node_type == qntRestrict)))
{
bool isleft = true;
/*
Create a new node and:
1) Move the where expressions to the new node.
2) Set the new node's children = current node children
3) Set the new node's relations = current node relations.
4) Set current node's left or right child = new node;
5) Set new node's id = current id + 200;
6) set parent id, etc.
7) determine which table needs to be used for the
restrict node.
*/
query_node *new_node = (query_node*)my_malloc(sizeof(query_node),
MYF(MY_ZEROFILL | MY_WME));
init_node(new_node);
new_node->node_type = qntRestrict;
new_node->parent_nodeid = QN->nodeid;
new_node->nodeid = QN->nodeid + 200;
new_node->where_expr = QN->where_expr;
new_node->join_expr = new Expression();
QN->where_expr = new Expression();

/*
Loop through tables and move table that matches
to the new node
*/
for(i = 0; i < MAXNODETABLES; i++)
{
if (QN->relations[i] != NULL)
{
if (find_table_in_expr(new_node->where_expr,
QN->relations[i]->table_name))
{
new_node->relations[j] = QN->relations[i];
j++;
if (i != 0)
isleft = false;
QN->relations[i] = NULL;
}
}
}

/* set children to point to balance of tree */
new_node->right = 0;
if (isleft)
{
new_node->child = LEFTCHILD;
new_node->left = QN->left;
QN->left = new_node;
}
else
{
new_node->child = RIGHTCHILD;
new_node->left = QN->right;
QN->right = new_node;
}
if (new_node->left)
new_node->left->parent_nodeid = new_node->nodeid;
j = QN->attributes->num_attributes();
if ((QN->node_type == qntJoin) && (j > 0))
{
Attribute *attribs = 0;
Item * attr;
int ii = 0;
int jj = 0;
if ((QN->attributes->num_attributes() == 1) &&
(strcasecmp("*",
((Field *)QN->attributes->get_attribute(0))->field_name) == 0))
{
new_node->attributes = new Attribute();
new_node->attributes->add_attribute(j,
QN->attributes->get_attribute(0));
}
else
{
attribs = new Attribute();
for (i = 0; i < (int)new_node->relations[0]->table->s->fields; i++)
{
Item *f = (Item *)new_node->relations[0]->table->field[i];
attribs->add_attribute(true, (Item *)f);
}
j = attribs->num_attributes();
new_node->attributes = new Attribute();
for (i = 0; i < j; i++)
{
attr = attribs->get_attribute(i);
jj = QN->attributes->index_of(
(char *)((Field *)attr)->table->s->table_name.str,
(char *)((Field *)attr)->field_name);
if (jj > −1)
{
new_node->attributes->add_attribute(ii, attr);
ii++;
QN->attributes->remove_attribute(jj);
}
else if (find_attr_in_expr(QN->join_expr,
(char *)((Field *)attr)->table->s->table_name.str,
(char *)((Field *)attr)->field_name))
{
new_node->attributes->add_attribute(ii, attr);
new_node->attributes->hide_attribute(attr, true);
ii++;
}
}
}
}
else
{
QN->node_type = qntJoin;
QN->join_type = jnINNER;
new_node->attributes = new Attribute();
}
}
split_restrict_with_join(QN->left);
split_restrict_with_join(QN->right);
}
DBUG_RETURN(0);
}

The split_project_with_join() method searches the tree for joins that have attributes (and thus are both joins and projections) and breaks them into two nodes: a join and a project node. Listing 13-12 shows the source code for this method.

Listing 13-12. Split Project With Join

/*
Split projections that have joins.

SYNOPSIS
split_project_with_join()
query_node *QN IN the node to operate on

DESCRIPTION
This method looks for joins that have attributes (thus are both
joins and projections) and breaks them into two nodes.

NOTES
This is a RECURSIVE method!

RETURN VALUE
Success = 0
Failed = 1
*/
int Query_tree::split_project_with_join(query_node *QN)
{
int j = 0;
int i;

DBUG_ENTER("split_project_with_join");
if(QN != 0)
{
if((QN->join_expr->num_expressions() > 0) &&
((QN->node_type == qntJoin) || (QN->node_type == qntProject)))
{
/*
Create a new node and:
1) Move the where expressions to the new node.
2) Set the new node's children = current node children
3) Set the new node's relations = current node relations.
4) Set current node's left or right child = new node;
5) Set new node's id = current id + 300;
6) set parent id, etc.
*/
QN->node_type = qntJoin;
QN->join_type = jnINNER;
if (QN->left == 0)
{
query_node *new_node = (query_node*)my_malloc(sizeof(query_node),
MYF(MY_ZEROFILL | MY_WME));
init_node(new_node);
new_node->node_type = qntProject;
new_node->parent_nodeid = QN->nodeid;
new_node->nodeid = QN->nodeid + 300;
for(i = 0; i < MAXNODETABLES; i++)
new_node->relations[i] = 0;
new_node->relations[0] = QN->relations[0];
QN->relations[0] = 0;
new_node->left = QN->left;
QN->left = new_node;
new_node->right = 0;
new_node->child = LEFTCHILD;
if (new_node->left != 0)
new_node->left->parent_nodeid = new_node->nodeid;
j = QN->attributes->num_attributes();
new_node->attributes = new Attribute();
new_node->where_expr = new Expression();
new_node->join_expr = new Expression();
if ((j == 1) &&
(strcasecmp("*", ((Field *)QN->attributes->get_attribute(0))->field_name)==0))
{
new_node->attributes = new Attribute();
new_node->attributes->add_attribute(j, QN->attributes->get_attribute(0));
if (QN->right != 0)
QN->attributes->remove_attribute(0);
}
else if (j > 0)
{
Attribute *attribs = 0;
Item * attr;
int ii = 0;
int jj = 0;
attribs = new Attribute();
for (i = 0; i < (int)new_node->relations[0]->table->s->fields; i++)
{
Field *f = new_node->relations[0]->table->field[i];
attribs->add_attribute(true, (Item *)f);
}
j = attribs->num_attributes();
for (i = 0; i < j; i++)
{
attr = attribs->get_attribute(i);
jj = QN->attributes->index_of(
(char *)((Field *)attr)->table->s->table_name.str,
(char *)((Field *)attr)->field_name);
if (jj > −1)
{
new_node->attributes->add_attribute(ii, attr);
ii++;
QN->attributes->remove_attribute(jj);
}
else if (find_attr_in_expr(QN->join_expr,
(char *)((Field *)attr)->table->s->table_name.str,
(char *)((Field *)attr)->field_name))
{
new_node->attributes->add_attribute(ii, attr);
new_node->attributes->hide_attribute(attr, true);
ii++;
}
}
}
}
if (QN->right == 0)
{
query_node *new_node = (query_node*)my_malloc(sizeof(query_node),
MYF(MY_ZEROFILL | MY_WME));
init_node(new_node);
new_node->node_type = qntProject;
new_node->parent_nodeid = QN->nodeid;
new_node->nodeid = QN->nodeid + 400;
for(i = 0; i < MAXNODETABLES; i++)
new_node->relations[0] = 0;
new_node->relations[0] = QN->relations[1];
QN->relations[1] = 0;
new_node->left = QN->right;
QN->right = new_node;
new_node->right = 0;
new_node->child = RIGHTCHILD;
if (new_node->left != 0)
new_node->left->parent_nodeid = new_node->nodeid;
j = QN->attributes->num_attributes();
new_node->attributes = new Attribute();
new_node->where_expr = new Expression();
new_node->join_expr = new Expression();
if ((j == 1) &&
(strcasecmp("*", ((Field *)QN->attributes->get_attribute(0))->field_name)==0))
{
new_node->attributes = new Attribute();
new_node->attributes->add_attribute(j, QN->attributes->get_attribute(0));
QN->attributes->remove_attribute(0);
}
else
{
Attribute *attribs = 0;
Item * attr;
int ii = 0;
int jj = 0;
attribs = new Attribute();
for (i = 0; i < (int)new_node->relations[0]->table->s->fields; i++)
{
Field *f = new_node->relations[0]->table->field[i];
attribs->add_attribute(true, (Item *)f);
if (j == 0)
{
new_node->attributes->hide_attribute((Item *)f, true);
}
}
j = attribs->num_attributes();
new_node->attributes = new Attribute();
for (i = 0; i < j; i++)
{
attr = attribs->get_attribute(i);
jj = QN->attributes->index_of(
(char *)((Field *)attr)->table->s->table_name.str,
(char *)((Field *)attr)->field_name);
if (jj > −1)
{
new_node->attributes->add_attribute(ii, attr);
ii++;
QN->attributes->remove_attribute(jj);
}
else if (find_attr_in_expr(QN->join_expr,
(char *)((Field *)attr)->table->s->table_name.str,
(char *)((Field *)attr)->field_name))
{
new_node->attributes->add_attribute(ii, attr);
new_node->attributes->hide_attribute(attr, true);
ii++;
}
}
}
}
}
split_project_with_join(QN->left);
split_project_with_join(QN->right);
}
DBUG_RETURN(0);
}

The split_restrict_with_project() method searches the tree for restrictions that have attributes (and thus are both projections and restrictions) and breaks them into two nodes: a restrict and a project node. Listing 13-13 shows the source code for this method.

Listing 13-13. Split Restrict With Project

/*
Split restrictions that have attributes (projections).

SYNOPSIS
split_restrict_with_project()
query_node *QN IN the node to operate on

DESCRIPTION
This method looks for restrictions that have attributes (thus are both
projections and restrictions) and breaks them into two nodes.

NOTES
This is a RECURSIVE method!

RETURN VALUE
Success = 0
Failed = 1
*/
int Query_tree::split_restrict_with_project(query_node *QN)
{
DBUG_ENTER("split_restrict_with_project");
if(QN != 0)
{
if(((QN->attributes->num_attributes() > 0) &&
(QN->where_expr->num_expressions() > 0)) &&
((QN->node_type == qntProject) || (QN->node_type == qntRestrict)))
{
/*
Create a new node and:
1) Move the expressions to the new node.
2) Set the new node's children = current node children
3) Set the new node's relations = current node relations.
4) Set current node's left child = new node;
5) Set new node's id = current id + 1000;
6) set parent id, etc.
*/
query_node *new_node = (query_node*)my_malloc(sizeof(query_node),
MYF(MY_ZEROFILL | MY_WME));
init_node(new_node);
new_node->child = LEFTCHILD;
new_node->node_type = qntRestrict;
if(new_node->node_type == qntJoin)
{
new_node->join_cond = QN->join_cond;
new_node->join_type = QN->join_type;
}
QN->node_type = qntProject;
new_node->attributes = new Attribute();
new_node->where_expr = QN->where_expr;
new_node->join_expr = new Expression();
QN->where_expr = new Expression();
new_node->left = QN->left;
new_node->right = QN->right;
new_node->parent_nodeid = QN->nodeid;
new_node->nodeid = QN->nodeid + 1000;
if(new_node->left)
new_node->left->parent_nodeid = new_node->nodeid;
if(new_node->right)
new_node->right->parent_nodeid = new_node->nodeid;
for(int i = 0; i < MAXNODETABLES; i++)
{
new_node->relations[i] = QN->relations[i];
QN->relations[i] = NULL;
}
QN->left = new_node;
QN->right = 0;
}
split_restrict_with_project(QN->left);
split_restrict_with_project(QN->right);
}
DBUG_RETURN(0);
}

The find_restriction() method searches the tree from the starting node (QN) for the next restriction in the tree. If a restriction is found, a pointer to the node is returned; otherwise, the method returns NULL. Listing 13-14 shows the source code for this method.

Listing 13-14. Find Restriction

/*
Find a restriction in the subtree.
SYNOPSIS
find_restriction()
query_node *QN IN the node to operate on

DESCRIPTION
This method looks for a node containing a restriction and returns the node
pointer.

NOTES
This is a RECURSIVE method!
This finds the first restriction and is biased to the left tree.

RETURN VALUE
Success = query_node * the node located
Failed = NULL
*/
Query_tree::query_node *Query_tree::find_restriction(query_node *QN)
{
DBUG_ENTER("find_restriction");
query_node *N;

N = 0;
if(QN != 0)
{
/*
A restriction is a node marked as restrict and
has at least one expression
*/
if (QN->where_expr->num_expressions() > 0)
N = QN;
else
{
N = find_restriction(QN->left);
if(N == 0)
N = find_restriction(QN->right);
}
}
DBUG_RETURN(N);
}

The push_restriction() method searches the tree from the starting node (QN) and pushes the restriction node (pNode) down the tree to nodes that contain the relations specified in the restriction. Listing 13-15 shows the source code for this method.

Listing 13-15. Push Restrictions

/*
Push restrictions down the tree.

SYNOPSIS
push_restrictions()
query_node *QN IN the node to operate on
query_node *pNode IN the node containing the restriction attributes

DESCRIPTION
This method looks for restrictions and pushes them down the tree to nodes
that contain the relations specified.

NOTES
This is a RECURSIVE method!
This finds the first restriction and is biased to the left tree.

RETURN VALUE
Success = 0
Failed = 1
*/
int Query_tree::push_restrictions(query_node *QN, query_node *pNode)
{
query_node *NewQN=0;

DBUG_ENTER("push_restrictions");
if((QN != 0) && (pNode != 0) && (pNode->left != 0))
{
/*
Conditions:
1) QN is a join node
2) QN is a project node
3) QN is a restrict node
4) All other nodes types are ignored.

Methods:
1) if join or project and the children are not already restrictions
add a new node and put where clause in new node else
see if you can combine the child node and this one
2) if the node has the table and it is a join,
create a new node below it and push the restriction
to that node.
3) if the node is a restriction and has the table,
just add the expression to the node's expression list
*/

/* if projection, move node down tree */
if((QN->nodeid != pNode->nodeid) && (QN->node_type == qntProject))
{
if (QN->left != 0)
{
QN->left = (query_node*)my_malloc(sizeof(query_node),
MYF(MY_ZEROFILL | MY_WME));
init_node(QN->left);
NewQN = QN->left;
NewQN->left = 0;
}
else
{
NewQN = QN->left;
QN->left = (query_node*)my_malloc(sizeof(query_node),
MYF(MY_ZEROFILL | MY_WME));
QN->left->left = NewQN;
NewQN = QN->left;
}
NewQN->sub_query = 0;
NewQN->join_cond = jcUN; /* (join_con_type) 0; */
NewQN->join_type = jnUNKNOWN; /* (type_join) 0; */
NewQN->right = 0;
for(long i = 0; i < MAXNODETABLES; i++)
NewQN->relations[i] = 0;
NewQN->nodeid = QN->nodeid + 1;
NewQN->parent_nodeid = QN->nodeid;
NewQN->node_type = qntRestrict;
NewQN->attributes = new Attribute();
NewQN->where_expr = new Expression();
NewQN->join_expr = new Expression();
if (pNode->relations[0])
NewQN->where_expr->reduce_expressions(pNode->relations[0]->table);
if ((QN->relations[0] != NULL) && (QN->relations[0] == pNode->relations[0]))
{
if (QN->relations[0])
{
if (find_table_in_expr(pNode->where_expr, QN->relations[0]->table_name))
{
NewQN->relations[0] = QN->relations[0];
QN->relations[0] = 0;
}
}
}
else
{
if (pNode->relations[0])
if (find_table_in_tree(QN->left, pNode->relations[0]->table_name))
NewQN->relations[0] = 0;
pNode->where_expr = NULL;
pNode->relations[0] = 0;
}
}
/* if join, move restrict node down tree */
else if((QN->nodeid != pNode->nodeid) &&
((QN->left == 0) || (QN->right == 0)) &&
(QN->node_type == qntJoin))
{
if(QN->relations[0] != 0)
{
QN->left = (query_node*)my_malloc(sizeof(query_node),
MYF(MY_ZEROFILL | MY_WME));
NewQN = QN->left;
NewQN->sub_query = 0;
NewQN->join_cond = jcUN; /* (join_con_type) 0; */
NewQN->join_type = jnUNKNOWN; /* (type_join) 0; */
NewQN->left = 0;
NewQN->right = 0;
for(long i = 0; i < MAXNODETABLES; i++)
NewQN->relations[i] = 0;
NewQN->nodeid = QN->nodeid + 1;
NewQN->parent_nodeid = QN->nodeid;
NewQN->node_type = qntRestrict;
NewQN->attributes = new Attribute();
NewQN->where_expr = new Expression();
NewQN->join_expr = new Expression();
NewQN->relations[0] = QN->relations[0];
QN->relations[0] = 0;
if (pNode->relations[0])
NewQN->where_expr->reduce_expressions(pNode->relations[0]->table);
}
else if(QN->relations[1] != 0)
{
QN->right = (query_node*)my_malloc(sizeof(query_node),
MYF(MY_ZEROFILL | MY_WME));
NewQN = QN->left;
NewQN->sub_query = 0;
NewQN->join_cond = jcUN; /* (join_con_type) 0; */
NewQN->join_type = jnUNKNOWN; /* (type_join) 0; */
NewQN->left = 0;
NewQN->right = 0;
for(long i = 0; i < MAXNODETABLES; i++)
NewQN->relations[i] = 0;
}
NewQN->nodeid = QN->nodeid + 1;
NewQN->parent_nodeid = QN->nodeid;
NewQN->node_type = qntRestrict;
NewQN->attributes = new Attribute();
NewQN->where_expr = new Expression();
NewQN->join_expr = new Expression();
NewQN->relations[0] = QN->relations[1];
QN->relations[1] = 0;
NewQN->where_expr->reduce_expressions(pNode->relations[0]->table);
}
push_restrictions(QN->left, pNode);
push_restrictions(QN->right, pNode);
}
DBUG_RETURN(0);
}

The find_projection() method searches the tree from the starting node (QN) for the next projection in the tree. If a projection is found, a pointer to the node is returned; otherwise, the method returns NULL. Listing 13-16 shows the source code for this method.

Listing 13-16. Find Projection

/*
Find a projection in the tree

SYNOPSIS
find_projection()
query_node *QN IN the node to operate on

DESCRIPTION
This method looks for a node containing a projection and returns the node
pointer.

NOTES
This finds the first projection and is biased to the left tree.
This is a RECURSIVE method!

RETURN VALUE
Success = query_node * the node located or NULL for not found
Failed = NULL
*/
Query_tree::query_node *Query_tree::find_projection(query_node *QN)
{
DBUG_ENTER("find_projection");
query_node *N;

N = 0;
if(QN != 0)
{
/*
A projection is a node marked as project and
has at least one attribute
*/
if((QN->node_type == qntProject) &&
(QN->attributes != 0))
N = QN;
else
{
N = find_projection(QN->left);
if(N == 0)
N = find_projection(QN->right);
}
}
DBUG_RETURN(N);
}

The push_projection() method searches the tree from the starting node (QN) and pushes the projection node (pNode) down the tree to nodes that contain the relations specified in the projection. Listing 13-17 shows the source code for this method.

Listing 13-17. Push Projections

/*
Push projections down the tree.

SYNOPSIS
push_projections()
query_node *QN IN the node to operate on
query_node *pNode IN the node containing the projection attributes

DESCRIPTION
This method looks for projections and pushes them down the tree to nodes
that contain the relations specified.

NOTES
This is a RECURSIVE method!

RETURN VALUE
Success = 0
Failed = 1
*/
int Query_tree::push_projections(query_node *QN, query_node *pNode)
{
DBUG_ENTER("push_projections");
Item * a;
int i;
int j;

if((QN != 0) && (pNode != 0))
{
if((QN->nodeid != pNode->nodeid) &&
(QN->node_type == qntProject))
{
i = 0;
j = QN->attributes->num_attributes();
/* move attributes to new node */
while(i < j)
{
a = QN->attributes->get_attribute(i);
if(has_relation(QN,
(char *)((Field *)a)->table->s->table_name.str))
{
if(!has_attribute(QN, a))
insert_attribute(QN, a);
del_attribute(pNode, a);
}
i++;
}
}
if(pNode->attributes->num_attributes() != 0)
{
push_projections(QN->left, pNode);
push_projections(QN->right, pNode);
}
}
DBUG_RETURN(0);
}

The find_join () method searches the tree from the starting node (QN) for the next join in the tree. If a join is found, a pointer to the node is returned; otherwise, the method returns NULL. Listing 13-18 shows the source code for this method.

Listing 13-18. Find Join

/*
Find a join in the subtree.
SYNOPSIS
find_restriction()
query_node *QN IN the node to operate on

DESCRIPTION
This method looks for a node containing a join and returns the
node pointer.

NOTES
This is a RECURSIVE method!
This finds the first restriction and is biased to the left tree.

RETURN VALUE
Success = query_node * the node located
Failed = NULL
*/
Query_tree::query_node *Query_tree::find_join(query_node *QN)
{
DBUG_ENTER("find_join");
query_node *N;
N = 0;

if(QN != 0)
{
/*
if this is a restrict node or a restrict node with
at least one expression it could be an unprocessed join
because the default node type is restrict
*/
if(((QN->node_type == qntRestrict) ||
(QN->node_type == qntRestrict)) && (QN->join_expr->num_expressions() > 0))
N = QN;
else
{
N = find_join(QN->left);
if(N == 0)
N = find_join(QN->right);
}
}
DBUG_RETURN(N);
}

The push_joins() method searches the tree from the starting node (QN) and pushes the join node (pNode) down the tree to a position at which the join is the parent of two nodes that contain the relations specified in the children of the join. Listing 13-19 shows the source code for this method.

Listing 13-19. Push Joins

/*
Push joins down the tree.

SYNOPSIS
push_restrictions()
query_node *QN IN the node to operate on
query_node *pNode IN the node containing the join

DESCRIPTION
This method looks for theta joins and pushes them down the tree to the
parent of two nodes that contain the relations specified.

NOTES
This is a RECURSIVE method!

RETURN VALUE
Success = 0
Failed = 1
*/
int Query_tree::push_joins(query_node *QN, query_node *pNode)
{
DBUG_ENTER("push_joins");
Item *lField;
Item *rField;
expr_node *node;

if(!pNode->join_expr)
DBUG_RETURN(0);
node = pNode->join_expr->get_expression(0);
if (!node)
DBUG_RETURN(0);
lField = node->left_op;
rField = node->right_op;

/* Node must have expressions and not be null */
if((QN != NULL) && (pNode != NULL) &&
(pNode->join_expr->num_expressions() > 0))
{
/* check to see if tables in join condition exist */
if((QN->nodeid != pNode->nodeid) &&
(QN->node_type == qntJoin) &&
QN->join_expr->num_expressions() == 0 &&
((has_relation(QN->left,
(char *)((Field *)lField)->table->s->table_name.str) &&
has_relation(QN->right,
(char *)((Field *)rField)->table->s->table_name.str)) ||
(has_relation(QN->left,
(char *)((Field *)rField)->table->s->table_name.str) &&
has_relation(QN->right,
(char *)((Field *)lField)->table->s->table_name.str))))
{
/* move the expression */
QN->join_expr = pNode->join_expr;
pNode->join_expr = new Expression();
QN->join_type = jnINNER;
QN->join_cond = jcON;
}
push_joins(QN->left, pNode);
push_joins(QN->right, pNode);
}
DBUG_RETURN(0);
}

The prune_tree() method searches the tree for blank nodes (nodes that have no longer have any operation or function) that are a result of performing heuristic optimization on the tree and deletes them. Listing 13-20 shows the source code for this method.

Listing 13-20. Prune Tree

/*
Prune the tree of dead limbs.

SYNOPSIS
prune_tree()
query_node *prev IN the previous node (parent)
query_node *cur_node IN the current node pointer (used to delete).

DESCRIPTION
This method looks for nodes blank nodes that are a result of performing
heuristic optimization on the tree and deletes them.

NOTES
This is a RECURSIVE method!

RETURN VALUE
Success = 0
Failed = 1
*/
int Query_tree::prune_tree(query_node *prev, query_node *cur_node)
{
DBUG_ENTER("prune_tree");
if(cur_node != 0)
{
/*
Blank Nodes are 1) projections without attributes
that have at least 1 child, or 2) restrictions
without expressions
*/
if((((cur_node->node_type == qntProject) &&
(cur_node->attributes->num_attributes() == 0)) ||
((cur_node->node_type == qntRestrict) &&
(cur_node->where_expr->num_expressions() == 0))) &&
((cur_node->left == 0) || (cur_node->right == 0)))
{
/*
Redirect the pointers for the nodes above and
below this node in the tree.
*/
if(prev == 0)
{
if(cur_node->left == 0)
{
cur_node->right->parent_nodeid = −1;
root = cur_node->right;
}
else
{
cur_node->left->parent_nodeid = −1;
root = cur_node->left;
}
my_free(cur_node);
cur_node = root;
}
else
{
if(prev->left == cur_node)
{
if(cur_node->left == 0)
{
prev->left = cur_node->right;
if (cur_node->right != NULL)
cur_node->right->parent_nodeid = prev->nodeid;
}
else
{
prev->left = cur_node->left;
if (cur_node->left != NULL)
cur_node->left->parent_nodeid = prev->nodeid;
}
my_free(cur_node);
cur_node = prev->left;
}
else
{
if(cur_node->left == 0)
{
prev->right = cur_node->right;
if (cur_node->right != NULL)
cur_node->right->parent_nodeid = prev->nodeid;
}
else
{
prev->right = cur_node->left;
if (cur_node->left != NULL)
cur_node->left->parent_nodeid = prev->nodeid;
}
my_free(cur_node);
cur_node = prev->right;
}
}
prune_tree(prev, cur_node);
}
else
{
prune_tree(cur_node, cur_node->left);
prune_tree(cur_node, cur_node->right);
}
}
DBUG_RETURN(0);
}

Compiling and Testing the Code

If you haven’t already, download the source code for this chapter and place the files in the /sql directory off the root of your source tree. In the example code, you will also find a difference file (ch13.diff) that you can use to apply the changes to the server source-code files (e.g. mysqld.cc sql_cmd.h, etc.). Or, you can use the changes from Chapter 12 as the changes to the server code is the same.

Spend a few moments looking through the source code so that you are familiar with the methods. Taking the time to look through the code now will help should you need to debug the code to work with your configuration or if you want to add other enhancements or work the exercises.

Once you have all the source-code files downloaded and have examined the code, add the files to the CMakeLists.txt file. See “Adding the Files to the CMakeLists.txt file” in Chapter 12 for the details. You will be adding the attribute and expression helper source files (attribute.cc and expression.cc). Once you have the files added to the project, navigate to the root of the source tree, run the cmake, and make commands as shown. Make sure the code compiles without compilation errors.

cmake .
make

Once you have the new code installed and compiled, run the server and perform the tests. You can run the test you created earlier or you can enter the commands in a MySQL client utility. Listing 13-21 shows the expected output of running the commands listed in the test.

Listing 13-21. Example Test Runs

DBXP_SELECT * FROM staff' at line 1
+−−------------------------+
| Execution Path |
+−−------------------------+
| expert_mysql.staff |
| | |
| | |
| | |
| V |
| ------------------- |
| | RESTRICT | |
| ------------------- |
| | Access Method: | |
| | iterator | |
| ------------------- |
| | |
| | |
| | |
| V |
| ------------------- |
| | PROJECT | |
| ------------------- |
| | Access Method: | |
| | iterator | |
| ------------------- |
| | |
| | |
| V |
| Result Set |
+−−------------------------+
25 rows in set (0.00 sec)

+−−--------------------------------------------------+
| Execution Path |
+−−--------------------------------------------------+
| expert_mysql.staff |
| | |
| | |
| | |
| V |
| ------------------- |
| | PROJECT | |
| ------------------- |
| | Access Method: | |
| | iterator | |
| ------------------- |
| | expert_mysql.directorate |
| | | |
| | | |
| | | |
| | V |
| | ------------------- |
| | | PROJECT | |
| | ------------------- |
| | | Access Method: | |
| | | iterator | |
| | ------------------- |
| | | |
| | ---------------------------- |
| | | |
| V V |
| ------------------- |
| | JOIN | |
| ------------------- |
| | Access Method: | |
| | iterator | |
| ------------------- |
| | |
| | |
| V |
| Result Set |
+−−--------------------------------------------------+
36 rows in set (0.00 sec)

+−−----------------------------------------------+
| Execution Path |
+−−----------------------------------------------+
| expert_mysql.staff |
| | |
| | |
| | |
| V |
| ------------------- |
| | RESTRICT | |
| ------------------- |
| | Access Method: | |
| | iterator | |
| ------------------- |
| | expert_mysql.tasking |
| | | |
| | | |
| | | |
| | V |
| | ------------------- |
| | | PROJECT | |
| | ------------------- |
| | | Access Method: | |
| | | iterator | |
| | ------------------- |
| | | |
| | ---------------------------- |
| | | |
| V V |
| ------------------- |
| | JOIN | |
| ------------------- |
| | Access Method: | |
| | iterator | |
| ------------------- |
| | |
| | |
| V |
| Result Set |
+−−----------------------------------------------+
36 rows in set (0.00 sec)

Query OK, 4 rows affected (0.00 sec)
mysql >

Notice how the query plans differ for each of the statements entered. Take some time to explore other query statements to see how the optimizer optimizes other forms of queries.

image Note The output for the DBXP_SELECT commands may look a little odd at first (they are query plans) but recall from earlier that we stubbed the DBXP_select_command() method in the sql_dbxp_parser.cc file to redirect to the DBXP_explain_select_command()method. We will add the execution of the queries in the next chapter.

Summary

I presented in this chapter the most complex database internal technology—an optimizer. You learned how to expand the concept of the query tree to incorporate a query optimizer that uses the tree structure in the optimization process. More important, you discovered how to construct a heuristic query optimizer. The knowledge of heuristic optimizers should provide you with a greater understanding of the DBXP engine and how it can be used to study database technologies in more depth. It doesn’t get any deeper than an optimizer!

In the next chapter, I show you more about query execution through an example implementation of a query-tree optimization strategy. The next chapter will complete the DBXP engine by linking the heuristic query optimizer, using the query-tree class, to an execution process that—surprise—also uses the query-tree structure.

EXERCISES

The following lists several areas for further exploration. They represent the types of activities you might want to conduct as experiments (or as a class assignment) to explore relational database technologies.

1. Complete the code for the balance_joins() method. Hint: You will need to create an algorithm that can move conjunctive joins around so that the join that is most restrictive is executed first (is lowest in the tree).

2. Complete the code for the cost_optimization() method. Hint: You will need to walk the tree and indicate nodes that can use indexes.

3. Examine the code for the heuristic optimizer. Does it cover all possible queries? If not, are there any other rules that can be used to complete the coverage?

4. Examine the code for the query tree and heuristic optimizer. How can you implement the distinct node type as listed in the query-tree class? Hint: See the code that follows the prune_tree() method in the heuristic_optimization() method.

5. How can you change the code to recognize invalid queries? What are the conditions that determine that a query is invalid and how would you test for them?

6. (advanced) MySQL does not currently support the intersect operation (as defined by Date). Change the MySQL parser to recognize the new keyword and process queries, such as SELECT * FROM A INTERSECT B. Are there any limitations of this operation, and are they reflected in the optimizer?

7. (advanced) How would you implement the GROUP BY, ORDER BY, and HAVING clauses? Make the changes to the optimizer to enable these clauses.

1 P. G. Selinger, M. M. Astraham, D. D. Chamberlin, R. A. Lories, and T. G. Price. 1979. “Access Path Selection in a Relational Database Management System.” Proceedings of the ACM SIGMOD International Conference on the Management of Data, Aberdeen, Scotland: 23–34. Considered by some to be the “Bible of Query Optimization.”

2 M. Stonebraker, E. Wong, P. Kreps. 1976. “The Design and Implementation of INGRES.” ACM Transactions on Database Systems 1(3): 189–222.

3 http://en.wikipedia.org/wiki/Dynamic_programming

4 Query execution in traditional systems includes not only processing the query but also accessing the data from physical media. In-memory systems, however, do not have the long access times associated with retrieval from physical media.

5 This practice is still used today by most commercial database systems.

6 The accumulation of statistics in real time is called piggyback statistic generation.

7 D. Kossman and K. Stocker. 2000. “Iterative Dynamic Programming: A New Class of Query Optimization Algorithms.” ACM Transactions on Database Systems 25(1): 43–82.

8 Y. E. Ioannidis, R. T. Ng, K. Shim, and T. Sellis. 1997. “Parametric Query Optimization.” VLDB Journal 6:132–151.

9 In this case, efficient execution may not be the optimal solution.

10 A common technique practiced by novice database users.

11 May disallow the use of indexes for the join operation.

12 C. Lee, C. Shih, and Y. Chen. 2001. “A Graph-Theoretic Model for Optimizing Queries Involving Methods.” VLDB Journal 9:327–343.

13 It isn’t a true abstract class, because it contains some methods that are defined in the source code. A true abstract class defines all methods as virtual and therefore they are used as interfaces rather than base classes.

14 The sentinel value in this code is derived from a classic rock album. Do you know the name of the band?