Expert MySQL, Second Edition (2012)
PART 4. APPENDIX
Chapter Exercise Notes
This section contains some hints and helpful direction for the exercises included in Chapters 12, 13, and 14. Some of the exercises are practical exercises whereby the solutions would be too long to include in an appendix. For those exercises that require programming to solve I include some hints as to how to write the code for the solution. In other cases, I include additional information that should assist you in completing the exercise.
The following questions are from Chapter 12, “Internal Query Representation.”
Question 1. The query in figure 12-1 exposes a design flaw in one of the tables. What is it? Does the flaw violate any of the normal forms? If so, which one?
Look at the semester attribute. How many values does the data represent? Packing data like this makes for some really poor performing queries if you need to access part of the attribute (field). For example, to query for all of the semesters in 2001, you would have to use a WHERE clause and use the LIKE operator: WHERE semester LIKE '%2001'. This practice of packing fields (also called multi-valued fields) violates First Normal Form.
Question 2. Explore the TABLE structure and change the SELECT DBXP stub to return information about the table and its fields
Change the code to return information like we did in Chapter 8 when we explored the show_disk_usage_command() method. Only this time, include the metadata about the table. Hint: see the table class.
Question 3. Change the EXPLAIN SELECT DBXP command to produce an output similar to the MySQL EXPLAIN SELECT command
Change the code to produce the information in a table like that of the MySQL EXPLAIN command. Note that you will need additional methods in the Query_tree class to gather information about the optimized query.
Question 4. Modify the build_query_tree function to identify and process the LIMIT clause
The changes to the code require you to identify when a query has the LIMIT clause and to abbreviate the results accordingly. By way of a hint, here is the code to capture the value of the LIMIT clause. You will need to modify the code in the DBXP_select_command() method to handle the rest of the operation.
SELECT_LEX_UNIT *unit= &lex->unit;
Question 5. How can the query tree query_node structure be changed to accommodate HAVING, GROUP BY, and ORDER clauses?
The best design is one that stays true to the query tree concept. That is, consider a design where each of these clauses is a separate node in the tree. Consider also if there are any heuristics that may apply to these operations. Hint: would it not be more efficient to process the HAVING clause nearest the leaf nodes? Lastly, consider rules that govern how many of each of these nodes can exist in the tree.
The following questions are from Chapter 13, “Query Optimization.”
Question 1. Complete the code for the balance_joins() method. Hints: you will need to create an algorithm that can move conjunctive joins around so that the join that is most restrictive is executed first (is lowest in the tree)
This exercise is all about how to move joins around in the tree to push the most restrictive joins down. The tricky part is using the statistics of the tables to determine which joins will produce the fewest results. Look to the handler and table classes for information about accessing this data. Beyond that, you will need helper methods to traverse the tree and get information about the tables. This is necessary because it is possible (and likely) that the joins will be higher in the tree and may not contain direct reference to the table.
Question 2. Complete the code for the cost_optimization() method. Hints: you will need to walk the tree and indicate nodes that can use indexes
This exercise requires you to interrogate the handler and table classes to determine which tables have indexes and what those columns are.
Question 3. Examine the code for the heuristic optimizer. Does it cover all possible queries? If not, are there any other rules (heuristics) that can be used to complete the coverage?
You should discover that there are many such heuristics and that this optimizer covers only the most effective of the heuristics. For example, you could implement heuristics that take into account the GROUP BY and HAVING operations treating them in a similar fashion as project or restrict pushing the nodes down the tree for greater optimization.
Question 4. Examine the code for the query tree and heuristic optimizer. How can you implement the distinct node type as listed in the query tree class? Hint: see the code that follows the prune_tree() method in the heuristic_optimization() method
Most of the hints for this exercise are in the sample code. The following excerpt shows how you can identify when a DISTINCT option is specified on the query.
Question 5. How can you change the code to recognize invalid queries? What are the conditions that determine a query is invalid and how would you test for them?
Part of the solution for this exercise is done for you. For example, a query statement that is syntactically incorrect will be detected by the parser and an appropriate error displayed. However, for those queries that are syntactically correct by semantically meaningless, you will need to add additional error handling code to detect. Try a query that is syntactically correct but references the wrong fields for a query. Create tests of this nature and trace (debug) the code as you do. You should see places in the code where additional error handling can be placed. Lastly, you could also create a method in the Query_tree class that validates the query tree itself. This could be particularly handy if you attempt to create additional node types or implement other heuristic methods.
Question 6. (advanced) MySQL does not currently support the INTERSECT operation (as defined by Date). Change the MySQL parser to recognize the new keyword and process queries like SELECT * FROM A INTERSECT B. Are there any limitations of this operation and are they reflected in the optimizer?
What sounds like a very difficult problem has a very straight-forward solution. Consider adding a new node type named “intersect” that has two children. The operation merely returns those rows that are in both tables. Hint: use one of the many merge sort variants to accomplish this.
Question 7. (advanced) How would you implement the GROUP BY, ORDER BY, and HAVING clauses? Make the changes to the optimizer to enable these clauses.
There are many ways to accomplish this. In keeping with the design of the Query_tree class, each of these operations can be represented as another node type. You can build a method to handle each of these just as we did with restrict, project, and join. Note however that the HAVINGclause is used with the GROUP BY clause and the ORDER BY clause is usually processed last.
The following questions are from Chapter 14, “Query Execution.”
Question 1. Complete the code for the do_join() method to support all of the join types supported in MySQL. Hint: you need to be able to identify the type of join before you begin optimization. Look to the parser for details
To complete this exercise, you may want to restructure the code in the do_join() method. The example I used keeps all of the code together, but a more elegant solution would be one where the select-case statement in the do_join() method called helper methods for each type of join and possibly other helper methods for common operations (i.e., see the preempt_pipeline code). The code for the other forms of joins is going to be very similar to the join implemented in the example code.
Question 2. Examine the code for the check_rewind() method in the Query_tree class. Change the implementation to use temporary tables to avoid high memory usage when joining large tables
This exercise also has a straight-forward solution. See the MySQL code in the sql_select.cc file for details on how to create a temporary table. Hint: it’s very much the same as create table and insert. You could also use the base Spartan classes and create a temporary table that stores the record buffers.
Question 3. Evaluate the performance of the DBXP query engine. Run multiple test runs and record execution times. Compare these results to the same queries using the native MySQL query engine. How does the DBXP engine compare to MySQL?
There are many ways to record execution time. You could use a simple stopwatch and record the time based on observation or you could add code that captures the system time. This later method is perhaps the quickest and most reliable way to determine relative speed. I say relative because there are many factors concerning the environment and what is running at the time of the execution that could affect performance. When you conduct your test runs, be sure to use multiple test runs and perform statistical analysis on the results. This will give you a normalized set of data to compare.
Question 4. Why is the remove duplicates operation not necessary for the intersect operation? Are there any conditions where this is false? If so, what are they?
Let us consider what an intersect operation is. It is simply the rows that appear in each of the tables involved (you can intersect on more than two tables). Duplicates in this case are not possible if the tables themselves do not have duplicates. However, if the tables are the result of operations performed in the tree below and have not had the duplicates removed and the DISTINCT operation is included in the query, you will need to remove duplicates. Basically, this is an “it depends” answer.
Question 5. (advanced) MySQL does not currently support a CROSS PRODUCT or INTERSECT operation (as defined by Date). Change the MySQL parser to recognize these new keywords and process queries like SELECT * FROM A CROSS B and SELECT * FROM A INTERSECT B and add these functions to the execution engine. Hint: see the do_join() method
The files you need to change are the same files we changed when adding the DBXP keyword. These include lex.h and sql_yacc.yy. You may need to extend the sql_lex structure to include provisions for recording the operation type.
Question 6. (advanced) Form a more complete list of test queries and examine the limitations of the DBXP engine. What modifications are necessary to broaden the capabilities of the DBXP engine?
First, the query tree should be expanded to include the HAVING, GROUP BY, and ORDER BY clauses. You should also consider adding the capabilities for processing aggregate functions. These aggregate functions (e.g., max(), min(), etc.) could be fit into the Expression class whereby new methods are created to parse and evaluate the aggregate functions.