Expert Oracle SQL: Optimization, Deployment, and Statistics (2014)

PART 4. Optimization

CHAPTER 16. Rewriting Queries

At the beginning of Chapter 1 I explained the naiveté of assuming that the CBO will come up with an optimal execution plan regardless of how you write your SQL statement. So far in this book you have already seen numerous cases where an improperly constructed SQL statement can cause the CBO to pick an inappropriate execution plan, there being quite a few examples in the last chapter alone. This chapter and the next explore the topic of rewriting SQL statements in more detail, but I won’t pretend to be giving you a full list of problematic scenarios.

In fact, this chapter is a lot shorter than it would have needed to be if it had been written a few years ago; as the number of optimizer transformations increases, the number of SQL constructs that cause performance problems decreases. Nevertheless, there are still many cases where a SQL statement looks perfectly sensible to the casual reader but will need to be rewritten to achieve acceptable performance.

These days one of the most common reasons for rewriting a SQL statement is to address the problem of poorly performing sorts, and the next chapter is dedicated entirely to that topic. However, before we get into that complex topic let us look at a few simple examples of SQL rewrites that we haven’t discussed earlier in the book.

Use of Expressions in Predicates

Generally speaking, you should try hard to avoid using expressions on columns used in predicates. We have already seen in Listing 14-1 that the use of a function in a predicate can cause cardinality errors in the CBO, but that isn’t the only issue. Listing 16-1 shows how partition elimination can be affected by expressions.

Listing 16-1. Partition elimination and expressions in predicates

SELECT COUNT (DISTINCT amount_sold)
FROM sh.sales
WHERE EXTRACT (YEAR FROM time_id) = 1998;

------------------------------------------------------------------------
| Id | Operation | Name | Cost (%CPU)| Pstart| Pstop |
------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 528 (4)| | |
| 1 | SORT AGGREGATE | | | | |
| 2 | VIEW | VW_DAG_0 | 528 (4)| | |
| 3 | HASH GROUP BY | | 528 (4)| | |
| 4 | PARTITION RANGE ALL| | 528 (4)| 1 | 28 |
| 5 | TABLE ACCESS FULL | SALES | 528 (4)| 1 | 28 |
------------------------------------------------------------------------

SELECT COUNT (DISTINCT amount_sold)
FROM sh.sales
WHERE time_id >=DATE '1998-01-01' AND time_id < DATE '1999-01-01';

-----------------------------------------------------------------------------
| Id | Operation | Name | Cost (%CPU)| Pstart| Pstop |
-----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 118 (6)| | |
| 1 | SORT AGGREGATE | | | | |
| 2 | VIEW | VW_DAG_0 | 118 (6)| | |
| 3 | HASH GROUP BY | | 118 (6)| | |
| 4 | PARTITION RANGE ITERATOR| | 113 (2)| 5 | 8 |
| 5 | TABLE ACCESS FULL | SALES | 113 (2)| 5 | 8 |
-----------------------------------------------------------------------------

Both queries in Listing 16-1 produce the same result: the number of distinct values of AMOUNT_SOLD in the SH.SALES table from 1998. The first query uses the EXTRACT function to identify the year of each sale. The second query uses two predicates to generate a range of dates. We can see that in the latter case we have been able to perform partition elimination and thus only accessed the partitions for 1998. In the former case we have had to access all the partitions in the table because the application of a function to the column value precludes partition elimination.

If there is no straightforward way to avoid using a function on a column used in a predicate, you might need to create a virtual column or add a function-based index as was demonstrated in Listing 15-2. Listing 16-2 shows another approach.

Listing 16-2. Using a join when a function is applied to a column

SELECT COUNT (DISTINCT amount_sold)
FROM sh.sales
WHERE EXTRACT (MONTH FROM time_id) = 10;

------------------------------------------------------------------------
| Id | Operation | Name | Cost (%CPU)| Pstart| Pstop |
------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 528 (4)| | |
| 1 | SORT AGGREGATE | | | | |
| 2 | VIEW | VW_DAG_0 | 528 (4)| | |
| 3 | HASH GROUP BY | | 528 (4)| | |
| 4 | PARTITION RANGE ALL| | 528 (4)| 1 | 28 |
| 5 | TABLE ACCESS FULL | SALES | 528 (4)| 1 | 28 |
------------------------------------------------------------------------

SELECT COUNT (DISTINCT amount_sold)
FROM sh.sales s, sh.times t
WHERE EXTRACT (MONTH FROM t.time_id) = 10 AND t.time_id = s.time_id;

---------------------------------------------------------------------------------------------
| Id | Operation | Name | Cost (%CPU)| Pstart| Pstop |
---------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 102 (1)| | |
| 1 | SORT AGGREGATE | | | | |
| 2 | VIEW | VW_DAG_0 | 102 (1)| | |
| 3 | HASH GROUP BY | | 102 (1)| | |
| 4 | NESTED LOOPS | | | | |
| 5 | NESTED LOOPS | | 101 (0)| | |
| 6 | INDEX FULL SCAN | TIMES_PK | 0 (0)| | |
| 7 | PARTITION RANGE ITERATOR | | | KEY | KEY |
| 8 | BITMAP CONVERSION TO ROWIDS | | | | |
| 9 | BITMAP INDEX SINGLE VALUE | SALES_TIME_BIX | | KEY | KEY |
| 10 | TABLE ACCESS BY LOCAL INDEX ROWID| SALES | 101 (0)| 1 | 1 |
---------------------------------------------------------------------------------------------

Both queries in Listing 16-2 count the number of distinct values for AMOUNT_SOLD in SH.SALES that were made in the month of October (of any year). The execution plan for the first query accesses all the rows in the table because, as with the first query in Listing 16-1, no partition elimination is possible. The second query in Listing 16-2 performs an INDEX FULL SCAN on the TIMES_PK index of the SH.TIMES table. This extra step adds overhead, but the overhead (a single-block access) is recouped with interest because only the matching rows from theSH.SALES table are accessed. The net benefit may not be obvious, but you can see that on my laptop the CBO estimates a cost of 528 for the first query and 102 for the second. The second query actually ran faster as well!

Equality Versus Inequality Predicates

The use of an inequality operator in a predicate is sometimes unavoidable, but its misuse is so frequent that the SQL Tuning Advisor throws up a warning. Listing 16-3 shows a typical example.

Listing 16-3. Inappropriate use of an inequality operator

CREATE TABLE mostly_boring
(
primary_key_id INTEGER PRIMARY KEY
,special_flag CHAR (1)
,boring_field CHAR (100) DEFAULT RPAD ('BORING', 100)
)
PCTFREE 0;

INSERT INTO mostly_boring (primary_key_id, special_flag)
SELECT ROWNUM, DECODE (MOD (ROWNUM, 10000), 0, 'Y', 'N')
FROM DUAL
CONNECT BY LEVEL <= 100000;

BEGIN
DBMS_STATS.gather_table_stats (
ownname => SYS_CONTEXT ('USERENV', 'CURRENT_SCHEMA')
,tabname => 'MOSTLY_BORING'
,method_opt => 'FOR COLUMNS SPECIAL_FLAG SIZE 2');
END;
/

CREATE INDEX special_index
ON mostly_boring (special_flag);

SELECT *
FROM mostly_boring
WHERE special_flag != 'N';

----------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)|
----------------------------------------------------------------
| 0 | SELECT STATEMENT | | 10 | 384 (1)|
| 1 | TABLE ACCESS FULL| MOSTLY_BORING | 10 | 384 (1)|
----------------------------------------------------------------

SELECT *
FROM mostly_boring
WHERE special_flag = 'Y';

----------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)|
----------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 10 | 2 (0)|
| 1 | TABLE ACCESS BY INDEX ROWID BATCHED| MOSTLY_BORING | 10 | 2 (0)|
| 2 | INDEX RANGE SCAN | SPECIAL_INDEX | 10 | 1 (0)|
----------------------------------------------------------------------------------

SELECT *
FROM mostly_boring
WHERE special_flag = 'N';

----------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)|
----------------------------------------------------------------
| 0 | SELECT STATEMENT | | 99990 | 384 (1)|
| 1 | TABLE ACCESS FULL| MOSTLY_BORING | 99990 | 384 (1)|
----------------------------------------------------------------

I have created a table called MOSTLY_BORING that contains 100,000 rows that are, you guessed it, mostly boring. There are, however, ten rows that have SPECIAL_FLAG set to Y; the remaining 99,990 rows have a value of N for SPECIAL_FLAG. When gathering statistics I specified a histogram for SPECIAL_FLAG so that the skew in the two values is recorded. When we query the table using the predicate SPECIAL_FLAG != N we use a full table scan. We can’t take advantage of the index that we have set up. When we use the predicate SPECIAL_FLAG = 'Y' we can see that the index is used and a dramatic estimated cost saving is the result.

When we request all the boring rows with the predicate SPECIAL_FLAG = 'N' the CBO correctly selects a full table scan. There is no point in going through the index to read almost all the rows in the table. This means that the entries in the index for boring rows (SPECIAL_FLAG = 'N') will never be used. We can avoid storing these rows and make our index smaller. We can do this by a technique that some database theoreticians will brand as heretical: we can use the absence of a value (NULL) to indicate a boring row. Let us reload the table data this way.

Listing 16-4 reloads the data using a “value” of NULL instead of N to represent the 99,990 boring rows. We no longer need a histogram on SPECIAL_FLAG as there is now only one non-null value. The estimated cost of the modified query that selects the boring rows has reduced. This is expected because the fact that the value of N has been removed from the boring rows means that the table is slightly smaller and the cost of the full table scan has been reduced correspondingly. But what do we see with the query that selects the special rows? The estimated cost has increased from 2 to 11! Why? In fact, the increased estimated cost is correct, and the estimated cost of 2 in Listing 16-3 was too low. Why? Well, the original value of the clustering factor for SPECIAL_INDEX suggested that most of the time consecutive index entries would be in the same block. Whereas this was true for the majority of index entries (those with SPECIAL_FLAG = 'N'), it wasn’t the case for our ten special rows, no two of which were in the same block! In fact, the BLEVEL of SPECIAL_INDEX has reduced from 1 to 0 as a result of the data change, so we do get a small performance improvement with the change regardless of what the CBO estimates.

Listing 16-4. Use of NULL to indicate a common value

TRUNCATE TABLE mostly_boring;

INSERT INTO mostly_boring (primary_key_id, special_flag)
SELECT ROWNUM, DECODE (MOD (ROWNUM, 10000), 0, 'Y', NULL)
FROM DUAL
CONNECT BY LEVEL <= 100000;

BEGIN
DBMS_STATS.gather_table_stats (
ownname => SYS_CONTEXT ('USERENV', 'CURRENT_SCHEMA')
,tabname => 'MOSTLY_BORING');
END;
/

SELECT *
FROM mostly_boring
WHERE special_flag = 'Y';

--------------------------------------------------------------------------
| Id | Operation | Name | Cost (%CPU)|
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 11 (0)|
| 1 | TABLE ACCESS BY INDEX ROWID BATCHED| MOSTLY_BORING | 11 (0)|
| 2 | INDEX RANGE SCAN | SPECIAL_INDEX | 1 (0)|
--------------------------------------------------------------------------

SELECT *
FROM mostly_boring
WHERE special_flag IS NULL;

----------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)|
----------------------------------------------------------------
| 0 | SELECT STATEMENT | | 99990 | 379 (1)|
| 1 | TABLE ACCESS FULL| MOSTLY_BORING | 99990 | 379 (1)|
----------------------------------------------------------------

Implicit Data-Type Conversions

The fact that the Oracle database automatically converts data from one type to another can be very convenient, particularly for business users throwing together an important ad-hoc query where development time is at a premium. However, this silent conversion may mask serious performance issues. Listing 16-5 shows one way that implicit data-type conversions can mask performance problems.

Listing 16-5. Implicit data-type conversion

CREATE TABLE date_table
(
mydate DATE
,filler_1 CHAR (2000)
)
PCTFREE 0;

INSERT INTO date_table (mydate, filler_1)
SELECT SYSDATE, RPAD ('x', 2000)
FROM DUAL
CONNECT BY LEVEL <= 1000;

BEGIN
DBMS_STATS.gather_table_stats (
ownname => SYS_CONTEXT ('USERENV', 'CURRENT_SCHEMA')
,tabname => 'DATE_TABLE');
END;
/

CREATE INDEX date_index
ON date_table (mydate);

SELECT mydate
FROM date_table
WHERE mydate = SYSTIMESTAMP;

-------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)|
-------------------------------------------------------------
| 0 | SELECT STATEMENT | | 10 | 70 (0)|
|* 1 | TABLE ACCESS FULL| DATE_TABLE | 10 | 70 (0)|
-------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter(SYS_EXTRACT_UTC(INTERNAL_FUNCTION("MYDATE"))=SYS_EXTRACT_U
TC(SYSTIMESTAMP(6)))

SELECT mydate
FROM date_table
WHERE mydate = SYSDATE;

------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)|
------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 1 (0)|
|* 1 | INDEX RANGE SCAN| DATE_INDEX | 1 | 1 (0)|
------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - access("MYDATE"=SYSDATE@!)

Listing 16-5 creates a table DATE_TABLE that includes an indexed column of type DATE. When we try to compare the column with a value of type TIMESTAMP WITH TIMEZONE, the column is implicitly converted to a TIMESTAMP WITH TIMEZONE data type, thus precluding the use of an index. When our predicate specifies a value of type DATE then the index can be used.

This sort of coding problem may seem obscure; who uses SYSTIMESTAMP rather than SYSDATE? However, this sort of problem actually occurs all the time when bind variables are used, particularly when the same PL/SQL variable is used for multiple SQL statements. Also bear in mind that the SQL*Plus VARIABLE command only supports a limited number of types, and DATE is not one of them.

Bind Variables

Although Oracle has introduced adaptive cursor sharing as a damage limitation exercise after having unleashed the wrath of bind variable peeking on us, I still recommend caution when using bind variables in predicates that involve columns with histograms. Let us put our MOSTLY_BORINGtable back the way it was in Listing 16-3 and see what happens when we use a bind variable for SPECIAL_FLAG.

Listing 16-6. Inappropriate use of bind variables

/* SEE Listing 16-3 for setup of MOSTLY_BORING table */

SELECT *
FROM mostly_boring
WHERE special_flag = :b1;

----------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)|
----------------------------------------------------------------
| 0 | SELECT STATEMENT | | 50000 | 384 (1)|
|* 1 | TABLE ACCESS FULL| MOSTLY_BORING | 50000 | 384 (1)|
----------------------------------------------------------------

There are 100,000 rows in the table and two possible values of SPECIAL_FLAG so the CBO assumes that 50,000 rows will be returned by the query in Listing 16-6 when we use EXPLAIN PLAN. In fact, of course, if we run the statement both the cardinality estimate and the execution plan will be dependent on the supplied bind variable value. We will then end up using the same plan for repeated executions of our statement regardless of the value of the supplied bind variable on each occasion. There are several ways to address this problem. You can replicate code as I did in Listing 6-7 or you could use dynamic SQL to generate code with literal values. These approaches both result in the creation of two different SQL statements with independent plans. Since there are just two different execution plans you could write just one statement with a UNION ALL. Hold that thought for a short while and let us focus on UNION ALL.

UNION, UNION ALL, and OR

As a reader of a book entitled Expert Oracle SQL I am sure that you know the difference between the semantics of UNION, UNION ALL, and OR. Fortunately or unfortunately, depending on your perspective, a lot of SQL is written by business users with relatively little understanding of SQL performance issues, and the difference between these constructs is unclear to them. Listing 16-7 shows three statements that return identical results but have quite different execution plans.

Listing 16-7. Three equivalent statements with different plans

SELECT *
FROM sh.customers
WHERE cust_id = 3228
UNION ALL
SELECT *
FROM sh.customers
WHERE cust_id = 6783;

--------------------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)|
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 2 | 4 (50)|
| 1 | UNION-ALL | | | |
| 2 | TABLE ACCESS BY INDEX ROWID| CUSTOMERS | 1 | 2 (0)|
|* 3 | INDEX UNIQUE SCAN | CUSTOMERS_PK | 1 | 1 (0)|
| 4 | TABLE ACCESS BY INDEX ROWID| CUSTOMERS | 1 | 2 (0)|
|* 5 | INDEX UNIQUE SCAN | CUSTOMERS_PK | 1 | 1 (0)|
--------------------------------------------------------------------------

SELECT *
FROM sh.customers
WHERE cust_id = 3228
UNION
SELECT *
FROM sh.customers
WHERE cust_id = 6783;

---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)|
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 2 | 4 (50)|
| 1 | SORT UNIQUE | | 2 | 4 (50)|
| 2 | UNION-ALL | | | |
| 3 | TABLE ACCESS BY INDEX ROWID| CUSTOMERS | 1 | 2 (0)|
|* 4 | INDEX UNIQUE SCAN | CUSTOMERS_PK | 1 | 1 (0)|
| 5 | TABLE ACCESS BY INDEX ROWID| CUSTOMERS | 1 | 2 (0)|
|* 6 | INDEX UNIQUE SCAN | CUSTOMERS_PK | 1 | 1 (0)|
---------------------------------------------------------------------------

SELECT *
FROM sh.customers
WHERE cust_id = 3228 OR cust_id = 6783;

--------------------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)|
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 2 | 5 (0)|
| 1 | INLIST ITERATOR | | | |
| 2 | TABLE ACCESS BY INDEX ROWID| CUSTOMERS | 2 | 5 (0)|
|* 3 | INDEX UNIQUE SCAN | CUSTOMERS_PK | 2 | 3 (0)|
--------------------------------------------------------------------------

The three statements in Listing 16-7 all return two rows—the same two rows in each case. However, these statements all have different execution plans. To understand why this is, take a look at the three queries in Listing 16-8.

Listing 16-8. UNION, UNION ALL, and OR producing different results

SELECT cust_first_name, cust_last_name
FROM sh.customers
WHERE cust_first_name = 'Abner'
UNION ALL
SELECT cust_first_name, cust_last_name
FROM sh.customers
WHERE cust_last_name = 'Everett'; -- Returns 144 rows

-------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)|
-------------------------------------------------------------
| 0 | SELECT STATEMENT | | 173 | 845 (51)|
| 1 | UNION-ALL | | | |
|* 2 | TABLE ACCESS FULL| CUSTOMERS | 43 | 423 (1)|
|* 3 | TABLE ACCESS FULL| CUSTOMERS | 130 | 423 (1)|
-------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter("CUST_FIRST_NAME"='Abner')
3 - filter("CUST_LAST_NAME"='Everett')

SELECT cust_first_name, cust_last_name
FROM sh.customers
WHERE cust_first_name = 'Abner'
UNION
SELECT cust_first_name, cust_last_name
FROM sh.customers
WHERE cust_last_name = 'Everett'; -- Returns 10 rows

--------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)|
--------------------------------------------------------------
| 0 | SELECT STATEMENT | | 173 | 845 (51)|
| 1 | SORT UNIQUE | | 173 | 845 (51)|
| 2 | UNION-ALL | | | |
|* 3 | TABLE ACCESS FULL| CUSTOMERS | 43 | 423 (1)|
|* 4 | TABLE ACCESS FULL| CUSTOMERS | 130 | 423 (1)|
--------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

3 - filter("CUST_FIRST_NAME"='Abner')
4 - filter("CUST_LAST_NAME"='Everett')

SELECT cust_first_name, cust_last_name
FROM sh.customers
WHERE cust_first_name = 'Abner' OR cust_last_name = 'Everett' -- Returns 128 rows;

------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)|
------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 423 (1)|
|* 1 | TABLE ACCESS FULL| CUSTOMERS | 1 | 423 (1)|
------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter("CUST_FIRST_NAME"='Abner' AND "CUST_LAST_NAME"='Everett')

All three queries in Listing 16-8 combine two subsets of rows from SH.CUSTOMERS. The first query in Listing 16-8 returns the 80 rows that match the first name Abner followed by the 64 rows that match the surname Everett. The second query returns only ten rows. The UNIONoperator removes all duplicate values from the select list, and it turns out that there are only ten distinct combinations of first and last name that match our predicates. The third query returns a different number of rows again. The OR condition does not remove duplicates, but it does ensure that if a particular row matches both predicates it is only returned once. Since there are 16 rows in the SH.CUSTOMERS table that have a first name of Abner and a last name of Everett, these 16 rows are returned by both halves of the UNION ALL query and only once by the OR operator, so there are 16 more rows in the result set of the UNION ALL query than in that of the query that uses the OR operator.

It is never a good idea to use UNION when UNION ALL is known to produce identical results. UNION just performs a UNION ALL and then performs a SORT UNIQUE operation. If the removal of duplicates is not required then the sort performed by UNION is just an unnecessary overhead, sometimes a significant one. As you can see in Listing 16-8, the OR operator used just one full table scan when the other constructs required two, and generally speaking you should use an OR operator instead of UNION or UNION ALL when the results are known to be identical. But let me show you a possible exception. Listing 16-9 returns to our MOSTLY_BORING table.

Listing 16-9. Avoiding dynamic SQL with UNION ALL

SELECT *
FROM mostly_boring
WHERE special_flag = 'Y' AND :b1 = 'Y'
UNION ALL
SELECT *
FROM mostly_boring
WHERE special_flag = 'N' AND :b1 = 'N';

----------------------------------------------------------------------------
| Id | Operation | Name | Cost (%CPU)|
----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 386 (100)|
| 1 | UNION-ALL | | |
| 2 | FILTER | | |
| 3 | TABLE ACCESS BY INDEX ROWID BATCHED| MOSTLY_BORING | 2 (0)|
| 4 | INDEX RANGE SCAN | SPECIAL_INDEX | 1 (0)|
| 5 | FILTER | | |
| 6 | TABLE ACCESS FULL | MOSTLY_BORING | 384 (1)|
----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter(:B1='Y')
4 - access("SPECIAL_FLAG"='Y')
5 - filter(:B1='N')
6 - filter("SPECIAL_FLAG"='N')

SELECT /*+ use_concat */
*
FROM mostly_boring
WHERE (special_flag = 'Y' AND :b1 = 'Y')
OR (special_flag = 'N' AND :b1 = 'N');

----------------------------------------------------------------------------
| Id | Operation | Name | Cost (%CPU)|
----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 387 (1)|
| 1 | CONCATENATION | | |
|* 2 | FILTER | | |
| 3 | TABLE ACCESS BY INDEX ROWID BATCHED| MOSTLY_BORING | 2 (0)|
|* 4 | INDEX RANGE SCAN | SPECIAL_INDEX | 1 (0)|
|* 5 | FILTER | | |
|* 6 | TABLE ACCESS FULL | MOSTLY_BORING | 385 (1)|
----------------------------------------------------------------------------

Predicate Information(identified by operation id):
---------------------------------------------------

2 - filter(:B1='Y')
4 - access("SPECIAL_FLAG"='Y')
5 - filter(:B1='N')
6 - filter("SPECIAL_FLAG"='N' AND (LNNVL(:B1='Y') OR
LNNVL("SPECIAL_FLAG"='Y')))

Listing 16-9 shows how two different access paths can be combined in a single statement without dynamic SQL. The first query in Listing 16-9 uses a UNION ALL set operator. The first half of the query only selects rows if the value of the bind variable is Y. As such, an indexed access path is selected. The second half of the first query will only return rows if the value of the bind variable is N, and a full table scan is therefore appropriate. The FILTER operations on lines 2 and 5 ensure that only one of the two operations is performed.

The second query in Listing 16-9 accomplishes the same thing with an OR operator, but a USE_CONCAT hint is required; this is because the CBO doesn’t realize that only one of the two halves of the statement will be executed and adds the cost of the full table scan and the cost of the indexed access path to arrive at the total statement cost. We know that only one half of the statement will be run and understand why the CBO costing is wrong. We can, therefore, provide the hint with confidence.

Issues with General Purpose Views

Data dictionary views are very useful, and the more complex the definition the more useful the view may be. Apart from the gains in productivity and the improvement in the readability of the queries that use the complex view, there is the potential to avoid changing code in multiple places when requirements change.

Unfortunately, the more complex the view the more likely it is that inefficiencies may creep in. This risk is much reduced in recent releases of Oracle database, as transformations such as join elimination have been introduced, but problems still exist. Listing 16-10 shows a typical example.

Listing 16-10. Inefficiencies with complex views

CREATE OR REPLACE VIEW sales_data
AS
SELECT *
FROM sh.sales
JOIN sh.customers USING (cust_id)
JOIN sh.products USING (prod_id);

SELECT prod_name, SUM (amount_sold)
FROM sales_data
GROUP BY prod_name;

---------------------------------------------------------------
| Id | Operation | Name | Cost (%CPU)|
---------------------------------------------------------------
| 0 | SELECT STATEMENT | | 576 (6)|
| 1 | HASH GROUP BY | | 576 (6)|
|* 2 | HASH JOIN | | 576 (6)|
| 3 | VIEW | VW_GBC_9 | 573 (6)|
| 4 | HASH GROUP BY | | 573 (6)|
|* 5 | HASH JOIN | | 552 (2)|
| 6 | INDEX FAST FULL SCAN| CUSTOMERS_PK | 33 (0)|
| 7 | PARTITION RANGE ALL | | 517 (2)|
| 8 | TABLE ACCESS FULL | SALES | 517 (2)|
| 9 | TABLE ACCESS FULL | PRODUCTS | 3 (0)|
---------------------------------------------------------------

Listing 16-10 begins by defining a view that joins SH.SALES, SH.CUSTOMERS and SH.PRODUCTS. You want to join SH.SALES and SH.PRODUCTS yourself and decide to use the view provided. After all, you know about the fancy join elimination transformation that the CBO has and are confident that your use of the view will not be inefficient. However, we can see that the execution plan still includes the join with the SH.CUSTOMERS table. The reason is that the referential integrity constraint that ensures that every row in SH.SALES has a corresponding row inSH.CUSTOMERS is not validated, so the join elimination isn’t legal. To avoid this extra step you would need to join SH.SALES and SH.PRODUCTS yourself.

How to Use Temporary Tables

I showed in Listing 1-15 that you could, and generally should, use factored subqueries to avoid the use of temporary tables that are only required by a single SQL statement. It is theoretically possible for a temporary table to provide better performance than a factored subquery when an index of the temporary table can be used. However, this theoretical possibility is very rare because the cost of building and maintaining the index generally outweighs any benefit the index might bring to a single statement.

However, if multiple independent SQL statements in the same session use similar constructs then a temporary table might be a good idea to avoid repeating work. Listing 16-11 shows a procedure that is crying out for optimization.

Listing 16-11. Repeated construct in multiple SQL statements

CREATE TABLE key_electronics_customers
(
cust_id NUMBER PRIMARY KEY
,latest_sale_month DATE
,total_electronics_sold NUMBER (10, 2)
);

CREATE TABLE electronics_promotion_summary
(
sales_month DATE
,promo_id NUMBER
,total_electronics_sold NUMBER (10, 2)
,PRIMARY KEY (sales_month, promo_id)
);

CREATE OR REPLACE PROCEDURE get_electronics_stats_v1 (p_sales_month DATE)
IS
v_sales_month CONSTANT DATE := TRUNC (p_sales_month, 'MM'); -- Sanity check
v_next_sales_month CONSTANT DATE := ADD_MONTHS (v_sales_month, 1);
BEGIN
--
-- Identify key electronics customers from this month
-- that spent more than 1000 on Electronics
--
MERGE INTO key_electronics_customers c
USING ( SELECT cust_id, SUM (amount_sold) amount_sold
FROM sh.sales s JOIN sh.products p USING (prod_id)
WHERE time_id >=v_sales_month
AND time_id < v_next_sales_month
AND prod_category = 'Electronics'
GROUP BY cust_id
HAVING SUM (amount_sold) > 1000) t
ON (c.cust_id = t.cust_id)
WHEN MATCHED
THEN
UPDATE SET
c.latest_sale_month = v_sales_month
,c.total_electronics_sold = t.amount_sold
WHEN NOT MATCHED
THEN
INSERT (cust_id, latest_sale_month, total_electronics_sold)
VALUES (t.cust_id, v_sales_month, t.amount_sold);

--
-- Remove customers with little activity recently
--
DELETE FROM key_electronics_customers
WHERE latest_sale_month < ADD_MONTHS (v_sales_month, -3);

--
-- Now generate statistics for promotions for sales in Electronics
--
MERGE INTO electronics_promotion_summary p
USING ( SELECT promo_id, SUM (amount_sold) amount_sold
FROM sh.sales s JOIN sh.products p USING (prod_id)
WHERE time_id >=v_sales_month
AND time_id < v_next_sales_month
AND prod_category = 'Electronics'
GROUP BY promo_id) t
ON (p.promo_id = t.promo_id AND p.sales_month = v_sales_month)
WHEN MATCHED
THEN
UPDATE SET p.total_electronics_sold = t.amount_sold
WHEN NOTMATCHED
THEN
INSERT (sales_month, promo_id, total_electronics_sold)
VALUES (v_sales_month, t.promo_id, t.amount_sold);
END get_electronics_stats_v1;
/

Listing 16-11 creates two tables and a procedure. KEY_ELECTRONICS_CUSTOMERS identifies the customers that have spent more than 1000 units of currency on electronics in one or more of the last three months. ELECTRONICS_PROMOTION_SUMMARY summarizes the total sales for electronics for each promotion in each month. These tables are maintained by a procedure, GET_ELECTRONICS_STATS_V1, that takes data from the SH.SALES and SH.PRODUCTS tables and merges it into the two tables.

We can see that the two MERGE statements access the SH.SALES and SH.PRODUCTS tables in remarkably similar ways: they both join the same way, they both select only rows for electronics for a particular month, and they both perform some kind of aggregation. We can optimize this code by performing the common work once and saving the results in a temporary table. Listing 16-12 shows the way.

Listing 16-12. Avoiding duplication of effort by using a temporary table

CREATE GLOBAL TEMPORARY TABLE electronics_analysis_gtt
(
cust_id NUMBER NOT NULL
,promo_id NUMBER NOT NULL
,time_id DATE
,amount_sold NUMBER
) ON COMMIT DELETE ROWS;

CREATE OR REPLACE PROCEDURE get_electronics_stats_v2 (p_sales_month DATE)
IS
v_sales_month CONSTANT DATE := TRUNC (p_sales_month, 'MM'); -- Sanity check
v_next_sales_month CONSTANT DATE := ADD_MONTHS (v_sales_month, 1);
BEGIN
--
-- create semi-aggregated data for later use
--

DELETE FROM electronics_analysis_gtt; -- just in case

INSERT INTO electronics_analysis_gtt(cust_id
,promo_id
,time_id
,amount_sold)
SELECT cust_id
,promo_id
,MAX (time_id) time_id
,SUM (amount_sold) amount_sold
FROM sh.sales JOIN sh.products p USING (prod_id)
WHERE time_id >=v_sales_month
AND time_id < v_next_sales_month
AND prod_category = 'Electronics'
GROUP BY cust_id, promo_id;

--
-- Identify key electronics customers from this month
-- that spent more than 1000 on Electronics
--
MERGE INTO key_electronics_customers c
USING ( SELECT cust_id, SUM (amount_sold) amount_sold
FROM electronics_analysis_gtt
GROUP BY cust_id
HAVING SUM (amount_sold) > 1000) t
ON (c.cust_id = t.cust_id)
WHEN MATCHED
THEN
UPDATE SET
c.latest_sale_month = v_sales_month
,c.total_electronics_sold = t.amount_sold
WHEN NOT MATCHED
THEN
INSERT (cust_id, latest_sale_month, total_electronics_sold)
VALUES (t.cust_id, v_sales_month, t.amount_sold);

--
-- Remove customers with little activity recently
--
DELETE FROM key_electronics_customers
WHERE latest_sale_month < ADD_MONTHS (v_sales_month, -3);

--
-- Now generate statistics for promotions for sales in Electronics
--
MERGE INTO electronics_promotion_summary p
USING ( SELECT promo_id, SUM (amount_sold) amount_sold
FROM electronics_analysis_gtt
GROUP BY promo_id) t
ON (p.promo_id = t.promo_id AND p.sales_month = v_sales_month)
WHEN MATCHED
THEN
UPDATE SET p.total_electronics_sold = t.amount_sold
WHEN NOT MATCHED
THEN
INSERT (sales_month, promo_id, total_electronics_sold)
VALUES (v_sales_month, t.promo_id, t.amount_sold);
END get_electronics_stats_v2;
/

Listing 16-12 begins by creating a global temporary table ELECTRONICS_ANALYSIS_GTT that will be used to store semi-aggregated data for use by our upgraded procedure GET_ELECTRONICS_STATS_V2. The upgraded procedure begins by generating data aggregated for each combination of CUST_ID and PROMO_ID for electronics in the specified month. The two MERGE statements then further aggregate the data in the temporary table.

If you run the procedure specifying 1st December 2001, the temporary table contains 1071 rows. Because we only start out with 4132 matching rows in the SH.SALES table the performance benefits of this optimization are not easy to measure, but in real life the amount of data reduction achieved by semi-aggregation can be much larger and the benefits quite substantial.

Avoiding Multiple Similar Subqueries

Just like the multiple similar aggregations in Listing 16-11, many SQL statements contain multiple similar subqueries. Listing 16-13 shows a typical example.

Listing 16-13. Multiple similar subqueries

SELECT p.prod_id
,p.prod_name
,p.prod_category
, (SELECT SUM (amount_sold)
FROM sh.sales s
WHERE s.prod_id = p.prod_id)
sum_amount_sold
, (SELECT SUM (quantity_sold)
FROM sh.sales s
WHERE s.prod_id = p.prod_id)
sum_quantity_sold
FROM sh.products p;

---------------------------------------------------------
| Id | Operation | Name | Cost (%CPU)|
---------------------------------------------------------
| 0 | SELECT STATEMENT | | 1078 (6)|
| 1 | HASH JOIN OUTER | | 1078 (6)|
| 2 | HASH JOIN OUTER | | 541 (6)|
| 3 | TABLE ACCESS FULL | PRODUCTS | 3 (0)|
| 4 | VIEW | VW_SSQ_2 | 538 (6)|
| 5 | HASH GROUP BY | | 538 (6)|
| 6 | PARTITION RANGE ALL| | 517 (2)|
| 7 | TABLE ACCESS FULL | SALES | 517 (2)|
| 8 | VIEW | VW_SSQ_1 | 537 (6)|
| 9 | HASH GROUP BY | | 537 (6)|
| 10 | PARTITION RANGE ALL | | 516 (2)|
| 11 | TABLE ACCESS FULL | SALES | 516 (2)|
---------------------------------------------------------

The query in Listing 16-13 calculates the total amount and total quantity of sales for each product using two similar subqueries in the select list. I have shown the execution plan from a 12cR1 database, and we can see that the CBO has managed to unnest the two subqueries. Unfortunately, as of the time of writing, the CBO is unable to coalesce these subqueries (I wouldn’t be surprised if the CBO is able to do this at some point in the future), and so we are left with two full table scans on the SH.SALES table. We can see that the majority of the estimated cost comes from these two repeated full table scans.

There are actually several different ways to address the problem of replicated subqueries, but in this case we have a simple solution. Listing 16-14 shows how we can use a join to avoid duplicating work.

Listing 16-14. Use of a join to avoid repeated subqueries

SELECT p.prod_id
,p.prod_name
,p.prod_category
,SUM (amount_sold) sum_amount_sold
,SUM (quantity_sold) sum_quantity_sold
FROM sh.sales s, sh.products p
WHERE s.prod_id = p.prod_id
GROUP BY p.prod_id, p.prod_name, p.prod_category;

---------------------------------------------------------
| Id | Operation | Name | Cost (%CPU)|
---------------------------------------------------------
| 0 | SELECT STATEMENT | | 541 (6)|
| 1 | HASH GROUP BY | | 541 (6)|
| 2 | HASH JOIN | | 541 (6)|
| 3 | VIEW | VW_GBC_5 | 538 (6)|
| 4 | HASH GROUP BY | | 538 (6)|
| 5 | PARTITION RANGE ALL| | 517 (2)|
| 6 | TABLE ACCESS FULL | SALES | 517 (2)|
| 7 | TABLE ACCESS FULL | PRODUCTS | 3 (0)|
---------------------------------------------------------

We can see that by simply joining the two tables and performing a GROUP BY we can get our aggregated values without the need to replicate our full table scan.

Summary

This chapter has provided examples of what many people refer to as badly written SQL. However, we need to bear in mind that code that is easy to read but performs badly today may perform perfectly well tomorrow when the CBO implements its next query transformation. Be that as it may, poorly performing SQL almost always either does work that it doesn’t need to do, throwing the results away, or does work that it does need to do more than once. In this chapter I have provided only a taster of the more common scenarios in which the CBO is unable to generate an optimal plan until the query is rewritten. If you see an important SQL statement that seems to do redundant or repeated work you should think about rewriting it.

None of the examples in this chapter have looked at sorting data. I have found that a substantial amount of my time in the last few years has been spent trying to avoid the performance problems of pesky sorts. The next installment of our tale focusses entirely on the subject of sorts.