Object Statistics and Deployment - Basic Concepts - Expert Oracle SQL: Optimization, Deployment, and Statistics (2014)

Expert Oracle SQL: Optimization, Deployment, and Statistics (2014)

PART 1. Basic Concepts

CHAPTER 6. Object Statistics and Deployment

This chapter introduces the second of the two important themes of the book: how to manage the performance of SQL statements in a production environment. This is undoubtedly the most controversial chapter in this book as some experts may disagree with the views I express here, particularly as they fly in the face of Oracle’s stated strategy. Nevertheless, I am very far from alone in reaching the conclusions that I have, and the techniques I describe have been used by a number of people to stabilize the performance of many Oracle databases. Let us begin with some generic observations about performance management.

The Principle of Performance Management

There is an overriding principle that I believe applies to the performance of all types of services. It doesn’t have to be a database-related service, and it doesn’t even have to be in IT. Let me give you a couple of examples that are not IT related to make the point clear.

The Royal Mail Example

About 20 years ago I was listening to an after-dinner talk by an employee of the Royal Mail. The Royal Mail is the postal service in the United Kingdom, and the speaker started to discuss a survey that his employer had made of its business customers. The Royal Mail wanted to know if their customers would be happy with a 2:00 p.m. delivery time or whether they would pay extra for a 9:00 a.m. delivery. The response from their customer base was that the Royal Mail was asking the wrong question!

At the time, the reliability of the Royal Mail was not particularly great. My own experience of what is known as “First Class Mail” was that correspondence would arrive the following day about 90% of the time. Ten percent of the time the mail wasn’t delivered for at least two days. This sort of service is no good at all to a business consumer. Suppose a law firm sends legal documents to their employee John at his office in London. John wants to discuss these documents with Jim, a client of the law firm, also based in London. Do you imagine that John can call Jim and say, “Hey, Jim, why not pop over for a chat about these legal documents? There is a 90% chance that I will have them by then!” That is not how business functions. John needs a guarantee of a delivery time. If that guaranteed time is 2:00 p.m. then John might be able to arrange his meeting with Jim for 3:00 p.m. If Jim is only available in the morning then another carrier might be needed, but whichever service provider John uses, that service provider must (paraphrasing a famous advertising slogan from a U.S. parcel service) ensure that the correspondence absolutely, positively, must be there by whatever time was guaranteed.

In real life, of course, there is no such thing as an absolute guarantee. There are always factors beyond our control that introduce small levels of risk. But these exceptions need to be very few if consumers are to tolerate them.

The Airport Example

I was driving my adult son Thomas to the airport and told him about the performance management principle that I was trying to explain. I asked if he had any ideas for a second example that most people could relate to. He struggled a bit to come up with a suitable example as we sat over a meal. After we had finished eating, Thomas headed through airport security and I made my way home. About an hour later I got a call from Thomas. He had some good news and some bad news. The bad news was that he had missed his flight, but the good news was that I had a second example for my book!

I used to be a frequent flyer, and I absolutely hate wasting time at an airport. I left Thomas to go through security about 30 minutes before his flight was due to depart. That should have given him enough time, and he wouldn’t be stuck in the departures lounge for too long. I thought I was being a good “service provider”. Unfortunately, there was an issue with the image of Thomas’s electronic boarding pass on his smartphone and he had to go and get a paper copy printed off. By the time Thomas was back at security his flight had closed. He explained to me that optimizing the airport experience to minimize the wait was all very laudable but that next time I would have to ensure that he absolutely, positively did not miss the flight!

Service Level Agreements in IT

Consumers of IT services are no different from consumers of any other type of service. They want guarantees. These guarantees are embodied in Service Level Agreements, or SLAs, that are either formal documents or informal understandings. These SLAs aren’t written with words like “the batch must finish in one hour 90% of the time” or even “the batch must finish in an average time of less than one hour over the course of a month.” These sorts of performance measures are of no use to any service consumer. No, SLAs are always stated to reflect the sentiment that the batch “absolutely, positively must be finished in an hour.”

In the vast majority of cases, the consumer will not care whether the one-hour batch finishes in 15 minutes or 59 minutes any more than you would care if your letters arrived at 4:30 in the morning. It would be a rare event for a business consumer of an IT service to call staff at home and say, “Hey, the overnight batch is running ahead of schedule, why not grab an early train and start work early?” Consumers of services make plans based on the guaranteed performance level and often aren’t in a position to take advantage of higher levels of performance if they occur.

Of course, the service provider of a one-hour batch is very much interested in the difference between a 15 minute and a 59 minute run. In the former case there is plenty of contingency time to deal with unexpected events, while there is none in the latter case. Nevertheless, the consumer’s needs come first, and efforts to reduce the elapsed time of a batch can only be made after a predictable performance profile has been established.

Non-database Deployment Strategies

Given the requirements of service consumers, the IT industry has established techniques for ensuring that both the functionality and performance levels of IT services are predictable. There are a number of software development methodologies in existence, but they all share a couple of common attributes:

· New features are tested in some way to provide confidence that the software functions correctly and that performance levels are satisfactory before being delivered for general-purpose consumption.

· Changes, either to software or critical configuration data, involving a production service are controlled in some way to minimize risk. Typically, these controls involve some combination of testing, impact analysis, authorization, and auditing.

All these processes and procedures are there to do one basic thing; they are there so that the consumers absolutely, positively will get the service they have been promised. Of course, we don’t always succeed, but that is at least our goal.

The Strategic Direction for the CBO

Up until now I have been talking about the delivery of services to consumers in general. I have done so at length and deliberately avoided any mention of database performance. All this preparation is, as you may have guessed, because in my opinion the strategy that Oracle, and the CBO development team in particular, are following does not reflect the basic requirements for predictable service delivery.

The History of Strategic Features

The CBO was introduced in Oracle 7 as an optional feature. You didn’t have to use it, and most of us didn’t. The slow take up was for two reasons. Some customers were concerned that, like any major new feature, the CBO might not work well at first. They had a point. The second reason for the slow take up was that the CBO was unpredictable by design. You might get one execution plan one day and then, after gathering object statistics, you might get a completely different execution plan that may or may not work as well.

One of the key things about the old Rule-Based Optimizer (RBO) was that the same statement resulted in the same execution plan every time. The CBO, on the other hand, is designed to be adaptable. The CBO tries to get the optimal execution plan each and every time, and if the object statistics suggest that the data has changed then the CBO will quite happily consider a new execution plan.

Despite the concerns about predictability that customers started raising from the outset and continue to raise today, the strategic direction of the CBO continues to be based on adaptability. Features like dynamic sampling and bind variable peeking started to appear. When 11g arrived it became very clear that the CBO team was looking towards ever more adaptable and ever less predictable behaviors. The signals came in the form of Adaptive Cursor Sharing in 11gR1 and Cardinality Feedback in 11gR2. Both Adaptive Cursor Sharing and Cardinality Feedback can result in the execution plan selected for the first execution of a statement being “upgraded” for the second and subsequent executions, even with no change to object statistics! Usually, these changes are for the better, but there are no guarantees.

As if this wasn’t bad enough, 12cR1 introduced Adaptive Execution Plans. This feature might result in the execution plan of a statement changing in mid-flight! We don’t want these features!

· On a test system we don’t want these features because they might mask problems that will impact our production service.

· On a production system we don’t want these features because the execution plans that we have tested are satisfactory, whatever the CBO thinks; we don’t want to risk using an untested plan that might cause performance to degrade.

Implications of the CBO Strategy

Now that we know what the strategy of the CBO development team is, let us look at what it means for us. I’ll approach this by looking at comments from two well-known Oracle gurus: Connor McDonald and Dave Ensor.

Connor McDonald’s view

If you ask people who the most knowledgeable experts in Oracle database technology are you might soon find yourself in the middle of a disagreement. Sometimes quite a heated disagreement! However, the most entertaining Oracle speaker in the world is surely Connor McDonald. If you come up with another name, you probably haven’t heard Connor speak! Oh, and just in case you are wondering, Connor knows a thing or two about Oracle database technology as well!

I have only had the privilege of hearing Connor speak twice, and on the first of those occasions he discussed a topic that is right at the center of the issues we are discussing in this chapter. Connor began his presentation by proposing a very simple change to a software program. A dropdown list had two entries: “MALE” and “Female”. The “MALE” is all capital letters but “Female” only has the first letter capitalized. A developer would like to correct this by changing “MALE” to “Male”. What needs to be done? Well Connor ran through a long list of tasks that the fictitious developer might have to do: write a business case, have the business case approved, generate a test plan, run through unit testing, subsystem testing, and system testing, document test results, go through a Quality Assurance review. . . the list went on and on and on. All the while Connor’s laptop was throwing images of meetings, documents, and various other things that I won’t mention in case you see his presentation and I spoil the fun! By the time Connor got to the end of his list I was, along with most of the rest of the audience, chuckling quite loudly. After all, most of us have at some point experienced something similar!

Connor’s presentation then took a different tack. Connor proposed a fictitious service impacting incident and the ensuing post-mortem. The incident was caused by a changed execution plan that in turn was caused by a change in object statistics. What! A change that caused an incident! Was this change authorized? Was there a change record? Who is to blame?

Of course the fictitious execution plan changed was caused by an explicitly run job to gather statistics. No authorization, no impact analysis, no testing, nothing. Connor asked the perfectly reasonable question as to why IT organizations routinely precipitate untested execution plan changes on their database systems while at the same time placing most other types of changes under the highest scrutiny. It seemed to Connor that that the world had gone mad. I agree!

The Dave Ensor Paradox

If the award for the most entertaining Oracle speaker in the world goes to Connor McDonald, I would give Dave Ensor the award for most memorable sound bite! I don’t know when or where Dave first stated that “the only time that it is safe to gather statistics is when to do so would make no difference.” Wherever or whenever he spoke those now famous, or infamous, words, they are now known world-wide as “The Dave Ensor Paradox.”

I should point out right away that Dave’s assertion is utterly untrue, a fact that I am sure Dave would be the first to acknowledge. However, there are two thought-provoking aspects of his “paradox” that warrant discussion. The first of these is the basis for the original assertion.

When you gather statistics, one of two things might happen to the execution plans of your SQL statements. The first thing that might happen is nothing; the execution plan based on the new statistics is precisely the same as the execution plan that was used the last time the statement ran. Since nothing has changed you obviously haven’t done any harm. The second thing that might happen is that the execution plan has changed. We don’t know for sure how the new plan will behave in practice, so it isn’t safe to gather statistics when that is the outcome. A compelling argument.

The second interesting aspect of Dave Ensor’s paradox is the fatal flaw in his argument. We will come to that very soon.

Why We Need to Gather Statistics

If gathering statistics is such a dangerous activity, as both Connor and Dave have suggested, then why on earth do we do it? Listing 6-1 sheds some light on this:

Listing 6-1. How time changes an execution plan, Part 1

SET LINES 2000 PAGES 0

CREATE TABLE t3
(
c1
,c2
,c3
)
PCTFREE 99
PCTUSED 1
AS
SELECT ROWNUM, DATE '2012-04-01' + MOD (ROWNUM, 2), RPAD ('X', 2000) c3
FROM DUAL
CONNECT BY LEVEL <= 1000;

CREATE INDEX t3_i1
ON t3 (c2);

EXPLAIN PLAN
FOR
SELECT *
FROM t3
WHERE t3.c2 = TO_DATE ( :b1, 'DD-MON-YYYY');

SELECT * FROM TABLE (DBMS_XPLAN.display);

VARIABLE b1 VARCHAR2(11)
EXEC :b1 := '01-APR-2012';

BEGIN
FOR r IN (SELECT *
FROM t3
WHERE t3.c2 = TO_DATE ( :b1, 'DD-MON-YYYY'))
LOOP
NULL;
END LOOP;
END;
/

SELECT * FROM TABLE (DBMS_XPLAN.display_cursor (sql_id => 'dgcvn46zatdqr'));

Listing 6-1 begins by creating table T3 that includes a column of type DATE. Table T3 is populated with dates from April 2012 and in 12cR1 onwards statistics will be implicitly gathered for both the table and the index. We then use DBMS_XPLAN.DISPLAY to see the results of anEXPLAIN PLAN statement and DBMS_XPLAN.DISPLAY_CURSOR to see the actual execution plan used at runtime. Listing 6-2 shows the results of the experiment.

Listing 6-2. How time changes an execution plan, Part 2

Plan hash value: 4161002650

--------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 272 (100)| |
|* 1 | TABLE ACCESS FULL| T3 | 500 | 982K| 272 (0)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter("T3"."C2"=TO_DATE(:B1,'DD-MON-YYYY'))

SQL_ID dgcvn46zatdqr, child number 0
-------------------------------------
SELECT * FROM T3 WHERE T3.C2 = TO_DATE ( :B1, 'DD-MON-YYYY')

Plan hash value: 4161002650
--------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 272 (100)| |
|* 1 | TABLE ACCESS FULL| T3 | 500 | 982K| 272 (0)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter("T3"."C2"=TO_DATE(:B1,'DD-MON-YYYY'))

The output of the two DBMS_XPLAN calls both show precisely the same execution plan. That is good. Now let us assume some time passes. Listing 6-3 is my attempt to simulate the passage of time from April 2012 until April 2014.

Listing 6-3. How time changes an execution plan, Part 3

ALTER SYSTEM FLUSH SHARED_POOL;

DELETE FROM t3;

INSERT INTO t3 (c1, c2,c3)
SELECT ROWNUM, DATE '2014-04-01' + MOD (ROWNUM, 2),rpad('X',2000)
FROM DUAL
CONNECT BY LEVEL <= 1000;

EXPLAIN PLAN
FOR
SELECT *
FROM t3
WHERE t3.c2 = TO_DATE ( :b1, 'DD-MON-YYYY');

SELECT * FROM TABLE (DBMS_XPLAN.display);

VARIABLE b1 VARCHAR2(11)
EXEC :b1 := '01-APR-2014';

BEGIN
FOR r IN (SELECT *
FROM t3
WHERE t3.c2 = TO_DATE ( :b1, 'DD-MON-YYYY'))
LOOP
NULL;
END LOOP;
END;
/

SELECT * FROM TABLE (DBMS_XPLAN.display_cursor (sql_id => 'dgcvn46zatdqr'));

What I have done here is to:

· Flush the shared pool to simulate the aging of the statement from the cursor cache

· Although not required, replace the data in the table with data from April 2014

· Re-execute the EXPLAIN PLAN statement

· Re-execute the DBMS_XPLAN.DISPLAY call

· Re-execute the statement, retrieving data from April 2014

· Re-execute the DBMS_XPLAN.DISPLAY_CURSOR statement

Listing 6-4 shows the results of this second experiment.

Listing 6-4. How time changes an execution plan, Part 4

Plan hash value: 4161002650

--------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 500 | 982K| 272 (0)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| T3 | 500 | 982K| 272 (0)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter("T3"."C2"=TO_DATE(:B1,'DD-MON-YYYY'))

SQL_ID dgcvn46zatdqr, child number 0
-------------------------------------
SELECT * FROM T3 WHERE T3.C2 = TO_DATE ( :B1, 'DD-MON-YYYY')

Plan hash value: 3225921897

---------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 2 (100)| |
| 1 | TABLE ACCESS BY INDEX ROWID BATCHED| T3 | 1 | 2013 | 2 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | T3_I1 | 1 | | 1 (0)| 00:00:01 |
---------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("T3"."C2"=TO_DATE(:B1,'DD-MON-YYYY'))

We can see that although the output of EXPLAIN PLAN is unchanged, the execution plan used at runtime has changed completely! April fool’s! The reason the change in execution plan has occurred is because the object statistics, and in particular the column statistics for column T3.C2, include what I call time-sensitive data. The LOW_VALUE and HIGH_VALUE statistics record the minimum and maximum value of a column at the time statistics are gathered. When we execute our statement two years later, the CBO goes through the following metaphorical thought process: “If the maximum value of T3.C2 was April 30th 2012 the last time statistics were gathered, and the value specified for the bind variable B1 is now April 1st 2014, then the number of rows selected from T3 is almost certainly zero”. What is the basis for this logic? The CBO assumes that we have specified a date in the future and that the statistics are current rather than the surely more likely case that the date is in the recent past and that the statistics are out of date. Let us see what happens when we re-gather statistics.

Listing 6-5. How time changes an execution plan, Part 5

BEGIN
DBMS_STATS.gather_table_stats (SYS_CONTEXT ('USERENV', 'CURRENT_SCHEMA')
,'T3'
,no_invalidate => FALSE);
END;
/

BEGIN
FOR r IN (SELECT *
FROM t3
WHERE t3.c2 = TO_DATE ( :b1, 'DD-MON-YYYY'))
LOOP
NULL;
END LOOP;
END;
/

SELECT * FROM TABLE (DBMS_XPLAN.display_cursor (sql_id => 'dgcvn46zatdqr'));
SQL_ID dgcvn46zatdqr, child number 0
-------------------------------------
SELECT * FROM T3 WHERE T3.C2 = TO_DATE ( :B1, 'DD-MON-YYYY')

Plan hash value: 3225921897

---------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 1 (100)| |
| 1 | TABLE ACCESS BY INDEX ROWID BATCHED| T3 | 1 | 2013 | 1 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | T3_I1 | 1 | | 1 (0)| 00:00:01 |
---------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("T3"."C2"=TO_DATE(:B1,'DD-MON-YYYY'))

We can see that after we re-gather statistics, thereby updating the LOW_VALUE and HIGH_VALUE column statistics, the execution plan used at runtime for our statement suddenly reverts back to the original.

This experiment leads us to an astonishing conclusion. The following sentence is the single most important in this book.

The reason we gather statistics on a production system is to prevent execution plans from changing, not to ensure that execution plans do change.

Let us pause for a second and consider the enormous implications of this statement. It is now possible to see the flaw in Dave Ensor’s paradox. Sometimes the “difference” that gathering statistics makes is to prevent an execution plan change that would otherwise have taken place! So the correct version of Dave Ensor’s paradox should be:

The only time that it is safe to gather statistics is when to do so results in no execution plan changes from the last time the SQL statements ran.

It doesn’t quite have the same ring to it does it? Our position now seems hopeless. On the one hand, if we don’t gather statistics on our production system then sooner or later, as sure as eggs are eggs, execution plans will change and almost certainly for the worse. On the other hand, if we do gather statistics then we are playing a game of Russian Roulette; sooner or later some change in some table, column, or index statistic will result in an undesirable execution plan change that will create a service-impacting incident. What are we to do? We need to sort out one or two more things before addressing that question.

How Often Do We Need to Change Execution Plans?

The implicit assumption made by both Connor McDonald and Dave Ensor is that predictable performance comes from unchanged execution plans. Is that really true? If our database is growing and the sizes of the tables change don’t we need, at some point, to change the execution plans of our statements? Wolfgang Breitling can help us answer that question.

Wolfgang Breitling’s Tuning by Cardinality Feedback

In 2006 Wolfgang Breitling published a paper entitled “Tuning by Cardinality Feedback (TCF).” You can find his paper here: http://www.centrexcc.com/Tuning%20by%20Cardinality%20Feedback.pdf. This paper may well be the single biggest contribution to the science of Oracle SQL tuning to date.

image Note I should point out that the 11gR2 cardinality feedback feature that I mentioned earlier has absolutely nothing to do with Wolfgang’s paper. However, I think that Wolfgang is quietly pleased that Oracle seems to have named a feature after his invention!

Briefly, Wolfgang’s argument is as follows:

· If the CBO is able to correctly determine the cardinality of operations in an execution plan, then the CBO will come up with the optimal execution plan.

· When the CBO comes up with an execution plan that runs substantially slower than the best possible plan, it is because one or more cardinality estimates are out by multiple orders of magnitude; cardinality estimates that are out by a factor of two or three may result in the optimal execution plan being missed, but the selected plan will perform almost equivalently to the optimal plan.

· All we need to do to determine the optimal execution plan for a statement is to give the CBO the correct cardinality estimates, and the CBO will do the rest.

Whilst these statements aren’t always true, they are true the majority of the time! Now, Wolfgang’s paper was about SQL tuning so it may seem odd that I am bringing it up in a chapter that is dedicated to object statistics and deployment. The reason I have done so is because of the TCF corollary.

The TCF Corollary

Let us assume that an operation in an execution plan actually returns 10,000 rows. What TCF says is that if the CBO estimate is 5,000 or 20,000 then there is no need to worry; the plan selected by the CBO will be reasonable. It is only when the CBO estimate is lower than 100 or higher than 1,000,000 that the CBO is likely to come up with a plan that performs really badly.

The corollary to this statement is that if a table doubles or triples in size then no execution plan change is needed. Suppose that statistics are gathered on our 10,000-row table, and the CBO comes up with a plan based on an accurate cardinality estimate of 10,000. Suppose now that the table grows to 20,000 rows. TCF says that when the CBO uses a plan based on the (now) incorrect cardinality estimate of 10,000 it will work fine! That is the theory. Theory doesn’t always work in practice, but on this occasion I am happy to report that it does!

Let me be clear about something. Assume we have a SQL statement that joins, say, T1 and T2 and then aggregates the results. On the one hand, if T1 and T2 both double in size then the SQL statement will almost certainly run slower. On the other hand, it is unlikely that that the elapsed time increase can be avoided by changing the execution plan; whatever the original optimal execution plan was is still likely to be optimal, or close to it, after the size increase.

In the overwhelming majority of real-life cases database tables will not often grow (or shrink) by factors of 100 or 1,000; we can live with one set of execution plans for years on end with no execution-plan related performance issues whatsoever! The main exception to this principle is the deployment of an empty table to production that grows rapidly, such as the AUDIT_TRAIL table that I mentioned in Chapter 5. We’ll deal with this exception a little later in the chapter once we have the main principles worked out.

Concurrent Execution Plans

We have now established that if an execution plan works well one day then it will likely work well the next, but can we always get away with just one execution plan at a time for each SQL statement? Let us look at a couple of cases that seem to suggest that we can’t.

Skewed Data and Histograms

There are many database tables that contain skewed data. By that I mean that a limited number of values within a column occur very frequently while others occur much less frequently. Suppose that your company does 60% of its business with a company called AJAX and the other 40% of its business with 100 other customers including ACME. A query that selects all transactions for AJAX from a table might well be best suited to a full table scan. On the other hand, a query that selects all transactions for ACME might be better off using an index, as only about 0.4% of the transactions will be selected. To enable the CBO to distinguish the AJAX customer from all others, a histogram can be used.

The main point to realize here is that bind variables and histograms should be considered mutually exclusive. If you parse a statement using one value for a bind variable, such as AJAX in my example, then that execution plan may be used when other values of the bind variable, such as ACME, are used. To prevent this, you should ensure that you have different SQL_IDs for each execution plan that you need. Listing 6-6 show how this might be done:

Listing 6-6. Using multiple SQL_IDs to allow for multiple execution plans

CREATE OR REPLACE PROCEDURE listing_6_6 (p_cust_last_name VARCHAR2)
IS
cnt NUMBER;
BEGIN
IF p_cust_last_name ='AJAX'
THEN
SELECT COUNT (*)
INTO cnt
FROM sh.sales s JOIN sh.customers c USING (cust_id)
WHERE c.cust_last_name = 'AJAX'; -- One plan for AJAX
ELSE
SELECT COUNT (*)
INTO cnt
FROM sh.sales s JOIN sh.customers c USING (cust_id)
WHERE c.cust_last_name = p_cust_last_name; -- Another for everybody else
END IF;
END listing_6_6;
/

The above procedure doesn’t do anything useful, but rather shows the correct way to code with histograms. The two SQL statements in the PL/SQL procedure will have different SQL_IDs and, therefore, two different execution plans may simultaneously be kept in the cursor cache.

In 11gR1 Oracle introduced Adaptive Cursor Sharing, which allows two or more execution plans to be kept in the cursor cache with the same SQL_ID. This might suggest to you that the complexity shown in Listing 6-6 is no longer needed. The problem with Adaptive Cursor Sharing is that a SQL statement has to perform badly at least once before an alternative execution plan is considered. If you want high-performing SQL each and every time then you still need to use separate SQL statements with separate SQL_IDs.

Workload Variations

There is one other case that I have come across where multiple execution plans were apparently needed for the same SQL statement. Let me explain this extraordinarily rare scenario.

I was working with an ETL process that loaded data from a table R1 in a remote database into a global temporary table TEMP1. An INSERT statement was then used to join TEMP1 with a permanent table T1 and then load the joined data into T2. The ETL process had two operating modes, and this is where things got interesting.

The normal operating mode of the ETL process was referred to as INCREMENTAL. Only the changed data from R1 was loaded into TEMP1. The INCREMENTAL load ran once every 15 minutes and the number of rows loaded into TEMP1 ranged from zero to a few dozen. The second operating mode was FULL. A FULL load was only done under exceptional conditions and involved loading all rows from R1 into TEMP1. A FULL load added millions of rows to TEMP1, but the same INSERT statement was used for both INCREMENTAL and FULL loads to insert data intoT2.

There were two possible execution plans the CBO might pick depending on the statistics set for TEMP1 or the outcome of dynamic sampling. The first plan involved a NESTED LOOPS join driven by TEMP1 and this worked fine for the INCREMENTAL load. Unfortunately, when theNESTED LOOPS plan was used with a FULL load the ETL process took about two days! The alternative execution plan was to use a HASH JOIN. When a FULL load was done with this plan it finished in a couple of hours, which was more than satisfactory given the infrequent use. Unfortunately, when the HASH JOIN was used for the INCREMENTAL load the process took about one hour, quite unsuitable for a process that was designed to run every 15 minutes!

Now, as you can imagine, a FULL load and an INCREMENTAL load never ran at the same time, but they did occasionally run within a few minutes of each other and so we occasionally observed the same execution plan being used for different modes. We might have used some combination of shared pool flushing, dynamic sampling, and statistics manipulation to solve this problem, but relying on operational staff to follow documentation is a risk that should be avoided when possible. Listing 6-7 gives you an indication of what we did in practice.

Listing 6-7. Using hints to create multiple SQL_IDs

CREATE OR REPLACE PROCEDURE listing_6_7 (load_mode VARCHAR2)
IS
BEGIN
IF load_mode = 'INCREMENTAL'
THEN
INSERT INTO t3 (c1, c2)
SELECT /*+ leading(temp1) use_nl(t3) */
temp1.c3, t3.c2
FROM temp1, t3
WHERE temp1.c1 = t3.c1;
ELSE -- FULL
INSERT INTO t3 (c1, c2)
SELECT /*+ leading(t3) use_hash(temp1) no_swap_join_inputs(temp1) */
temp1.c3, t3.c2
FROM temp1, t3
WHERE temp1.c1 = t3.c1;
END IF;
END listing_6_7;
/

As you can see from the highlighted sections of the code, we used optimizer hints. Don’t worry about the specifics of the hints just now. The key point at this stage is that the hints served three purposes:

· Because the SQL_IDs of the two statements are now different, an execution plan in the cursor cache from a FULL load would not be used for an INCREMENTAL load or vice-versa.

· The hints instructed the CBO which execution plan to use. By the time we had worked out the rest of the details of the problem we knew what plans were needed in each case, so we didn’t need to risk the possibility that the CBO would get things wrong.

· The hints provided useful documentation so that those that followed had some kind of a clue as to why such obscure logic was used. A few paragraphs of explanatory comments were also added to the code, of course.

I have worked with a significant number of applications based on Oracle technology over the last several years, each of which probably has several tens of thousands of SQL statements. In all that time I have come across exactly one SQL statement that needed this sort of treatment! It is, however, a very nice educational scenario.

Concurrent Execution Plans Wrap Up

There do seem to be some exceptional cases where different execution plans are needed for the same SQL statement to take care of variations in the data being accessed. Fortunately or unfortunately, the only way we can ensure that different execution plans are used in different scenarios is to recode our application so that different SQL statements with different SQL_IDs are used for each scenario. The good news is that once this reality is accepted we have a self-fulfilling prophecy upon which we can base our strategy: We need just one execution plan for each SQL statement!

Oracle’s Plan Stability Features

Although the CBO strategy is all about adaptability, Oracle, like any other company, has to do something to address the demands of its largest and most vocal customers. Here are the tactical plan stability measures that Oracle has taken to address customer concerns about predictability.

Stored Outlines

The first response to the customer demand for stable execution plans came in release 8i with the introduction of stored outlines. Loosely speaking, stored outlines involve a repository of execution plans, one for each SQL statement. The repository is held in tables in the OUTLN schema. Stored outlines have a number of drawbacks, but the two key ones are:

· There is no way to determine which stored outlines are no longer required because they refer to SQL statements that have changed or been deleted.

· Even more fatally, there is no way to determine which statements in your application are lacking stored outlines.

SQL Profiles

SQL profiles appeared in 10g and are intimately linked to the SQL Tuning Advisor. Although the example I gave in Listing 5-7 involved the creation of a new index, most SQL Tuning Advisor recommendations relate to cardinality errors and simply involve a different execution plan. To arrange for the new execution plan to be used you are required to “accept” the SQL profile that the SQL Tuning Advisor creates for you. Since the SQL profile overrides the execution plan that the CBO otherwise would have used, most of us initially assumed that SQL profiles were some sort of plan stability offering. They aren’t! Even after you have accepted a SQL profile, the next time you gather statistics the original, poorly performing plan may reappear! You might even get a third plan with unknown performance.

SQL profiles are actually an attempt to improve the CBO’s guesses. They are not an attempt to remove the guesswork. SQL profiles aren’t part of the CBO strategic direction and they don’t provide plan stability. SQL profiles fall between two stools, and the CBO team has assured us that SQL profiles won’t see a great deal of investment in the future.

SQL Plan Baselines

SQL plan baselines appeared in 11gR1 in a wave of confusion. Very simply put, SQL plan baselines are a direct replacement for stored outlines. One noticeable improvement, at least from a security point of view, is that the repository of execution plans is now held in the data dictionary rather than in a schema with a password that might be compromised.

SQL plan baselines are reasonably easy to use and many customers are using them successfully. However, they suffer the same major drawbacks as stored outlines in that you cannot reconcile the repository of execution plans with the statements in your application. In most other important respects SQL plan baselines have the same advantages and disadvantages as stored outlines.

The confusion arose with respect to licensing. Almost up to the point of release SQL plan baselines were going to be a chargeable feature. In fact many students of the Oracle University who attended an 11g new features course, including me, were told, incorrectly, that SQL plan baselines required the Tuning Pack. In fact, the only part of the SQL plan baselines feature that requires a license is something called plan evolvement, a feature that allows execution plans to change! Even then the license is only required when plan evolvement is performed by the SQL Tuning Advisor. So we get the stuff we want for free (the stability part) and have to pay for the stuff we don’t want (the adaptability part). It is no wonder that Oracle staff were internally confused about all of this!

Introducing TSTATS

TSTATS is a deployment methodology for managing performance that allows us to take full advantage of the best aspects of the CBO while preventing it from changing execution plans on a whim. No repository of execution plans is involved, so there are no reconciliation issues, and no use of hidden initialization parameters or other unsupported features is involved.

Acknowledgements

The name TSTATS stands for “Tony’s Statistics,” a name invented by my client to refer to the methodology that I was successfully persuading them to adopt. Although the name appeals to my sense of vanity, it would be wrong of me to suggest that TSTATS is entirely my invention. In the preface I give some details of the history of TSTATS and give credit to both Jonathan Lewis and Wolfgang Breitling. However, the biggest contribution is from Adrian Billington, whose ideas for deleting time-based data from object statistics are at the core of TSTATS. Let us get to that now.

Adjusting Column Statistics

If the CBO’s lack of predictability is based on time-sensitive data, in other words statistics that are only valid for a limited period of time, then why don’t we just delete those statistics? It turns out that although there are a couple of reasons why deleting time-sensitive data from your object statistics might not be a good idea, and we will discuss these in Chapter 20, most of the time we can do so with impunity. You can’t update column statistics to remove the LOW_VALUE and HIGH_VALUE statistics and for reasons that I cannot fathom this created a mental block with me for several years. The solution to this problem, obvious in retrospect, is to delete the column statistics and re-insert them with the time-sensitive bits removed! Listing 6-8 shows how to do this.

Listing 6-8. Updating column statistics with DBMS_STATS

DECLARE
distcnt NUMBER;
density NUMBER;
nullcnt NUMBER;
srec DBMS_STATS.statrec;
avgclen NUMBER;
BEGIN
DBMS_STATS.get_column_stats (
ownname => SYS_CONTEXT ('USERENV', 'CURRENT_SCHEMA')
,tabname => 'T3'
,colname => 'C2'
,distcnt => distcnt
,density => density
,nullcnt => nullcnt
,srec => srec
,avgclen => avgclen);

DBMS_STATS.delete_column_stats (
ownname => SYS_CONTEXT ('USERENV', 'CURRENT_SCHEMA')
,tabname => 'T3'
,colname => 'C2');

DBMS_STATS.set_column_stats (
ownname => SYS_CONTEXT ('USERENV', 'CURRENT_SCHEMA')
,tabname => 'T3'
,colname => 'C2'
,distcnt => distcnt
,density => density
,nullcnt => nullcnt
,srec => NULL
,avgclen => avgclen);
END;
/

Listing 6-8 manipulates T3.C2 and the key point is the highlighted line. The SREC structure holds the data that is destined for the LOW_VALUE and HIGH_VALUE statistics, and this structure is omitted when the statistics are reinserted.

Once these changes have been made and the shared pool flushed, the example statement used in Listings 6-1 through 6-5 will use a full table scan no matter what date you give! In addition, the cost and cardinality calculations will be the same after the deletion of the time-sensitive statistics as they were for contemporary dates before the deletion.

TSTATS in a Nutshell

The removal of time-sensitive data from object statistics is the main idea behind TSTATS. Here is the essence of a process that can be used on the back of that idea:

1. Gather statistics on a selected “master” test system.

2. Fabricate statistics for all global temporary tables.

3. Remove time-sensitive data from object statistics.

4. Perform initial performance testing on the master test system and make adjustments as necessary.

5. Copy the statistics to all other test systems and ultimately to production.

6. Lock the object statistics of your application schemas on all systems.

7. Drop all statistics-gathering jobs for application schemas on all your systems.

8. TSTATS only applies to application schemas, so any jobs that gather dictionary statistics are unaffected.

By following these steps and taking a few other precautions, you can ensure that the execution plans on all your test systems match those of your production systems and that the same SQL statement will have the same execution plan forever, or at least until you decide otherwise!

Does this approach seem familiar? It should; it is the standard approach for deploying software, DML, and configuration changes. Whatever change-control procedures you have in place for software and DML should be applied to object statistics. Typically you would deliver software, DML, and object statistics at the same time as parts of a release.

An Alternative to TSTATS

Stabilizing your application performance by the use of the TSTATS methodology is not straightforward, and some investment of time and effort is required. As you will see in Chapter 20 there may be ways to simplify TSTATS to address specific point issues so that you can get a resolution to your key stability issues quickly and easily. Even so, if you are managing an estate of hundreds of databases then adopting any variant of TSTATS as a standard will probably be impractical. That situation will change only in the unlikely event that the CBO team delivers some changes that make a TSTATS-like approach more straightforward.

In 2012 the Real World Performance team from Oracle did a round-the-world tour to try to dispel some myths about Oracle performance and to educate people about performance management. I attended the Edinburgh seminar and found it very stimulating. One of the most interesting suggestions came from Graham Wood. Graham’s observation was that many of Oracle’s customers gather statistics far too frequently. I want to make it clear that Graham did not say that gathering statistics was only required to prevent execution plan changes, and he most definitely didn’t liken gathering statistics to a game of Russian Roulette! However, Graham’s recommendation does suggest, quite correctly, that there is some risk to gathering statistics and that by doing it less often you can reduce that risk considerably.

HOW OFTEN DO YOU NEED TO GATHER STATISTICS?

The CBO will reduce its cardinality estimates more slowly the wider the date range of data in your tables, and it is only when the cardinality estimates reduce to very low numbers that execution plans are likely to change. So if you hold a month’s worth of data and collect statistics once per week you should be OK. If you hold one year’s worth of data then you shouldn’t have to gather statistics more than once every three months!

Deployment Options for Tuned SQL

In Chapter 5 I described a systematic process for tuning a SQL statement that involved running a series of experiments. I made it clear that the experimental change and the production deployment change might be different, and it is the process for identifying the latter that I want to talk about now. The considerations involved are the same whether you are using traditional object statistics management techniques or some variant of TSTATS; the primary consideration is whether you want your production change to affect other statements as well.

When Just One SQL Statement Needs to Change

There are several reasons why you might want your production change to be restricted to one specific SQL statement. It may be that your performance problem relates to a construct that is only found in one place and you know that no other SQL statements need to be fixed. You may even suspect that other SQL statements might be suffering from the same issue but these statements are not critical and the time and effort to test the impact of a change cannot be justified. Finally, you might have discovered an issue that affects a number of critical SQL statements but you need to deploy the fix to just one SQL statement now with the fix for other statements being deferred to the next release.

Whatever the reason for deciding to restrict your changes to one SQL statement, that decision is key to determining your deployment strategy.

Transforming SQL Code

This is an easy one. If the successful optimization experiment involved rewriting the SQL statement in some way, then most of the time making that same code change in production is only going to affect that one SQL statement. Since that is the desired effect in this case you can just deploy the experimental code change to production.

To limit the scope of your change you should be reluctant to alter a view definition. It would be better to either create a new view or use a factored subquery instead.

Adding Hints to the SQL Code

If your successful optimization experiment involved adding hints to your SQL statement and you want a deployment option that restricts change to that one SQL statement then you should have no hesitation in deploying hints to production. This recommendation may surprise you, as there is a lot of material in books, conference papers, and blogs that suggests that you should be very reluctant to deploy hints in production code. I disagree strongly with such sentiments and I will return to this topic with a lengthier discussion in Chapter 18.

Just because you used one set of hints to get the desired execution plan does not mean that you should use the same set of hints for production use. For example, you may have taken Wolfgang Breitling’s Tuning by Cardinality Feedback message to heart and applied one or moreCARDINALITY hints to your code to correct some CBO cardinality errors. If you aren’t using TSTATS then these hints will not reliably prevent execution plan changes when statistics are next gathered. It would be best to identify what changes resulted from the CARDINALITY hints and then use other hints to specify the access methods, join orders, and so on directly. In Chapter 8 I will show you how to determine what these hints are, and we will be discussing the meaning of most of them in Part 3.

Physical Database Changes

You should be very careful about making physical database changes just to make one SQL statement run faster. I already explained in the last chapter that adding an index results in extra overhead for all DML. Furthermore, when you add an index or make some other physical database change you can’t be sure of the impact on the execution plans of other SQL statements. You should consider making physical database changes to optimize one SQL statement only as a last resort.

When Multiple SQL Statements Need to Change

When you are in the early stages of a major release and still have a lot of test cycles ahead of you might want to make changes that affect a number of SQL statements. The changes you make might even result in sweeping performance improvements across your application.

I was recently tasked with taking a six-hour overnight batch and reducing the elapsed time to two hours. The batch involved hundreds, possibly thousands, of statements and tuning even the top 10% one by one would have been a hopeless task. Even in situations like these I begin by looking at the longest running SQL statement first. But once I understand the issue with that statement I try to find ways of fixing it that might benefit other statements as well. This is often possible when the performance problem is associated with a common construct.

Remember that if you make a change that improves the performance of one hundred statements and causes the performance of two other statements to degrade you can always hint the two that have degraded!

Transforming SQL Code and Adding Hints

I have already said that altering a view definition is a bad idea when you are trying to limit the scope of a change. The corollary is that altering a view definition is a good idea when sweeping changes are needed.

With this exception, changing a single SQL statement, and that includes hinting, is not likely to make a massive difference to the application as a whole unless this particular statement runs very often or very long.

Physical Database Changes

Physical database changes have a huge potential for improving the performance of an application as a whole, and it is the application as a whole that needs to be considered when making such changes. It is not uncommon to discover that an application is performing poorly because of the huge number of indexes that have been added over the years in order to solve point issues with individual queries. Perhaps the way to improve the performance of your application is to drop a number of those indexes! Adding an index is not the only way to address a long-running full table scan, and we shall discuss many of these in Chapter 15.

Altering Statistics

I have made a big fuss about the fact that when object statistics change execution plans might change. In the context of a production system execution plan changes are bad things, but in the context of tuning an entire application on a test system such changes are potentially very good things.

If your successful optimization experiment involved adding hints to your SQL statement and you suspect that a number of other SQL statements are going to be similarly affected, you should try to find a different solution that will fix the issue everywhere. Altering statistics can help.

Extended Statistics

As Wolfgang Breitling’s TCF paper explains, one of the most common reasons that the CBO comes up with a poor execution plan is that it has come up with a poor cardinality estimate. And one of the main reasons that the CBO comes up with a poor cardinality estimate is because it is unaware of the correlation between two columns. So if you provide the CBO with the correlation information then you might help the performance of multiple SQL statements.

Imagine that you have a table T1 with 40,000 rows. T1 has two columns C1 and C2. There are 20 different values of C1, 2,000 rows for each value. There are also 20 different values of C2, 2,000 rows for each value. The CBO might need to know how many rows will match a specific value of C1 and also a specific value of C2. The CBO knows that there are 400 possible combinations of C1 and C2, and will normally assume that each of these 400 possible combinations is equally likely. So the CBO will divide 40,000 by 400 to get a cardinality estimate of 100.

Now suppose that the real name of T1 is ORDERS and that the real names of C1 and C2 are ORDER_DATE and a DELIVERY_DATE respectively. The values of ORDER_DATE vary from 1st March to 20th March and the delivery dates vary from 2nd March to 21st March. You run a tight ship and every single DELIVERY_DATE is the day after ORDER_DATE! This means that in reality there are only 20 combinations of ORDER_DATE and DELIVERY_DATE in the table—2,000 rows for each combination. The ORDER_DATE and the DELIVERY_DATE are correlated.

Issues like these are addressed by the use of extended statistics, and I will make a detailed walkthrough of extended statistics in Chapter 9. By creating extended statistics rather than hinting you might improve the performance of other SQL statements that use the same combination of columns.

Hand-crafted Histograms

Another reason that the CBO comes up with bad cardinality estimates is data skew. I introduced the concept of data skew a little earlier in this chapter: it is a situation where a small number of column values occur more frequently than others. Although the normal TSTATS approach to column statistics is to remove all information about specific values, this isn’t the correct approach for columns requiring histograms. Histograms are a perfectly reasonable thing to deploy as long as the application doesn’t use bind variables for values of the column that has data skew.

In my experience the number of columns in a fully normalized schema that require histograms is very low. Of course in real-life some level of denormalization is common and particular columns requiring a histogram will be replicated across multiple tables. In these cases the replicated columns will usually have the same column name, be easy to identify, and require identical histograms. Whether you are using TSTATS or not, my recommendation would always be to analyze your data and manually set such histograms. We will cover this is detail in Chapter 20. At this point, however, the key message is, as always, that statistic changes can be used to improve the performance of multiple SQL statements at once.

Generating False Object Statistics

The use of extended statistics and histograms are features that Oracle employees frequently discuss and encourage you to use. However, there is far less talk about explicitly setting or overriding other types of statistics with particular values. I would presume that the lack of discussion is because setting statistics to specific values requires a considerably higher level of knowledge and expertise than setting up either extended statistics or histograms.

What I want to emphasize at this point is that setting statistics to particular values is fully supported, and the techniques for doing so are documented with the DBMS_STATS package. There are a few special cases that require setting object statistics to values that do not reflect reality, and we will discuss these in Chapter 20.

image Note The reason object statistics exist is to help the CBO arrive at correct decisions. If accurate statistics result in bad CBO decisions and inaccurate statistics result in good decisions you want to deploy inaccurate statistics.

As with any statistics manipulation you should consider the impact of your change on other SQL statements. Typically knowledge of what types of query prevail in your application and what types rarely if ever occur is critical to determining whether falsifying object statistics will be beneficial or not.

Changing PL/SQL Code and Altering the Environment

I have often seen SQL statements that perform poorly because of repeated function calls that involve recursive SQL. One way to improve matters is to recode the logic that calls the SQL statement so that the function is called once. When calls to a particular function, or set of functions, are known to be called repeatedly by large numbers of SQL statements in an application, it might be cost effective to make those functions DETERMINISTIC or to use the function cache that I described in Chapter 1.

If a sort spills to disk, you might parallelize the query. But if you suspect other sorts may spill to disk as well, you might consider increasing PGA_AGGREGATE_TARGET to help improve those other queries as well. You should consider other environmental changes in the same light.

Summary

This chapter has introduced a deployment methodology known as TSTATS that addresses the requirement for execution plan stability in a production environment without the need for a repository of execution plans. The methodology is complex to set up and will be discussed in detail inChapter 20 of this book, but alternative approaches are needed. You might use a simplified form of TSTATS, but for the bulk of well-behaving applications you should probably just keep the number of times object statistics are gathered to a minimum.

Together with physical database design, object statistics are an important consideration when deploying optimized SQL. In the early stages of a release, adjusting statistics is a great way to improve the performance of multiple parts of an application at once. However, at later stages of the testing lifecycle when the goal is to localize the impact of change, changes to object statistics should be minimized, and a sensible alternative is to add hints to your SQL statements.