Expert Oracle SQL: Optimization, Deployment, and Statistics (2014)

PART 5. Managing Statistics with TSTATS

CHAPTER 20. Managing Statistics with TSTATS

This chapter explains a radical approach one can take to the management of object statistics on production systems. This approach has been named TSTATS, for “Tony’s Statistics,” by one of my clients, and I must say that I like the name. It appeals to my sense of vanity! However, as I have already stated, I am not the only one to have considered the issues addressed in this chapter, and others have developed similar solutions on their own, Adrian Billington being one such person.

I explained the problem that TSTATS addresses in Chapter 6: gathering statistics is a risky activity on production systems and said statistics may result in poor-performing execution plans. But when using traditional approaches in the management of object statistics, as recommended by Oracle, you cannot avoid that risk. TSTATS allows you to eliminate the risky activity of gathering statistics on your production system and allows you to sleep soundly in the knowledge that your execution plans will not change for the worse during your slumbers.

There is good news and bad news when it comes to TSTATS. The bad news is that TSTATS is not a quick “plug and play” solution. You will need to spend weeks, possibly months, preparing for the deployment of TSTATS for a complex application. You will need the services of an Oracle performance expert to get TSTATS going. Notice I said expert, not experienced. If such an expert isn’t available full-time you will need to arrange for one to review plans and make sure that they are okay. Furthermore, you may well run into opposition from your colleagues and management; there are those who will claim, quite incorrectly, that TSTATS will render your database unsupportable by Oracle.

TSTATS AND ORACLE SUPPORT

The procedures in the DBMS_STATS package for manually creating or changing statistics are fully documented and supported. The use of these supplied procedures is quite different from, say, manually updating tables in the data dictionary: use of the supplied procedures is supported, but directly updating rows in the data dictionary is not.

It is true that if you make changes to object statistics that result in worse execution plans than the unaltered statistics would have produced, Oracle support is unlikely to be particularly sympathetic. However, if you run into any other type of issue (such as an ORA-0600 error during execution), you should have the same rights to support from Oracle whether you are using TSTATS or not.

The good news about TSTATS is that when it is developed correctly, it works! After the initial setup, you may never need to gather statistics on any development, test, or production system again, except, of course, to generate statistics for new tables or indexes as they are added to your application.

Caution The biggest single mistake I made with TSTATS was to re-gather statistics at the time of a 10g to 11g upgrade. This was totally unnecessary and just created a lot of unnecessary work. Be very cautious about re-gathering statistics on tables after a successful TSTATS deployment.

The two keys to the success of a successful TSTATS deployment are buy-in from management and a commitment to do the job properly—a poorly thought-out implementation will just lead the naysayers to cheer “I told you so,” and you will never get a second chance.

With these thoughts in mind, you need to consider your answers to the following questions before proceeding:

· Is your problem with performance stability bad enough to warrant the investment in TSTATS?

· Is there a simpler solution, such as gathering statistics less frequently? How about gathering statistics as part of a quarterly release?

· Is there a way that you can safely simplify the TSTATS approach or do you really need all the bells and whistles described in this chapter?

· Is your problem application-wide or can you limit the scope of TSTATS in some way?

· Does this chapter explain all that is needed to solve your problems or is there something special about your application that needs extra consideration? For example, do data volumes vary by multiple orders of magnitude under certain special circumstances?

When you answer these questions bear in mind that there are no right or wrong answers. If, for example, you decide to reduce the scope of TSTATS to a few key tables, decide not to consider histograms, and hint all queries that use range-based predicates you may have underestimated what needs to be done. Like all the issues that are raised in this book and elsewhere in life, decisions are best made with as much information as possible, so speak to your colleagues and take your time.

Now that I have built up the tension, I need to tell you what this wonderful TSTATS technology consists of. There are several elements, as follows:

· The customization of column statistics. This is the most important part of TSTATS, and is a concept I introduced in Chapter 6.

· The management of statistics on partitioned tables. Obviously, if you don’t have partitioned tables you don’t need to worry about this part of TSTATS.

· The management of statistics for temporary tables.

· Deploying TSTATS.

Let us get started and revisit the topic of column statistics, which I introduced in Chapter 6.

Managing Column Statistics

When you stop gathering statistics on your production system, the most important step towards stabilizing execution plans is to eliminate the high and low values for column statistics in the way shown in Listing 6-8. However, there are several other issues to consider, such as:

· Columns with only one value

· Columns with skewed data values

· Correlated columns and expressions

· Columns used in range predicates

It is important to realize that most columns in most tables will not require any special consideration: the statistics associated with columns not used in WHERE or HAVING clause predicates won’t be used by the CBO, and even if a column is used in a predicate, it may not be time based, may not have any skewed data, may not be used in predicates in conjunction with any correlated columns, and may not be used with range predicates. It is important not to underestimate the task at hand, but it is also important not to overestimate the task.

Time-based columns

As we discussed in Chapter 6, time-based columns, such as dates or sequence numbers, can cause unwanted changes to execution plans when statistics are not gathered periodically. You can prevent these unwanted execution plan changes by removing the HIGH_VALUE and LOW_VALUEstatistic columns. However, with a few exceptions, which we will discuss shortly, it generally does no harm to remove the HIGH_VALUE and LOW_VALUE statistic columns, even if the column isn’t time based.

Caution While finalizing this book I discovered that a bug was introduced in release 12.1.0.1 that causes join cardinalities to be calculated incorrectly when HIGH_VALUE and LOW_VALUE are missing from the joined columns of both tables. A workaround to this problem is to set theHIGH_VALUE very high and the LOW_VALUE very low. This workaround is included in the downloadable materials, but to avoid unnecessary complexity the book assumes that there is no issue setting HIGH_VALUE and LOW_VALUE to NULL.

My general recommendation is to begin the setup of TSTATS by eliminating the HIGH_VALUE and LOW_VALUE statistic columns for all object columns in your schema unless you know a priori that an exception, several of which I will cover shortly, applies. If there are exceptions that you are unaware of, don’t worry. You will discover them during your testing.

Listing 20-1 shows a procedure, ADJUST_COLUMN_STATS_V1, within a package called TSTATS, to remove the HIGH_VALUE and LOW_VALUE statistic columns from for a specified table.

Listing 20-1. Removing HIGH_VALUE and LOW_VALUE statistic columns

CREATE OR REPLACE PACKAGE tstats
AS
-- Other procedures not shown

PROCEDURE adjust_column_stats_v1 (
p_owner all_tab_col_statistics.owner%TYPE DEFAULT SYS_CONTEXT (
'USERENV'
,'CURRENT_SCHEMA')
,p_table_name all_tab_col_statistics.table_name%TYPE);

-- Other procedures not shown
END tstats;

CREATE OR REPLACE PACKAGE BODY tstats
AS
-- Other procedures not shown

PROCEDURE adjust_column_stats_v1 (
p_owner all_tab_col_statistics.owner%TYPE DEFAULT SYS_CONTEXT (
'USERENV'
,'CURRENT_SCHEMA')
,p_table_name all_tab_col_statistics.table_name%TYPE)
AS
CURSOR c1
IS
SELECT *
FROM all_tab_col_statistics
WHERE owner = p_owner
AND table_name = p_table_name
AND last_analyzed IS NOT NULL;
BEGIN
FOR r IN c1
LOOP
DBMS_STATS.delete_column_stats (ownname => r.owner
,tabname => r.table_name
,colname => r.column_name
,cascade_parts => TRUE
,no_invalidate => TRUE
,force => TRUE);
DBMS_STATS.set_column_stats (ownname => r.owner
,tabname => r.table_name
,colname => r.column_name
,distcnt => r.num_distinct
,density => r.density
,nullcnt => r.num_nulls
,srec => NULL -- No HIGH_VALUE/LOW_VALUE
,avgclen => r.avg_col_len
,no_invalidate => FALSE
,force => TRUE);
END LOOP;
END adjust_column_stats_v1;
-- Other procedures not shown
END tstats;

The procedure shown in Listing 20-1 loops through all columns in the specified table that have column statistics and then deletes and re-adds the column statistics without the HIGH_VALUE and LOW_VALUE bits.

Columns with NUM_DISTINCT=1

I have discovered a rather peculiar CBO behavior relating to columns that, according to the object statistics, have the same value for every row, where the column value is not NULL. Let me give you an example. Listing 16-4 showed how to minimize index size by using NULL to represent the most common value of a column. The example used a column called SPECIAL_FLAG. If the row was special then SPECIAL_FLAG was set to Y and if not SPECIAL_FLAG was NULL. In this case, the number of distinct values of SPECIAL_FLAG was 1, as NULL doesn’t count as a value in the eyes of the CBO (or database theoreticians for that matter).

Immediately after gathering statistics, the HIGH_VALUE and LOW_VALUE of SPECIAL_VALUE would both be Y and the NUM_DISTINCT and DENSITY statistic columns would both be 1. Not unreasonably, the CBO gets somewhat confused when NUM_DISTINCT is 1 andHIGH_VALUE and LOW_VALUE are different! Under these circumstances, the CBO assumes that no rows match any equality predicate that you may supply. Prior to 12cR1, the same behavior is seen when the HIGH_VALUE and LOW_VALUE statistics are removed, as Listing 20-2demonstrates.

Listing 20-2. Special treatment by the CBO when NUM_DISTINCT=1

CREATE TABLE payments
(
payment_id INTEGER
,employee_id INTEGER
,special_flag CHAR (1)
,paygrade INTEGER
,payment_date DATE
,job_description VARCHAR2 (50)
)
PCTFREE 0;

INSERT INTO payments (payment_id
,employee_id
,special_flag
,paygrade
,payment_date
,job_description)
WITH standard_payment_dates
AS ( SELECT ADD_MONTHS (DATE '2014-01-20', ROWNUM - 1)
standard_paydate
,LAST_DAY (ADD_MONTHS (DATE '2014-01-20', ROWNUM - 1))
last_day_month
FROM DUAL
CONNECT BY LEVEL <= 12)
,employees
AS ( SELECT ROWNUM employee_id, TRUNC (LOG (2.6, ROWNUM)) + 1 paygrade
FROM DUAL
CONNECT BY LEVEL <= 10000)
,q1
AS (SELECT ROWNUM payment_id
,employee_id
,DECODE (MOD (ROWNUM, 100), 0, 'Y', NULL) special_flag
,paygrade
--
-- The calculation in the next few lines to determine what day of the week
-- the 20th of the month falls on does not use the TO_CHAR function
-- as the results of this function depend on NLS settings!
--
,CASE
WHEN MOD (ROWNUM, 100) = 0
THEN
standard_paydate + MOD (ROWNUM, 7)
WHEN paygrade = 1
AND MOD (last_day_month - DATE '1001-01-06', 7) =
5
THEN
last_day_month - 1
WHEN paygrade = 1
AND MOD (last_day_month - DATE '1001-01-06', 7) =
6
THEN
last_day_month - 2
WHEN paygrade = 1
THEN
last_day_month
WHEN MOD (standard_paydate - DATE '1001-01-06', 7) = 5
THEN
standard_paydate - 1
WHEN MOD (standard_paydate - DATE '1001-01-06', 7) = 6
THEN
standard_paydate - 2
ELSE
standard_paydate
END
paydate
,DECODE (
paygrade
,1, 'SENIOR EXECUTIVE'
,2, 'JUNIOR EXECUTIVE'
,3, 'SENIOR DEPARTMENT HEAD'
,4, 'JUNIOR DEPARTMENT HEAD'
,5, 'SENIOR MANAGER'
,6, DECODE (MOD (ROWNUM, 3)
,0, 'JUNIOR MANAGER'
,1, 'SENIOR TECHNICIAN'
,'SENIOR SUPERVISOR')
,7, DECODE (MOD (ROWNUM, 2)
,0, 'SENIOR TECHNICIAN'
,'SENIOR SUPERVISOR')
,8, DECODE (MOD (ROWNUM, 2)
,0, 'JUNIOR TECHNICIAN'
,'JUNIOR SUPERVISOR')
,9, 'ANCILLORY STAFF'
,10, DECODE (MOD (ROWNUM, 2)
,0, 'INTERN'
,'CASUAL WORKER'))
job_description
FROM standard_payment_dates, employees)
SELECT *
FROM q1
ORDER BY paydate;

BEGIN
DBMS_STATS.gather_table_stats (
ownname => SYS_CONTEXT ('USERENV', 'CURRENT_SCHEMA')
,tabname => 'PAYMENTS'
,method_opt => 'FOR ALL COLUMNS SIZE 1');
END;
/

CREATE UNIQUE INDEX payments_pk
ON payments (payment_id);

ALTER TABLE payments
ADD CONSTRAINT payments_pk PRIMARY KEY (payment_id);

CREATE INDEX payments_ix1
ON payments (paygrade, job_description);

SELECT *
FROM payments
WHERE special_flag = 'Y';

-----------------------------------------------
| Id | Operation | Name | Rows |
-----------------------------------------------
| 0 | SELECT STATEMENT | | 1200 |
| 1 | TABLE ACCESS FULL| PAYMENTS | 1200 |
-----------------------------------------------

EXEC tstats.adjust_column_stats_v1(p_table_name=>'PAYMENTS');

SELECT *
FROM payments
WHERE special_flag = 'Y';

----------------------------------------------
| Id | Operation | Name | Rows |
----------------------------------------------
| 0 | SELECT STATEMENT | | 1| -- Releases prior to 12cR1
| 1 | TABLE ACCESS FULL| PAYMENTS | 1|
----------------------------------------------

Listing 20-2 creates a table called PAYMENTS that I will use to make several points in this chapter. The PAYMENTS table includes a SPECIAL_FLAG column that is either Y or NULL. Immediately after gathering statistics we can see that the estimated cardinality of a query that includes an equality predicate on SPECIAL_FLAG is absolutely accurate. However, once the procedure TSTATS.ADJUST_COLUMN_STATS_V1 is called to remove the HIGH_VALUE and LOW_VALUE statistic columns, we see that the CBO gets confused and reduces the cardinality to 0 (rounded up in the display to 1, as usual). This may be a bug, as the behavior cannot be observed in 12.1.0.1.

One way to address this issue would be for the statistic adjustment procedure to avoid processing columns where NUM_DISTINCT is 1. That would certainly work in the example shown in Listing 20-2. However, it is possible for a column with one distinct value to be a time-based column and, if so, leaving HIGH_VALUE and LOW_VALUE in place will create problems, as I will show when we come to look at DBMS_STATS.COPY_TABLE_STATS in a short while.

An alternative approach to dealing with NUM_DISTINCT being 1 is to change the value of NUM_DISTINCT! Listing 20-3 shows the procedure TSTATS.ADJUST_COLUMN_STATS_V2, which solves the problem.

Listing 20-3. Adjusting NUM_DISTINCT and DENSITY to fix cardinality issues

CREATE OR REPLACE PACKAGE tstats
AS
-- Other procedures not shown
PROCEDURE adjust_column_stats_v2 (
p_owner all_tab_col_statistics.owner%TYPE DEFAULT SYS_CONTEXT (
'USERENV'
,'CURRENT_SCHEMA')
,p_table_name all_tab_col_statistics.table_name%TYPE);
-- Other procedures not shown
/END tstats;

CREATE OR REPLACE PACKAGE BODY tstats
AS
-- Other procedures not shown
PROCEDURE adjust_column_stats_v2 (
p_owner all_tab_col_statistics.owner%TYPE DEFAULT SYS_CONTEXT (
'USERENV'
,'CURRENT_SCHEMA')
,p_table_name all_tab_col_statistics.table_name%TYPE)
AS
CURSOR c1
IS
SELECT *
FROM all_tab_col_statistics
WHERE owner = p_owner
AND table_name = p_table_name
AND last_analyzed IS NOT NULL;

v_num_distinct all_tab_col_statistics.num_distinct%TYPE;
BEGIN
FOR r IN c1
LOOP
DBMS_STATS.delete_column_stats (ownname => r.owner
,tabname => r.table_name
,colname => r.column_name
,cascade_parts => TRUE
,no_invalidate => TRUE
,force => TRUE);

IF r.num_distinct = 1
THEN
v_num_distinct := 1 + 1e-14;
ELSE
v_num_distinct := r.num_distinct;
END IF;

DBMS_STATS.set_column_stats (ownname => r.owner
,tabname => r.table_name
,colname => r.column_name
,distcnt => v_num_distinct
,density => 1 / v_num_distinct
,nullcnt => r.num_nulls
,srec => NULL -- No HIGH_VALUE/LOW_VALUE
,avgclen => r.avg_col_len
,no_invalidate => FALSE
,force => TRUE);
END LOOP;
END adjust_column_stats_v2;
-- Other procedures not shown
END tstats;
/

EXEC tstats.adjust_column_stats_v2(p_table_name=>'PAYMENTS');
SELECT *
FROM payments
WHERE special_flag = 'Y';

-----------------------------------------------
| Id | Operation | Name | Rows |
-----------------------------------------------
| 0 | SELECT STATEMENT | | 1200 |
| 1 | TABLE ACCESS FULL| PAYMENTS | 1200 |
-----------------------------------------------

You can see that TSTATS.ADJUST_COLUMN_STATS_V2 now sets NUM_DISTINCT for SPECIAL_FLAG to a non-integer value slightly greater than 1 and the DENSITY to 1/NUM_DISTINCT. Now, it is not logical for the number of distinct values for a column to be anything other than an integer value, but the CBO seems to be unconcerned by this and plugs the non-integer value into its calculations. As we can see by the cardinality estimate of our query, we have avoided the undesirable special code path through the CBO code, and our cardinality estimate has returned to a sensible (and in this case absolutely correct) value.

Let me move on to two more complications with time-based column statistics: skewed column values and range predicates.

Skewed column values and range predicates

Listing 9-8 showed how to set up histograms for a table in order to deal with skewed data that requires different execution plans for the different values specified in equality predicates. I also mentioned in Chapter 9 that you may have to specify realistic values for the high and low endpoints when range predicates are involved (rather than the absurd ones in Listing 9-8). But what if you have time-based skewed data and/or time-based range predicates?

Consider a payroll system that holds 12 months of payroll payments. The vast majority of transactions occur on the 20th day of the month (unless the 20th is a weekend) because that is your company’s payday. However, a tiny number of transactions occur on other days to deal with executive pay, errors, adjustments, or other exceptional events.

In this payroll application the popular values will change over time: each month a new standard payday is added and an old one is removed. Personally, I have not worked on an application with time-based histograms like this, but working out how to address such a requirement seems to be a useful exercise. I would propose manually updating the histogram as new payment dates are added and old ones are removed.

Notice that manually updating a histogram with new values is quite different from gathering statistics; in the former case the changes are tightly controlled and there is no risk that undesirable changes to execution plans will creep in, but in the latter case we are back to our game of Russian Roulette. The maintenance procedures for our mythical payroll application would need to call a custom procedure in the TSTATS package at least once per month to keep the histogram current.

This type of application-specific customization of TSTATS is typical of the type of thing you might need to consider and demonstrates why TSTATS isn’t usually a plug-and-play solution. Let us now look at the final complication of time-based column statistics: correlation.

Correlated columns and expressions

Listings 9-11 and 9-12 showed how to set up extended statistics in order to handle correlated columns and expressions in predicates. It is worth noting that you can also create histograms on extended statistics and that extended statistics may be time based. Manually setting histograms on column-group statistics is far from straightforward. Theoretically, one should be able to use the SYS_OP_COMBINED_HASH function to set up values for the histogram, but I have been unable to get this to work. Furthermore, the CBO doesn’t use extended statistics at all when values are missing from the histogram.

It is easy to imagine more complex scenarios where extended statistics would be even less practical. I recently worked on an application that had so many correlated columns that hundreds of extended statistics would have been needed. However, careful analysis of the poorly performing queries led me to an alternative solution. That solution was to falsify the statistics for some columns.

To illustrate the point, you can see that the PAYMENTS table created by Listing 20-2 includes a column called JOB_DESCRIPTION and that each job description is highly correlated to PAYGRADE. We might know that all the queries in our application that include a predicate onJOB_DESCRIPTION also include a predicate on PAYGRADE. That being the case, it might be practical to falsify the value of NUM_DISTINCT for JOB_DESCRIPTION. By setting NUM_DISTINCT to a very low value we can minimize the extent to which predicates involvingJOB_DESCRIPTION cause the CBO to incorrectly reduce cardinality estimates.

But let us take a breath. Rather than using vast amounts of code to manipulate column statistics to solve complex issues we can carefully construct sample data. Let me demonstrate.

Use of sample data for complex statistical issues

The problem with real-life data is that it is unpredictable. One month you may have 2% of exceptions in your payroll and the next month you may have 1% and the month after that 3%. But if you manually create sample data and collect statistics on that sample data you can eliminate this sort of variation. At some point it becomes easier to gather statistics on fabricated data than to try and fabricate the statistics directly. Once gathered, the column statistics from the sample data can be copied to the real-life table.

Whether we are fabricating statistics or fabricating sample data we must begin by analyzing our requirements. So let me propose the following results from an analysis of SQL statements that use the PAYMENTS table created in Listing 20-2.

PAYGRADE column analysis

We discover that the code for processing the payments to the two employees on PAYGRADE 1 is different from the code for processing payments to the 9,998 employees with PAYGRADE 2 to PAYGRADE 10. We need a histogram that shows a small number of payments to those onPAYGRADE 1, but we don’t want the histogram to reflect the real-life differences between the remaining pay grades because the same SQL is used to process all pay grades other than 1. We decide that our histogram should show PAYGRADE 1 as having 100 times fewer rows than other pay grades. This means:

· A selectivity of 1/901 for PAYGRADE 1

· A selectivity of 100/901 each for pay grades 2 to 10

Remember, we want a histogram that results in effective and stable execution plans. It doesn’t matter that the histogram doesn’t accurately reflect the data distribution in the PAYMENTS table.

PAYMENT_DATE column analysis

The vast majority of payroll payments are made on the 20th day of each month (or the preceding Friday if the 20th falls on a weekend). There are a few exceptions for errors, adjustments, and executive pay that might fall later in the month, but these are minimal. If we had no histogram onPAYMENT_DATE then the selectivity for normal pay days would be too low. Since the PAYMENTS table holds one year’s data, our histogram should show that almost 1/12th of all payments are made on each of the 12 standard paydays in our PAYMENTS table and it should also show a very small selectivity for other dates in the year.

PAYGRADE and PAYMENT_DATE correlation

It turns out that the two most senior employees of our mythical company (those on PAYGRADE 1) get paid on the last weekday of the month even though everybody else gets paid on or around the 20th of the month. If the special code for processing those two executives on PAYGRADE 1 includes SQL statements that include a predicate on PAYGRADE and PAYMENT_DATE the CBO is unlikely to pick a suitable execution plan if it assumes that most payments are made on the standard pay day.

We can set up column-group statistics for the PAYGRADE and PAYMENT_DATE columns that show two payments are made to the PAYGRADE 1 employees on each of their special paydays. We might also set up non-zero selectivities for days when exceptions might be made so that the column-group statistics are still used.

JOB_DESCRIPTION column analysis

There are 13 job descriptions and 10 pay grades, but there are not 130 combinations of both. We might think of setting up column-group statistics on PAYGRADE and JOB_DESCRIPTION, but as long as the application never uses a predicate on JOB_DESCRIPTION independently ofPAYGRADE there is a potentially simpler approach: we can just tell the CBO that there are just two values of JOB_DESCRIPTION. This type of white lie is useful when you have a large number of correlated columns and just want to stop the CBO reducing cardinality estimates too much. You will see how this works once we have fabricated our sample data.

Fabricating sample data

Now that we have identified our problematic columns and decided how we want the CBO to treat them we are ready to fabricate some data that fits our model. Listing 20-4 creates a table, SAMPLE_PAYMENTS, that includes the subset of columns from PAYMENTS that we are interested in, namely PAYGRADE, PAYMENT_DATE, and JOB_DESCRIPTION. Listing 20-4 also shows a procedure TSTATS.AMEND_TIME_BASED_STATISTICS that populates SAMPLE_PAYMENTS. Once SAMPLE_PAYMENTS is populated with data we gather statistics and then copy the column statistics from SAMPLE_PAYMENTS to the real-life PAYMENTS table.

Listing 20-4. Creating and populating the SAMPLE_PAYMENTS table

CREATE TABLE sample_payments
(
paygrade INTEGER
,payment_date DATE
,job_description CHAR (20)
);

CREATE OR REPLACE PACKAGE tstats
AS
-- Other procedures cut
PROCEDURE amend_time_based_statistics (
effective_date DATE DEFAULT SYSDATE);
-- Other procedures cut
END tstats;
/

CREATE OR REPLACE PACKAGE BODY tstats AS
PROCEDURE amend_time_based_statistics (
effective_date DATE DEFAULT SYSDATE)
IS
distcnt NUMBER;
density NUMBER;
nullcnt NUMBER;
srec DBMS_STATS.statrec;

avgclen NUMBER;
BEGIN
--
-- Step 1: Remove data from previous run
--
DELETE FROM sample_payments;

--
-- Step 2: Add data for standard pay for standard employees
--
INSERT INTO sample_payments (paygrade, payment_date, job_description)
WITH payment_dates
AS ( SELECT ADD_MONTHS (TRUNC (effective_date, 'MM') + 19
,1 - ROWNUM)
standard_paydate
FROM DUAL
CONNECT BY LEVEL <= 12)
,paygrades
AS ( SELECT ROWNUM + 1 paygrade
FROM DUAL
CONNECT BY LEVEL <= 9)
,multiplier
AS ( SELECT ROWNUM rid
FROM DUAL
CONNECT BY LEVEL <= 100)
SELECT paygrade
,CASE MOD (standard_paydate - DATE '1001-01-06', 7)
WHEN 5 THEN standard_paydate - 1
WHEN 6 THEN standard_paydate - 2
ELSE standard_paydate
END
payment_date
,'AAA' job_description
FROM paygrades, payment_dates, multiplier;

--
-- Step 3: Add data for paygrade 1
--
INSERT INTO sample_payments (paygrade, payment_date, job_description)
WITH payment_dates
AS ( SELECT ADD_MONTHS (LAST_DAY (TRUNC (effective_date))
,1 - ROWNUM)
standard_paydate
FROM DUAL
CONNECT BY LEVEL <= 12)
SELECT 1 paygrade
,CASE MOD (standard_paydate - DATE '1001-01-06', 7)
WHEN 5 THEN standard_paydate - 1
WHEN 6 THEN standard_paydate - 2
ELSE standard_paydate
END
payment_dates
,'zzz' job_description
FROM payment_dates;

--
-- Step 4: Add rows for exceptions.
--
INSERT INTO sample_payments (paygrade, payment_date, job_description)
WITH payment_dates
AS ( SELECT ADD_MONTHS (TRUNC (effective_date, 'MM') + 19
,1 - ROWNUM)
standard_paydate
FROM DUAL
CONNECT BY LEVEL <= 12)
,paygrades
AS ( SELECT ROWNUM + 1 paygrade
FROM DUAL
CONNECT BY LEVEL <= 7)
SELECT paygrade
,CASE MOD (standard_paydate - DATE '1001-01-06', 7)
WHEN 5 THEN standard_paydate - 2 + paygrade
WHEN 6 THEN standard_paydate - 3 + paygrade
ELSE standard_paydate - 1 + paygrade
END
payment_date
,'AAA' job_description
FROM paygrades, payment_dates;

--
-- Step 5: Gather statistics for SAMPLE_PAYMENTS
--
DBMS_STATS.gather_table_stats (
ownname => SYS_CONTEXT ('USERENV', 'CURRENT_SCHEMA')
,tabname => 'SAMPLE_PAYMENTS'
,method_opt => 'FOR COLUMNS SIZE 1 JOB_DESCRIPTION '
|| 'FOR COLUMNS SIZE 254 PAYGRADE,PAYMENT_DATE, '
|| '(PAYGRADE,PAYMENT_DATE)');

--
-- Step 6: Copy column statistics from SAMPLE_PAYMENTS to PAYMENTS
--
FOR r IN (SELECT column_name, histogram
FROM all_tab_cols
WHERE table_name = 'SAMPLE_PAYMENTS')
LOOP
DBMS_STATS.get_column_stats (
ownname => SYS_CONTEXT ('USERENV', 'CURRENT_SCHEMA')
,tabname => 'SAMPLE_PAYMENTS'
,colname => r.column_name
,distcnt => distcnt
,density => density
,nullcnt => nullcnt
,srec => srec
,avgclen => avgclen);

DBMS_STATS.set_column_stats (
ownname => SYS_CONTEXT ('USERENV', 'CURRENT_SCHEMA')
,tabname => 'PAYMENTS'
,colname => r.column_name
,distcnt => distcnt
,density => density
,nullcnt => nullcnt
,srec => srec
,avgclen => avgclen);
END LOOP;
END amend_time_based_statistics;
-- Other procedures cut
END tstats;
/

There are several things to note about the code in Listing 20-4:

· We don’t need as many rows in SAMPLE_PAYMENTS as there are in PAYMENTS. We just need enough rows to demonstrate the data distribution we want. In fact, SAMPLE_PAYMENTS holds 10,896 rows whereas PAYMENTS currently holds 120,000.

· The table SAMPLE_PAYMENTS doesn’t include all columns in PAYMENTS, just the columns that require special treatment.

· After deleting historic data from SAMPLE_PAYMENTS we add 100 rows for each combination of pay grades from 2 to 9 and the 12 standard pay dates. We specify a low value for JOB_DESCRIPTION.

· We then add rows for executive pay—just one row per executive payday. We specify a second, high, value for JOB_DESCRIPTION.

· The last set of rows to be added is for the exceptional entries—one for each day in the week after the standard payday. We specify an arbitrary value of PAYGRADE (not 1) .

· When we gather statistics it is important that we do not gather histograms for JOB_DESCRIPTION, as the values we have specified for the columns aren’t real. We gather histograms on the remaining columns. Notice an interesting feature of the METHOD_OPT syntax: we can implicitly create extended statistics this way!

· We then copy the column statistics gathered for SAMPLE_PAYMENTS into PAYMENTS.

You may be wondering what we have achieved here. The point of all this is that each month we can regenerate the data in SAMPLE_PAYMENTS and can be 100% certain that the only differences in data are the values of PAYMENT_DATE. Everything else will be the same. This means that once we copy the statistics to PAYMENTS, the execution plans will remain the same. Listing 20-5 shows how this works in practice.

Listing 20-5. Demonstrating statistics from fabricated data in action

DECLARE
extension_name all_stat_extensions.extension_name%TYPE;
extension_exists EXCEPTION;
PRAGMA EXCEPTION_INIT (extension_exists, -20007);
BEGIN
extension_name :=
DBMS_STATS.create_extended_stats (
ownname => SYS_CONTEXT ('USERENV', 'CURRENT_SCHEMA')
,tabname => 'PAYMENTS'
,extension => '(PAYGRADE,PAYMENT_DATE)');
EXCEPTION
WHEN extension_exists
THEN
NULL;
END;
/

EXEC tstats.amend_time_based_statistics(date '2014-05-01');

ALTER SESSION SET statistics_level='ALL';

SELECT COUNT (*)
FROM payments
WHERE paygrade = 1 AND payment_date = DATE '2014-04-30';

SELECT *
FROM TABLE (DBMS_XPLAN.display_cursor (format => 'BASIC IOSTATS LAST'));

----------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows |
----------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |
|* 2 | TABLE ACCESS BY INDEX ROWID BATCHED| PAYMENTS | 1 | 11 | 2 |
|* 3 | INDEX RANGE SCAN | PAYMENTS_IX1 | 1 | 132 | 24 |
----------------------------------------------------------------------------------------

SELECT COUNT (*)
FROM payments
WHERE paygrade = 10 AND payment_date = DATE '2014-04-18';

SELECT *
FROM TABLE (DBMS_XPLAN.display_cursor (format => 'BASIC IOSTATS LAST'));

------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows |
------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |
|* 2 | TABLE ACCESS FULL| PAYMENTS | 1 | 1101 | 4525 |
------------------------------------------------------------------

SELECT COUNT (*)
FROM payments
WHERE paygrade = 4 AND payment_date = DATE '2014-04-18';

SELECT *
FROM TABLE (DBMS_XPLAN.display_cursor (format => 'BASIC IOSTATS LAST'));

------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows |
------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |
|* 2 | TABLE ACCESS FULL| PAYMENTS | 1 | 1101 | 28 |
------------------------------------------------------------------

SELECT COUNT (*)
FROM payments
WHERE paygrade = 8 AND job_description = 'JUNIOR TECHNICIAN';

SELECT *
FROM TABLE (DBMS_XPLAN.display_cursor (format => 'BASIC IOSTATS LAST'));

---------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows |
---------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |
|* 2 | INDEX RANGE SCAN| PAYMENTS_IX1 | 1 | 6674 | 7716 |
---------------------------------------------------------------------

SELECT COUNT (*)
FROM payments
WHERE employee_id = 101;

SELECT *
FROM TABLE (DBMS_XPLAN.display_cursor (format => 'BASIC IOSTATS LAST'));

------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows |
------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |
|* 2 | TABLE ACCESS FULL| PAYMENTS | 1 | 12 | 12 |
------------------------------------------------------------------

--------------------------------------------------------------------
EXEC tstats.amend_time_based_statistics(date '2015-05-01');

SELECT COUNT (*)
FROM payments
WHERE paygrade = 1 AND payment_date = DATE '2015-04-30';

SELECT * FROM TABLE (DBMS_XPLAN.display_cursor (format => 'BASIC +ROWS'));

---------------------------------------------------------------------
| Id | Operation | Name | Rows |
---------------------------------------------------------------------
| 0 | SELECT STATEMENT | | |
| 1 | SORT AGGREGATE | | 1 |
| 2 | TABLE ACCESS BY INDEX ROWID BATCHED| PAYMENTS | 11 |
| 3 | INDEX RANGE SCAN | PAYMENTS_IX1 | 132 |
---------------------------------------------------------------------

SELECT COUNT (*)
FROM payments
WHERE paygrade = 10 AND payment_date = DATE '2015-04-20';

SELECT * FROM TABLE (DBMS_XPLAN.display_cursor (format => 'BASIC +ROWS'));

-----------------------------------------------
| Id | Operation | Name | Rows |
-----------------------------------------------
| 0 | SELECT STATEMENT | | |
| 1 | SORT AGGREGATE | | 1 |
| 2 | TABLE ACCESS FULL| PAYMENTS | 1101 |
-----------------------------------------------

SELECT COUNT (*)
FROM payments
WHERE paygrade = 4 AND payment_date = DATE '2015-04-20';

SELECT * FROM TABLE (DBMS_XPLAN.display_cursor (format => 'BASIC +ROWS'));

-----------------------------------------------
| Id | Operation | Name | Rows |
-----------------------------------------------
| 0 | SELECT STATEMENT | | |
| 1 | SORT AGGREGATE | | 1 |
| 2 | TABLE ACCESS FULL| PAYMENTS | 1101 |
-----------------------------------------------

SELECT COUNT (*)
FROM payments
WHERE paygrade = 8 AND job_description = 'JUNIOR TECHNICIAN';

SELECT * FROM TABLE (DBMS_XPLAN.display_cursor (format => 'BASIC +ROWS'));

--------------------------------------------------
| Id | Operation | Name | Rows |
--------------------------------------------------
| 0 | SELECT STATEMENT | | |
| 1 | SORT AGGREGATE | | 1 |
| 2 | INDEX RANGE SCAN| PAYMENTS_IX1 | 6674 |
--------------------------------------------------

SELECT COUNT (*)
FROM payments
WHERE employee_id = 101;

SELECT * FROM TABLE (DBMS_XPLAN.display_cursor (format => 'BASIC +ROWS'));

-----------------------------------------------
| Id | Operation | Name | Rows |
-----------------------------------------------
| 0 | SELECT STATEMENT | | |
| 1 | SORT AGGREGATE | | 1 |
| 2 | TABLE ACCESS FULL| PAYMENTS | 12 |
-----------------------------------------------

Listing 2-5 begins by showing the creation of column-group extended statistics on PAYMENTS for the two columns PAYGRADE and PAYMENT_DATE. This is a one-off task and, unlike the call to TSTATS.AMEND_TIME_BASED_STATISTICS, would not be performed every month.

Normally TSTATS.AMEND_TIME_BASED_STATISTICS would use the current date to generate sample data, but for demonstration purposes Listing 20-5 begins by specifying May 1, 2014. The first query in Listing 20-5 involves PAYGRADE 1. There are 24 rows input to the COUNTfunction, but the estimated cardinality is too high at 132. Nevertheless, the correct index access method is selected. If there are other issues with the high selectivity some further adjustment to the sample data might be warranted.

The second and third queries in Listing 20-5 involve standard pay grades and the standard pay day of April 18 (April 20, 2014, is a Sunday). The CBO provides the same estimate of 1,101 rows for the two queries. A full table scan is used, which seems reasonable. In the case ofPAYGRADE 10 the cardinality estimate is too low, but for PAYGRADE 4 it is too high. However, we have the consistent estimate and the consistent execution plan that we wanted.

The fourth query in Listing 20-5 involves PAYGRADE and JOB_DESCRIPTION. The estimate of 6,674 rows is reasonably close to the actual 7,716 matching rows in the PAYMENTS table despite the fact that we have not created column-group statistics showing that there are only two values of JOB_DESCRIPTION for employees on PAYGRADE 8. This is because the statistics copied from SAMPLE_PAYMENTS lead the CBO to believe that there are only two values of JOB_DESCRIPTION in the entire table! If there are a number of columns that are all correlated to each other, changing column statistics in this way to reflect a significantly reduced number of distinct values can often avoid the CBO underestimating cardinalities.

The next query in Listing 20-5 involves a predicate on EMPLOYEE_ID. Our sample data doesn’t include this column, so the statistics for the EMPLOYEE_ID column in the PAYMENTS were not replaced. The CBO based its cardinality estimate on statistics gathered from the PAYMENTStable.

Suppose time passes. Listing 20-5 makes a second call to TSTATS.AMEND_TIME_BASED_STATISTICS for May 1, 2015, simulating the passage of time. We can see that the cardinality estimates for the new dates in 2015 are identical to those in 2014; I haven’t shown actual row counts, as the PAYMENTS table hasn’t been updated as it would be in real life.

Managing column statistics wrap up

There are numerous reasons why we might wish to change gathered column statistics to help the CBO produce more accurate cardinality estimates and/or to stabilize those cardinality estimates over time. We might start off with the gathered column statistics and just remove theHIGH_VALUE and LOW_VALUE bits. This is the most common approach. To deal with more complex issues, we can fabricate statistics or even gather statistics on fabricated data. The point is that whatever issues we come across, we can find a way to avoid repeatedly gathering statistics on tables used by our application.

Statistics and Partitions

In Chapter 9, I explained some of the reasons why partition-level statistics are used and stated that, in my opinion, partition-level statistics are often more trouble than they are worth. I realize that many of you will still need some convincing on this point. I am also sure that some of you will be concerned, quite rightly, that a change from using partition-level statistics to global statistics is a large change that might involve an unnecessary level of risk for some established applications. Let us take a look at the much overrated procedure, DBMS_STATS.COPY_TABLE_STATS, with the intent of making these issues clearer.

The DBMS_STATS.COPY_TABLE_STATS myth

DBMS_STATS.COPY_TABLE_STATS was first documented in 11gR1 but was actually supported earlier, in 10.2.0.4 and later releases of 10gR2. The procedure is somewhat inappropriately named because the procedure is not used to copy table statistics as such, at least not global table statistics. The procedure is in fact used to copy partition statistics from one partition in a table to another partition in the same table, or to copy subpartition statistics from one subpartition in a table to another subpartition in the same table. Let me begin my coverage of this procedure with an explanation of how things are supposed to work before going on to discuss the drawbacks.

How DBMS_STATS.COPY_TABLE_STATS is supposed to work

There are three key advantages to using DBMS_STATS.COPY_TABLE_STATS over gathering partition-level statistics. The first advantage is that it is far faster to copy statistics than to gather them. The second advantage is that statistics for a partition can only be gathered once the partition has been populated with data, and you might need to query the partition before all the data has been loaded. The third advantage, and the one that most appealed to me initially, is the fact that replicating statistics from one partition to the next minimizes the risk of changes to execution plans.Listing 20-6 shows the basic principle behind DBMS_STATS.COPY_TABLE_STATS.

Listing 20-6. Initial attempt at using DBMS_STATS.COPY_TABLE_STATS

CREATE TABLE statement_part
(
transaction_date_time
,transaction_date
,posting_date
,description
,transaction_amount
,product_category
,customer_category
)
PARTITION BY RANGE
(transaction_date)
(
PARTITION p1 VALUES LESS THAN (DATE '2013-01-05')
,PARTITION p2 VALUES LESS THAN (DATE '2013-01-11')
,PARTITION p3 VALUES LESS THAN (maxvalue))
PCTFREE 99
PCTUSED 1
AS
SELECT TIMESTAMP '2013-01-01 12:00:00.00 -05:00'
+ NUMTODSINTERVAL (TRUNC ( (ROWNUM - 1) / 50), 'DAY')
,DATE '2013-01-01' + TRUNC ( (ROWNUM - 1) / 50)
,DATE '2013-01-01' + TRUNC ( (ROWNUM - 1) / 50) + MOD (ROWNUM, 3)
,DECODE (MOD (ROWNUM, 4)
,0, 'Flight'
,1, 'Meal'
,2, 'Taxi'
,'Deliveries')
,DECODE (MOD (ROWNUM, 4)
,0, 200 + (30 * ROWNUM)
,1, 20 + ROWNUM
,2, 5 + MOD (ROWNUM, 30)
,8)
,TRUNC ( (ROWNUM - 1) / 50) + 1
,MOD ( (ROWNUM - 1), 50) + 1
FROM DUAL
CONNECT BY LEVEL <= 500;

CREATE INDEX statement_part_ix1
ON statement_part (transaction_date)
LOCAL
PCTFREE 99;

BEGIN
DBMS_STATS.delete_table_stats (
ownname => SYS_CONTEXT ('USERENV', 'CURRENT_SCHEMA')
,tabname => 'STATEMENT_PART'
,cascade_parts => TRUE
,cascade_indexes => TRUE
,cascade_columns => TRUE);

DBMS_STATS.gather_table_stats (
ownname => SYS_CONTEXT ('USERENV', 'CURRENT_SCHEMA')
,tabname => 'STATEMENT_PART'
,partname => 'P1'
,granularity => 'PARTITION'
,method_opt => 'FOR ALL COLUMNS SIZE 1');

DBMS_STATS.copy_table_stats (
ownname => SYS_CONTEXT ('USERENV', 'CURRENT_SCHEMA')
,tabname => 'STATEMENT_PART'
,srcpartname => 'P1'
,dstpartname => 'P2');

DBMS_STATS.copy_table_stats (
ownname => SYS_CONTEXT ('USERENV', 'CURRENT_SCHEMA')
,tabname => 'STATEMENT_PART'
,srcpartname => 'P1'
,dstpartname => 'P3');
END;
/

--
-- We can see that the NUM_ROWS statistic has been copied
-- from P1 to other partitions and aggregated to global stats.
--

SELECT partition_name, num_rows
FROM all_tab_statistics
WHERE owner = SYS_CONTEXT ('USERENV', 'CURRENT_SCHEMA')
AND table_name = 'STATEMENT_PART';

PARTITION_NAME NUM_ROWS
600
P1 200
P2 200
P3 200

--
-- The column statistics have also been copied.
--
SELECT partition_name, num_distinct
FROM all_part_col_statistics
WHERE owner = SYS_CONTEXT ('USERENV', 'CURRENT_SCHEMA')
AND table_name = 'STATEMENT_PART'
AND column_name = 'DESCRIPTION'
UNION ALL
SELECT 'TABLE' partition_name, num_distinct
FROM all_tab_col_statistics
WHERE owner = SYS_CONTEXT ('USERENV', 'CURRENT_SCHEMA')
AND table_name = 'STATEMENT_PART'
AND column_name = 'DESCRIPTION';

PARTITION_NAME NUM_DISTINCT
P1 4
P2 4
P3 4
TABLE 4

--
-- We can now get reasonable cardinality estimates from
-- queries against partition P2.
--

SELECT COUNT (*)
FROM statement_part PARTITION (p2)
WHERE description = 'Flight';

--------------------------------------------------------------------------
| Id | Operation | Name | Rows | Pstart| Pstop |
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | |
| 1 | SORT AGGREGATE | | 1 | | |
| 2 | PARTITION RANGE SINGLE| | 50 | 2 | 2 |
|* 3 | TABLE ACCESS FULL | STATEMENT_PART | 50 | 2 | 2 |
--------------------------------------------------------------------------

Listing 20-6 recreates the table STATEMENT_PART that we created in Chapter 9, this time with three partitions. I begin by deleting any existing statistics and gathered statistics on partition P1. I then use DBMS_STATS.COPY_TABLE_STATS to copy the statistics from partition P1 to partitions P2 and P3 and show that we now not only have a usable set of statistics for P2 but that global statistics for STATEMENT_PART have also been generated. This all seems pretty cool. However, all is not quite as simple and straightforward as it at first seems.

Drawbacks of DBMS_STATS.COPY_TABLE_STATS

Listing 20-7 highlights the main complications that can arise with DBMS_STATS.COPY_TABLE_STATS.

Listing 20-7. Issues with DBMS_STATS.COPY_TABLE_STATS

DELETE FROM statement_part
WHERE transaction_date NOT IN (DATE '2013-01-01', DATE '2013-01-06');

BEGIN
DBMS_STATS.delete_table_stats (
ownname => SYS_CONTEXT ('USERENV', 'CURRENT_SCHEMA')
,tabname => 'STATEMENT_PART'
,cascade_parts => TRUE
,cascade_indexes => TRUE
,cascade_columns => TRUE);

DBMS_STATS.gather_table_stats (
ownname => SYS_CONTEXT ('USERENV', 'CURRENT_SCHEMA')
,tabname => 'STATEMENT_PART'
,partname => 'P1'
,granularity => 'PARTITION'
,method_opt => 'FOR ALL COLUMNS SIZE 1');

DBMS_STATS.copy_table_stats (
ownname => SYS_CONTEXT ('USERENV', 'CURRENT_SCHEMA')
,tabname => 'STATEMENT_PART'
,srcpartname => 'P1'
,dstpartname => 'P2');

DBMS_STATS.copy_table_stats (
ownname => SYS_CONTEXT ('USERENV', 'CURRENT_SCHEMA')
,tabname => 'STATEMENT_PART'
,srcpartname => 'P1'
,dstpartname => 'P3');
END;
/

CREATE OR REPLACE FUNCTION convert_date_stat (raw_value RAW)
RETURN DATE
IS
date_value DATE;
BEGIN
DBMS_STATS.convert_raw_value (rawval => raw_value, resval => date_value);
RETURN date_value;
END convert_date_stat;
/

SELECT column_name
,partition_name
,convert_date_stat (low_value) low_value
,convert_date_stat (high_value) high_value
,num_distinct
FROM all_part_col_statistics
WHERE owner = SYS_CONTEXT ('USERENV', 'CURRENT_SCHEMA')
AND table_name = 'STATEMENT_PART'
AND column_name IN ('TRANSACTION_DATE', 'POSTING_DATE')
ORDER BY column_name DESC, partition_name;

COLUMN_NAME PARTITION_NAME LOW_VALUE HIGH_VALU NUM_DISTINCT
-------------------- -------------------- --------- --------- ------------
TRANSACTION_DATE P1 01-JAN-13 01-JAN-13 1
TRANSACTION_DATE P2 05-JAN-13 05-JAN-13 1
TRANSACTION_DATE P3 11-JAN-13 11-JAN-13 1
POSTING_DATE P1 01-JAN-13 03-JAN-13 3
POSTING_DATE P2 01-JAN-13 03-JAN-13 3
POSTING_DATE P3 01-JAN-13 03-JAN-13 3

SELECT COUNT (*)
FROM statement_part
WHERE transaction_date = DATE '2013-01-06';

--------------------------------------------------------------
| Id | Operation | Name | Rows |
--------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 |
| 1 | SORT AGGREGATE | | 1 |
| 2 | PARTITION RANGE SINGLE| | 1 |
| 3 | INDEX RANGE SCAN | STATEMENT_PART_IX1 | 1 |
--------------------------------------------------------------

Listing 20-7 begins by deleting rows from the table STATEMENT_PART so that each partition has only one TRANSACTION_DATE. Although arguably it makes more sense to partition a table by list when it is guaranteed that there is just one value of the partitioning column per partition, it is not at all uncommon to use range-based partitions in such cases. It is also not uncommon for such ranges to cover more than 24 hours, particularly for businesses that have no data for weekends.

Once statistics for partition P1 have been re-gathered and copied to P2 and P3, we see the first issue: the DBMS_STATS.COPY_TABLE_STATS procedure hasn’t just blindly copied the statistics from P2 to P3. It has made what it believes to be a sensible adjustment: the HIGH_VALUEand LOW_VALUE column statistics for column TRANSACTION_DATE in partition P2 have both been set to January 5. In release 11.2.0.2 and 11.2.0.3, the HIGH_VALUE would have been set to January 11, reflecting the range of the partition, but this caused a problem (bug 14607573), as theNUM_DISTINCT column statistic for TRANSACTION_DATE in partition P2 has a value of 1 and yet HIGH_VALUE and LOW_VALUE are different. In any event, since the value of TRANSACTION_DATE for rows in partition P2 is actually January 6, HIGH_VALUE and LOW_VALUE are still misleading, even in 12cR1.

Although DBMS_STATS.COPY_TABLE_STATS attempts to adjust the column statistics for the partitioning column, it makes no attempt to adjust the statistics for other columns, which leads to a second problem. The POSTING_DATE column is correlated to TRANSACTION_DATE, and any time you adjust the statistics for one you should logically adjust the other; if you don’t use TSTATS then I would recommend manually adjusting the HIGH_VALUE and LOW_VALUE of all time-based columns when you use DBMS_STATS.COPY_TABLE_STATS on tables partitioned by date.

The logical contradiction with DBMS_STATS.COPY_TABLE_STATS

My first implementation of TSTATS was in 2009. That first application, which I shall call AJAX, included tables partitioned on a column that I shall call BUSINESS_DATE. The programmers for AJAX adhered to a particular coding standard: all SQL statementsmust include an equality predicate on BUSINESS_DATE for every partitioned table level accessed. The intent of this standard was to avoid any dependency on global statistics. Partition-level statistics were used exclusively, and as new partitions were created, statistics for these new partitions were generated using the DBMS_STATS.COPY_TABLE_STATS procedure. Column statistics for BUSINESS_DATE were adjusted manually.

The TSTATS implementation was successful and paved the way for other applications to be converted to TSTATS. I was happy as a clam and thanked my lucky stars for the wonderful DBMS_STATS.COPY_TABLE_STATS procedure.

A couple of years later, for no reason I can think of, I awoke in my bed with a surprising thought: the only legitimate reason to use partition-level statistics would be because partitions are significantly different. But DBMS_STATS.COPY_TABLE_STATS only works when all partitions are similar. So DBMS_STATS.COPY_TABLE_STATS is only usable when there is no need for it!

I often have these sorts of thoughts when I wake up: the whole world seems to have missed something obvious and gone mad. Usually, after I have wiped the sleep from my eyes, had a cup of tea, and eaten my breakfast cereal, the fatal flaw in my sleepy thoughts becomes clear. However, on this occasion no such flaw came to mind. I still believe that despite its widespread use there are few valid use cases for DBMS_STATS.COPY_TABLE_STATS.

Whether you adopt a TSTATS approach or not, the problems you face with global statistics on partitioned tables aren’t very different from those you face with statistics on unpartitioned tables: time-based columns cause changes to execution plans, correlated columns and expressions cause cardinality estimate errors, and skewed values for a column may require multiple execution plans for the same statement. Partition-level statistics don’t properly address all these issues, as I showed in Listings 9-17 to 9-21 in Chapter 9.

Unfortunately, we can’t always treat partitioned tables just like unpartitioned tables. We can’t always just delete the partition and subpartition statistics and get on with our lives. Let me stop talking about DBMS_STATS.COPY_TABLE_STATS now and take a more detailed look at global statistics on partitioned tables so that we understand what problems global statistics do and do not pose.

Cardinality estimates with global statistics

Let us try another approach to managing object statistics on STATEMENT_PART. Listing 20-8 gets rid of the partition-level statistics altogether and sets up global statistics alone.

Listing 20-8. Global statistics on STATEMENT_PART

BEGIN
DBMS_STATS.delete_table_stats (
ownname => SYS_CONTEXT ('USERENV', 'CURRENT_SCHEMA')
,tabname => 'STATEMENT_PART'
,cascade_parts => TRUE
,cascade_indexes => TRUE
,cascade_columns => TRUE);

DBMS_STATS.gather_table_stats (
ownname => SYS_CONTEXT ('USERENV', 'CURRENT_SCHEMA')
,tabname => 'STATEMENT_PART'
,granularity => 'GLOBAL'
,method_opt => 'FOR ALL COLUMNS SIZE 1');

tstats.adjust_column_stats_v2 (
p_owner => SYS_CONTEXT ('USERENV', 'CURRENT_SCHEMA')
,p_table_name => 'STATEMENT_PART');
END;
/

SELECT COUNT (*)
FROM statement_part t
WHERE transaction_date = DATE '2013-01-06';

-------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)| Pstart| Pstop |
-------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 29 (0)| | |
| 1 | SORT AGGREGATE | | 1 | | | |
| 2 | PARTITION RANGE SINGLE| | 50 | 29 (0)| 2 | 2 |
|* 3 | INDEX FAST FULL SCAN | STATEMENT_PART_IX1 | 50 | 29 (0)| 2 | 2 |
-------------------------------------------------------------------------------------------

SELECT /*+ full(t) */
COUNT (*)
FROM statement_part t
WHEREtransaction_date = DATE '2013-01-06';

---------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)| Pstart| Pstop |
---------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 47 (0)| | |
| 1 | SORT AGGREGATE | | 1 | | | |
| 2 | PARTITION RANGE SINGLE| | 50 | 47 (0)| 2 | 2 |
|* 3 | TABLE ACCESS FULL | STATEMENT_PART | 50 | 47 (0)| 2 | 2 |
---------------------------------------------------------------------------------------

The global statistics for STATEMENT_PART have been adjusted using the TSTATS.ADJUST_COLUMN_STATS_V2 procedure introduced in Listing 20-3. The cardinality estimates produced by queries against STATEMENT_PART are perfectly reasonable, just as they would be for an unpartitioned table. Notice that our query uses the index we created and that the cost of the plan based on a hinted full table scan is higher. But what happens if we add a few more partitions? Listing 20-9 shows us just that.

Listing 20-9. Adding partitions to STATEMENT_PART

ALTER TABLE statement_part
SPLIT PARTITION p3
AT (DATE '2013-01-12')
INTO (PARTITION p3, PARTITION p4);

ALTER TABLE statement_part
SPLIT PARTITION p4
AT (DATE '2013-01-13')
INTO (PARTITION p4, PARTITION p5);

ALTER TABLE statement_part
SPLIT PARTITION p5
AT (DATE '2013-01-14')
INTO (PARTITION p5, PARTITION p6);

SELECT COUNT (*)
FROM statement_part t
WHERE transaction_date = DATE '2013-01-06';

---------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)| Pstart| Pstop |
---------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 25 (0)| | |
| 1 | SORT AGGREGATE | | 1 | | | |
| 2 | PARTITION RANGE SINGLE| | 50 | 25 (0)| 2 | 2 |
|* 3 | TABLE ACCESS FULL | STATEMENT_PART | 50 | 25 (0)| 2 | 2 |
---------------------------------------------------------------------------------------

SELECT COUNT (*)
FROM statement_part t
WHERE transaction_date = DATE '2013-01-13';

---------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)| Pstart| Pstop |
---------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 25 (0)| | |
| 1 | SORT AGGREGATE | | 1 | | | |
| 2 | PARTITION RANGE SINGLE| | 50 | 25 (0)| 5 | 5 |
|* 3 | TABLE ACCESS FULL | STATEMENT_PART | 50 | 25 (0)| 5 | 5 |
---------------------------------------------------------------------------------------

After we add partitions to STATEMENT_PART we can see that our sample query has the same cardinality estimate of 50 both before and after partition maintenance and that a similar query against a newly created partition also has a cardinality estimate of 50. Notice that the value ofNUM_DISTINCT for TRANSACTION_DATE is only 2, even though there are partitions for many more values of TRANSACTION_DATE. The reason that I highlight this is because many people have spoken to me who were concerned about such discrepancies. As long as the value ofNUM_DISTINCT for the column TRANSACTION_DATE and the value of NUM_ROWS for the table are consistent, the number of rows per TRANSACTION_DATE will be reasonable.

If partition maintenance hasn’t changed the cardinality estimates, why are the execution plans in Listings 20-8 and 20-9 different? The answer has nothing to do with cardinality and everything to do with cost.

Costing full table scans of table partitions

There is one significant complication that arises when the total number of partitions in a partitioned table changes while the object statistics for the table aren’t changed. Take another look at the execution plans in Listings 20-8 and 20-9. The unhinted query in Listing 20-8 uses an index. The cost associated with the variant that uses a FULL hint is higher—no problem so far.

However, once we add a few more partitions we suddenly find that the unhinted query uses a full table scan! We can see that the cost of the full table scan has been reduced as a result of the partition-maintenance operation. Now the cost of the full table scan is less than that of the index scan, and the execution plan for the unhinted query has changed. If the newly added partitions were dropped, the index access would once again become more attractive.

What has happened here is very unusual. The CBO has estimated the cost of the full table scan of the partition by dividing the value of the NUM_BLOCKSstatistic by the current value of the number of partitions obtained from the data dictionary. Normally, the inputs to the CBO’s estimating process are the object statistics, system statistics, and initialization parameters. One exception to this rule is the degree of parallelism specified in the data dictionary for a table or index. The only other case that I know of where the CBO uses data dictionary information, other than statistics, is when it costs full table scans for partitions.

PARTITION-ELIMINATION ANOMALIES

The CBO realizes that performing a full table scan on one partition will take less time than scanning all partitions in a table. However, the same logic is not applied to index scans of any type: the cost of an index scan on one partition of a local index is identical to that of scanning all partitions; this is true in all database releases up to and including 12.1.0.1.

Furthermore, if you use composite partitioning you will find that elimination of subpartitions within a partition will not reduce the cost of a full table scan either. These anomalies may, of course, be corrected at some point in the future.

It seems to me that the CBO development team has missed a trick here. In my opinion, it would be better if there were an object statistic, perhaps called NUM_PARTITIONS, that reflected the number of partitions that existed at the time the object statistics for that table were gathered. But that is just wishful thinking. We need to deal with reality. We can do so simply by increasing the value of the NUM_BLOCKS statistic for the table by the same factor as the increase in the number of partitions. The easiest way to do this is to capture the average number of blocks per partition at the same time as statistics are gathered and then to add a TSTATS hook to the partition-maintenance operations. Listing 20-10 shows one way to do this.

Listing 20-10. Managing NUM_BLOCKS with partition-maintenance operations

CREATE OR REPLACE PACKAGE tstats
AS
-- Other procedures cut

PROCEDURE adjust_global_stats (
p_owner all_tab_col_statistics.owner%TYPE DEFAULT SYS_CONTEXT (
'USERENV'
,'CURRENT_SCHEMA')
,p_table_name all_tab_col_statistics.table_name%TYPE
,p_mode VARCHAR2 DEFAULT 'PMOP');

PROCEDURE gather_table_stats (
p_owner all_tab_col_statistics.owner%TYPE DEFAULT SYS_CONTEXT (
'USERENV'
,'CURRENT_SCHEMA')
,p_table_name all_tab_col_statistics.table_name%TYPE);

-- Other procedures cut
END tstats;
/

CREATE OR REPLACE PACKAGE BODY tstats
AS
-- Other procedures cut

PROCEDURE adjust_global_stats (
p_owner all_tab_col_statistics.owner%TYPE DEFAULT SYS_CONTEXT (
'USERENV'
,'CURRENT_SCHEMA')
,p_table_name all_tab_col_statistics.table_name%TYPE
,p_mode VARCHAR2 DEFAULT 'PMOP')
IS
-- This helper function updates the statistic for the number of blocks in the
-- table so that the average size of a partition is unaltered. We sneak
-- this value away in the unused CACHEDBLK statistic.
--
numblks NUMBER;
numrows NUMBER;
avgrlen NUMBER;
cachedblk NUMBER;
cachehit NUMBER;
BEGIN
DBMS_STATS.get_table_stats (ownname => p_owner
,tabname => p_table_name
,numrows => numrows
,avgrlen => avgrlen
,numblks => numblks
,cachedblk => cachedblk
,cachehit => cachehit);

IF p_mode = 'PMOP'
THEN
--
-- Resetting NUMBLKS based on CACHEDBLK
-- average segment size and current number
-- of partitions.
--
IF cachedblk IS NULL
THEN
RETURN; -- No saved value
END IF;

--
-- Recalculate the number of blocks based on
-- the current number of partitions and the
-- saved average segment size.
-- Avoid reference to DBA_SEGMENTS in case
-- there is no privilege.
--
SELECT cachedblk * COUNT (*)
INTO numblks
FROM all_objects
WHERE owner = p_owner
AND object_name = p_table_name
AND object_type = 'TABLE PARTITION';
ELSIF p_mode = 'GATHER'
THEN
--
-- Save average segment size in CACHEDBLK based on NUMBLKS
-- and current number of partitions.
--
SELECT numblks / COUNT (*), TRUNC (numblks / COUNT (*)) * COUNT (*)
INTO cachedblk, numblks
FROM all_objects
WHERE owner = p_owner
AND object_name = p_table_name
AND object_type = 'TABLE PARTITION';
ELSE
RAISE PROGRAM_ERROR;
-- Only gets here if p_mode not set to PMOP or GATHER
END IF;

DBMS_STATS.set_table_stats (ownname => p_owner
,tabname => p_table_name
,numblks => numblks
,cachedblk => cachedblk
,force => TRUE);
END adjust_global_stats;

PROCEDURE gather_table_stats (
p_owner all_tab_col_statistics.owner%TYPE DEFAULT SYS_CONTEXT (
'USERENV'
,'CURRENT_SCHEMA')
,p_table_name ll_tab_col_statistics.table_name%TYPE)
IS
BEGIN
DBMS_STATS.unlock_table_stats (ownname => p_owner
,tabname => p_table_name);

FOR r IN (SELECT *
FROM all_tables
WHERE owner = p_owner AND table_name = p_table_name)
LOOP
DBMS_STATS.gather_table_stats (
ownname => p_owner
,tabname => p_table_name
,granularity => CASE r.partitioned
WHEN 'YES' THEN 'GLOBAL'
ELSE 'ALL'
END
,method_opt => 'FOR ALL COLUMNS SIZE 1');

adjust_column_stats_v2 (p_owner => p_owner
,p_table_name => p_table_name);

IF r.partitioned = 'YES'
THEN
adjust_global_stats (p_owner => p_owner
,p_table_name => p_table_name
,p_mode => 'GATHER');
END IF;
END LOOP;

DBMS_STATS.lock_table_stats (ownname => p_owner, tabname => p_table_name);
END gather_table_stats;

-- Other procedures cut
END tstats;
/

Listing 20-10 shows more procedures from the TSTATS package. TSTATS.GATHER_TABLE_STATS is a wrapper routine that gathers statistics for a table and adjusts the column statistics. In the case of partitioned tables, only global statistics are gathered, and the average number of blocks per partition is saved in the unused CACHEDBLK statistic using procedure TSTATS.ADJUST_TABLE_STATS.

Note Using a table statistic reserved by Oracle for future use is very, very naughty. Some of you may be shocked to see such practice shown in a book. Strictly speaking, I should have saved my private data in a legitimate place, such as a regular table, outside of the data dictionary. However, in real life I found the temptation to use the unused statistic irresistible. The fact that the statistic is preserved during calls to DBMS_STATS.EXPORT_TABLE_STATS and DBMS_STATS.IMPORT_TABLE_STATS seems to make the risk of the procedure breaking at some future point in time worth taking. The statistic column is unused in all releases up to and including 12.1.0.1.

The routine TSTATS.ADJUST_TABLE_STATS operates in two modes. When called as part of a statistics-gathering process, the average number of blocks per partition is calculated and saved. When partition-maintenance operations are performed theTSTATS.ADJUST_TABLE_STATS procedure is called with the default mode of "PMOP," which causes the NUM_BLOCKS statistic to be recalculated based on the current number of partitions and the saved value of the number of blocks per partition.

Listing 20-11 shows how we can incorporate TSTATS.ADJUST_TABLE_STATS into our partition-maintenance procedure to avoid having said partition-maintenance operations destabilize our execution plans.

Listing 20-11. Adjusting NUM_BLOCKS as part of partition maintenance

--
-- First drop the empty partitions
--
ALTER TABLE statement_part DROP PARTITION p3;
ALTER TABLE statement_part DROP PARTITION p4;
ALTER TABLE statement_part DROP PARTITION p5;
ALTER TABLE statement_part DROP PARTITION p6;
--
-- Now gather statistics on full partitions
--

BEGIN
tstats.gather_table_stats (
p_owner => SYS_CONTEXT ('USERENV', 'CURRENT_SCHEMA')
,p_table_name => 'STATEMENT_PART');
END;
/

--
-- Check plans and cost before partition maintenance
--
SELECT COUNT (*)
FROM statement_part t
WHERE transaction_date = DATE '2013-01-06';

-------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)| Pstart| Pstop |
-------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 29 (0)| | |
| 1 | SORT AGGREGATE | | 1 | | | |
| 2 | PARTITION RANGE SINGLE| | 50 | 29 (0)| 2 | 2 |
|* 3 | INDEX FAST FULL SCAN | STATEMENT_PART_IX1 | 50 | 29 (0)| 2 | 2 |
-------------------------------------------------------------------------------------------

SELECT /*+ full(t) */
COUNT (*)
FROM statement_part t
WHERE transaction_date = DATE '2013-01-06';

---------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)| Pstart| Pstop |
---------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 69 (0)| | |
| 1 | SORT AGGREGATE | | 1 | | | |
| 2 | PARTITION RANGE SINGLE| | 50 | 69 (0)| 2 | 2 |
|* 3 | TABLE ACCESS FULL | STATEMENT_PART | 50 | 69 (0)| 2 | 2 |
---------------------------------------------------------------------------------------

--
-- Now recreate the empty partitions

ALTER TABLE statement_part
ADD PARTITION p3 VALUES LESS THAN (DATE '2013-01-12');

ALTER TABLE statement_part
ADD PARTITION p4 VALUES LESS THAN (DATE '2013-01-13');

ALTER TABLE statement_part
ADD PARTITION p5 VALUES LESS THAN (DATE '2013-01-14');

ALTER TABLE statement_part
ADD PARTITION p6 VALUES LESS THAN (maxvalue);

--
-- Finally call TSTATS hook to adjust NUM_BLOCKS statistic
--

BEGIN
tstats.adjust_global_stats (
p_owner => SYS_CONTEXT ('USERENV', 'CURRENT_SCHEMA')
,p_table_name => 'STATEMENT_PART');
END;
/

--
-- Now re-check plans and costs
--

SELECT COUNT (*)
FROM statement_part t
WHERE transaction_date = DATE '2013-01-06';

-------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)| Pstart| Pstop |
-------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 29 (0)| | |
| 1 | SORT AGGREGATE | | 1 | | | |
| 2 | PARTITION RANGE SINGLE| | 50 | 29 (0)| 2 | 2 |
|* 3 | INDEX FAST FULL SCAN | STATEMENT_PART_IX1 | 50 | 29 (0)| 2 | 2 |
-------------------------------------------------------------------------------------------

SELECT COUNT (*)
FROM statement_part t
WHERE transaction_date = DATE '2013-01-13';

-------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)| Pstart| Pstop |
-------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 29 (0)| | |
| 1 | SORT AGGREGATE | | 1 | | | |
| 2 | PARTITION RANGE SINGLE| | 50 | 29 (0)| 5 | 5 |
|* 3 | INDEX FAST FULL SCAN | STATEMENT_PART_IX1 | 50 | 29 (0)| 5 | 5 |
-------------------------------------------------------------------------------------------

SELECT /*+ full(t) */ COUNT (*)
FROM statement_part t
WHERE transaction_date = DATE '2013-01-06';

---------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)| Pstart| Pstop |
---------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 69 (0)| | |
| 1 | SORT AGGREGATE | | 1 | | | |
| 2 | PARTITION RANGE SINGLE| | 50 | 69 (0)| 2 | 2 |
|* 3 | TABLE ACCESS FULL | STATEMENT_PART | 50 | 69 (0)| 2 | 2 |
---------------------------------------------------------------------------------------

The original cost of the full table scan after gathering and adjusting statistics has changed somewhat as a result of rounding issues, but we can see that by incorporating TSTATS.ADJUST_TABLE_STATS into our partition-maintenance procedure the cost of a full table scan remains unchanged and the unhinted query continues to use the index.

Temporary Tables

I will be honest: temporary tables are a challenging problem for the TSTATS method for two reasons. They are as follows:

· You can’t easily gather statistics for a temporary table by an independent job because the temporary table will be empty at the time.

· Whereas it is uncommon for permanent tables to vary in size by orders of magnitude, it is quite common for temporary tables to hold 1,000,000 rows on one occasion and 3 rows shortly after.

When temporary tables are used as staging tables in extract transformation and load (ETL) processes, these difficulties are often apparent. Indeed, permanent tables are sometimes used as staging tables in ETL processes, and the issues discussed in this section apply equally to such tables.

The pros and cons of dynamic sampling

Although it is completely contrary to the philosophy of TSTATS, one pragmatic approach to dealing with wildly varying cardinalities is to rely on dynamic sampling. Unless you explicitly disable it, dynamic sampling will occur automatically on any table that has no object statistics, and that includes temporary tables.

The good thing about dynamic sampling is that it is quite smart and understands all about column correlation and so on, so you don’t have to worry about extended statistics or even know what they are! In fact, you are quite likely to get a fairly decent execution plan when your statement is parsed. And if you reparse your statement the next week with wildly different contents in your temporary table then you are likely to get a wildly different plan that is quite appropriate for your modified workload.

Tip If you are running 12cR1 or later and you have multiple queries against the same temporary table with the same contents, you can avoid repeated dynamic sampling by gathering statistics on the temporary table that are session specific. For details see the specification ofDBMS_STATS.SET_TABLE_PREFS procedure in the PL/SQL Packages and Types Reference manual and look for the GLOBAL_TEMP_TABLE_STATS preference. However, if you use session-specific statistics as an alternative to dynamic sampling you may have to set up extended statistics!

All this is very good, but I am sure you can see the potential downside of dynamic sampling. We still have the same risks of poor performance that inspired TSTATS in the first place, as follows:

· An execution plan resulting from dynamic sampling cannot be predicted with certainty and neither can the resulting performance.

· The execution plan created by one session may be used by the same or a different session when the temporary table contents are quite different than those at the time the statement was parsed.

I must confess that when I first implemented TSTATS for the AJAX application I turned a blind eye to these concerns and hoped that stabilizing the execution plans for all statements not involving temporary tables and relying on dynamic sampling for the remaining statements would be good enough for AJAX to provide a stable service. Unfortunately, I was wrong and had to come up with a way to stabilize plans for queries involving temporary tables. I ended up abandoning dynamic sampling and fabricating statistics for temporary tables.

Fabricating statistics for temporary tables

The idea is simple: use DBMS_STATS.SET_TABLE_STATS, DBMS_STATS.SET_COLUMN_STATS, and, if appropriate, DBMS_STATS.SET_INDEX_STATS to set statistics for global temporary tables and their associated columns and indexes. That way you can ensure that the execution plans that involve your global temporary tables are fixed and testable like all your other queries. The problem is, of course, working out which statistics to fabricate.

To set the stage, let us have a look at Listing 20-12, which performs a merge of a temporary table with a permanent table.

Listing 20-12. Merging a temporary table into a permanent table

CREATE GLOBAL TEMPORARY TABLE payments_temp
AS
SELECT *
FROM payments
WHERE 1 = 0;

MERGE /*+ cardinality(t 3 ) */
INTO payments p
USING payments_temp t
ON (p.payment_id = t.payment_id)
WHEN MATCHED
THEN
UPDATE SET p.employee_id = t.employee_id
,p.special_flag = t.special_flag
,p.paygrade = t.paygrade
,p.payment_date = t.payment_date
,p.job_description = t.job_description
WHEN NOT MATCHED
THEN
INSERT (payment_id
,special_flag
,paygrade
,payment_date
,job_description)
VALUES (t.payment_id
,t.special_flag
,t.paygrade
,t.payment_date
,t.job_description);

-----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)|
-----------------------------------------------------------------------------
| 0 | MERGE STATEMENT | | 3 | 5 (0)|
| 1 | MERGE | PAYMENTS | | |
| 2 | VIEW | | | |
| 3 | NESTED LOOPS OUTER | | 3 | 5 (0)|
| 4 | TABLE ACCESS FULL | PAYMENTS_TEMP | 3 | 2 (0)|
| 5 | TABLE ACCESS BY INDEX ROWID| PAYMENTS | 1 | 1 (0)|
| 6 | INDEX UNIQUE SCAN | PAYMENTS_PK | 1 | 0 (0)|
-----------------------------------------------------------------------------

MERGE /*+ cardinality(t 10000 ) */
INTO payments p
USING payments_temp t
ON (p.payment_id = t.payment_id)
WHEN MATCHED
THEN
UPDATE SET p.employee_id = t.employee_id
,p.special_flag = t.special_flag
,p.paygrade = t.paygrade
,p.payment_date = t.payment_date
,p.job_description = t.job_description
WHEN NOT MATCHED
THEN
INSERT (payment_id
,special_flag
,paygrade
,payment_date
,job_description)
VALUES (t.payment_id
,t.special_flag
,t.paygrade
,t.payment_date
,t.job_description);
-------------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)|
-------------------------------------------------------------------
| 0 | MERGE STATEMENT | | 10000 | 176 (1)|
| 1 | MERGE | PAYMENTS | | |
| 2 | VIEW | | | |
| 3 | HASH JOIN OUTER | | 10000 | 176 (1)|
| 4 | TABLE ACCESS FULL| PAYMENTS_TEMP | 10000 | 2 (0)|
| 5 | TABLE ACCESS FULL| PAYMENTS | 120K| 174 (1)|
-------------------------------------------------------------------

Listing 20-12 shows two MERGE statements and their associated execution plans. The only difference between the two statements is that the argument to the CARDINALITY hint is 3 in one case and 10,000 in the other. We can see that in both cases the CBO elects to use the temporary table as the driving table in the join. In the case where the cardinality is 3 the CBO would like to use a nested loops join, but in the case where the cardinality estimate is 10,000 a hash join is preferred.

We would like to fabricate our statistics so that a reasonable execution plan is selected by the CBO without the need to hint our code. But the optimal execution plan seems to vary depending on the number of rows in our temporary table, so what are we to do?

The key to answering this question lies in the answer to two other questions:

· How bad would it be if we used a nested loops join when there are 10,000 rows in the temporary table?

· How bad would it be if we used a hash join when there are only 3 rows in the table?

Listing 20-13 shows how we might begin our thought process.

Listing 20-13. Assessing the impact of incorrect cardinalities

MERGE /*+ cardinality(t 3) leading(t) use_hash(p) no_swap_join_inputs(p) */
INTO payments p
USING payments_temp t
ON (p.payment_id = t.payment_id)
WHEN MATCHED
THEN
UPDATE SET p.employee_id = t.employee_id
,p.special_flag = t.special_flag
,p.paygrade = t.paygrade
,p.payment_date = t.payment_date
,p.job_description = t.job_description
WHEN NOT MATCHED
THEN
INSERT (payment_id
,special_flag
,paygrade
,payment_date
,job_description)
VALUES (t.payment_id
,t.special_flag
,t.paygrade
,t.payment_date
,t.job_description);

-------------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)|
-------------------------------------------------------------------
| 0 | MERGE STATEMENT | | 3 | 176 (1)|
| 1 | MERGE | PAYMENTS | | |
| 2 | VIEW | | | |
| 3 | HASH JOIN OUTER | | 3 | 176 (1)|
| 4 | TABLE ACCESS FULL| PAYMENTS_TEMP | 3 | 2 (0)|
| 5 | TABLE ACCESS FULL| PAYMENTS | 120K| 174 (1)|
-------------------------------------------------------------------

MERGE /*+ cardinality(t 10000) leading(t) use_nl(p) */
INTO payments p
USING payments_temp t
ON (p.payment_id = t.payment_id)
WHEN MATCHED
THEN
UPDATE SET p.employee_id = t.employee_id
,p.special_flag = t.special_flag
,p.paygrade = t.paygrade
,p.payment_date = t.payment_date
,p.job_description = t.job_description
WHEN NOT MATCHED
THEN
INSERT (payment_id
,special_flag
,paygrade
,payment_date
,job_description)
VALUES (t.payment_id
,t.special_flag
,t.paygrade
,t.payment_date
,t.job_description);
-----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)|
-----------------------------------------------------------------------------
| 0 | MERGE STATEMENT | | 10000 | 10004 (1)|
| 1 | MERGE | PAYMENTS | | |
| 2 | VIEW | | | |
| 3 | NESTED LOOPS OUTER | | 10000 | 10004 (1)|
| 4 | TABLE ACCESS FULL | PAYMENTS_TEMP | 10000 | 2 (0)|
| 5 | TABLE ACCESS BY INDEX ROWID| PAYMENTS | 1 | 1 (0)|
| 6 | INDEX UNIQUE SCAN | PAYMENTS_PK | 1 | 0 (0)|
-----------------------------------------------------------------------------

Listing 12-13 shows the CBO cost estimate for when we force the optimal plan for a 10,000-row temporary table on a 3-row temporary table as well as the cost estimate for when we force the optimal plan for a 3-row temporary table on a 10,000-row temporary table. What we can see in this specific example is that a nested loops join, when applied to a 10,000-row table, would seem disastrous: the cost has jumped from 176 to 10,000. However, the use of a hash join is not particularly disastrous when there are only 3 rows selected from our temporary table, jumping from 5 to 176; this is a big factor but probably only leads to an increase of one or two elapsed seconds. So we can fabricate our temporary table statistics to indicate 10,000 rows in our table and all should be well—at least in this example.

If you select specific rows from your table you might want to prevent the presence of predicates on these columns from reducing the CBO’s cardinality estimates to such an extent that a nested loops join method is used. Listing 20-14 shows that by setting the value of NUM_DISTINCT to 2 for each column and bumping up NUM_ROWS for the table to, say, 20,000 we can arrange that a hash join is always, or almost always, selected.

Listing 20-14. Setting statistics to increase the chances of a hash join

CREATE OR REPLACE PACKAGE tstats
AS
-- Other procedures omitted
PROCEDURE set_temp_table_stats (
p_owner all_tab_col_statistics.owner%TYPE DEFAULT SYS_CONTEXT (
'USERENV'
,'CURRENT_SCHEMA')
,p_table_name all_tab_col_statistics.table_name%TYPE
,p_numrows INTEGER DEFAULT 20000
,p_numblks INTEGER DEFAULT 1000
,p_avgrlen INTEGER DEFAULT 400);
-- Other procedures omitted
END tstats;
/

CREATE OR REPLACE PACKAGE BODY tstats
AS
-- Other procedures omitted
PROCEDURE set_temp_table_stats (
p_owner all_tab_col_statistics.owner%TYPE DEFAULT SYS_CONTEXT (
'USERENV'
,'CURRENT_SCHEMA')
,p_table_name all_tab_col_statistics.table_name%TYPE
,p_numrows INTEGER DEFAULT 20000
,p_numblks INTEGER DEFAULT 1000
,p_avgrlen INTEGER DEFAULT 400)
IS
distcnt NUMBER;
BEGIN
DBMS_STATS.unlock_table_stats (ownname => p_owner
,tabname => p_table_name);
$IF dbms_db_version.version >=12
$then
DBMS_STATS.set_table_prefs (ownname => p_owner
,tabname => p_table_name
,pname => 'GLOBAL_TEMP_TABLE_STATS'
,pvalue => 'SHARED');
$END
DBMS_STATS.delete_table_stats (ownname => p_owner
,tabname => p_table_name);
DBMS_STATS.set_table_stats (ownname => p_owner
,tabname => p_table_name
,numrows => p_numrows
,numblks => p_numblks
,avgrlen => p_avgrlen
,no_invalidate => FALSE);
/*

We must now set column statistics to limit the effect of predicates on cardinality
calculations; by default cardinality is reduced by a factor of 100 for each predicate.

We use a value of 2 for the number of distinct columns to reduce this factor to 2. We
do no not use 1 because predicates of the type "column_1 <> 'VALUE_1'" would reduce the
cardinality to 1.

*/
distcnt := 2;

FOR r IN (SELECT *
FROM all_tab_columns
WHERE owner = p_owner AND table_name = p_table_name)
LOOP
DBMS_STATS.set_column_stats (ownname => p_owner
,tabname => r.table_name
,colname => r.column_name
,distcnt => distcnt
,density => 1 / distcnt
,srec => NULL
,no_invalidate => FALSE);
END LOOP;

DBMS_STATS.lock_table_stats (ownname => p_owner, tabname => p_table_name);
END set_temp_table_stats;
-- Other procedures omitted
END tstats;

EXEC tstats.set_temp_table_stats(p_table_name=>'PAYMENTS_TEMP');

SELECT *
FROM payments_temp JOIN payments USING (payment_id);

-----------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)|
-----------------------------------------------------------------
| 0 | SELECT STATEMENT | | 20000 | 845 (1)|
| 1 | HASH JOIN | | 20000 | 845 (1)|
| 2 | TABLE ACCESS FULL| PAYMENTS_TEMP | 20000 | 272 (0)|
| 3 | TABLE ACCESS FULL| PAYMENTS | 120K| 174 (1)|
-----------------------------------------------------------------

SELECT *
FROM payments_temp t JOIN payments p USING (payment_id)
WHERE t.paygrade = 10 AND t.job_description='INTERN';

-----------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)|
-----------------------------------------------------------------
| 0 | SELECT STATEMENT | | 5000 | 447 (1)|
| 1 | HASH JOIN | | 5000 | 447 (1)|
| 2 | TABLE ACCESS FULL| PAYMENTS_TEMP | 5000 | 272 (0)|
| 3 | TABLE ACCESS FULL| PAYMENTS | 120K| 174 (1)|
-----------------------------------------------------------------

SELECT *
FROM payments_temp t JOIN payments p USING (payment_id)
WHERE t.paygrade != 10 AND t.payment_date > SYSDATE - 31;

-----------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)|
-----------------------------------------------------------------
| 0 | SELECT STATEMENT | | 6697 | 447 (1)|
| 1 | HASH JOIN | | 6697 | 447 (1)|
| 2 | TABLE ACCESS FULL| PAYMENTS_TEMP | 6697 | 272 (0)|
| 3 | TABLE ACCESS FULL| PAYMENTS | 120K| 174 (1)|
-----------------------------------------------------------------

Listing 20-14 shows another procedure in the TSTATS package. This procedure fabricates statistics on temporary tables. The procedure includes a conditional call to set table preferences that is 12c-specific. After fabricating statistics for PAYMENTS_TEMP, Listing 20-14 shows the execution plans for various two-table joins that use a variety of predicates on our temporary table in combination. Because the cardinality estimates remain quite high, a hash join is used in all cases and the cost estimate remains reasonable. Of course, in real life you wouldn’t rely on the CBO’s estimates—you would run some real workloads and measure actual elapsed times.

THE RIGHT VALUES FOR COLUMN STATISTICS

In the examples shown in Listing 20-14 the cardinality estimate is kept high by having a low value for NUM_DISTINCT for all columns used in predicates. However, a low value for NUM_DISTINCT will cause a low cardinality when such a column is used in a GROUP BY clause. Some experimentation may be required, and you may well need to resort to hinting in some cases.

The listings have only looked at one SQL statement, and this statement may or may not be typical of the SQL statements in your application. Nevertheless, hash joins often represent the best compromise when the number of rows in a table varies a great deal.

You shouldn’t take this discussion of temporary table statistics as a prescriptive solution. Rather, you should consider it as a template for the type of analysis you need to do. If you do your analysis and find that you really must have different execution plans for different sizes of temporary tables, the types of solutions shown in Listings 6-6 and 6-7 are always available as last resorts.

Once you have gathered or set all of the statistics for all of your tables, including the temporary tables, you should have stable execution plans for all the statements in your application on your test system. It is now time to look at how to replicate that stability to other test systems as well as to production.

How to Deploy TSTATS

As I explained in Chapter 6, the TSTATS philosophy is to treat object statistics like application code. And just like application code, object statistics should be placed under version control and be deployed to production and formal test systems using automated deployment scripts according to change control rules. How is this best done?

The process begins on the test system where your object statistics are gathered. The first step is to export the object statistics for the tables in your application to an export table created with the DBMS_STATS.CREATE_STAT_TABLE procedure. For the purposes of discussion, let us assume that your application is called AJAX and that your export table is called TSTATS_AJAX. You now need to use TSTATS_AJAX as the basis for a deployment script that is placed under version control. You can do this by converting the rows in TSTATS_AJAX to insert statements or by creating a SQL*Loader file. Tools such as SQL Developer from Oracle and TOAD from Dell include features for generating such scripts. Your deployment script can create TSTATS_AJAX on your target system and then load the data directly into it using SQL*Plus or SQL*Loader as appropriate.

Note You may think that the size of your deployment script will be huge if it includes an insert statement for every table, index, and table column in your application schema. However, bear in mind that you will not have statistics for table partitions, so things probably won’t be as bad as you might think.

The final part of your deployment script should import the data from TSTATS_AJAX into the data dictionary on production or on your target test system. Listing 20-15 shows a procedure in the TSTATS package that performs this import.

Listing 20-15. Importing object statistics into the data dictionary

CREATE OR REPLACE PACKAGE tstats
AS
-- Other procedures cut
PROCEDURE import_table_stats (
p_owner all_tab_col_statistics.owner%TYPE DEFAULT SYS_CONTEXT (
'USERENV'
,'CURRENT_SCHEMA')
,p_table_name all_tab_col_statistics.table_name%TYPE
,p_statown all_tab_col_statistics.owner%TYPE DEFAULT SYS_CONTEXT (
'USERENV'
,'CURRENT_SCHEMA')
,p_stat_table all_tab_col_statistics.table_name%TYPE);
-- Other procedures cut
END tstats;
/

CREATE OR REPLACE PACKAGE BODY tstats
AS
-- Other procedures cut
PROCEDURE import_table_stats (
p_owner all_tab_col_statistics.owner%TYPE DEFAULT SYS_CONTEXT (
'USERENV'
,'CURRENT_SCHEMA')
,p_table_name all_tab_col_statistics.table_name%TYPE
,p_statown all_tab_col_statistics.owner%TYPE DEFAULT SYS_CONTEXT (
'USERENV'
,'CURRENT_SCHEMA')
,p_stat_table all_tab_col_statistics.table_name%TYPE)
IS
BEGIN
DECLARE
already_up_to_date EXCEPTION;
PRAGMA EXCEPTION_INIT (already_up_to_date, -20000);
BEGIN
DBMS_STATS.upgrade_stat_table (ownname => 'DLS'
,stattab => 'DLS_TSTATS');
EXCEPTION
WHEN already_up_to_date
THEN
NULL;
END;

DBMS_STATS.unlock_table_stats (ownname => p_owner
,tabname => p_table_name);
DBMS_STATS.delete_table_stats (ownname => p_owner
,tabname => p_table_name
,no_invalidate => FALSE);
DBMS_STATS.import_table_stats (ownname => p_owner
,tabname => p_table_name
,statown => p_statown
,stattab => p_stat_table
,no_invalidate => FALSE);

-- For partitioned tables it may be that the number of (sub)partitions on
-- the target systems do not match those on the source system.
FOR r
IN (SELECT *
FROM all_tables
WHERE owner = p_owner
AND table_name = p_table_name
AND partitioned = 'YES')
LOOP
adjust_global_stats (p_owner, p_table_name,'PMOP');
END LOOP;

DBMS_STATS.lock_table_stats (ownname => p_owner, tabname => p_table_name);
END import_table_stats;
-- Other procedures cut
END tstats;
/

In case the target system is running a later version of the Oracle database software than the source system, a call is made to DBMS_STATS.UPGRADE_STAT_TABLE to bring TSTATS_AJAX up to date, ignoring any error generated when TSTATS_AJAX is already up to date. The following actions are then performed for each table referenced by TSTATS_AJAX:

· Remove any existing statistics. This is necessary to prevent a mixture of statistics appearing.

· Import the object statistics into the data dictionary using a call to DBMS_STATS.IMPORT_TABLE_STATS.

· Adjust the global statistics for partitioned tables, as the number of partitions on the target system may not match those on the source system.

· Lock the imported statistics.

If in a later release of your application software you wish to add new tables or indexes you can simply create a new version of your deployment script that includes the additional objects, or you can create a new script specific to those objects. You might also include theTSTATS.IMPORT_OBJECT_STATS call in the DML script that creates the new table or index. It is entirely a question of the procedures your organization uses to deploy new objects.

Summary

TSTATS is an approach for managing object statistics that is radically different from the standard approach suggested by Oracle. Nevertheless, TSTATS uses no unsupported features of the Oracle database and effectively allows changes to execution plans to be placed under change control processes without the need for repositories of SQL statements that are difficult, if not impossible, to reconcile with application code.

This chapter has provided some ideas for how you might set up TSTATS. Nevertheless, TSTATS is not a “plug-and-play” tool, and a considerable amount of analysis and customization will usually be required before TSTATS can be successfully deployed for a complex commercial application. Hopefully this chapter has shown the way.

I hope you have found reading this book an educational and enjoyable experience. Good luck!