Table of Contents
Understanding the Informix Optimizer
What Is the Optimizer?
The optimizer is the part of the Informix engine that anticipates the best route to take before running a given query. The output from SET EXPLAIN ON explains how the optimizer chose to access the data. With this information available, you might discover that the optimizer chose a path that is time consuming, and you can take steps to restructure the query.
How Does the Optimizer Make Its Decision?
The main aim of the optimizer is to reduce I/O by limiting the amount of data to search in order to obtain the requested data in the most efficient way. The optimizer makes its decisions based on information in the system catalogues. This information consists of:
- The number of rows in each table used in the query (systables.nrows).
- How many pages are used for data and how many pages are used for indexes (systables.npused).
- Whether columns values are unique (sysconstraints).
- What indexes exist (sysindexes).
- Whether the indexes are ascending or descending (sysindexes). (No longer required in version 7 databases.)
- Whether the data is in the same order as the index—that is, clustered (sysindexes.clust).
- How many levels are in an index (sysindexes.levels).
- What the second largest and second lowest values are for each column. This gives the optimizer a rough idea of the range of values (syscolumns.colmin and colmax). Version 7 can obtain more detail about the distribution of data (sysdistrib).
Using all this information, the optimizer determines all possible routes and then weighs each method with an estimated cost. After the optimizer selects a plan that it believes is the most cost effective, it passes the query for processing, and if SET EXPLAIN ON is in effect, the chosen method is recorded in a file.
How Accurate Are Your Statistics?
It is very important to remember that the information in the system catalogues used by the optimizer is only updated when the UPDATE STATISTICS command is executed. The optimizer is only as good as the information it is provided, so, rule number one is to execute UPDATE STATISTICS as regularly as possible. You should execute UPDATE STATISTICS more often for very dynamic tables than for tables that rarely change.
When Does the Engine Perform the Optimization?
The optimization occurs whenever an SQL statement is prepared, providing there are no host variables. If there are host variables, the optimizer does not have all the information required until the variables are passed (when the cursor is opened), and in this case, the optimization occurs on the OPEN statement.
Even after running UPDATE STATISTICS regularly in production, you may notice one day that a query that previously took 10 minutes suddenly takes one hour. This change can be the result of the optimizer choosing a new path based on new information in the system catalogues. This may mean the query needs restructuring to influence the optimizer to make a better decision. Because of this circumstance, you should have the capability to use SET EXPLAIN ON for any program in production instead of being forced to recompile the program to find out the query plan. Consider building a simple function into all programs that enables you to either pass a parameter or set an environment variable to turn SET EXPLAIN ON.
Development Versus Production
If you test a query in development on a small database, the statistics may be quite different from what is in production. A query might appear to perform well in development, but once it is in production, the optimizer might choose (rightly or wrongly) to take a totally different route. The only way to get around this problem in the quality assurance process is to run the query against a test database that is a similar size to the production database with the same data. Alternatively (providing the database schemas are the same), one might consider manually updating some of the columns in the system catalogue tables after running UPDATE STATISTICS on the test database so the optimizer makes the same decisions.
Controlling the Optimizer
The optimizer can be influenced by using the SET OPTIMIZATION command or by altering the default setting of OPTCOMPIND.
Obtaining Data Distribution for the Optimizer
Data distribution analysis gives the optimizer a better understanding of the values contained in a column, such as how many unique values are in each area of the table. This information is provided by sampling the data in a column and storing information about sections of the table in various bins. This information can be extremely valuable to the optimizer when dealing with large tables. To generate distribution for a column, you use the UPDATE STATISTICS command.
An UPDATE STATISTICS Strategy for OnLine DSA Version 7
The recommended UPDATE STATISTICS strategy for Informix DSA version 7 is to perform the following steps (in the same order):
- Run UPDATE STATISTICS MEDIUM on all tables using DISTRIBUTIONS ONLY without listing columns and using the default RESOLUTION parameters. If you use a version before 7.10.UD1, it is better to actually list the columns that do not head an index for better performance.
- Run UPDATE STATISTICS HIGH for all columns that head an index or the columns that are definitely part of a query. Execute a separate command for each column.
- Run UPDATE STATISTICS LOW for all other columns in composite indexes.
Examining the Optimizer's Choice
When you are comfortable knowing that you supplied the optimizer with enough information, you can see what query plan the Informix optimizer chose. To do this, use SET EXPLAIN ON within the query. When this is set, the optimizer writes output for all queries (for the same process) to a file called sqexplain.out in the current directory.
Understanding the SET EXPLAIN Output
This section discusses each line of the preceding code in detail. Each code line is shown in bold, with a detailed explanation following it.
This section of the output shows the actual query that was optimized. LOW is displayed if SET OPTIMIZATION is set to LOW. Note that sqexplain.out is appended to if the file already exists.
This value is simply a number the optimizer assigned to the chosen access method. The value is not meaningful except to the optimizer because it bears no relationship to real time. It cannot be compared to the estimated cost of other queries and is best ignored. You can use it, however, to compare changes made for the same query (such as an index change).
Estimated # of Rows Returned:
This is the optimizer's estimate based on information in the system catalogue tables. Remember that the catalogue information is fairly limited (especially before version 7), so this value is often inaccurate (more so if the query involves a join). In OnLine DSA version 7, you can obtain distribution information for the data, which helps the optimizer estimate the number of rows more accurately.
Temporary Files Required For: Order By | Group By
When this is displayed, there is a GROUP BY or an ORDER BY statement in the query, and the optimizer determined that there is no corresponding index available to obtain the data in the required order. A temporary file will be created to order the result set. This file could be very large (depending on the size of tables), so check available disk space and be aware of the effect this sort could have on performance. You cannot use indexes when the columns to be sorted come from more than one table. Note that in version 7, the optimizer can choose to traverse an index in the direction of the ORDER BY regardless of whether the INDEX is in the same order as the ORDER BY. Before version 7, the capability of the optimizer to use the index for an ORDER BY depended on whether the ASCENDING and DESCENDING values on the index and the ORDER BY matched.
1) owner.table: INDEX PATH (Key-Only)
This is the table that the optimizer chose to read first (indicated by the 1). Subsequent table accesses (for a nested loop join, for example) are displayed further down in the explain output and are indicated by a higher number. For each row returned at this level, the engine will query the tables at a lower level. INDEX PATH indicates an index will be used to access this table.
The (Key-Only) notation (with OnLine only) indicates that only the index will be read and the actual data value (row) will not be read from this table. Key-only access is generally very efficient (before version DSA 7.2 ) due to the smaller size of the index compared to the row. Not only is the read for the data row eliminated, but also more index key values are likely to fit on the one page, which in turn reduces I/O. This type of access is achieved only if no columns are selected from the same table. Avoid using SELECT * if possible and select only the required columns. Note that with OnLine DSA 7.2, key-only reads are in fact slower in most cases due to the read-ahead capabilities.
(1)Indexkeys: column_name Lower Index Filter: owner.table.column > x Upper Index Filter: owner.table.column < y
column_name is the name of the column to be used in the INDEX PATH read.
Lower Index Filter shows the first key value (x) where the index read will begin.
Upper Index Filter shows the key value (y) where the index read will stop.
1) owner.table: SEQUENTIAL SCAN (Serial, fragments: ALL)
In the preceding case, all rows will be read from this table using a sequential scan.
The section in parentheses relates to version 7. If Parallel is displayed instead of Serial, the engine will perform a parallel scan. (This behavior is influenced by the PDQPRIORITY setting.) The ALL notation indicates that all fragments must be scanned because the optimizer cannot eliminate fragments after examining the WHERE predicates. NONE indicates the opposite; that is, the optimizer eliminated all fragments and therefore none must be examined. A number (or list of numbers) indicates that the engine will examine only the fragments listed. (Numbers are relative to the order in the sysfragments table.)
Pay special attention if the sequential scan is performed at a lower level in the query plan (indicated by a higher number) because this could mean the whole table is scanned for each row returned in a previous step. Often, this is one of the warning bells when optimizing or performing quality assurance on a query. Sequential scans are not so bad when they are for small tables or when they are in the first step of a query plan, providing that the engine does not have to scan a large table to retrieve a fraction of the table.
AUTOINDEX PATH: owner.table.column
This statement is used more in version 4. To avoid sequential access, a temporary index is built on the owner.table.column to perform a join. You see this statement if an index does not exist on the join column, and it is generally an indication that you need a permanent index.
SORT SCAN: owner.table.column
This statement is used in combination with a sequential scan when no index is available on the join column. The owner.table.column will be sorted for later use with a join.
MERGE JOIN Merge Filters: owner.table.column = owner.table.column
A merge join is used to join the results of the two previous selections sets, which were prepared for a join. After the join columns are obtained in the appropriate order (possibly via a SORT SCAN if an index does not exist), the server sequentially reads both result sets and merges them before accessing the rows. A merge join is considered faster than a nested loop join in many cases.
DYNAMIC HASH JOIN (Build Outer) Dynamic Hash Filters: owner.tab1.col = owner.tab2.column …
In version 7 only, a hash join is used to join the two preceding tables in the explain output. The Build Outer notation tells you which table is used first. The filter shows how the tables will be joined. When some complex queries cannot use an index, the hash join takes over. A hash join is also used instead of a sort-merge join and is considered more efficient. Whereas a sort-merge join sorts both tables, a hash join typically sorts only one. Hash joins are favored with large amounts of data, especially for parallel database queries (PDQ) with fragmentation. Rows are placed in a hash table after using an internal hash algorithm. The cost of a hash join can be lower than using an index, especially when more than around 15 percent of data from a large table must be scanned. When the data is not clustered (in a different order to the index), the cost of traversing the index in addition to retrieving the actual rows (in a different order) is quite high compared to a table scan with a hash join. OPTCOMPIND=2 (which is the default) causes the optimizer to consider hash joins instead of an index. Note that OPTCOMPIND should be set to 1 when REPEATABLE READ is used (and arguably should be the default).