Generating Candidate Execution Plans
The basic purpose of the Query Optimizer is to find an
efficient execution plan for your query. Even for relatively simple
queries, there may be a large number of different ways to access the
data to produce the same end result. As such, the Query Optimizer has to
select the best possible plan from what may be a very large number of
candidate execution plans, and it’s important that it makes a wise
choice, as the time it takes to return the results to the user can vary
wildly, depending on which plan is selected.
The job of the Query Optimizer is to create and assess as many
candidate execution plans as possible, within certain criteria, in order
to arrive at the best possible plan. We define the search space
for a given query as the set of all the possible execution plans for
that query, and any possible plan in this search space returns the same
results. Theoretically, in order to find the optimum execution plan for a
query, a cost-based query optimizer should generate all possible
execution plans that exist in that search space and correctly estimate
the cost of each plan. However, some complex queries may have thousands
or even millions of possible execution plans and, while the SQL Server
Query Optimizer can typically consider a large number of candidate
execution plans, it cannot perform an exhaustive search of all the
possible plans for every query. If it did, then the time taken to assess
all of the plans would be unacceptably long, and could start to have a
major impact on the overall query execution time.
The Query Optimizer must strike a balance between optimization time
and plan quality. For example, if the Query Optimizer spends one second
finding a good enough plan that executes in one minute, then it doesn’t
make sense to try to find the perfect or most optimal plan, if this is
going to take five minutes of optimization time, plus the execution
time. So SQL Server does not do an exhaustive search, but instead tries
to find a suitably efficient plan as quickly as possible. As the Query
Optimizer is working within a time constraint, there’s a chance that the
plan selected may be the optimal plan but it is also likely that it may
just be something close to the optimal plan.
In order to explore the search space, the Query Optimizer uses
transformation rules and heuristics. The generation of candidate
execution plans is performed inside the Query Optimizer using
transformation rules, and the use of heuristics limits the number of
choices considered in order to keep the optimization time reasonable.
Candidate plans are stored in memory during the optimization, in a
component called the Memo.
Assessing the Cost of each Plan
Searching, or enumerating candidate plans is just one part of the
optimization process. The Query Optimizer still needs to estimate the
cost of these plans and select the least expensive one. To estimate the
cost of a plan, it estimates the cost of each physical operator in that
plan using costing formulas that consider the use of resources such as
I/O, CPU, and memory. This cost estimation depends mostly on the
algorithm used by the physical operator, as well as the estimated number
of records that will need to be processed; this estimate of the number
of records is known as the cardinality estimation.
To help with this cardinality estimation, SQL Server uses and
maintains optimizer statistics, which contain statistical information
describing the distribution of values in one or more columns of a table.
Once the cost for each operator is estimated using estimations of
cardinality and resource demands, the Query Optimizer will add up all of
these costs to estimate the cost for the entire plan.
Query Execution and Plan Caching
Once the query is optimized, the resulting plan is used by the
Execution Engine to retrieve the desired data. The generated execution
plan may be stored in memory, in the plan cache (known as the procedure
cache in previous versions of SQL Server) in order that it might be
reused if the same query is executed again. If a valid plan is available
in the plan cache, then the optimization process can be skipped and the
associated cost of this step, in terms of optimization time, CPU
resources, and so on, can be avoided.
However, reuse of an existing plan may not always be the best
solution for a given query. Depending on the distribution of data within
a table, the optimal execution plan for a given query may differ
greatly depending on the parameters provided in said query, and a
behavior known as parameter sniffing may result in a suboptimal plan
being chosen.
Even when an execution plan is available in the plan cache, some
metadata changes, such as removing an index or a constraint, or
significant enough changes made to the contents of the database, may
render an existing plan invalid or suboptimal, and thus cause it to be
discarded from the plan cache and a new optimization to be generated. As
a trivial example, removing an index will make a plan invalid if the
index is used by that plan. Likewise, the creation of a new index could
make a plan suboptimal, if this index could be used to create a more
efficient alternative plan, and enough changes to the database contents
may trigger an automatic update of statistics, with the same effect on
the existing plan.
Plans may also be removed from the plan cache when SQL Server is
under memory pressure or when certain statements are executed. Changing
some configuration options, for example max degree of parallelism, will
clear the entire plan cache. Alternatively, some statements, like
altering a database with certain ALTER DATABASE options will clear all the plans associated with that particular database.
Hinting
Most of the time, the Query Optimizer does a great job at choosing
highly efficient execution plans. However, there may be cases when the
selected execution plan does not perform as expected. It is vitally
important to differentiate between when these cases arise because you
are not providing the Query Optimizer with all the information it needs
to do a good job, and when the problem arises because of a Query
Optimizer limitation.
The reality is that, even after more than 30 years of research, query
optimizers are highly complex pieces of software which still face some
technical challenges, some of which will be mentioned in the next
section. As a result, there may be cases when, even after you’ve
provided the Query Optimizer with all the information it needs and there
doesn’t seem to be any apparent problem, you are still not getting an
efficient plan; in these cases you may want to resort to hints. However,
since hints let you to override the operations of the Query Optimizer,
they need to be used with caution, and only as a last resort when no
other option is available. Hints are instructions that you can send to
the Query Optimizer to influence a particular area of an execution plan.
For example, you can use hints to direct the Query Optimizer to use a
particular index or a specific join algorithm. You can even ask the
Query Optimizer to use a specific execution plan, provided that you
specify one in XML format.
Ongoing Query Optimizer Challenges
Query optimization is an inherently complex problem, not only in SQL
Server but in any other relational database system. Despite the fact
that query optimization research dates back to the early seventies,
challenges in some fundamental areas are still being addressed today.
The first major impediment to a query optimizer finding an optimal plan
is the fact that, for many queries, it is just not possible to explore
the entire search space. An effect known as combinatorial explosion
makes this exhaustive enumeration impossible, as the number of possible
plans grows very rapidly depending on the number of tables joined in the
query. To make the search a manageable process, heuristics are used to
limit the search space. However, if a query optimizer is not able to
explore the entire search space, there is no way to prove that you can
get an absolutely optimal plan, or even that the best plan is among the
candidate being considered. As a result, it is clearly extremely
important that the set of plans which a query optimizer considers
contains plans with low costs.
This leads us to another major technical challenge for the Query
Optimizer: accurate cost and cardinality estimation. Since a cost-based
optimizer selects the execution plan with the lowest cost, the quality
of the plan selection is only as good as the accuracy of the optimizer’s
cost and cardinality estimations. Even supposing that time is not a
concern and that the query optimizer can analyze the entire search space
without a problem, cardinality and cost estimation errors can still
make a a query optimizer select the wrong plan. Cost estimation models
are inherently inexact, as they do not consider all of the hardware
conditions, and must necessarily make certain assumptions about the
environment. For example, the costing model assumes that every query
starts with a cold cache; that is, its data is read from disk and not
from memory, and this assumption could lead to costing estimation errors
in some cases. In addition, cost estimation relies on cardinality
estimation, which is also inexact and has some known limitations,
especially when it comes to the estimation of the intermediate results
in a plan. On top of all that, there are some operations which are not
covered by the mathematical model of the cardinality estimation
component, which has to resort to guess logic or heuristics to deal with
these situations.
A Historical Perspective
We’ve seen some of the challenges query optimizers still face today,
but these imperfections are not for want of time or research. One of
these earliest works describing a cost-based query optimizer was “Access
Path Selection in a Relational Database Management System”, published
in 1979 by Pat Selinger et al to describe the query optimizer
for an experimental database management system developed in 1975 at what
is now the IBM Almaden Research Center. This "System-R" management
system advanced the field of Query Optimization by introducing the use
of cost-based query optimization, the use of statistics, an efficient
method of determining join orders, and the addition of CPU cost to the
optimizer's cost estimation formulae.
Yet despite being an enormous influence in the field of query
optimization research, it suffered a major drawback: its framework could
not be easily extended to include additional transformations. This led
to the development of more extensible optimization architectures, which
facilitated the gradual addition of new functionality to query
optimizers. The trailblazers in this field were the Exodus Extensible
DBMS Project, and later the Volcano Optimizer generator, the latter of
which was defined by Goetz Graefe (who was also involved in the Exodus
Project) and William McKenna. Goetz Graefe then went on to define the
Cascades Framework, resolving errors which were present in his previous
two endeavors.
While this is interesting, what's most relevant for you and me is
that SQL Server implemented its own cost-based Query Optimizer based on
the Cascades Framework in 1999, when its database engine was
re-architected for the release of SQL Server 7.0. The extensible
architecture of the Cascades Framework has made it much easier for new
functionality, such as new transformation rules or physical operators,
to be implemented in the Query Optimizer. This allows the performance of
the Query Optimizer to constant be tuned and improved.

No comments:
Post a Comment