OLE DB Programmer's Reference

Special-Purpose Rowsets

OLE DB uses rowset objects not only as a means to represent consumer data, but as a tool to represent information such as the following:

Table Statistics

The OLE DB TABLE_STATISTICS schema rowset and histogram rowset, along with associated properties, enables a specialized audience of consumers to write efficient distributed query processors by giving providers the means to expose useful statistics on base tables. All statistics are returned in the form of rowsets.

Using table statistics, the consumer can estimate the number of rows that will be returned by a query. Three kinds of statistics are provided: column cardinalities, tuple cardinalities, and histograms.

Column and Tuple Cardinality

In general, cardinality is the number of elements in a given mathematical set, and a tuple is a set with a specified number of ordered elements.

Column cardinality is the number of unique values of a column in a table. Tuple cardinality is the number of unique values of a tuple in a table. Column and tuple cardinality is used with other information to estimate the number of rows that would match a given predicate (for example, the WHERE clause of a SELECT statement).

Histogram

In general, a histogram consists of a sorted set of ranges versus the number of elements within each range.

Histogram rowsets contain information regarding the distribution of values in a column. In the absence of histograms, query processors have to rely on a uniform distribution of values. With histograms, a query processor can make more accurate cardinality estimates.

Specific Features

The following features support the ability to obtain statistical information:

A provider may choose to support both cardinalities and histograms, only cardinalities, or only histograms.

The provider may choose to optimize its implementation of histogram and cardinality information by sampling only a subset of the rows in the base table. In this case, the information reported in the histogram and TABLE_STATISTICS rowsets should be scaled up to the total number of rows in the base table and not merely the sample size.

Whether information returned in the TABLE_STATISTICS schema and histogram rowsets is up-to-date with the latest table contents is provider-specific. Consumers should not assume the statistics are current. The provider may choose its own strategy in how often statistics are updated.

Example Table

The following table will be used to illustrate concepts in subsequent discussions. The columns are arbitrarily named k1, k2, and k3, and their values have no special meaning other than being easy to work with.

k1 k2 k3
10 10 10
10 20 40
10 20 40
20 30 40
30 40 50

TABLE_STATISTICS Schema Rowset

The TABLE_STATISTICS schema rowset describes the available set of statistics on tables in the provider. Each statistic is identified by a fully qualified name (catalog/schema/name), like other database objects, and corresponds to statistics data on one or more columns of a table.

Consider a statistic defined over n ordered set of columns, c1, c2, ... cn. A provider can support one or more of the following statistics:

Column cardinalities

Column cardinality values are defined as follows:

di = (number of distinct values of the column ci in the table)

The column cardinality for our example table is as follows:

d1= 3 (for values 10, 20, and 30 in column k1)
d2 = 4 (for values 10, 20, 30, and 40 in column k2)
d3 = 3 (for values 10, 40, and 50 for column k3)

Tuple cardinalities

Tuple cardinality values are defined as follows:

di = (number of distinct values of tuples <c1, c2, ..., ci> where ci is a column in the table).

In this definition, tuples all start from the first column (c1). The tuple cardinality of the example table is as follows:

<k1> = 3 (for values 10, 20, and 30 in k1)
<k1, k2> = 4 (for values 10, 20, 30, and 40 in k1 and k2)
<k1, k2, k3> = 5 (for values 10, 20, 30, 40, and 50 in k1, k2 and k3)

The provider can choose not to support some of the i cardinality values. If a particular cardinality value is not available, the provider should return NULL as the value.

Histogram Rowset

The IOpenRowset::OpenRowset method can return a histogram rowset for a specified table. A histogram rowset consists of a row for each range of values of the first column in the table. The histogram rowset is sorted on the RANGE_HI_KEY column in ascending order.

Histogram rowsets have the columns described in the following table.

Column name Type Description
RANGE_HI_KEY (depends on column type) The highest key value corresponding to this histogram range.
RANGE_ROWS DBTYPE_R8 (Optional) The fraction of the number of rows in the base table that have a value that falls in this histogram range.
EQ_ROWS DBTYPE_R8 (Optional) The fraction of the number of rows in the base table that have a value equal to RANGE_HI_KEY.
DISTINCT_RANGE_ROWS DBTYPE_I8 (Optional) Number of distinct values in the base table that fall in this histogram range.

The provider must return at least one of RANGE_ROWS or EQ_ROWS for each histogram row. The provider should return NULL for unsupported optional columns.

Each range is represented by RANGE_HI_KEY, the highest value in the first column of the table in that range. The main piece of information associated with each range is RANGE_ROWS, the fraction of rows in the base table that have a value that falls in the histogram range.

RANGE_ROWS = (number of rows in the base table that have a value x that falls in the range defined by Ki-1 < x <= Ki, where Ki is the RANGE_HI_KEY for the i-th histogram range) / (number of rows in the table).

The provider can also return EQ_ROWS, which is the number of rows in the table that exactly match RANGE_HI_KEY.

EQ_ROWS = (number of rows in the first column of the base table that contain a value equal to RANGE_HI_KEY) / (number of rows in the table).

A possible histogram rowset for the example table is illustrated by the following. (The real numbers, 0.6 and 0.2, are the equivalents of the fractions 3/5 and 1/5.)

range_hi_key range_rows eq_rows distinct_range_ rows
10 0.6 0.6 1
20 0.2 0.2 1
30 0.2 0.2 1

Providers can also provide information about frequent values in the table. This can be done by adding a histogram row corresponding to the frequent value and setting EQ_ROWS to the fraction of the rows in the table that match this frequent value.