Special-Purpose Rowsets

OLE DB uses rowset objects not only as a means to represent consumer data, but as a tool to represent information such as the following:

Metadata—Schema rowsets contain self-descriptive information (metadata) about the data store or about a rowset derived from the data store. Schema rowsets may contain logical descriptions of the data, such as the columns in a table, or physical descriptions of the data store, such as the names and structures of indexes in a database. Some OLE DB providers support schema rowsets. For more information, see Appendix B: Schema Rowsets.
Indexes—An index provides access to data in the rows of a rowset based on key values and references to those rows. Index rowsets may be separate from the rowset they reference (index rowsets) or part of the referenced rowset (integrated index rowsets). Providers over ISAM data typically support index rowsets. For more information, see Chapter 10: Index Rowsets, and Chapter 11: Integrated Indexes.
Views—A view rowset defines a subset of the rows and columns from a rowset as specified by some criteria. Views contain no data of their own, and they cannot be used to combine data from multiple rowsets. Views are useful for sorting and filtering.
Chapters—Chapters are rowsets whose rows are grouped together as specified by some common criteria, and each group of rows can be treated as if it were a separate rowset. Additional columns, such as those created from calculations on the same column in several rows (such as sums and averages) or different columns in a single row, may be appended to one or more rows in a chapter.
References to a chapter may be set as the value of columns in another rowset. The effect is that the chapter appears to be contained in a column of the other rowset. That is, the rowsets appear to be hierarchical (hierarchical rowsets). For more information, see Chapter 4: Hierarchical Rowsets.
Table statistics—A new feature as of OLE DB 2.6, histogram and metadata rowsets contain statistical information about the rowset that would result from a query request. These rowsets are of value mostly to developers writing query processors. For more detailed information, see the topics "Table Statistics" and "Histogram Rowset" in this section.

Table Statistics

The OLE DB TABLE_STATISTICS schema rowset and histogram rowset, along with associated properties, enables a specialized audience of consumers to write efficient distributed query processors by giving providers the means to expose useful statistics on base tables. All statistics are returned in the form of rowsets.

Using table statistics, the consumer can estimate the number of rows that will be returned by a query. Three kinds of statistics are provided: column cardinalities, tuple cardinalities, and histograms.

Column and Tuple Cardinality

In general, cardinality is the number of elements in a given mathematical set, and a tuple is a set with a specified number of ordered elements.

Column cardinality is the number of unique values of a column in a table. Tuple cardinality is the number of unique values of a tuple in a table. Column and tuple cardinality is used with other information to estimate the number of rows that would match a given predicate (for example, the WHERE clause of a SELECT statement).

Histogram

In general, a histogram consists of a sorted set of ranges versus the number of elements within each range.

Histogram rowsets contain information regarding the distribution of values in a column. In the absence of histograms, query processors have to rely on a uniform distribution of values. With histograms, a query processor can make more accurate cardinality estimates.

Specific Features

The following features support the ability to obtain statistical information:

DBPROP_TABLESTATISTICS—A data source information property that indicates whether the provider supports one or more of column cardinality, tuple cardinality, or histogram statistics on a specific base table.
DBPROP_OPENROWSETSUPPORT property—A new bit is added which indicates whether the provider supports histogram rowsets. Any provider that supports histograms should also set DBPROP_TABLESTATISTICS.
TABLE_STATISTICS—A schema rowset obtained with the IDBSchemaRowset interface. This schema rowset lists column and tuple cardinality information as well as other information for a given base table.
IOpenRowset::OpenRowset—This method accepts a TableID and special DBID structure arguments and returns a histogram rowset corresponding to the specified table.

A provider may choose to support both cardinalities and histograms, only cardinalities, or only histograms.

The provider may choose to optimize its implementation of histogram and cardinality information by sampling only a subset of the rows in the base table. In this case, the information reported in the histogram and TABLE_STATISTICS rowsets should be scaled up to the total number of rows in the base table and not merely the sample size.

Whether information returned in the TABLE_STATISTICS schema and histogram rowsets is up-to-date with the latest table contents is provider-specific. Consumers should not assume the statistics are current. The provider may choose its own strategy in how often statistics are updated.

Example Table

The following table will be used to illustrate concepts in subsequent discussions. The columns are arbitrarily named k1, k2, and k3, and their values have no special meaning other than being easy to work with.

k1	k2	k3
10	10	10
10	20	40
10	20	40
20	30	40
30	40	50

TABLE_STATISTICS Schema Rowset

The TABLE_STATISTICS schema rowset describes the available set of statistics on tables in the provider. Each statistic is identified by a fully qualified name (catalog/schema/name), like other database objects, and corresponds to statistics data on one or more columns of a table.

Consider a statistic defined over n ordered set of columns, c1, c2, ... cn. A provider can support one or more of the following statistics:

Column cardinalities

Column cardinality values are defined as follows:

di = (number of distinct values of the column ci in the table)

The column cardinality for our example table is as follows:

d1= 3 (for values 10, 20, and 30 in column k1)

d2 = 4 (for values 10, 20, 30, and 40 in column k2)

d3 = 3 (for values 10, 40, and 50 for column k3)

Tuple cardinalities

Tuple cardinality values are defined as follows:

di = (number of distinct values of tuples <c1, c2, ..., ci> where ci is a column in the table).

In this definition, tuples all start from the first column (c1). The tuple cardinality of the example table is as follows:

<k1> = 3 (for values 10, 20, and 30 in k1)

<k1, k2> = 4 (for values 10, 20, 30, and 40 in k1 and k2)

<k1, k2, k3> = 5 (for values 10, 20, 30, 40, and 50 in k1, k2 and k3)

The provider can choose not to support some of the i cardinality values. If a particular cardinality value is not available, the provider should return NULL as the value.

Histogram Rowset

The IOpenRowset::OpenRowset method can return a histogram rowset for a specified table. A histogram rowset consists of a row for each range of values of the first column in the table. The histogram rowset is sorted on the RANGE_HI_KEY column in ascending order.

Histogram rowsets have the columns described in the following table.

Column name	Type	Description
RANGE_HI_KEY	(depends on column type)	The highest key value corresponding to this histogram range.
RANGE_ROWS	DBTYPE_R8	(Optional) The fraction of the number of rows in the base table that have a value that falls in this histogram range.
EQ_ROWS	DBTYPE_R8	(Optional) The fraction of the number of rows in the base table that have a value equal to RANGE_HI_KEY.
DISTINCT_RANGE_ROWS	DBTYPE_I8	(Optional) Number of distinct values in the base table that fall in this histogram range.

The provider must return at least one of RANGE_ROWS or EQ_ROWS for each histogram row. The provider should return NULL for unsupported optional columns.

Each range is represented by RANGE_HI_KEY, the highest value in the first column of the table in that range. The main piece of information associated with each range is RANGE_ROWS, the fraction of rows in the base table that have a value that falls in the histogram range.

RANGE_ROWS = (number of rows in the base table that have a value x that falls in the range defined by Ki-1 < x <= Ki, where Ki is the RANGE_HI_KEY for the i-th histogram range) / (number of rows in the table).

The provider can also return EQ_ROWS, which is the number of rows in the table that exactly match RANGE_HI_KEY.

EQ_ROWS = (number of rows in the first column of the base table that contain a value equal to RANGE_HI_KEY) / (number of rows in the table).

A possible histogram rowset for the example table is illustrated by the following. (The real numbers, 0.6 and 0.2, are the equivalents of the fractions 3/5 and 1/5.)

range_hi_key	range_rows	eq_rows	distinct_range_ rows
10	0.6	0.6	1
20	0.2	0.2	1
30	0.2	0.2	1

Providers can also provide information about frequent values in the table. This can be done by adding a histogram row corresponding to the frequent value and setting EQ_ROWS to the fraction of the rows in the table that match this frequent value.