Statistics and Selectivity

Statistics and Selectivity
Prev	Chapter 10. Temporal Types: Aggregation and Indexing	Next

Statistics Collection

The PostgreSQL planner relies on statistical information about the contents of tables in order to generate the most efficient execution plan for queries. These statistics include a list of some of the most common values in each column and a histogram showing the approximate data distribution in each column. For large tables, a random sample of the table contents is taken, rather than examining every row. This enables large tables to be analyzed in a small amount of time. The statistical information is gathered by the ANALYZE command and stored in the pg_statistic catalog table. Since different kinds of statistics may be appropriate for different kinds of data, the table only stores very general statistics (such as number of null values) in dedicated columns. Everything else is stored in five “slots”, which are couples of array columns that store the statistics for a column of an arbitrary type.

The statistics collected for time types and temporal types are based on those collected by PostgreSQL for scalar types and span types. For scalar types, such as float, the following statistics are collected:

fraction of null values,
average width, in bytes, of non-null values,
number of different non-null values,
array of most common values and array of their frequencies,
histogram of values, where the most common values are excluded,
correlation between physical and logical row ordering.

For span types, like tstzspan, three additional histograms are collected:

length histogram of non-empty spans,
histograms of lower and upper bounds.

For geometries, in addition to (1)–(3), the following statistics are collected:

number of dimensions of the values, N-dimensional bounding box, number of rows in the table, number of rows in the sample, number of non-null values,
N-dimensional histogram that divides the bounding box into a number of cells and keeps the proportion of values that intersects with each cell.

The statistics collected for columns of the time and span types tstzset, tstzspan, tstzspanset, intspan, and floatspan replicate those collected by PostgreSQL for the tstzrange. This is clear for the span types in MobilityDB, which are more efficient versions of the range types in PostgreSQL. For the tstzset and the tstzspanset types, a value is converted into its bounding period, then the statistics for the tstzspan type are collected.

The statistics collected for columns of temporal types depend on their subtype and their base type. In addition to statistics (1)–(3) that are collected for all temporal types, statistics are collected for the time and the value dimensions independently. More precisely, the following statistics are collected for the time dimension:

For columns of instant subtype, the statistics (4)–(6) are collected for the timestamps.
For columns of other subtype, the statistics (7)–(8) are collected for the (bounding box) periods.

The following statistics are collected for the value dimension:

For columns of temporal types with step interpolation (that is, tbool, ttext, or tint):
- For the instant subtype, the statistics (4)–(6) are collected for the values.
- For all other subtypes, the statistics (7)–(8) are collected for the values.
For columns of the temporal float type (that is, tfloat):
- For the instant subtype, the statistics (4)–(6) are collected for the values.
- For all other subtype, the statistics (7)–(8) are collected for the (bounding) value spans.
For columns of temporal point types (that is, tgeompoint and tgeogpoint) the statistics (9)–(10) are collected for the points.

Selectivity Estimation

Boolean operators in PostgreSQL can be associated with two selectivity functions, which compute how likely a value of a given type will match a given criterion. These selectivity functions rely on the statistics collected. There are two types of selectivity functions. The restriction selectivity functions try to estimate the percentage of the rows in a table that satisfy a WHERE-clause condition of the form column OP constant. On the other hand, the join selectivity functions try to estimate the percentage of the rows in a table that satisfy a WHERE-clause condition of the form table1.column1 OP table2.column2.

MobilityDB defines 23 classes of Boolean operators (such as =, <, &&, <<, etc.), each of which can have as left or right arguments a PostgreSQL type (such as integer, timestamptz, etc.) or a MobilityDB type (such as tstzspan, tint, etc.). As a consequence, there is a very high number of operators with different arguments to be considered for the selectivity functions. The approach taken was to group these combinations into classes corresponding to the value and time dimensions. The classes correspond to the type of statistics collected as explained in the previous section.

MobilityDB estimates both restriction and join selectivity for time, span, and temporal types.