Cost-based SQL Optimization

Hive supports cost-based optimization (CBO) of SQL queries. This uses statistics about the Hive tables, table partitions, and columns within a table to produce good query execution plans. More efficient query plans can better utilize cluster resources and improve query latency. CBO is especially useful for complex queries containing multiple JOIN statements and for queries on very large tables.

Tables are not required to have partitions to generate CBO statistics.

CBO currently generates the following statistics:

Statistics Granularity

Description

Table-level

* Uncompressed size of table * Number of rows * Number of files

Column-level

* Number of distinct values * Number of NULL values * Minimum value * Maximum value

CBO requires column-level statistics to generate the best query execution plans. Column-level CBO statistics can be generated by both partitioned and un-partitioned tables.

Generating Statistics

Use the ANALYZE TABLE command to generate statistics for tables and columns. Use the optional NO SCAN clause to improve query performance by preventing a scan of files on HDFS. This option gathers only the following statistics:

· Number of files

· Size of files in bytes

ANALYZE TABLE tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)] COMPUTE STATISTICS [NO SCAN];

The following example views statistics for all partitions in the employees table. The query also uses the NO SCAN clause to improve performance:

ANALYZE TABLE employees PARTITION (dt) COMPUTE STATISTICS

Generating Column-level Statistics to generate statistics for columns in the employee table:

ANALYZE TABLE tablename [PARTITION(partcol1[1=val1], partcol2[=val2], ...)] COMPUTE STATISTICS FOR COLUMNS [NO SCAN];

The following example generates statistics for all column in the employees table:

ANALYZE TABLE employees PARTITION (dt) COMPUTE STATISTICS FOR COLUMNS;

Viewing Table details

Hive has a very nice feature that allows you to see details about a table, such as columns, data types, storage location of the table, size, etc. To view such information, use describe formatted with the table name.

DESCRIBE FORMATTED

Use the DESCRIBE statement to view statistics generated by CBO.

DESCRIBE [EXTENDED] tablename;

Viewing Generated Column Statistics

Use the following syntax to generate column statistics:

DESCRIBE FORMATTED [dbname.]tablename.columnname;

Last updated