Cost-based SQL Optimization
Hive supports cost-based optimization (CBO) of SQL queries. This uses statistics about the Hive tables, table partitions, and columns within a table to produce good query execution plans. More efficient query plans can better utilize cluster resources and improve query latency. CBO is especially useful for complex queries containing multiple JOIN statements and for queries on very large tables.
Tables are not required to have partitions to generate CBO statistics.
CBO currently generates the following statistics:
Statistics Granularity
Description
Table-level
* Uncompressed size of table * Number of rows * Number of files
Column-level
* Number of distinct values * Number of NULL values * Minimum value * Maximum value
CBO requires column-level statistics to generate the best query execution plans. Column-level CBO statistics can be generated by both partitioned and un-partitioned tables.
Generating Statistics
Use the ANALYZE TABLE command to generate statistics for tables and columns. Use the optional NO SCAN clause to improve query performance by preventing a scan of files on HDFS. This option gathers only the following statistics:
· Number of files
· Size of files in bytes
ANALYZE TABLE tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)] COMPUTE STATISTICS [NO SCAN];
The following example views statistics for all partitions in the employees table. The query also uses the NO SCAN clause to improve performance:
ANALYZE TABLE employees PARTITION (dt) COMPUTE STATISTICS
Generating Column-level Statistics to generate statistics for columns in the employee table:
ANALYZE TABLE tablename [PARTITION(partcol1[1=val1], partcol2[=val2], ...)] COMPUTE STATISTICS FOR COLUMNS [NO SCAN];
The following example generates statistics for all column in the employees table:
ANALYZE TABLE employees PARTITION (dt) COMPUTE STATISTICS FOR COLUMNS;
Viewing Table details
Hive has a very nice feature that allows you to see details about a table, such as columns, data types, storage location of the table, size, etc. To view such information, use describe formatted with the table name.
DESCRIBE FORMATTED
Use the DESCRIBE statement to view statistics generated by CBO.
DESCRIBE [EXTENDED] tablename;
Viewing Generated Column Statistics
Use the following syntax to generate column statistics:
DESCRIBE FORMATTED [dbname.]tablename.columnname;
Last updated