Documents
  • Invariant Documents
  • Platform
    • Data Platform
      • Install Overview
      • System Requirement
      • Software Requirement
      • Prepare the Environment
      • Installing Ambari Server
      • Setup Ambari Server
      • Start Ambari Server
      • Single Node Install
      • Multi-Node Cluster Install
      • Cluster Install from Ambari
      • Run and monitor HDFS
    • Apache Hadoop
      • Compatible Hadoop Versions
      • HDFS
        • HDFS Architecture
        • Name Node
        • Data Node
        • File Organization
        • Storage Format
          • ORC
          • Parquet
        • Schema Design
      • Hive
        • Data Organization
        • Data Types
        • Data Definition
        • Data Manipulation
          • CRUD Statement
            • Views, Indexes, Temporary Tables
        • Cost-based SQL Optimization
        • Subqueries
        • Common Table Expression
        • Transactions
        • SerDe
          • XML
          • JSON
        • UDF
      • Oozie
      • Sqoop
        • Commands
        • Import
      • YARN
        • Overview
        • Accessing YARN Logs
    • Apache Kafka
      • Compatible Kafka Versions
      • Installation
    • Elasticsearch
      • Compatible Elasticsearch Versions
      • Installation
  • Discovery
    • Introduction
      • Release Notes
    • Methodology
    • Discovery Pipeline
      • Installation
      • DB Event Listener
      • Pipeline Configuration
      • Error Handling
      • Security
    • Inventory Manager
      • Installation
      • Metadata Management
      • Column Mapping
      • Service Configuration
      • Metadata Configuration
      • Metadata Changes and Versioning
        • Generating Artifacts
      • Reconciliation, Merging Current View
        • Running daily reconciliation and merge
      • Data Inventory Reports
    • Schema Registry
  • Process Insight
    • Process Insight
      • Overview
    • Process Pipeline
      • Data Ingestion
      • Data Storage
    • Process Dashboards
      • Panels
      • Templating
      • Alerts
        • Rules
        • Notifications
  • Content Insight
    • Content Insight
      • Release Notes
      • Configuration
      • Content Indexing Pipeline
    • Management API
    • Query DSL
    • Configuration
  • Document Flow
    • Overview
  • Polyglot Data Manager
    • Polyglot Data Manager
      • Release Notes
    • Data Store
      • Concepts
      • Sharding
    • Shippers
      • Filerelay Container
    • Processors
    • Search
    • User Interface
  • Operational Insight
    • Operational Insight
      • Release Notes
    • Data Store
      • Concepts
      • Sharding
    • Shippers
      • Filerelay Container
    • Processors
    • Search
    • User Interface
  • Data Science
    • Data Science Notebook
      • Setup JupyterLab
      • Configuration
        • Configuration Settings
        • Libraries
    • Spark DataHub
      • Concepts
      • Cluster Setup
      • Spark with YARN
      • PySpark Setup
        • DataFrame API
      • Reference
  • Product Roadmap
    • Roadmap
  • TIPS
    • Service Troubleshooting
    • Service Startup Errors
    • Debugging YARN Applications
      • YARN CLI
    • Hadoop Credentials
    • Sqoop Troubleshooting
    • Log4j Vulnerability Fix
Powered by GitBook
On this page
  1. Platform
  2. Apache Hadoop
  3. Hive

Cost-based SQL Optimization

Hive supports cost-based optimization (CBO) of SQL queries. This uses statistics about the Hive tables, table partitions, and columns within a table to produce good query execution plans. More efficient query plans can better utilize cluster resources and improve query latency. CBO is especially useful for complex queries containing multiple JOIN statements and for queries on very large tables.

Tables are not required to have partitions to generate CBO statistics.

CBO currently generates the following statistics:

Statistics Granularity

Description

Table-level

* Uncompressed size of table * Number of rows * Number of files

Column-level

* Number of distinct values * Number of NULL values * Minimum value * Maximum value

CBO requires column-level statistics to generate the best query execution plans. Column-level CBO statistics can be generated by both partitioned and un-partitioned tables.

Generating Statistics

Use the ANALYZE TABLE command to generate statistics for tables and columns. Use the optional NO SCAN clause to improve query performance by preventing a scan of files on HDFS. This option gathers only the following statistics:

· Number of files

· Size of files in bytes

ANALYZE TABLE tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)] COMPUTE STATISTICS [NO SCAN];

The following example views statistics for all partitions in the employees table. The query also uses the NO SCAN clause to improve performance:

ANALYZE TABLE employees PARTITION (dt) COMPUTE STATISTICS

Generating Column-level Statistics to generate statistics for columns in the employee table:

ANALYZE TABLE tablename [PARTITION(partcol1[1=val1], partcol2[=val2], ...)] COMPUTE STATISTICS FOR COLUMNS [NO SCAN];

The following example generates statistics for all column in the employees table:

ANALYZE TABLE employees PARTITION (dt) COMPUTE STATISTICS FOR COLUMNS;

Viewing Table details

Hive has a very nice feature that allows you to see details about a table, such as columns, data types, storage location of the table, size, etc. To view such information, use describe formatted with the table name.

DESCRIBE FORMATTED

Use the DESCRIBE statement to view statistics generated by CBO.

DESCRIBE [EXTENDED] tablename;

Viewing Generated Column Statistics

Use the following syntax to generate column statistics:

DESCRIBE FORMATTED [dbname.]tablename.columnname;

PreviousViews, Indexes, Temporary TablesNextSubqueries

Last updated 6 years ago