Data Organization

Hive data is organized into:

  • Databases: Namespaces function to avoid naming conflicts for tables, views, partitions, columns, and so on. Databases can also be used to enforce security for a user or group of users. Also interchangeably referred to as Schema.

  • Tables: Used to organize homogeneous units of data which have the same schema.

  • Partitions: Each Table can have one or more partition Keys which determines how the data is stored. Partitions—apart from being storage units—also allow the user to efficiently identify the rows that satisfy a specified criteria. Partition columns are virtual columns, they are not part of the data itself but are derived on load.

  • Buckets (or Clusters): Data in each partition can further be divided into Buckets based on the value of a hash function of some column of the Table. These can be used to efficiently sample the data.

It is not necessary for tables to be partitioned or bucketed, but these abstractions allow the system to prune large quantities of data during query processing, resulting in faster query execution.

Last updated