ORC
Last updated
Last updated
Apache ORC (Optimized Row Columnar) is a self-describing type-aware columnar file format designed for Hadoop workloads. The data format is designed and optimized for large streaming reads, but also supports finding required rows quickly. Storing data in a columnar format is popular in analytics as it lets the reader read, decompress, and process only the values that are required for the specific query. ORC files are type-aware, the writer chooses the most appropriate encoding for the type and builds an internal index as the file is written. Predicate pushdown uses those indexes to determine which stripes in a file need to be read for a particular query and the row indexes can narrow the search to a particular set of 10,000 rows. ORC supports the complete set of types in Hive, including the complex types: structs, lists, maps, and unions. source: https://orc.apache.org/specification/ORCv1/
If you issue a query in one place to run against a lot of data in another place, you end up with a lot of network traffic, which is slow and costly. However if you can “push down” parts of the query to where the data is stored, which filters out most of the data, then reduce network traffic and get faster results. Read more about ORC spec at https://orc.apache.org