Storage Format

There are a variety of file formats and compression algorithms supported by Hadoop. HDFS is used for raw storage but there are applications such as Hive layered on top which are optimized for ad-hoc queries and may have additional data format requirements.

Data from other systems usually arrives as text or structured format

· Text Data – Log files, CSV

· Structured Text Data – XML, JSON

However, for efficient queries and storage, the data should be converted to one of the following file formats

· ORC - The Optimized Row Columnar (ORC) file format provides a highly efficient way to store Hive data. Using ORC files improves performance when Hive is reading, writing, and processing data.

· Parquet - Parquet is a wide columnar format built with complex nested data structures in mind. Parquet can also be used with Hive and other engines.

Last updated