Monday, 15 April 2019

ORC and Parquet and avro formats

While ORC and Parquet are both columnar data stores that are supported in Hadoop, I was wondering if there was additional guidance on when to use one over the other? Or things to consider before choosing which format to use?


1. Many of the performance improvements provided in the Stinger initiative are dependent on features of the ORC format including block level index for each column. This leads to potentially more efficient I/O allowing Hive to skip reading entire blocks of data if it determines predicate values are not present there. Also the Cost Based Optimizer has the ability to consider column level metadata present in ORC files in order to generate the most efficient graph.

2. ACID transactions are only possible when using ORC as the file format.


 Avro is a row-based storage format for Hadoop which is widely used as a serialization platform. Avro stores the data definition (schema) in JSON formatmaking it easy to read and interpret by any program. The data itself is stored in binary format making it compact and efficient.


No comments:

Post a Comment