Hadoop File Formats decision tree

Hadoop being the most popular distributed system in the market, has a ton of features. Among them is file formats. Most organizations and engineers do use various file formats but fail to understand their exact use and underlying architecture.

The Problem?
There is a significant performance drop if the wrong file format is chosen. Increase in query execution time by 3x, Sudden increase in storage instead of paying for 100MBs; project spends for GBs into storage, Re-processing complete data when new fields get added or removed because schema evolution was not considered by architects. Well, there is not a single file format available in the market which can handle all these problems.

The flexible structure!
Hadoop is designed to handle unstructured data, then why does file formats matter? The unstructured data includes videos, binary data, images etc. But most of the organization don’t have to deal with it, they simply use “unstructured data” for changing or flexible structure (add/remove fields over data)

In this article, we’d be looking at the key factor for deriving correct file format (for querying environment) and leveraging the power of it for your use case. Few of the below factors need to be kept in mind while taking the decision. (but it is not limited to)

1. Distribution
• Both CDH and CDP support most of the Hadoop file formats, but if you are using Impala keep in mind it doesn’t work with ORC. Likewise, Parquet is not supported with Hive-Stinger.
• CDH provides support for Parquet whereas HDP provides support for ORC.
• It’s a good habit to check once, before using those formats. Otherwise, some overhead activities will come into action.

2. Query processing Tools
• As stated in the previous point, Impala doesn’t support ORC. Hence if your data lake is in ORC make an appropriate decision to choose the query processing tools.
• Make sure if you are using any data processing language than it has reader and writers for the file format. Avro doesn’t have native scala reader/writers but for Java they provided. Databricks library does provide Avro support over scala.

3. Schema Evolution
• Decide and get the tentative idea how frequently and at what scale, schema changes will occur.
• Choose appropriate file format based on this, don’t plan it later for changing file format based schema as it’ll involve a large overhead.
• Categorize your data is really an unstructured data or flexible schema?

4. Size
• With large size comes the large-scale problems. Updating data or changing schema will be a nightmare.
• If your data small than updating or changing schema in textfiles

won’t be an issue, in which data is being read from the beginning.
• But if your data size if huge and reading it sequentially will take lot of time. Updating environment uses ORC, Changing schema use Avro.

5. Performance
• Decide in the initial phase itself either you want fast read or fast write. Based on that take a decision on selecting correct file formats.
• Your query execution time and storage also depends on this factor.

6. Requirement
• If your purpose if only storing data and making it archival system. Still, you need to think how your data will be compressed and store more data
• Interacting with BI tools? Yes, then only a few columns will be used, why not go for ORC.
• Frequent change in business regulation and schema changes with fast data read, ah Avro is there for you!
• In the end, the final decision is based on your requirement.

The decision tree showed here helps to select an appropriate file format for you!

This was just an introduction. In next article of this series, we’ll be seeing each of the file formats, its architecture and technicality.


Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.