Hive Challenges: Partitioning, Performance and More

In this series of articles, I will talk about the challenges while working with Hive tables. Like with any other Apache toolkit, there are a lot of configurations to play around with and you need to know a few of them. Luckily, for partitioning there are a lot of concepts which are similar to the RDBMS world. So, if you have some idea about table partitioning in databases, this should not be very complex for you.

Q: How to decide number of partitions for a very large table?

There are multiple aspects to look out for,

Cardinality: Find out the cardinality of important attributes to find out how many distinct values the attribute holds. If you are making any attribute as a partition, it should have as few distinct values as possible( though few hundred partitions for a table are also fine).

Sorting: If you can compromise on the write speeds, sort the data while writing. This allows formats like ORC to capture the starting and ending values in the metadata of files. While reading it filters out all the files which do not have values you are looking for by just looking at the metadata indexes, thus reducing the amount of read needed. Sorting can be done either at the time of table creation as part of the Create Table command [e.g. clustered by(userid) sorted by (age)] or at the time of insertion of data using the order by or sort by in the insert/ select statement.

Skew: Another thing to watch out for is the skew. If there are attribute values which have very few corresponding records in the table, you might have some very small partitions along with other huge partitions. There are few ways to deal with Skew but each has its own use case. One of the way is to use Skewed Table. In this case you are expecting a lot of data for few values for Columns C1 and C2, so you use the following command to let Hive know that. Hive will create separate directories to store such values so that these values can be directly used in the where clause (just like partition values). The only problem with this approach is that you need to know which values (and for which column) will create the skew in your data beforehand.

CREATE TABLE <T> (SCHEMA) SKEWED BY (keys) ON ('c1''c2')
ON (('C1_Value1, C2_Value2),('C1_Value2','C2_Value2'))
[Stored as Directories];

Small file problem: Make sure that the partition scheme you are coming up with is not resulting into a lot of small files. The way hive metastore works, each new file gets a record in the metadata tables. If you are defining partitions which are resulting in a large number of files, your queries will start overloading the Namenode of your hadoop cluster. This happens because the namenode needs to load all the records from the metastore into memory. As the number of records increase (because of large number of files), time to query the metastore increases. Also, the memory needed to store these records on the Namenode increases. All these factors slow down the Namenode performance and results in slower query performance as a whole. Also, as the Namenode is the single point of failure, this presents a larger risk of an unhealthy cluster which can bring down the cluster.

Q: E.g. Facebook wants to track the number of active users every month based on the user_activity table which logs every successful login. How will you partition this data to support a decent read performance.

A novel method will be to have a table which stores the (Cust_Id, Date, Current_status). Partition this table on day level. Now your query is just a count of cust_id with Current_status = True for any date.

To capture changes in status of users (active->inactive or inactive->active), have another table which just stores incoming (Cust_id, Status_changes). From this table we can generate daily partitions for the first table at the end of every day and insert in a new partition.

In the next post, I will cover some more aspects of designing Hive tables which impact query performance and how to avoid some common pitfalls while designing your tables.


Next post: Bucketing, Bloom Filters & More

Comments

Popular posts from this blog

Uber Data Model

Data Engineer Interview Questions: SQL

Cracking Data Engineering Interviews

Hive Challenges: Bucketing, Bloom Filters and More

UBER Data Architecture