Hive Challenges: Partitioning, Performance and More
In this series of articles, I will talk about the challenges while working with Hive tables. Like with any other Apache toolkit, there are a lot of configurations to play around with and you need to know a few of them. Luckily, for partitioning there are a lot of concepts which are similar to the RDBMS world. So, if you have some idea about table partitioning in databases, this should not be very complex for you. Q: How to decide number of partitions for a very large table? There are multiple aspects to look out for, Cardinality: Find out the cardinality of important attributes to find out how many distinct values the attribute holds. If you are making any attribute as a partition, it should have as few distinct values as possible( though few hundred partitions for a table are also fine). Sorting: If you can compromise on the write speeds, sort the data while writing. This allows formats like ORC to capture the starting and ending values in the metadata of files. Whil