Hive Challenges: Bucketing, Bloom Filters and More

In the second post of this series, we will learn about few more aspects of table design in Hive. Now that you know about partitioning challenges, you will be able to appreciate these features which will help you to further tune your Hive tables.

Bucketing

In addition to Partitioning the tables, you can enable another layer of bucketing of data based on some attribute value by using the Clustering method. Clustering can be used with partitioned or non-partitioned hive tables. One particular use case where Clustering becomes useful when your partitions might have unequal number of records (e.g. users logging in per country, US partition might be a lot bigger than New Zealand). In such cases, you can define the number of buckets and the clustered by field (like user Id), so that all the buckets have equal records. In partitioning each partition gets a directory while in Clustering, each bucket gets a file. Records with the same bucketed column will always be stored in the same bucket.

Bloom Filters

Bloom filters are relatively new addition to Hive but are really powerful in some use cases where the queries are using the = operator or when joining large tables. A bloom filter is a hash value for the data in a column in a given block of data. This means that you can ask a bloom filter if it contains a certain value (e.g. country = US or gender = female), without the need to read the block at all. This increases the read performance substantially as it can just skip the blocks where the value is not present. 

You can specify a bloom filter when using the create or alter statement of the table by setting the TBL property ‘orc.bloom.filter.columns’ to the columns for which you want to create the bloom filter. Currently bloom filters only work for the ORC format. You can make bloom filters more effective by sorting the data (by the columns which are part of the filter) while inserting. 

Predicate Pushdown

Predicate Pushdown has been part of a lot of RDBMS systems for a long time now. It allows the where clause to be executed before the join, resulting in reduced amount of data sent over the network. For example, in the following query, the WHERE clause will get executed before the join. It will result in lesser data being sent to the reduce phase.


SELECT a.*
FROM table1 a
JOIN table2 b ON a.id = b.id

WHERE a.city = 'New York'

Now, Hive also supports predicate pushdown. To enable Predicate pushdown, use this property:

set hive.optimize.ppd = true;


I hope that you will be able to apply these principles in your Hive systems and see the results yourself. Do share your feedback or topics you would like me to cover.


Previous post: Hive Partitioning Challenges

Comments

Popular posts from this blog

Uber Data Model

Data Engineer Interview Questions: SQL

Cracking Data Engineering Interviews

UBER Data Architecture