Posts

Showing posts from June, 2019

Hive Challenges: Partitioning, Performance and More

In this series of articles, I will talk about the challenges while working with Hive tables. Like with any other Apache toolkit, there are a lot of configurations to play around with and you need to know a few of them. Luckily, for partitioning there are a lot of concepts which are similar to the RDBMS world. So, if you have some idea about table partitioning in databases, this should not be very complex for you. Q: How to decide number of partitions for a very large table? There are multiple aspects to look out for, Cardinality:  Find out the cardinality of important attributes to find out how many distinct values the attribute holds. If you are making any attribute as a partition, it should have as few distinct values as possible( though few hundred partitions for a table are also fine). Sorting:  If you can compromise on the write speeds, sort the data while writing. This allows formats like ORC to capture the starting and ending values in the metadata of files. Whil

Cracking Data Engineering Interviews

Of late, a lot of people have asked me for tips on how to crack Data Engineering interviews at FAANG (Facebook, Amazon, Apple, Netflix, Google) or similar companies. I have not worked at all of these companies so I can't share tips which will necessarily apply for all of them but I will share tips which can be generalized for most of the big companies. First of all, the field of Data Engineering has expanded a lot in the last few years and has become one of the core functions of any big technology company. The obvious reason for this expansion is the amount of data being generated by devices and data-centric economy of the internet age. Each company is focussed on making the best use of data owned by them by making data driven decisions. If you compare this to the Data Engineering roles which used to exist a decade back, you will see a huge change. In the past, Data Engineering was invariably focussed on Databases and SQL. Even now, these two form some part of most Data Engin