Hive vs Spark vs Presto: Which Query Engine to Choose?

Now that we have covered setting up our machines to learn big data, the next question that we need to answer should be around different choices we have. The best thing about the Apache toolkit for big data is the number of choices for query engines. In this post, I will compare the three most popular such engines, namely Hive, Presto and Spark. These choices are available either as open source options or as part of proprietary solutions like AWS EMR.

Hive

Hive is the one of the original query engines which shipped with Apache Hadoop. Over the course of time, hive has seen a lot of ups and downs in popularity levels. There are two major functions of hive in any big data setup.

Hive Metastore

One of the constants in any big data implementation now-a-days is the use of Hive Metastore. Hive ships with the metastore service (or the Hcatalog service). This service allows you to manage your metastore as any other database. You can host this service on any of the popular RDBMS (e.g. MySQL, PostgreSQL etc.). This allows you to query your metastore with simple SQL queries, along with provisions of backup and disaster recovery. If your metastore starts growing you can always scale up your DB instance, instead of touching your Hadoop setup.

Hive Engine

Hive query engine allows you to query your HDFS tables via almost SQL like syntax, i.e. HQL. This is a massive factor in the usage and popularity of Hive. 

When to use Hive?

Unless you have a strong reason to not use the Hive metastore, you should always use it. For the Hive engine, though its performance is really improving over the last few years, there are better options in terms of capabilities and performance if you go with Spark or Presto. Another use case where I have seen people using Hive is in the ELT process on their Hadoop setup. As Hive allows you to do DDL operations on HDFS, it is still a popular choice for building data processing pipelines.

Spark

Spark is the new poster boy of big data world. And it deserves the fame. It is way faster than Hive and offers a very robust library collection with Python support. It processes data in-memory and optimizations like lazy processing and DAG implementation for dependency management makes it a de-facto choice for a lot of people. It also offers ANSI SQL support via the SparkSQL shell.

When to use Spark?

Spark excels in almost all facets of a processing engine. Using Spark, you can build your pipelines using Spark, do DDL operations on HDFS, build batch or streaming applications and run SQL on HDFS. The only reason to not have a Spark setup is the lack of expertise in your team. A minor issue with SparkSQL is its deteriorating performance with increased concurrency.

Presto

Presto is a peculiar product. It does only one thing but it does that really well. It is built for supporting ANSI SQL on HDFS and it excels at that. It is also an in-memory compute engine and as a result it is blazing fast. 

When to use Presto?

Presto is no-doubt the best alternative for SQL support on HDFS. It scales well with growing data. It supports high concurrency on the cluster. Its workload management system has improved over time. Another great feature of Presto is its support for multiple data stores via its catalogs. That means that you can join data in a Hadoop cluster with another dataset in MySQL (or Redshift, Teradata etc.) in a single SQL query. Isn't that amazing? 

In the next post I will share the results of performance benchmarking between Hive, Spark and Presto

Comments

Popular posts from this blog

Uber Data Model

Data Engineer Interview Questions: SQL

Cracking Data Engineering Interviews

Hive Challenges: Bucketing, Bloom Filters and More

UBER Data Architecture