Posts

How Many Spark Executors Do I need?

After we have understood the key components of a Spark job in the last post , in this post I will explain how to come up with the configuration to control these components. Do note that these are just best practices and guidelines widely implemented in the industry, but these are not set in stone. You can always revisit these guidelines if your system behaviour needs slightly different configurations.  Where to Start? In my experience, most people get stuck at this step because of the overwhelming number of settings and configs they can tweak. So, in this post I will try to give a simplified way to come up with these configurations. These come from Cloudera and are quite popular within the Big Data community. Then, I will walk you through a sample configuration which should help you understand the principles and see them in practice. Number of cores per executor: Most people actually stick to the two extremes while assigning number of cores for their executors. First extreme

Hive vs Spark vs Presto: SQL Performance Benchmarking

Image
In my previous post, we went over the qualitative comparisons between Hive, Spark and Presto . In this post, we will do a more detailed analysis, by virtue of a series of performance benchmarking tests on these three query engines.  Benchmarking Data Set For this benchmarking, we have two tables. Some of the key points of the setup are: - All the query engines are using the Hive metastore for table definitions as Presto and Spark both natively support Hive tables - All the tables are external Hive tables with data stored in S3 - All the tables are using  Parquet  and  ORC  as a storage format Tables : 1. product_sales: It has ~6 billion records 2. product_item: It has ~589k records Hardware Tests were done on the following EMR cluster configurations, EMR Version: 5.8 Spark: 2.2.0 Hive: 2.3.0 Presto: 0.170 Nodes: Master Node:   1x  r4.16xlarge Task nodes:  8 x r4.8xlarge Query Types There are three types of queries which were tested,

Amazon Price Tracker: A Simple Python Web Crawler

Image
Have you ever wanted something but couldn't afford it at the regular price? Or did you ever think that something was a good-to-have but not really worth at the regular price? Have you ended up checking Amazon every few days for a price drop on the item? Well, now you don't need to. You can build a very simple Python program that does this for you. That's what we are going to do today, in this post. We will build a very basic web crawler, we will scrape the price of the item from the item page on Amazon and we will compare whether the price has dropped, increased or remained the same. Prerequisites: You need to install a couple of things to get started. You will need Python. Once, you have installed Python, you need to install a couple of libraries. If you are using a Mac, you can get Python installed using Homebrew, >brew install python >brew install python@2 Once you have installed Python, you can use pip to install requests and bs4 . >

Hive vs Spark vs Presto: Which Query Engine to Choose?

Now that we have covered setting up our machines to learn big data , the next question that we need to answer should be around different choices we have. The best thing about the Apache toolkit for big data is the number of choices for query engines. In this post, I will compare the three most popular such engines, namely Hive, Presto and Spark. These choices are available either as open source options or as part of proprietary solutions like AWS EMR. Hive Hive is the one of the original query engines which shipped with Apache Hadoop. Over the course of time, hive has seen a lot of ups and downs in popularity levels. There are two major functions of hive in any big data setup. Hive Metastore One of the constants in any big data implementation now-a-days is the use of Hive Metastore. Hive ships with the metastore service (or the Hcatalog service). This service allows you to manage your metastore as any other database. You can host this service on any of the popular RDBMS (e.g.

Your Next Gen Data Architecture: Data Lakes

Software Engineering as a field thrives on ever-changing keywords. In the last decade it was "big data" and "noSQL". Every engineer was a big data engineer and every one was working on noSQL. Similarly, now the world is crazy about Data Science and Machine Learning. Half of my connections on LinkedIn have become "data scientists" overnight. Somewhere in between there was another keyword which made a splash (no pun intended!) in the data warehousing world. That keyword was "Data Lakes". In this post, we will talk about what a data lake is, what are the reasons to have a data lake and what are the things you should keep in mind while designing a data lake for your data architecture. Why Data Lakes? The two biggest reasons to have a data lake in your DWH architecture are: scalability and support for unstructured data. A data lake allows your data to grow, enormously. Without any major change in the storage strategy. In this age of big data,

Setting Up Presto On Your Machine

Image
This is the fourth post in this series, geared up to making the reader self-sufficient for learning big data applications. In the previous posts of the series, we have installed Hadoop, Hive on the same namenode and MySQL metastore for Hive. In this post, we will build on the same setup and install Presto to use the same HDFS and Hive metastore. Let's get started then. Prerequisites Setup your machine for learning big data Setup Apache Hadoop on your machine Setup Apache Hive on your machine Installing Presto Again, we will use Homebrew to install Presto on our machine. Run brew install presto on your terminal. Configuring Presto You will need to edit following files ( at location etc/ ): node.properties config.properties log.properties jvm.config Export Environment Variables Add the following to your .bash_profile file and restart terminal (or source it): export PRESTO_VERSION=0.221 export PRESTO_HOME=/usr/local/Cellar/presto/0.221/l

Connecting AWS Redshift to SSAS Cubes

Image
Though it is a rare combination but there are cases where you would like to connect an MPP database like Redshift to an OLAP solution for analytics solutions. But, there might be scenarios where you would want a cube to power your reports without the BI server hitting your Redshift cluster. In this post I will show you how to connect to a Redshift instance from a SQL Server Analysis Services 2014. Environment Setup In my setup, the Redshift instance is in a VPC while the SSAS server is hosted on an EC2 machine in the same VPC. Access to the Redshift instance and SSAS host machine are controlled by two different security groups. Security group attached to the Redshift cluster has an ingress rule setup for the security group attached to the EC2 machine. I have tried to keep the environment as close to real life setups as possible. In most cases, your environment will be similar to this setup. Steps to Connect Redshift to SSAS 2014 Step 1: Download the PGOLEDB driver for y

Redshift to Snowflake Migration: SQL Function Mapping

Gather around, kids! We have a new hero in town - Snowflake . Snowflake is killing other databases like Thanos killed spoiler alert  in Avengers . They started with Teradata and Vertica, and are now aiming their guns at AWS Redshift. And they have earned this right. They have really addressed the pain points people are facing while using these other databases. For example, scaling up and down in Redshift is a painful process. On the other hand, in Snowflake you can spin up clusters on the fly without much hassle or effort with almost zero impact on running pipelines.  While this post is not about finding the better one among Snowflake and Redshift, you should still read it if you are one of those thinking about making the move from Redshift to Snowflake. This post lists out the most common functions used in Redshift and points out their Snowflake counterpart. This list is not exhaustive by any means and I will strive to keep on updating it as and when I find new ones. Function

Hive Challenges: Bucketing, Bloom Filters and More

In the second post of this series, we will learn about few more aspects of table design in Hive. Now that you know about partitioning challenges , you will be able to appreciate these features which will help you to further tune your Hive tables. Bucketing In addition to Partitioning the tables, you can enable another layer of bucketing of data based on some attribute value by using the Clustering method. Clustering can be used with partitioned or non-partitioned hive tables. One particular use case where Clustering becomes useful when your partitions might have unequal number of records (e.g. users logging in per country, US partition might be a lot bigger than New Zealand). In such cases, you can define the number of buckets and the clustered by field (like user Id), so that all the buckets have equal records. In partitioning each partition gets a directory while in Clustering, each bucket gets a file. Records with the same bucketed column will always be stored in the same bucke

Setting up Apache Hive on your Machine

Image
This is the third post of this series where we are setting up our machines to get started with learning big data. In the first post, we have installed and configured Hadoop on our Macbook and now we will install Apache Hive and use the same Hadoop cluster to save data in HDFS. Let's get started then. Prerequisites 1. Setup your machine for learning big data 2. Setup Apache Hadoop on your machine Installation We will be installing Hive 1.2.2, so we need to update the brew formula (as we did for Hadoop in the previous article). Use brew edit hive to open the formula and edit it to have these values: url " https://www.apache.org/dyn/closer.cgi?path=hive/hive-1.2.2/apache-hive-1.2.2-bin.tar.gz " sha256 "763b246a1a1ceeb815493d1e5e1d71836b0c5b9be1c4cd9c8d685565113771d1" Once you have updated the formula, install hive using brew install hive Modify Hive Configurations First things first, we need to edit the .bash_profile fil