Your Next Gen Data Architecture: Data Lakes
Software Engineering as a field thrives on ever-changing keywords. In the last decade it was "big data" and "noSQL". Every engineer was a big data engineer and every one was working on noSQL. Similarly, now the world is crazy about Data Science and Machine Learning. Half of my connections on LinkedIn have become "data scientists" overnight. Somewhere in between there was another keyword which made a splash (no pun intended!) in the data warehousing world. That keyword was "Data Lakes". In this post, we will talk about what a data lake is, what are the reasons to have a data lake and what are the things you should keep in mind while designing a data lake for your data architecture.
Why Data Lakes?
The two biggest reasons to have a data lake in your DWH architecture are: scalability and support for unstructured data.
A data lake allows your data to grow, enormously. Without any major change in the storage strategy. In this age of big data, you can never really predict the size of your data. So, you need to have a place which can grow by orders of magnitude, if needed. A data lake fits that bill, perfectly.
A data lake also allows you to have disparate data sources and unstructured/ semi-structured data. This helps in having a loosely coupled processing architecture, without having to worry about supporting all previously unknown structured data. You can have a query engine now, knowing fully well that tomorrow you might have to get a new engine to process just one particular data source which is not present today. Having a lake to store such new types of data gives you a freedom for the future.
Design Principles
Design for scale: Your data lake will grow, and it will grow fast. Make sure that you have rules and policies to control who is pushing data into your lake. Your data governance rules should control when and how data gets loaded.
Modularity and extensibility: Your data lake should allow your consumer teams to plug and play. Your lake should not be restrictive for which technology can be used by your downstream teams.
(Big) Data lake: Unless you have very strong reasons, go for an HDFS based data lake. Also, in almost all cases, having a Hive metastore is a good idea. This will allow you to have a lot of choices with Hive, Presto, Spark, Impala, Kafka etc. Your storage might be S3 or any other cloud, but have HDFS and a Hive metastore. Hive’s catalog service allows you to maintain your metastore as a database and query it. You can host this metastore on any RDBMS (like MySQL, AWS RDS etc.).
Raw Data: Store all the raw data in your lake. You don’t realize it now but you will need it. This way, your lake becomes the single source of truth for all your lineage analysis. This looks like a waste of storage till the day you need to trace the source of bad data in your production pipelines.
All Inclusive Lake: Force your teams to adapt to the new architecture. A data lake is only fruitful if you have all your data platforms contributing. If half of your data is not in your lake then you have the same problem which plagued most data warehouses - siloed data stores. A data lake strategy is an all-in strategy, make sure that you have your management buy-in for the same.
Invest in data discovery and self service: As your lake grows in size, you will start facing problems in discovering what data resides where. You will have trouble in understanding the structure of data and location of files. So, start early in investing for a self service architecture. You will need to invest into a data discovery mechanism which can be used by the consumers of the lake to figure things out by themselves.
Out of all the keywords floating around in the engineering communities, data lake is one of those which sticks around for a long time. There is very clear merit in the idea of lakes in the world of big data. But, whether the implementation is successful or not depends completely on the strategy. In this post, I have tried to summarize the key principles which needs to be kept in mind. Do share your feedback in the comments below.
Comments
Post a Comment