6 minute read · July 24, 2019
Modern Data Platform and the Cloud
· Data Lake Platform Owner, Raiffeisenbank
This is the second article in the series of building a Modern Data Platform. The first article was on understanding the modern data platform and its objectives. This article will cover the cloud as it pertains to the modern data platform. It came out a bit longer than I expected, and I decided to split it into two parts.
Part 1
One of the very first decisions that you are going to face is whether to build your data platform on-prem or on the cloud. The cloud provides an infrastructure as a service, which allows to instantiate a piece of infrastructure on demand with a simple API call. It drastically reduces time to market and enables self-service, autonomous environment. It’s not a surprise then, that analytical agencies promote a “cloud first” approach. However, you probably heard that the cloud is very expensive and can become a ‘money pit’ that is difficult to manage.
While it’s true that the cloud is not cheap, with the right approach, it can provide a better TCO (Total Cost of Ownership) and ROI than on-prem solutions, and it can be a real enabler for your data platform. So, what is the secret sauce of success in the cloud? It’s actually very simple. In a few words – decoupling data and compute will enable you to utilize the Cloud Elasticity efficiently. However, the execution of this principle may not be as easy.
If you already have an on-prem data lake, most likely it’s a Hadoop cluster that hosts your data lake on HDFS (Hadoop Distributed File System) and executes various use cases – from production ETL processes to unpredictable ad-hoc SQL queries run by your BI and data science teams.
Sometimes, the ad-hoc queries significantly slow down the Hadoop cluster and may even bring down SLA-bound ETL jobs. It would be great to have more than one Hadoop cluster for various use cases, but it will take many months to put it in place assuming that your organization can afford it. And even then, how do you manage the data between two or three Hadoop clusters? The workload is volatile and growing and your cluster sometimes runs at its peak capacity. The upcoming holidays will likely bring a lot of business to your company but also cause an extremely high workload and potential outage as depicted in the plot below.
As discussed in the previous article, the modern data platform poses a challenge of 3Vs – Volume, Variety and Velocity. This requirement creates a perfect storm for the on-prem data platforms and results in many companies migrating their on-prem data lakes and data warehouses to the cloud. The cloud is designed for these needs. Theoretically, by making your cluster elastic, you can support unpredictable, volatile and virtually unlimited workload and pay only for the resources that you actually use (plots below) without throwing away money on maintaining a large static infrastructure day in and day.
So, the cloud seems to be elastic and have unlimited resources. Why not just lift and shift the Hadoop cluster to the cloud and make it as big as needed or maybe migrate the AS400/DB2 workload to the AS400 Cloud? This is one of the biggest pitfalls that organizations fall into. Generally speaking, this approach will increase the TCO as well as the technical debt. Moving the Hadoop cluster to the cloud does not change its nature. It’s still bound by a necessity to run the entire Hadoop cluster to support HDFS in a consistent and reliable way.
The way to solve this puzzle is to decouple data and compute. It means, we cannot use HDFS for the data lake. Instead, we need to use cloud’s storage such as AWS S3 or Azure ADLS as depicted in the diagram below.
This simple but important conceptual shift allows us to bring the cluster down without any impact on the data accessibility. It makes the cluster an ephemeral resource. Now, we can have many ephemeral clusters running on the data lake. The clusters can be of different types running a variety of tasks - Hadoop jobs, Spark sessions, Dremio queries, etc. Finally, it’s possible to dedicate clusters to specific teams and use cases, such as data engineering, BI, data science. The clusters are isolated from each other and allow to run jobs within SLA without risks of impact from other jobs as depicted on the diagram below.
The ephemeral clusters can also be elastic. There is no longer a fixed cost for the virtual machines just to support HDFS and struggle with limited compute resource under high workload. With ephemeral clusters, it’s possible to utilize cloud elasticity and scale the cluster out and in based on the actual workload (see plots below). This is something that is not achievable with lift-and-shift. The only reason it’s possible is because the data and compute are decoupled. This is one very simple very best practice!
However, don’t make elasticity the ultimate goal. While evaluating various options, keep in mind that elasticity is a way to achieve a specific objective, such as cost management. There are other high impact options that must be considered. For example, you might find that some tools that seem to utilize elasticity in a less efficient manner may outperform other tools on TCO.
This concludes Part 1. As always, I hope this article was informative. Whether your experience correlates with mine or not, please share your thoughts and ask questions. In part 2, I will cover many other challenges on implementing data platform in the cloud. Stay tuned!