4 minute read · November 10, 2020

Separation of Compute and Data: A Profound Shift in Data Architecture

Billy Bosworth

Billy Bosworth · CEO, Dremio

For many years now, the industry has talked about the separation of compute and storage, and for good reason – it was a critical step forward for efficiency. When we were able to separate the compute tier from the storage tier, at least three important things happened:

  1. Raw storage costs became so cheap that they were practically "free" on an IT budget spreadsheet.
  2. Compute costs were isolated meaning customers only paid for what they needed when processing data, thereby further lowering overall costs.
  3. Independent scaling of storage and compute allowed for on-demand, elastic fine-tuning of resources bringing flexibility to architectural designs.

But these didn’t happen right away. Large, expensive SANs, and cheaper, but often complicated-in-their-own-way NAS systems have been with us for quite some time. The limiting factor for both of those storage models was administration and procurement overhead. Mass adoption of separating compute and storage would become practical only with public cloud computing. Separate compute and storage in the public clouds is simple to administer and relatively low cost. In addition, these compute and storage cloud services are, for all intents and purposes, infinitely scalable, which also eliminates the hardware procurement problems of old. Moreover, the services provide very high availability and performance.

Today, another paradigm for fully leveraging cloud infrastructure resources is underway: one that puts data at the center of the architecture, not a vendor. Just as applications have moved to microservice architectures, data itself is now able to follow suit, fully exploiting cloud capabilities in the process. Imagine the model shifting something like this:

Separation of compute and data. Before and after.

Let’s take a cloud data warehouse as an example. From a pure cost standpoint, if a vendor charges you separately for your storage and compute utilization, you are in a better position than if those were inextricably linked. While that is progress, it brings some further challenges.

Let’s grant that a cloud data warehouse separates your compute and storage costs. So far, so good. But is your data itself separated from that vendor’s compute? Can you freely (in every sense of the word) access that data without paying the data warehouse vendor? You cannot. Are you paying the data warehouse vendor to get the data into their system? Or get it out of their system? Yes, you are. Is your data stored independently such that myriad other cloud services can access it through industry standard formats? No it is not. That state of affairs exists because that’s just not how data warehouses were designed some 30 years ago, and that same design principle exists in cloud data warehouses as well. The design imperative was to have the data completely in control of the data warehouse itself.

One hardly needs to make the point that data is central to an organization’s future. The question then becomes, what is the best architectural way to unlock that centrality? By separating compute from data, three immediate benefits are realized:

  1. Extreme reduction in complex and costly data copies and movement as one shifts from the data warehouse being the single source of truth and instead accessing the data in open formats in the data lake, which also eliminates data silos.
  2. Open data standards and formats allow for universal data access from unlimited services and applications, creating freedom to choose best-of-breed solutions.
  3. An open architecture means future cloud services will be able to access the data directly instead of going through a data warehouse vendor’s proprietary format or moving/copying the data from the data warehouse for access.

Application architectures have proved that a services approach allows for maximum scale, flexibility, and agility. Separating compute and storage was an important first step in lowering costs for analytics, but it does not provide the kind of advantages found in modern application architectures. By separating compute and data, application design benefits now can be realized for data analytics. And given the critical nature of data for all businesses, that can’t happen fast enough.

Ready to Get Started?

Enable the business to create and consume data products powered by Apache Iceberg, accelerating AI and analytics initiatives and dramatically reducing costs.