5 minute read · September 23, 2021
3 Key Trends Shaping the Future of Data Infrastructure
· Founder & Chief Product Officer, Dremio
Organizations today are looking beyond traditional ways to stand up analytics and implement modern data architectures that both accelerate analytics and keep costs under control. This includes challenging a 30-year-old paradigm extracting and loading data into costly, proprietary data warehouses for BI and analytics.
Three major trends have emerged in the world of analytics that IT leaders are actively working on today.
Trend #1: Shifting Toward Open Data Architectures
The data architectures prior to 2015 were closed and proprietary. For example, databases such as Teradata and Oracle had storage, compute and data all tied and co-located in one system, which meant all three were tethered to that system. There was no separation of compute and storage.
Between 2015 and 2020, the widespread adoption of public cloud changed this landscape and separation of compute and storage became a reality. Cloud data vendors such as AWS and Snowflake made it possible to separate storage from compute in cloud warehouses, providing better scalability and efficiency. But data still needed to be ingested, loaded and copied into one proprietary system, which was in turn tied to a single query engine. Using multiple databases or data warehouses meant storing multiple copies of data. And companies still had to pay to get their data in and out of the proprietary system, which led to astronomical costs.
Recently we’ve witnessed the rise of a more modern and open data architecture, one in which the data is its own independent layer. There is a distinct separation between data and compute. Data is stored in open source file formats (such as Apache Parquet) and open source table formats (such as Apache Iceberg), and accessed by decoupled and elastic compute engines such as Apache Spark (batch), Dremio (SQL) and Apache Kafka (streaming). Hence, the same data is accessed by different engines in a loosely coupled architecture.
With these architectures, data is stored as its own independent tier in open formats in the company’s own cloud account, and made accessible to downstream consumers through a variety of services. This is akin to how applications have transitioned from monolithic architectures to microservices. A similar transition is now happening in the world of data analytics, with companies moving from proprietary data warehouses and endless ETL processes to open data architectures such as cloud data lakes and lakehouses. This shift and evolution is inevitable and the data and analytics space is already moving in this direction.
Trend #2: Making Infrastructure Easier
Organizations get maximum benefit from open data architectures by making the infrastructure easier to use. And that’s by leveraging SaaS services to talk to the data tier and having different services communicate with each other.
SaaS makes it easier, just the way Gmail works today — no need to install, upgrade, configure or monitor software, and no need to collect log files or SSH into machines. And the services use open data formats, as discussed in the previous trend.
Dremio Cloud combines what people love about Dremio with the benefits of SaaS. It offers a unified and consistent semantic layer with security and governance, a multi-engine architecture for workload isolation and seamless integration with BI tools in the market. Above all, it offers ease of use — no upgrades, no versions, and infinite scalability (run as many queries as you want or just one query). The global control plane provides a single pane of glass with complete visibility into users, security and integrations regardless of cloud or regions. The control plane’s microservices architecture auto scales so that it can plan and optimize millions of queries in parallel, and the engines in the execution plane auto replicate to execute whatever workload is routed to them. The result is that you no longer have to worry about capacity or perform sizing exercises. And the data stays in the data plane within the customer’s S3 bucket within their VPC.
Trend #3: Making Data Engineering and Data Management Easier
As open data architectures are adopted by organizations, data engineering and data management on data lake storage has to become easier as well. Apache Iceberg and Delta Lake enable transactional tables within the lake. Data engineers can use simple DML statements at the record level to insert, update and delete data. They also provide several other key benefits such as powerful schema evolution, table versioning/partitioning and querying snapshots.
The introduction of Project Nessie, a modern metastore for data lakes and lakehouses, simplifies the life of a data engineer. Nessie borrows concepts from the world of Git — data branching and version control. In the traditional database world, OLTP operations are done in one session, by one user, entirely in SQL. With Nessie, inserts into a table can be performed in an ETL branch, which also supports multiple clients — perhaps the data is ingested from Kafka, and then transformed through a series of Spark jobs and Dremio queries prior to merging into the main branch. The merge is an atomic operation and all data consumers see the changes at the same time.
Learn More
To learn more about the key trends, please watch this keynote with Dremio co-founder and Chief Product Officer Tomer Shiran.
You can also sign up for Dremio Cloud here.