15 minute read · July 1, 2021
What is a Data Lake?
A data lake is a centralized repository that allows you to store all of your structured and unstructured data at any scale. In the past, when disk storage was expensive, and data was costly and time-consuming to gather, enterprises needed to be discerning about what data to collect and store. Organizations would carefully design databases and data warehouses to capture the information viewed as essential to the operation of their business. Today, with changes in the economics of data storage and improved analytic tools, how organizations view data collection and storage has changed dramatically.
Data Lakes – The New Normal
At the raw component level, a 1TB hard disk now costs approximately USD 50. The cost of data storage has plummeted from an estimated USD 500K per GB in 1980 to less than USD 0.03 today. Storing a terabyte of data is a staggering 10M+ times less expensive than it was 40 years ago. Just as storage costs have plummeted, so too has the cost of data acquisition.In our modern age of mobile computing and big data, the cost of capturing data has dropped to almost zero, with nearly all data originating from electronic sources. Most transactions now leave a digital trail. These include in-store purchases, in-app e-commerce orders, and recorded customer service interactions via phone or chat. Whereas enterprises once needed to consider what data to keep carefully, the economics have shifted dramatically. Today it is often cheaper to retain all data in relatively inexpensive data lakes just in case information is needed in the future.
About Data Lakes
Organizations increasingly see value in storing all the data that they collect in vast data repositories, referred to as data lakes. Like a real lake, data lakes store large amounts of unrefined data coming from various streams and tributaries in its natural state. Also, like a real lake, the sources that feed the lake can change with time.At the physical level, data lakes leverage big data storage technologies that support the storage of data in multiple formats. Popular storage solutions for data lakes include legacy HDFS (Hadoop Distributed File System), Amazon S3, Azure Blob storage, Azure Data Lake Storage (ADLS) and various other on-premises and cloud-based object stores. What these technologies have in common is that they rely on distributed systems where data is spread across multiple low-cost hosts or cloud instances. Data is typically written to three or more physical drives spread across hosts, networks, and in some cases, different physical locations to ensure continuous data availability.Unlike traditional relational databases, data lakes can easily store any kind of data. Data can be structured, semi-structured or unstructured. Examples of structured and semi-structured data include CSV files, JSON text, OS or website logs, or time-series telemetry originating from a wearable device or equipment on a factory floor. Data of a less structured nature may include photos related to customer insurance claims, audio recordings of customer service interactions, or email archives containing raw text and encoded attachments.Data lakes are only as useful as their metadata. All items placed in a data lake are assigned a unique identifier and tagged with a set of extended metadata attributes. These attributes help ensure that items of interest can be recalled in the future.
Why Use a Data Lake?
The motivation for data lakes is simple. With access to more and better information, organizations can make better decisions. They can gain valuable insights that can benefit the business in multiple ways.
Opportunities include identifying new upsell and cross-sell opportunities, avoiding customer attrition, and spotting opportunities for efficiency gains that can reduce costs and boost profitability.
- Improve customer service
- Enhance product quality
- Increase operational efficiencies
- Improve competitiveness
- Make more informed decisions
"According to 2020 research from MarketsandMarkets, the global data lake market is expected to grow at a CAGR of 20.6% to US 20.1B by 2024."
With the advent of modern analytic tools, it has become easier to perform analysis of data in its natural state, often in near real time. Also, enterprises are increasingly leveraging new analytic techniques, including machine learning (ML), to make better predictions from data.Training an ML model depends on having access to vast amounts of data. ML techniques such as deep neural networks (deep learning) can identify features in data with predictive value. Human analysts or traditional analytic techniques would almost certainly miss many of these essential features. Better quality predictive models can help with everything from managing inventory and supply chains, to improving customer retention and loyalty, to catching product defects early.
Data Lake Use Cases
Data architecture modernization
Avoid reliance on proprietary data warehouse infrastructure and the need to manage cubes, extracts and aggregation tables. Run operational data warehouse queries on lower-cost data lakes, offloading the data warehouse at your own pace.
Business intelligence on data lake storage
Dramatically improve speed for ad hoc queries, dashboards and reports. Run existing BI tools on lower-cost data lakes without compromising performance or data quality. Avoid costly delays adding new data sources and reports.
Data science on data lake storage
Accelerate data science on data lake storage with simplified data exploration and feature engineering. Dramatically improve performance making data scientists and engineers more efficient, resulting in high-quality analytic models.
Cloud data lake migration
Optionally deploy new applications to the cloud using data lake storage such as S3 or ADLS. Migrate from older on-prem data lake environments that are expensive and difficult to maintain while ensuring agility and flexibility.
The Evolution of Data Lakes
First-generation data lakes were typically implemented on premises using Hadoop, an open source technology originally developed by Doug Cutting at Yahoo. Hadoop was subsequently released as a top-level Apache project. In Hadoop, data is stored across the HDFS built on clusters of storage-dense commodity servers. Hadoop initially provided a Java-based framework and a variety of tools to process and manage very large datasets in parallel.
The Early Days of Hadoop
In the early days, Hadoop’s Java-based MapReduce programming model was used to process and query large datasets residing in HDFS. While difficult to use, for organizations building large data lakes, Hadoop was the only game in town. Hadoop evolved rapidly and saw the introduction of new scripting languages such as Pig and Hive (providing SQL-like functionality) and Apache HBase (for columnar storage). These tools made it possible to manipulate and query the large datasets in HDFS without writing Java programs directly. The Hadoop platform continued to evolve seeing the addition of YARN (a generalized resource manager) and additional data engines such as Impala that sidestepped Hadoop’s MapReduce roots.
Apache Spark
The need for faster data manipulation gave rise to Apache Spark, an in-memory data engine able to connect directly to data assets in Hadoop along with other data sources. Although Spark was developed independently of Hadoop, Spark became part of leading Hadoop distributions. It offered dramatically better performance than older MapReduce technology, and Spark quickly overtook MapReduce in popularity. Spark brought a variety of other capabilities to the Hadoop platform including Spark SQL, Spark streaming MLlib (a machine learning library) and GraphX.
Cloud-Based Object Stores
Around this same time, cloud providers were beginning to provide inexpensive object storage in the cloud. Examples include S3 and Azure Blobs (binary large objects). Like HDFS, these cloud object stores could store any type of enterprise data but they were easier to deploy, manage and use. Cloud providers began to introduce their own cloud services including in-memory analytic tools enabling data lake functionality directly in the cloud. These second-generation cloud-based data stores represented another step forward. They supported near-real-time query performance, they decoupled storage from compute (allowing the two to scale independently) and customers could leverage the cloud provider's identification and authorization frameworks to support multi-tenancy and ensure workload isolation.
Data Lake Architecture
A high-level architecture of a data lake is pictured below. The data lake is fed by a variety of sources pictured at the left of the diagram. Sources may include relational databases, NoSQL databases, Hadoop clusters, video or images, or data from various streaming sources.
Data in the data lake may be queried directly from various client tools via a modern data lake engine. Data may be extracted from the data lake to feed an existing data warehouse using ETL tools.
The data lake storage layer is where data is physically stored. In modern data lakes, data is frequently stored in cloud-based object stores such as Amazon S3 or ADLS but data may reside on premises as well. The data lake storage layer is not necessarily monolithic. Data in the logical data lake may span multiple physical data stores.
Data stored in data lake storage can exist in a variety of file formats, from text to various binary formats to specialized query-optimized formats. Some open source file formats such as Apache Parquet have their origins in Hadoop. Parquet is designed to support large, complex datasets with efficient compression and encoding while supporting column-oriented queries against large data tables. JSON (JavaScript Object Notation) is a popular format with developers. It is a lightweight, human-readable, text-based data-interchange format that can represent arbitrarily complex data. JSON is popular because it is easy to parse, and JSON is easily generated using a variety of programming languages. JSON is also frequently the basis of messages passed via modern RESTful APIs. Other tabular data may be stored in simple text files containing comma-separated values (CSV) or tab-separated values (TSV).
Table formats are metadata constructs built on top of the various physical file formats described above to make tables appear like SQL tables. Examples include Apache Iceberg, Delta Lake, AWS Glue and the Hive Metastore (HMS).
The table repository layer provides a uniform view and optimized access to the various table formats in the diagram above. Project Nessie is an open source project that works with the table formats described above, including Iceberg, Delta Lake and Hive tables.
Analytic and data science client tools typically access data in the data lake through a data lake engine. A variety of standard protocols efficiently encode and transmit queries and return results to client tools. These protocols include ODBC, JDBC, RESTful APIs and Apache Arrow Flight.
The data lake engine is an application or service which efficiently queries and processes the vast sets of data stored in a data lake via the various standardized software layers described above. Examples of data lake engines include Spark, Apache Kafka and Dremio data lake engine.
Data Warehouses vs. Data Lakes
Most enterprises operate both data warehouses and data lakes. Data warehouses have been around in various forms for decades, and variations include enterprise data warehouses (EDW), data marts and operational data stores (ODS). Data warehouses typically have carefully crafted schemas designed to answer predetermined queries quickly and efficiently.Data lakes, by comparison, are a little over a decade old. They rely on highly scalable, but comparatively low cost, object stores built across commodity servers and storage components. Data lakes can store a wide variety of data types, including binaries, files, images, video files, documents and text.
Challenges with Data Lakes
While analytic and query tools for data lakes have improved, the sheer diversity of data formats, storage technologies and use cases has led to a variety of different tools. This diversity of tooling has made data lakes harder to use for most business analysts. Also, despite improvements in file formats (Parquet, ORC, etc.) and increased use of in-memory analytic frameworks, queries made against data lakes are frequently slow compared to carefully optimized data warehouses. Data lakes have also lagged behind data warehouses in essential areas such as security, data governance and controls over data provenance and lineage.For these reasons, many enterprises choose to operate both data lakes and data warehouses. Data is often stored in a data lake, and data valuable to business analysts is extracted to the data warehouse where it is easily accessible. This replication of data results in added expense, complexity and development-related delays when new reports or business dashboards are required.Ideally, enterprise users would like to store their data in a single low-cost repository and use it to support both operational BI applications as well as data science and analytic applications. However, until recently, it was impractical in most instances to query the data lake directly while achieving acceptable performance and ensuring proper business controls over the integrity of data.
A Data Lake Engine Provides the Best of Both Worlds
A data lake engine is a software solution or cloud service that provides critical capabilities for a wide range of data sources for analytic workloads. Data lake engines provide a unified data model and set of APIs, enabling traditional BI and analytic tools to directly query data lake cloud storage without compromising on performance, functionality, security or data integrity. Data lake engines provide this functionality even if data is distributed across multiple relational or non-relational data stores such as Microsoft ADLS, Amazon, Hadoop or NoSQL databases.Data lake engines provide a single point of access for enterprise data. They directly support essential enterprise requirements such as simplifying data access, accelerating analytic processing, securing and masking data, curating datasets, and providing a unified catalog of data across all sources. They do all of this while avoiding the cost and complexity of traditional data warehouses and associated workflows to extract, transform and load data into a separate relational database. Data lake engines enable enterprises to leave data where it is already managed, providing fast access to all data consumers, regardless of the tools they use.