What Is a Data Lake?
A data lake is a centralized repository that allows you to store all of your data, whether a little or a lot, in one place.
Like a real lake, data lakes store large amounts of unrefined data coming from various streams and tributaries in its natural state. Also, like a real lake, the sources that feed the lake can change with time.
Data Lakes Are the New Normal
In the old days, the cost of data and complicated software meant that organizations had to be picky about how much data they kept. Organizations would carefully design databases but only used them to store business-critical information. Today, storage is cheaper and software to analyze the contents is more complex, so the way organizations view data collection and storage has changed dramatically.
Storing a terabyte of data is a staggering 10M+ times less expensive than it was 40 years ago. It's now a lot more viable to keep all the data your business generates. Sometimes, it can be cheaper to collect all the data you can in a data lake, as it comes in and then sort it later.
Just as storage costs have plummeted, so too has the cost of data acquisition. Thanks to all the devices we use today, the cost of capturing data has dropped to almost zero, with nearly all data originating from computers, laptops, tablets, and phones. Whenever you interact with someone else on the internet, it leaves a digital trail — everything from in-store purchases, in-app e-commerce orders, to recorded customer service interactions via phone or chat.
Tools for Data Lake Storage
There are a variety of tools for data lake storage. Popular storage solutions for data lakes include:
- Legacy HDFS (Hadoop Distributed File System)
- Amazon S3
- Azure Blob storage
- Azure Data Lake Storage (ADLS)
Those are just a few examples, but lots of other on-premises and cloud solutions exist.
Whichever tool you choose, they work in similar ways, using distributed systems where data is spread across multiple low-cost hosts or cloud instances. Data is usually stored in multiple places simultaneously to provide a backup if something goes wrong.
Data Lake File and Table Formats
Traditional databases had to store data in very specific, organized ways, but a data lake can easily store any kind of data — whether it's fully organized when it's uploaded or completely unstructured.
File Formats
A data lake can store a variety of file formats. Common file formats for data storage include:
- Comma-separated values, or CSV
- JavaScript Object Notation, or JSON
- Query-optimized open source, for example, Apache Parquet
Table Formats
Table formats are metadata constructs that make it easier to interact with files in tables.
Data lakes are only as useful as their metadata. Table formats are metadata constructs that help you understand what data you have in your data lake and make that data easier to use. Common table formats include:
- Apache Iceberg (open source)
- Delta Lake (Databricks)
A metastore stores metadata about all the tables in your data lake and how they are structured, essentially acting as a catalog for everything in your lake. Data lake metastores include:
- Dremio Arctic
- AWS Glue
- Hive Metastore
Data Lake vs. Data Warehouse vs. Data Lakehouse
Before the smartphone, we had to carry around lots of different devices with a single function — be that a diary, a camera, or a phone. The smartphone brought all the best parts of each device together in one device, and data lakehouses combine the best of both data warehouses and data lakes.
Data warehouses typically have carefully crafted schemas designed to answer predetermined queries quickly and efficiently. Data lakes store all your data, but historically they can be harder to query because data is not rigorously structured and formatted for analysis. A data lakehouse combines the best of both worlds.
Because a data lakehouse combines the features of a data lake and a data warehouse, it can be greater than the sum of its parts. It separates transactional functions from storage and reduces the overall amount of compute power needed to run queries by directly accessing standardized source data, whether or not it has been fully structured.
A cloud-based lakehouse supports a wide range of schemas, data governance protocols, and end-to-end streaming. It can also read and write data simultaneously, making a more stable platform for concurrent users.
Dremio and Data Lakehouses
Dremio helps companies get more value from their data, faster. Dremio’s forever-free lakehouse platform delivers high-performing BI dashboards and interactive analytics directly on the data lake.
Ready to go deeper? Read a more technical article on data lakes.