11 minute read · August 12, 2019
Cloud Data Lakes – What You Need to Know
· Director of Technical Marketing, Dremio
A cloud data lake is a cloud-hosted centralized repository that allows you to store all your structured and unstructured data at any scale, typically using an object store such as S3 or Azure Data Lake Store. It’s placement in the cloud means it can be interacted with as needed, whether it be processing, analytics, or reporting of said data. Cloud data lake can be used to store an organization’s data (including data generated from internal and external actions and interactions).
The broad term data lake is traditionally associated with Hadoop-oriented object storage – In such a scenario, an organization’s data is loaded into the Hadoop platform, and then analyzed as it resides on Hadoop’s cluster nodes of commodity computers. While traditional data lakes have been built on HDFS clusters on-premises, the current trend is to move and maintain data lakes in the cloud as an infrastructure-as-a-service.
A data lake can include structured data from relational databases (rows and columns), semi-structured data such as CSV, JSON and more, unstructured data (documents etc.) and binary data such as images or video. The primary utility of this shared data storage is in providing a united source for all data in a company - each of these data types can then be collectively transformed, analyzed, and more.
How Do Cloud Data Lakes Work?
The data journey is designed to take advantage of the separation of compute and storage, so that each individual element can scale when necessary, without slowing down the other – this autoscaling is the key benefit of putting a data lake in the cloud. Additionally, because of its centralized location, cloud data lake infrastructure provides self-service access to users and developers, compared to on-premises solutions, which siloed information.
The Data Journey
To understand the structure and logic of cloud data lakes, it helps to follow the path that data takes, from ingestion through to analytics and reporting:
1.Ingestion: The first step in the data journey, ingestion involves the uptake of structured and unstructured data. It’s collected and collated from multiple sources, and transferred into the data lake in its original format. A major benefit of data lakes is the fact that scaling can occur without the need to reconsider schemas, transformations, or data structures (as you would need with a traditional data warehouse). Despite the ease of transfer and storage, companies will usually maintain multiple data lakes – separated to avoid any issues with data privacy or internal access privileges.
2. Storage: The 2nd step in the data journey, storage is the controlled repository for all ingested data, prior to any transformations – this means all data can maintain its original state, whether it’s structured or unstructured. This simplified storage system allows businesses to collect and consider endless amounts of data. with the major cloud object stores (ADLS, S3, GCS), provide high availability, autoscaling, affordability. and security.
3. Processing: The 3rd step in the data journey, where data is converted from its raw state into something compatible with multiple data types, to allow for the combinatory analysis, through aggregation, joining, and more. Once the data has been processed, it’s returned to the data lake, where it can be analyzed.
4. Analytics: The final step in the data journey – stored, processed data is made available for analysis by data scientists, BI users, and more. Ultimately the end goal for data storage.
Cloud Data Lake Platforms
The main cloud providers, Microsoft, Amazon, and Google, all offer cloud data lake solutions:
Microsoft Azure Cloud
Microsoft’s data lake offering, Azure Data Lake Store (ADLS) is a hyper-scale repository for cloud storage. Built on the Hadoop file system, ADLS is capable of managing trillions of files, and can even sort and maintain petabyte-sized files. With high availability, ADLS was built with the expressed purpose of running and maintaining large-scale data analytics in the cloud.
Amazon Web Services
Amazon Web Services (AWS) offer a number of data lake solutions – their main offering being Amazon Simple Storage Service (Amazon S3). S3 is a highly scalable, industry standard object store, capable of storing all data types. It’s ensured to be both secure and durable, and it’s standardized API’s allow for the use of external analytics tools.
Google Cloud Services
While it’s offerings aren’t as established as Microsoft’s or Amazon’s, Google does provides its own cloud data lake offering. Google Cloud Storage is a lower cost cloud data lake, which gives user access to Google’s own suite of ingestion, processing and analytics tools.
Cloud Data Lake Comparison Table
Cloud Service | Ingestion | Storage | Processing | Analytics |
Microsoft Azure | Azure Data Factory Azure Stream Analytics Apache Sqoop Azure Powershell Azure Portal Adlcopy DistCp | ADLS Blob storage ADLS Gen2 | HDInsight Azure SQL Data Warehouse HDinsight Storm | Data Lake Analytics |
Amazon Web Services | Amazon Kinesis Amazon Snowball Amazon Storage Gateway | Amazon S3 | AWS Glue Amazon Glacier | Amazon Athena Amazon EMR Amazon Redshift |
Google Cloud Platform | Cloud Pub/Sub Cloud Data Flow Storage Transfer Service | Google Cloud Storage | Cloud Datalab Cloud Dataprep | Big Query Cloud Data Proc Cloud Bigtable |
Data Lakes: On-Premises, in the Cloud, & More
Data lakes can be built either in the cloud or on-premises, with the trend currently pointing to placing them in the cloud, because of the power and capacity that can be leveraged.
For organizations who already maintain a on-premise data lake but are considering a transition to a cloud-based solution, the migration process can be very daunting – how can they transfer vast quantities of data and adapt customized technology for a universal cloud provider? And for businesses who maintain some combination of cloud and on-premise data lake solution, why do they do it, and how can they transition away?
On-Premises: Data lakes maintained on-premises are different than their cloud counterparts. They require the combined management of both hardware and software. This double-duty requires greater engineering resources and expertise, and it also locks companies into a static scaling solution, where they have to be sure to maintain capacity overhead so as to avoid any downtime as they expand storage.
Hybrid Data Lake: Maintaining both on-premises and cloud data lakes concurrently introduces it’s own benefits and challenges. Managing a on-premises operation requires additional engineering expertise, as does constantly migrating data between on-premises and the cloud. But, on the other hand, this two pronged approach does allow companies to maintain less relevant data on-premises, while placing more important data in the cloud, thereby benefiting from the speed of cloud services.
Cloud Data Lake: By maintaining a standard cloud data lake, the major benefits are availability, speed, and lower engineering and IT costs. This option allows businesses to operate swiftly, without having to measure every decision against expertise. The downside can be that cloud services are paid for as a subscription model – overtime this will inevitably cost more than the “buy once” model of local storage.
Multi-Cloud Data Lake: The final type of data lake, wherein multiple cloud offerings are combined together, ie. businesses which use both AWS and Azure to manage and maintain the data lakes. Maintaining multiple data lakes means benefiting from the advantages of each platform, but it also requires greater expertise, since getting disparate platforms to communicate isn’t always that easy.
The Benefits of Building Data Lakes in the Cloud
Moving data storage to the cloud has become feasible for companies of all sizes –the scaling and centralized functionality allowing for greater operations simplicity, more immediate data-driven insights, and more:
Capacity: With cloud storage, you can start with a few small files and grow your data lake to exabytes in size, without the worries associated when expanding storage and data maintenance internally. This gives your engineers the freedom to worry about more important things.
Cost efficiency: Cloud storage providers allow for multiple storage classes and pricing options – this helps companies to pay for exactly as much as they need, instead of planning for an assumed cost and capacity as is needed when building a data lake locally.
Central repository: A centralized location for all object stores and data access means the setup is the same for every team in an organization. this simplifies operation complexity and frees up time for engineers to focus on more pressing matters.
Data Security: All companies have a responsibility to protect their data – with data lakes designed to store all types of data, including sensitive information like financial records or customer details, security becomes even more important. Cloud providers guarantee security of data as defined by the shared responsibility model.
Auto-scaling: Modern cloud services are designed to provide immediate scaling functionality, so businesses don’t have to worry about expanding capacity when necessary, or paying for hardware that they don’t need.
The Challenges of Data Lakes in the Cloud
The migration of data and infrastructure to the cloud has been a long time coming, and simplifies many operational costs for businesses - that doesn’t mean that it’s a perfect solution:
Migration: The biggest challenge for cloud data lakes is actually getting data into the cloud – the migration process can be incredibly daunting. It’s not only complex, but also can be expensive, especially when it occurs repeatedly.
Data management: One of the benefits of a data lake can also be a challenge – data management. Because data lakes are capable of supporting all types of data (structured, unstructured, etc.), the management and cleanliness of data lakes can be an intensive process. When things get out of hand, data swamps can occur. A data swamp, full of poorly formed data, holds very little value to a business, and requires a lot of effort to fix.
Storage costs: While on-premises storage costs can be aggressive, the tradeoff is fairly simple – cloud providers charge for storage based on time more than size. This means that costs can expand over time. This means a business has to weight its existing engineering and IT costs against the “rental” of cloud services.
Self-service analytics: The main benefit of setting up a data lake in the first place is analytics. The ability to combine, transform, and organize disparate data sources together, is a huge benefit, but it requires an equally robust analytics solution. While most cloud providers offer analytics solution, the ability to effectively utilize and hook in to these analytics platforms isn’t always easy.
How Dremio Can Help?
Dremio provides an integrated, self-service interface for data lakes. Designed for BI users and data scientists, Dremio incorporates capabilities for data acceleration, data curation, data catalog, and data lineage, all on any source, and delivered as a self-service platform.
Run SQL on any data source. Including optimized push downs and parallel connectivity to non-relational systems like Elasticsearch, S3 and HDFS.
Accelerate data. Using Data Reflections, a highly optimized representation of source data that is managed as columnar, compressed Apache Arrow for efficient in-memory analytical processing, and Apache Parquet for persistence.
Integrated data curation. Easy for business users, yet sufficiently powerful for data engineers, and fully integrated into Dremio.
Cross-Data Source Joins. execute high-performance joins across multiple disparate systems and technologies, between relational and NoSQL, S3, HDFS, and more.
Data Lineage. Full visibility into data lineage, from data sources, through transformations, joining with other data sources, and sharing with other users.
Visit our tutorials and resources to learn more about how you can gain insights from your data stored in ADLS, AWS, and more, using Dremio.