Dipankar is currently a Developer Advocate at Dremio where his primary focus is advocating data practitioners such as engineers, architects & scientists on Dremio’s lakehouse platform & various open-sourced projects such as Apache Iceberg, Arrow, etc. that helps data teams apply & scale analytics. In his past roles, he worked at the intersection of Machine Learning & Data visualization. Dipankar holds a Masters in Computer Science and his research area is Explainable AI.
Apache Flink is an open source data processing framework for handling batch and real-time data. While it supports building diverse applications, including event-driven and batch analytical workloads, Flink stands out particularly for streaming analytical applications. What gives it a solid edge with real-time data are features such as event-time processing, exactly one semantics, high throughput, […]
Data quality is a pivotal aspect of any data engineering workflow, as it directly impacts the downstream analytical workloads such as business intelligence and machine learning. For instance, you may have an ETL job that extracts some customer data from an operational source and loads it into your warehouse. What if the source contains inconsistent […]
Catalogs in Apache Iceberg In the Apache Iceberg world, a catalog is a logical namespace that contains information to fetch metadata about the tables. A catalog acts as a centralized repository that allows managing tables and their versions, facilitating operations such as creating, updating, and deleting tables. Most importantly, the catalog holds the reference to […]
Apache Iceberg 1.2.0 release brings in a range of exciting new features and bug fixes. The release centers around changes to the core Iceberg library, compute engines together with a couple of vendor integrations, making the ecosystem of tools and technologies around the ‘open’ table format extremely robust. Amongst some of the noteworthy features is […]
Imagine you are a data engineer working for the platform engineering team of your company’s analytics team. Your responsibilities include building data pipelines and infrastructure to make data available and support analytical workflows such as business intelligence (BI) and machine learning (ML) across your organization. In the past, your analytical workloads used to run on […]
What Is a Data Lakehouse? A data lakehouse combines the performance, functionality and governance of a data warehouse with the scalability and cost advantages of a data lake. With a data lakehouse, engines can access and manipulate data directly from data lake storage without copying data into expensive proprietary systems using ETL pipelines. Learn more […]
Every organization considers dashboards a key asset to support their decision-making process. Now, as organizations invest more and more in their data strategy, they constantly focus on making dashboards self-serviceable. The idea is to let any level of user, irrespective of their technical expertise, have access to these reports and be able to answer critical […]
Experimentation in Machine Learning Unlike the software engineering field, which is usually backed by established theoretical concepts, the world of machine learning (ML) takes a slightly different approach when it comes to productionizing a data product (model). Like with any new scientific discipline, machine learning leans a bit more toward the empirical aspects to determine […]
Trying out any new project with dependencies and integrating a couple of technologies can be a bit daunting at first. However, it doesn’t have to be that way. Developer experience is super critical to everything we do here in the Dremio Tech Advocacy team. So, through this Notebook, the idea is to simplify configurations, etc., […]
As co-creators of Apache Arrow, here at Dremio it’s been really exciting over the past several years to see its tremendous growth, bringing more usage, ecosystem adoption, capabilities, and users to the project. Today Apache Arrow is the de facto standard for efficient in-memory columnar analytics that provides high performance when processing and transporting large […]
Puffin is here in Apache Iceberg The Apache Iceberg community recently introduced a new file format called Puffin. Hold on. We have Parquet, ORC. Do we really need another file format, and does it give us additional benefits? The short answer is Yes! Until now, we had two ways of gathering statistics for efficient query […]
This tutorial introduces the Z-order clustering algorithm in Apache Iceberg and explains how it adds value to the file optimization strategy.
A hands-on tutorial for building a Tableau dashboard directly on the data lake using Dremio.
This tutorial provides a practical deep dive into the internals of Apache Iceberg using Dremio Sonar as the engine.
Over the past few years, more and more enterprises have wanted to democratize their data to make it more accessible and usable for critical business decision-making throughout the entire organization. This created a significant focus on making data centrally available and led to the popularization of monolithic data architectures. In theory, with monolithic data architectures […]
Data mesh is a decentralized approach to data management that focuses on domain-driven design (DDD). It aims to bring data closer to business units or domains, where people are responsible for generating, governing, and treating the data as a product. A Data Mesh is an architectural approach to designing data-driven applications. It provides a way […]
This blog post features the history behind Apache Arrow and how it addresses modern challenges in today’s data landscape.
Enable the business to create and consume data products powered by Apache Iceberg, accelerating AI and analytics initiatives and dramatically reducing costs.