Gnarly Data Waves

Episode 7

February 21, 2023

Getting Started with Hadoop Migration and Modernization

Data teams inheriting on-prem Hadoop face bottlenecks with the high cost of infrastructure and operational overhead. Join the latest episode of Gnarly Data Waves and learn how modernizing Hadoop to the data lakehouse with Dremio solves these challenges.

Topics Covered

Data lakehouse

Speakers

Kamran Hussain

Kamran has been working in the data space for the last 25+ years. In the last decade he has focused on data integration and modern data platform engineering. At Dremio Kamran is excited to help customers move to a simple and open Lakehouse platform.

Tony Truong

Tony Truong is a Sr. Product Marketing Manager for Dremio. He has experience building go-to-market strategies that drive product adoption and solutions offerings across various industries.

Video Synopsis

Getting Started With Hadoop Migration
Information on Dremio Subsurface Live
What Will You Learn in This Video?
Who Is at Dremio and What Do They Do?
What Problems Does Data Migration Usually Come With?
How to Start Hadoop Migration With Dremio
Dremio’s Interactive SQL UI
What Tools Does Dremio Provide for Data?
How To Navigate Dremio’s SQL UI
How Does Dremio Organize Semantic Layers?
Dremio SQL UI Live Demo
Can Dremio Mask SQL Data For Specific Users?
Setting Up and Using Scripts With Dremio
Editing and Viewing Table Data
Can You Join Data in Postgres with Object Storage?
What Are The Possibilities That Come With a Dremio to Dremio Connector?

Transcript

Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.

Alex Merced:

Getting Started With Hadoop Migration

Hey everybody, this is Alex Merced and you’re listening to another episode of Gnarly Data Waves, presented by Dremio. My name is Alex Merced, Developer Advocate here at Dremio, and I will be your host this evening. And this week's episode we'll be talking about getting started with Hadoop Modernization. Okay? This will be an exciting topic about how to take your existing data setup, and modernize it to get some more bang for that buck so that way you can move towards some of today's architectures that will, again, bring you more ease of use, lower cost, and a lot of other benefits and performance. Okay? Now, in that, we're talking about moving towards a more data lakehouse type infrastructure. And the best way to see, hey, is this, what you need is to get hands on with the data lakehouse.

Information on Dremio Subsurface Live

So one way you can do that is you can head over to Dremio.com and try out the Dremio testdrive. Within a few clicks you can get hands on and see what it's like to experience the data lakehouse, how it is to connect and live query data from Tableau, straight from your data lake storage, and lots of other really cool features. So, head over to Dremio.com, try out the testdrive. No commitment, no cost, just a way for you to evaluate that this is the architecture that you need. Also on March 1st and March 2nd is going to be our annual subsurface conference, the data lakehouse conference. Okay? And again, this year is going to be special because we're not just doing it virtually. We're also going to be having some live locations in San Francisco, New York, and London.

So again, you can go register at Dremio.com/subsurface and keep in mind that when you're registering, if you're interested in being in one of those live locations, it'll be based on capacity, but you can mark off your request to be in one of those locations on your registration form. So go register now because space is limited. And again, that's going to be on March 1st and March 2nd. And again, with locations in San Francisco, New York, and London with talks on Apache Iceberg data lakehouse implementations, and all sorts of other really great topics, head over to the Dremio.com/subsurface to register today. Okay. And again, this week's episodes are going to be on getting started with Hadoop Modernization and our presenters today are going to be Kamran Hussain, Field Solution Architect for Dremio and Tony Truong, Senior Product Marketing Manager here at Dremio. So without further ado, Kamran, Tony, this stage is yours.

What Will You Learn in This Video?

Tony Truong:

Thank you, Alex, for the introductions. Hey everybody, welcome back to Gnarly Data Waves. And hello, if this is your first time joining today, we'll be going over getting started with Hadoop Migration and Modernization. Before we get started I'll be going over what you'll be learning today. So in this episode, you'll be learning about some of the challenges with Hadoop and why organizations are migrating off of it. And then we'll talk about some of the options for migrating off of Hadoop. And third, Kamran will be going over the phase approach to Hadoop migration and modernization. Finally, you'll be seeing a demo of exactly how this works. All right, now let's talk about some of the challenges with Hadoop. If you and your team have inherited legacy Hadoop systems and you're trying to migrate off of the platform, then this episode is for you, right?

You may be maintaining and troubleshooting clusters that require resources with deep expertise in the Hadoop ecosystem. And if you look on the screen here, you see there are many components to Hadoop, and we often see organizations that have teams that are dedicated just to maintenance. And there's a reason why, because having subject matter expertise in each of these components of Hadoop requires somebody that really knows what they're doing. The second challenge is that you're probably dealing with the high cost of scalability as your data grows. And one of the reasons why you're dealing with the high cost is that you're not able to separate storage from compute, right? You need to add storage capacity to store the increasing amount of data which is usually never deleted. But at the same time, you don't need additional computing. But by design, the Hadoop architecture requires you to add both, which leads to high cost.

The third and fourth points here go hand in hand, right? You have high latency overhead from your query engines like Hive, which requires a lot of query performance tuning. And fourth, you're not able to enable governed self-service analytics, and you just technically access Impala or Hive from an API tool, it's more that the performance in efficiency wasn't there. What ends up happening is that IT will end up blocking down the environments and only provide curated data sets after your end users go through the ticketing process to get additional data sets and changes to their data. And so the data engineers are the ones who end up having self-service access because they have the skills to write their own and usually more efficient queries to do exploratory analysis and build their own data sets. And so your end users end up creating their own data sets, which creates data sprawls.

Who Is at Dremio and What Do They Do?

And this becomes a data governance nightmare. To give you a short overview of who we are at Dremio, we are the easy and open data lakehouse. And fun fact, our co-founder Tomer, was one of the founding members of MapR. And so he's seen a lot of the challenges that came with the Hadoop ecosystem, and that's what led him to start Dremio to solve some of these challenges. And on the right hand side here, you'll see some customers who we have helped with the data lakehouse journey. So now let's go over some of your options for migrating off of Hadoop. One of your options is to migrate Hadoop to a cloud managed Hadoop platform, right? And so moving to a cloud managed Hadoop platform really poses some challenges around performance, data governance, and security. These Hadoop systems are based in the cloud, typically use clusters that are over provisioned and run continuously to handle some of these workload requirements.

What Problems Does Data Migration Usually Come With?

However, we've seen customers come to realize that the challenges that they face in their on-prem environments, such as with reliability and scalability issues, are now carried over to their cloud-based Hadoop platform. For instance, it takes a considerable amount of time to provision and autoscale clusters during peak hours. And what ends up happening is that they opt to maintain long-running and over-provisioned clusters to accommodate these workload demands. And in addition, they spend a lot of time dealing with troubleshooting, infrastructure, and resource management and end up maintaining a lot of pipelines to integrate these managed services. Your second option is to use a Lakehouse query engine. There are only a few players in this space, and most of them require you to have data in cloud optic storage. So if you have data that needs to stay on-prem due to security and compliance reasons, or if your organization is not ready for the cloud, then cloud-based lakehouse query engines probably won't be a great solution for you.

And then finally, we have the cloud data warehouse. One complexity from migrating Hadoop to a cloud data warehouse is that it requires a deep understanding of both architectures. You also find that it involves significant ETL work to reformat and restructure the data for the data warehouse particularly for organizations with large and complex data sets. Another problem with cloud data warehouse is vendor lock-in, right? To take advantage of cloud data warehouses, your data would need to be a proprietary data format, meaning it locks you in into that specific engine. And in the real world enterprise data platform teams have more than one data warehouse across multi-cloud. We've seen organizations with Redshift, Snowflake, BigQuery, and what they end up doing is they have to move that data from the warehouse into a cloud object storage because by design, these data warehouse vendors don't talk to each other.

And so industry solutions are great for analytics and offer one way to migrate off of Hadoop, but they all share a universal shortcoming, which is that it's hard for organizations to enable self-service analytics. And like I talked a lot before, most organizations have data that is sitting in the cloud and on-prem across databases, data warehouses, and data lakes. The data silos make it difficult to provide end users with access to data, and there's never a single source of truth. It is possible to decommission your on-prem Hadoop workloads entirely using the various options that we talked about today, such as cloud managed Hadoop platforms, lakehouse query engines, and data warehouses. However, due to the high cost and complexity of data copying across managed platforms, most data teams can't realize the total business value of migrating off of Hadoop. A core concept here at Dremio is that your data architecture should be easy and open, and allow you to get self-service analytics across all of your data. All right, now I'll hand things over to Kamran who'll be walking you through the phase approach to migrating off of Hadoop with Dremio. Take it away, my friend.

How to Start Hadoop Migration With Dremio

Kamran Hussain:

Thanks Tony. So let's get into the details of the Hadoop migration approach. To see immediate results, it would be best to first switch the query engine to Dremio. Not only would you see Subsecond response times, you would also reduce the complexity and provide self-service to business users. Next, you can move the data from HDFS to a modern data storage like object storage in the cloud or on-prem, depending on your company's cloud adoption. With these two components off of the Hadoop ecosystem, you'd be in a great position to implement a data lakehouse. A data lakehouse gives you the performance and functionality of a data warehouse at the cost of a data lake. Let's take a look at each one of these stages in a little more detail. You can simply connect Dremio to your existing Hadoop clusters. Hadoop can be deployed on yarn if you have the capacity on your Hadoop environment, or it can run Dremio on its own instances on Kubernetes, on premise or in the cloud.

And we have offered many other deployment methods. And this is a very low risk approach, having minimal impact on your production environment. You would immediately see subsecond latency performance compared to the other engines. We have many customers who have implemented Dremio for exactly this use case and saw drastic performance improvements over their existing Hadoop technology. Now, one of the biggest complaints we hear from business users is not being able to access all the data with one tool. By implementing Dremio's Query Engine, your business users would be able to unify all of your data for self-service analytics. Dremio allows you normally to connect to your Hadoop environment and object storage. We are also enable to federate queries from relational and SQL sources like Snowflake or SQL Server, Oracle, MongoDB, Postgres, and many others. Now, during the demo, I'll show you how Dremio's semantic layer allows you to easily build virtual data models in staging business and application layers.

Dremio's Interactive SQL UI

Now we've designed our UI to be very interactive for SQL users. Now, once your queries are running in seconds and you have empowered your business users with self-service analytics, you can start to migrate the data off of HDFS to object storage in the cloud or on-prem to S3 compatible technologies like MinIO, ECS, and others. The data movement would be done with an ETL tool. And this process is a very low risk function because you'll be moving data in stages and have both data sets available to Dremio for testing before turning off the data that's stored in Hadoop. By turning off your Hadoop environment, you would normally reduce hardware and license cost, but you would also reduce the complexity in your architecture. Now, in order to fully implement a data lakehouse, you would need to migrate to an OpenTable format like Apache Iceberg.

This enables DML, Scheme Evolution, Time Travel, and other data warehouse functionalities. And now by implementing an OpenTable format, you will not get locked into a proprietary table format, which is required by most of the vendors like Teradata or Oracle, and so on and so forth. This also means that data is available to not only Dremio, but to engines, for example, Spark for doing ETL type workloads. Now remember this slide that Tony showed earlier, hopefully now you have an idea of how you can decommission all of these Hadoop components and start to use Dremio. So let's bring it all together and see what a modern data lakehouse architecture would look like with Dremio. You have your applications and devices generating and storing data in many different formats and locations on your left side. Then looking at the right side of this architecture.

What Tools Does Dremio Provide for Data?

And Dremio supports many different use cases for data science and BI data consumers. Dremio provides access to any tool through JDBC, REST API, and AeroFlight which is designed for high speed data transfer especially useful for data science use cases. Now looking at Dremio in the middle, Dremio's query engine enables access to all the data and provides very fast performance for interactive and Ad Hoc analysis. Dremio offers a browser-based ui, which I'll show you during the demo, which allows data curation without knowing any SQL or for power users, it's got very powerful SQL functionality as well. Now with our semantic layer, not only do you get a unified business friendly view of all the data, you can also assign role-based access and fine grained access control to those users.

And lastly, we're delivering some very innovative functionality for the lakehouse management, just like how you can branch and merge code for applications, for example, like with Git, and I can do the same with data. The service is a catalog for Iceberg, which also provides data optimization functionality more on this topic and future gnarly data way episodes. And lastly, Dremio's easy and open lakehouse platform is the easiest way to implement a data mesh or analytics workload as well. More specifically, Dremio provides four fundamental capabilities that are required to support a data mesh. The first one being a semantic layer and intuitive user experience that gives domain self-service experience, a lightning fast query engine that supports all SQL workloads. And lastly, a meta store service. Let's quickly summarize what differentiates Dremio from other technologies. First is the unified semantic layer that empowers business users to do self service analytics with our modern SQL friendly UI.

Second is our open platform based on Apache technologies like Arrow, which by the way is our in-memory column file and format. File formats like Parquet and Apache Iceberg table formats. And third is our subsecond performance at 1/10th the cost. Now Dremio was built from the ground up to deliver interactive query performance, which makes that possible. What makes that possible is our column or cloud cache that enables us to deliver NVME grade performance directly on data lake storage. Then we have data reflections that intelligently pre-compute various aggregations on data, and of course, the use of Apache. And Dremio has done a lot of work in workload management. We offer a multi engine architecture to isolate workloads. And lastly, our auto scaling capability, which helps reduce infrastructure cost. And finally before we get to the demo, let's take a look at two customers who have moved to Dremio from Hadoop.

TransUnion was one of our early customers who saw value in Dremio right away when they started testing it against Apache Grill. Customers were experiencing slow performance with SQL on Hadoop with Apache Drill on a lot of data. As you can see they saw immediate performance gains using our reflections and the self-service ability to explore data. So Dremio empowered analytics and analysts and customers with interactive dashboards. Now, NCR saw 30 times performance improvement when they moved to Dremio from Hadoop. We have many more customers who have seen similar performance improvements with Dremio and business users. Love the self-service capability as well when it, when you compare it to a complex Hadoop ecosystem. All right, I'm excited to show you action now.

How To Navigate Dremio's SQL UI

So let's take a look at Dremio's UI. So here I'm connected to a cluster which has a lot of the sources that we just talked about. So the UI comprises of the sources, which you're seeing here at the bottom. I already have a few of the sources for our demo. It's very easy to add multiple sources. So you just click on this ad source button, and you can see here the meta stores that are available, the object storage that's available and the different relational sources that we talked about are also available. So I'll be showing you a few of these. So first I've got Postgres. I've connected that to an existing Postgres environment. With version 24, which just came out, we also actually introduced a Dremio to Dremio connector. So I've got that configured as well.

Then I've got two different sources of S3. And if we take a look at the settings here, I'm using Access Key and I'm also providing a specific bucket here. So you can do that, and you can do many different configurations to exactly what you want available. And you can see here the default table format is iceberg. You can change that to part here as well. But out of the box it's Iceberg. And then I've got a HDFS set as well. I'll show you that I'm using this path here. So all the data that's in the user demo will be available to Dremio. And then I've also got Hive configured here. So those are all sources. Next we'll see here before we get into this the spaces on the left hand here, we've got, what I showed you are the data sets.

Then we've got the SQL Runner, which is the query environment. So you can actually access all of the data because SQL is something that's most useful for users and most people are used to it. You can directly come in here and start to see what are the files that are available through S3 and start to actually consume that data. You can see what's available in Dremio, what's available in HDFS and start to actually write queries against all of those Hive as well, so on and so forth. So we'll be doing some of that. And then lastly here on the left side, you've got the jobs view as well. So we'll be running some jobs and we'll take a look into each one of those. So we talked a lot about semantic layers.

How Does Dremio Organize Semantic Layers?

So that's how Dremio organizes semantic layers. It's called spaces. You can think of it as a scheme line in a relational context. So these are basically spaces that you create, and you can do that just by clicking on add space. And then you can see here, you can grant access to them as well. So staging I've got a number of different data sets here. So what I'll point out here is the green icons are what we call virtual data sets or VBS's. So you'll see here there's Shippers_PG, which is Postgres. So I've just named it like that, so we know where it's coming from. So if I come in here and look at Postgres, we'll see that there's that shipper's data set, right? And you can see here, this is a purple icon.

So this is a physical data set. So because I'm on the relational database here on object storage here the purple icons, so the actual physical data set, whereas these green icons are the virtual data sets. So in Dremio, we never want you to copy any data. We never want you to build cubes or any type of existing aggregations and build and copy the data. Everything within Dremio is virtual, so we're saving on the storage cost as well, while making it extremely fast. So these are all the different, virtual data sets that I have, availability working as part of the demo. We'll take a look at orders, for example. So it's as simple as coming in here, clicking on that. And if you do a preview, you'll see here the data is pulled over, just as if this was a relational cable.

Dremio SQL UI Live Demo

So this is something that most of the customers, most of the users are used to. So even though the file is a Parquet file, it's ordered data, it's stored in S3 as we can see right here. So this graph view, the lineage shows you the lineage of what I'm looking at. So I've called it S3 just for this demo so that we can follow along because we got many different sources. But you can see here by simply going to the graph view, you can see where it's coming from. So this virtual data set, again, is a green icon. It's called Orders S3, it's coming from the orders table and AWS S3 right here. So we'll go back to staging. So again, staging is the, the one-to-one or the raw data. And then I'll show you an example of HDFS for example.

So this is a CSV file that's stored in HDFS. We can come in here, and again, this is the virtual data set. This is a physical file, which is stored on HDFS and then we're going to go into the business and the application there as well here in a minute. Alright, so before we look at the business and application space, let's go ahead and build out a virtual dataset. So we're going to start with a dataset that's available in S3. So these two S3 sources are just pointing to different buckets. So I'm going to use this one here. I'm going to come into CSV files, and we want to use and build a virtual data set for the US states. So we simply go here to format files. It automatically recognizes that it's a text file, and we'll extract the field names and stay safe. So right now, Dremio recognizes that it is stored as metadata, and it's saying, okay, well, you can go ahead and start querying. So right there, very simply, that data set that's in S3 is now available for querying. And we'll go ahead and save that as a VDS or a virtual data set in staging, and we'll call it US states.

So you can see here now in the staging space, I've got US states. So earlier, if you looked at S3 CSV files and US states, it had a CSV extension to it, we want that to be a little bit more business friendly. So now it's just called US States. If you look at it and look at the graph, you can see here it's pointing to that S3 file, which is in S3. So that's how easy it is to create all of these. And that's what I've done for our demo today. So now let's jump into the business space. So the business space is where you start to bring data sets together so that they're more useful for analysts and consumers. So first, we'll take a look at order info. So order info. What I'm doing here is I'm actually joining the orders S3 table and the order detail table.

And again, we can simply come in here and take a look at it. This is the order info virtual data set. It's joining the two data sets in S3, and we can start to run that, right? So now you can see that now we've got a data set that's joining two tables within S3. So if we go back to our business layer let's move on to the next level, right, by bringing in more data sets. So as you are moving and moving from one stage to another, which we covered in our approach, you can bring in other data sets as well. So in this case here, before we run it, let's take a look at it. So now I'm joining S3 as well as Postgres. So my product information here is, as you can see, in Postgres.

So we'll go ahead and run that. And that runs and brings in the product name and the category ID as well. So earlier we just had order information, now we have product information as well. And then lastly, we'll look at the category information. So we want now to join not only what we were joining earlier, but we also want to consume some data from Hive, right? So as you can see, Dremio makes it very easy to connect to multiple sources, join that data, and from a business user perspective, they think it's just the data set that's available to them. They don't know where it's coming from, but all of the data is available to them at a subsecond latency. So we'll go ahead and run this. And now we've got our order information, we've got our product information, and now we also got our category information.

So the product information was in Postgres, category, information was in Hive, and now that's all available through Dremio as if it was just in single data warehouse or database that users can use, right? So that's that view again. So now let's take a look at the application there. We need to look at shipping information, and we'll do some data curation within there. So as you can see here, I'm doing some data curation. I'm doing a twochar and a case so that I can get the day from the ship date and the shipping happens the second day. So if that happens to fall on a Saturday or Sunday then we want to push the ship date out by one day. So that happens on a weekday, right?

So you can see here I'm using case statement, twochar, typical functions that are available to business users. So if I run that, what you're looking at here is I've got the order information, but I also have an additional column, a virtual column, which is a ship date, which I've just said plus two, right? So you can see order date is 7/4, ship date is going to be two days later. And if that happens to fall on a Saturday, which in this case it is a Saturday, then I want the ship date to be a Monday, which is the aim, right? So that's all the type of curation that business users do, which is very, very easy to do in Dremio by using our function capability which is right here, right?

So by doing day add, all the functions are available here. So it's readily available, very easy for you to look at. Alright, now let's take a look at some governance capability. So what I have here in the application layer, I've got another data set customer detailed. And you can see here what I've done is I've said, if the crew user is a shipping agent, then I want to concatenate and mask some data. I don't want them to see the whole phone number, and I also don't want them to see a number of different columns, right? So this virtual data set is pointing to the customers. It happens to be an HDFS so I want to only select a few of the columns, and I also want to then mask the data depending on what the user was like.

Can Dremio Mask SQL Data For Specific Users?

So if I run that right now, you can see I've got the whole phone number that's showing up and a bunch of other information that's showing up as well. Now if we were to go and log out of this user and log in as a user called shipping agent, now you can see here I've only granted them the application layer and only these virtual data sets. So they don't have the ability to look at any other sources relational or object storage. They don't have the ability to look at the other spaces as well. So this is a very, very easy way to enable governance and you know exactly what you want to grant a specific user to. So in this case, a shipping agent would only need access to certain information. So if I look at customer details in this case and run this, you can see here the phone number has been masked because Dremio recognized that it's a shipping agent user.

They don't want to be able to see all of the information. So we only have a few of the columns that are available here. If we go back and look at the customers, that view does not have masking enabled. So you can see here, we can see all of the information, we can see the phone numbers and all of that type of detail as well. So it's very, very simple to do masking, and I'm just showing you a couple of the capabilities. We offer a lot more including rollback access control, fine grain, a row and column level as well. So we're going to go back to Dremio, go back into the Dremio user, which has access to all the uses. So I'm going to show you two more things here. One, let's take a look at the SQL Runner. One capability that we just added very recently was if we look at the SQL that I'm pulling, so we do have querying ability, right?

Setting Up and Using Scripts With Dremio

So you can access all the data. And we also have a scripts tab here. So you can create scripts, save them and execute them. So I've got this script here. One of the capabilities we added was to format it very quickly. So you'll see here by doing a function on my keyboard, Dremio automatically formats that. So that's a really, really nice feature. So now I can go ahead and save it, and if I were to come back into that, you can see here that it's formatted and saved like that. So that's a new feature that we've just enabled. Actually before I do that, let me go back into Dremio. As you can see, it's one of the sources and we can start to actually consume the data from another from another Dremio environment.

So this one here is pointing to our Dremio.org environment, which is our demo environment. So I'm connected to that, and I can see all of the data. This is my dataset in that environment. You can see here, I can start accessing that data. All right, so next I'm going to show you that it's not only easy to consume and curate the data from our Dremio UI. But now I'm going to go ahead and switch my screen to an IDE tool to show you that we can do the same thing from any IDE tool, like data visualization or DBWeaver or Toad and so on and so forth. All right, so here we're looking at Data Visualizer. It's an IDE tool. You can use other similar tools. I've connected the Dremio to the same environment that we're using for our demo with the shipping agent.

So by going into that shipping agent, you can see here this user only has access to that application space. So that's also available in this tool. And then we were looking at the customer detail information. So if I come in here and look at the data, you can see here, the masking rules have also been passed through to this tool. So it's not only the Dremio UI tool that restricts that, but any tool that's connecting as that user will also restrict that. So in this case, this user only has access to the application space and to these three tables of virtual data sets or VDS's in Dremio. That's what shows up here as well. Alright, the last thing we'll do for our demo today is take a look at data lakehouse. Data lakehouse brings the data warehouse capability and the economies of scale and low cost of Data Lake together into this capability, which allows you to actually do data manipulation on the data.

So I'm going to go into our SQL Runner into scripts, and I've got a few scripts saved here. So the first thing we'll do is we'll go ahead and create an Iceberg Table using our staging table here. So that's actually going to be created within our S3 bucket. So now I've got a table called customers Iceberg, and it's storing this data, in iceberg format in 4k, and it's got the metadata here as well, right? So one of the options is to store the metadata within S3, and it can be many other sources as well. And then let's take a look at what did we created, right? So we created this Customer's Iceberg Table which you can see here is Iceberg Format, and it was just created. Ee can go ahead and do a quick select against that, right?

Editing and Viewing Table Data

So customer's Iceberg Table in S3, and we've got all the data that we've been looking at. We can go ahead and insert a row. So I'm just inserting one row here. And if we go back to the S3 bucket you can see here more files are being created as we're making those changes. And once this data is inserted, we can go ahead and select again. So we'll see here that this is the data that I just added here, right? So customer ID, maybe CD Dremio\myname and the city that I have it, right? So that's now available in the Iceberg Table. And similarly, the update is very similar to what we're seeing here. And then go ahead and delete some data. So now what's happening with Deletes is obviously these files are still kept, right?

But Dremio is managing all of that, and all of that data is being stored in these metadata files and in the manifest files. And that's what makes the performance really good because all of the older files will not be referenced in the metadata. So now we've gone ahead and deleted that file, and if we take a look at it, you'll see here that thatrow is gone now, right? And then we'll go ahead and drop that table. So you have the ability to not only create, but drop the table. So you can see here, it's been dropped. If we come here and refresh it now that customer's Iceberg table is no longer there in S3 and there's no sign of it. So if we do a select against it we're going to get the table that doesn't exist, right? Table was not found. Yep. So that's it for our demo today. Thank you for attending. Hopefully you'll learned drer's capabilities and how we can help you modernize your Hadoop environment to a modern lakehouse engine like Dremio. Now I'll turn it back over to Tony. Thank you.

Tony Truong:

Thank you Kamran for the demo, I hope everybody enjoyed it. And as you can see with Dremio, users that are on Hadoop can take a phased migration approach to modernize through the cloud. Yeah, and so if you want to try this at home, we do have a Dremio testdrive. You can do this in your own environment provision by Dremio to go in and create a free account. You go in and do sub-second query on a million rules of data across S3 and a Postgres database. So everything's hosted by Dremio. You can find out and sign up through dremio.com/testdrive. And with that being said, thank you everyone for your time. I'll hand things back over to Alex before we do the Q&A.

Alex Merced:

Hey everybody, welcome back. Welcome back, welcome back. Now it is time for Q&A. So that means it's time for you to leave your questions in that Q&A box down below. So let's take a minute if you have any questions about anything that was presented this evening, we have the speakers here to answer your questions. So put those questions in that Q&A box below, and that way we can address them. But let's kick off with our first question. Let me just bring that up, okay? And again, remember, we'll be doing this generally every week here, and you can also subscribe to Gnarly Data Waves by subscribing on Spotify, iTunes, any way you listen to podcasts. Now for our first question, okay, what are my options to deploy Dremio if I have HDFS on-prem?

What Are The Options to Deploy Dremio On-prem With HDFS?

Kamran Hussain:

Yeah, so like we covered, right? Dremio allows you to not only consume data from object storage, relational, and existing Hadoop technologies, right? So the idea is that initially when you deploy Dremio you want to be able to consume the data without having to move it, right? So initially stage one, like we said, was just to implement Hadoop as a query engine and connect back to HDFS or any other technologies. And hopefully you'll have some data in a data lake, right? So S3 or ADLS or on premise, right? A lot of our customers initially, several years ago, started to move off of Hadoop, and Dremio was the only choice, right? So Dremio made it very fast, and they still may have had some data. So slowly now they're moving to cloud storage. But initially they moved from HDFS to like MinIO or some other object storage on premise because they have a requirement to keep the data on-prem. So in summary, yes, you can have the data in HDFS and use Dremio to connect HDFS, Hive, relational sources like we showed you, now we even have a connector for Snowflake and other Dremio to Dremio connectors. So hopefully that answers the question.

Alex Merced:

Okay, I have another question. This webinar sums up what we are doing. My ETL pipelines are hard to keep up with because data is copied everywhere. Would I be able to join data in Postgres with object storage? I'm hoping this will reduce the amount of pipeline maintenance.

Can You Join Data in Postgres with Object Storage?

Kamran Hussain:

Yeah, that's the whole concept with a modern data platform. So I've been in the database world, the data warehouse world for a few decades and we've always been used to having tools like Informatica to move data from your OLTP environment to your data warehouse and then build marts and have multiple pipelines that are doing copies and cubes and all of that, right? So the idea is, with so much more data with a high volume of data coming from everywhere, being stored everywhere, you don't want to have to make copies because copies actually introduces challenges with the data correctness and issues with the data, right? Because people make copies and they'll have their own set of copies and they do aggregations and use that report and doesn't match somebody else, right?

So you want to be able to minimize the data pipelines and let Dremio connect to multiple sources, Postgres being one of them. So yes, you can very easily join data across different technologies and makes it very simple. And typically if you look at a data virtualization tool that tool usually has challenges because you're not depending on the performance of that. So Dremio's got some accelerators. We touched on reflections, for example, that makes it extremely fast to to consume data from other engines as well while you're joining.

Alex Merced:

Awesome. Okay. And basically, okay, just another question. This one coming from me on some of the new features, because I know we just came out with that sort of a Dremio to Dremio connector, which really unlocks some pretty interesting possibilities. I don't know if you want to talk about any sort of interesting use cases or possibilities that, when you think about that Dremio to Dremio connector, come up.

What Are The Possibilities That Come With a Dremio to Dremio Connector?

Kamran Hussain:

Yeah, I can actually give you one that comes to mind. We have one customer that has seven different clusters of Dremio running. They're a large auto manufacturer. And a lot of the data, for example, supply chain data is specific to one business unit, but there's other data sets that are common across different business units, right? So for example, if it's manufacturing, one team may need supply chain, the other team may need, for example, the aftermarket supplies and things like that. So there's certain data sets that are available. So now with the Dremio to Dremio connector, you can simply, if one of those clusters has the sources defined, you can simply just connect to that, right? So that's going to, I think, make deploying Dremio in enterprises and having separate clusters and still being able to share data amongst the different clusters is going to make it very, very easy and simple.

Alex Merced:

Awesome. Very cool. I guess I think that looks to be all the questions. So first off, I want to say thank you very much, Kamran and Tony, for coming on the show today. So thank you very much. Always great to have you on. That was an absolutely fabulous presentation. And then I also want to make sure to remind everybody next week in March 1st, March 2nd, it's subsurface, it's that time of year again, okay? And it's going to be ulcers of great presentations. Presentations like the one you saw today, but also presentations on Apache Iceberg, Apache Arrow, on Web3. All sorts of, everything you'd care about in the data space will be covered over there at subsurface. So make sure if you're not registered already to register, okay? And again, if you're listening to this after all this has happened, because you can also listen to this on Spotify, iTunes, YouTube, et cetera, okay? Still make sure to go check out all those recordings of those subsurface presentations and make sure to be ready to register again for next year's subsurface. But with that, again, thank you to Kamran and Tony, we'll see you all next time here in Gnarly Data Waves. Again, same bad time, same bad place every week. And again, thank you very much. Thanks Alex, take care.

Ready to Get Started? Here Are Some Resources to Help

Webinars

It’s Time To Consider a Hybrid Lakehouse Strategy

Discover the power of the hybrid lakehouse! Join data expert David Loshin to explore how this strategy combines the scalability of data lakes with the performance of data warehouses, enabling flexibility and future-proofing your data ecosystem.

Webinars

Mastering Dremio’s Well-Architected Framework: Overview & Security: Overview and Security

Learn best practices for creating sustainable data architectures that emphasize efficiency and long-term maintainability. Discover strategies for optimizing workflows while exploring core design principles like security, cost optimization, and operational excellence. Ideal for professionals looking to enhance their data architecture skills!

Webinars

AI-Ready Data with Data Products

As AI adoption rises, data quality and reliability are crucial. This presentation shows how treating data as a product—with clear ownership, quality standards, and governance—ensures AI readiness. Discover practical strategies to overcome challenges like accessibility and governance, turning data into a strategic asset for AI innovation.

Gnarly Data Waves

Getting Started with Hadoop Migration and Modernization

Speakers

Video Synopsis

Transcript

Getting Started With Hadoop Migration

Information on Dremio Subsurface Live

What Will You Learn in This Video?

Who Is at Dremio and What Do They Do?

What Problems Does Data Migration Usually Come With?

How to Start Hadoop Migration With Dremio

Dremio's Interactive SQL UI

What Tools Does Dremio Provide for Data?

How To Navigate Dremio's SQL UI

How Does Dremio Organize Semantic Layers?

Dremio SQL UI Live Demo

Can Dremio Mask SQL Data For Specific Users?

Setting Up and Using Scripts With Dremio

Editing and Viewing Table Data

Can You Join Data in Postgres with Object Storage?

What Are The Possibilities That Come With a Dremio to Dremio Connector?

Ready to Get Started? Here Are Some Resources to Help

Webinars

It’s Time To Consider a Hybrid Lakehouse Strategy

Webinars

Mastering Dremio’s Well-Architected Framework: Overview & Security: Overview and Security

Webinars

AI-Ready Data with Data Products

Get Started Free

See Dremio in Action

Talk to an Expert

Ready to Get Started?