May 2, 2024

A Case Study of Quebec Blue Cross Achieving Full CI/CD in Dremio with DBT

Deploying virtual data sets (VDS) to Dremio manually by copying and pasting can be a tedious and error-prone process. It can compromise the quality and consistency of the data, and waste valuable time and resources. How can we automate and streamline the deployment of VDS to Dremio using best practices and tools?

In this session, we will share how Quebec Blue Cross leveraged Bitbucket, Airflow, and DBT to achieve full CI/CD in Dremio. We will demonstrate the main components of the solution, and how they work together to enable fast and reliable deployment of VDS to Dremio. We will also provide some tips and lessons learned from our experience, and how you can apply them to your organization.

Topics Covered

Dremio Use Cases

Sign up to watch all Subsurface 2024 sessions

Transcript

Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.

Alain Barrette:

I am Alain Barrette and some of you may call me an old-timer. In fact, I date back to the IBM XT, which was introduced in 1983. Beside my role as an architect at Quebec Blue Cross, I also act as a data engineer. But my other duties also include absolutely everything, from infrastructure, monitoring, performance, cost control. I really love bugs. Hunting them and having a great challenge destroying them is quite fun for me. 

What is Quebec Blue Cross?

But, of course, many of you are probably asking what is Quebec Blue Cross. As its core, Quebec Blue Cross is a non-profit travel insurance organization. We try to provide the best possible services to our customers, Canadian and world. We try to provide them during and after the travel adventure. We provide that kind of services to our customers. Just imagine working for an organization that, from time to time, does save the lives of their customers. But how do we do that? Let’s see. We take care of you wherever you are. That’s our mantra at Blue Cross. Just imagine you’re on vacation in an all-inclusive resort with your two kids. That’s nice. You’re relaxing on the beach with a drink, when suddenly a big beach umbrella falls onto your head. That results in a broken neck with two small kids beside you. This is a very painful situation, of course. But since the hospital in the country you are in is a sub-bar, we call it, we evacuate you by air ambulance to the U.S., where you can have a good health follow-up. We also include a transport helper to take care of you, because you’re lying down in the ambulance and your kids need to be taken care of. Things like that. That was for the physical part. 

Now, let’s see what we can do for a customer in a mental part. Your daughter is traveling to Nepal. Okay. When the bus that she rides on falls into a river, she loses everything. All of her baggage, passport, paper, money, everything. She got bruises everywhere. And we took care of her by transporting her to a much better hospital in Nepal this time. We also took care of her by arranging things with the Canadian government for the passport, for the retrieval. But we also took care of you as a mother or father, because you were quite stressed having your daughter in such a situation in Nepal. That’s the kind of thing we do. Those situations may not make the news, but they are occurring way too frequently. Those two examples I just provided happened in the last two months, and I have many more to share. In summary, we take care of you wherever you are. 

In this session today, we will be presenting technical architecture of what we have. First thing to say is we are a very small data analytic team at Quebec Blue Cross. We are only six people plus DevOps. And we have a lot of objects to maintain. We are adding more and more objects each week to address new use cases, planned projects, unplanned projects, strategic initiatives, things like that. And the project I’m presenting here took us about 18 months to complete. So the most important thing here, I think, is that there is no one-size-fits-all. All the tools you will see, we are using them for one and only one purpose. We’re using the core feature that best matches our use case, and that we can manage with such a small team. We prefer to restrict the use of each tool to what it does best against no one-size-fits-all. In this session, I will be covering the details of how we achieved full CI/CD with just a small team by presenting you with the overall architecture and the recap of the main components, including Airflow, DBT, Dremio. 

DataOps Vision

And this is our DataOps vision. There is a small “as of now” because, well, since the beginning of the computer, there is a lot of change that did happen. Change is permanent. It’s the best thing that can happen. And so 12 months ago, we did not have that. And today, we’re there. Tomorrow, who knows? With the advent of AI, well, we will be including AI in the transformation and use. So let’s start on the left with the data producer. We have Oracle Database, SQL Database, things like that, the old stuff database, if we can say. We also have REST API, secure FTP. We also have SharePoint site, Salesforce, things like that. This is where we get the data from. 

Next, we have the ingestion engine. This is how we get the data from the data source. We are currently using two things. We are using click data integration to capture incrementally the data from be it Salesforce or Oracle or SQL Server. And we are also using custom Python code running in Airflow directly to capture data from FTP, from SharePoint, in and out. For the storage, we are using Azure. Our main cloud provider is Azure, so we are storing that in the blob storage. And we are using Databricks in between to do transformation of data. We are using Python, Spark, PSpark to transform those data and write them back to the lake in the standard raw brown, silver, and gold layer. 

In between, we deploy those data sets that are in the lake through Dremio, through virtual data sets using DBT, DBT Core. Everything runs on premise in that case. And then we do a bit of testing using DBT, using Dremio. In the end, we have our consumer that queried the data using Power BI. The thing is, limit the use of a tool to its primary function, one that we deem fit for our organization, that we deem fit for our very small team. This keeps the tools to a minimum, and that way we can maintain and evolve the infrastructure without too much pain. Let’s now dive into the detail of this marvelous infrastructure. 

Elements of the Infrastructure

So the element of the infrastructure. There is a lot of other components not presented here and on the previous slide. But basically, we run everything in Azure using Kubernetes. Infrastructure as a code is one of our goals. We’re using a vault to store sensitive information, password, and configuration file, things like that. We’re using Power BI Desktop, Bitbucket. We’re using Datadog to monitor the infrastructure, monitor the execution of the pipeline, those kinds of things. In the middle, we have Airflow, which is our orchestrator. Even if orchestration is available into Click Data Integration as a scheduler, we don’t use that. We start the pipeline using Airflow. Even if there is a scheduler available in Databricks, we don’t use those features there. We are strictly using one feature per component. This greatly helps us achieve what we need for our customer. 

After Airflow, we have Click Data Integration, which ingests data from Oracle and other Salesforce applications. While, again, Click is capable of scheduling, Click is capable of transforming data on the fly using some kind of SQL statement. But we don’t use that. We only use Click as point-to-point transport. Quite frankly, those kinds of things have been running for the last 12 months without much of an issue. Out of memory from time to time, but that’s about it. That brings us to the other component in the list. On the left, we have Azure Databricks. It’s integrated with Azure Cloud. It offers us great on-demand infrastructure. They are making progress to providing serverless configuration, so the thing will speed up and be a bit less costly. We have continuous integration and delivery through Airflow, BitMarket, things like that. This is the transformation engine. We don’t schedule anything there. We call everything from the workflow I call exclusively from Airflow. And that’s about it. 

In the middle, we have DBT. DBT, we use DBT Core to define the main project. That’s what we are talking about today. And we create the VDS in Dremio. We test them once we’ve got the correct result. We simply transform the syntax into DBT, and we deploy them to dev, QA, and production. This is our SQL deploy engine. And at last, but not the least, we have Dremio, which acts as a massive parallel SQL server on top of the data that’s on the lake. And this is where our customers query the data for Power BI, but also a bit for the UI. They’re doing subservice research, investigation, and that’s the main component of the solution here. 

Quebec Blue Cross Wish List

Next, okay, quite nice, but why do we want to do that? Did I say we are a very small team and that we need to automate things? So we need to bridge the gap between data tasks, software engineering, DevOps, DevSecOps, GitOps, whatever the ops it is, including human ops from time to time. We want to enable CI/CD everywhere. We’re not there yet, but for DBT and for Dremio, for the VDS, we are there. And we also want to be ready to implement data catalog when we are ready for that point. 

Why? Well, we need to increase the data quality of the project we launch, the data project we launch. We need to make sure that they are of real quality, that they provide the correct information for the end user, the consumer. We need to decrease the time to market. We need to enforce security. We need to help the decision-making process for our stakeholders. We want to help the organization take the best decision on the freshest data set they can have, and we need to secure everything from ingestion, diffusion. And let’s now see how we did that. 

DBT. I don’t know if any of you are aware of what it is as an engine, but for us, it’s our savior. Initially, we were deploying hundreds and hundreds of SQL statements by end. And three years ago, the team was going crazy doing deployment like that, because cut and paste is not the way of doing things. Absolutely not. But at that time, DBT was not ready for us, or we were not ready for DBT. But since then, it has helped us immensely. 

Airflow: DataOps Master

So, our first implementation through Airflow. We have our DataOps master presented here. Airflow is driving the show, literally. We have a DAG, DataOps master, that does lots of stuff. In the beginning, we added a backup of the current object before the deploy, just in case. Well, it served a purpose, but in the last 12 months, we did not use that anymore. So, the backup is still there, just in case. We synchronized the definition with Bitbucket, and then we ran a simple DBT run. We synchronized a group and user with Azure Active Directory. We validated the data and the physical data, the virtual data, by doing some select star, limit one from every object in Vremio. That ensures us that the physical datasets are valid, and that the virtual datasets are also valid. The only step that is still a bit manual is if the result is greater than 95%, we are good to go with the deployment. But we are working on automating that, too. 

Let’s now see a sample of a DBT run to our target QA environment. We can see that we have about 1,200 objects, and we see that it’s simply doing a start, a run, and a finish run in about 4 minutes and 7 seconds, which is astonishing compared to a cut-and-paste method. But how does that show up in Dremio? At this stage, in the upper left, we see a bunch of jobs that has been scheduled through the DBT run that was run through Airflow. We see the submitted job, and we see a very simple definition of a SQL statement. In the start, we did not add anything to the VDS definition by default, because we did not have much user consuming in generating SQL in Dremio. But as time passed, we encountered some situations where we needed to identify the objects that are under our DBT control, and the objects that are generated by the user. So we added simply a “manage by DBT” command in the VDS by default, and then we are good to go. We can identify the object of the end user and our object. Change analysis is simplified by doing that. 

Here we see the PDS validation step that we took in Airflow. Last year, we had only 200 PDS under control in July, and we are now at 1,200. That’s a testament on how this way of doing things is quite effective. We could not have such a grow if we had manual deployment in progress. Airflow is also doing VDS validation, and the same goes. We had about 950 VDS in July last year. We now have 1,400 VDS in Vremio. Again, we are a team of six plus a DevOps, two Data Engineers, two BI Analysts, a Data Analyst, and a Product Owner for the whole organization. We are deploying dozens of objects every week, fully validated, fully tested, and that’s quite interesting. 

Now, for DBT, what we like to do is use the simplest feature of the project. What we have on the left is the recap of the project definition. It’s simply a list of different folders on the definition of the object. In the middle, we have YAML file for the sheet modification for a directory under that project. Again, we have a very simple managed by DBT object for a SQL statement that corresponds to the DBT Vremio syntax. This is the simplest expression that we could do with DBT. We don’t do fancy stuff anywhere. We don’t have enough people on the team to do that. We keep it simple, and it’s working great. 

The Road So Far

So, for the road so far. Initially, we had 1,228 VDS in four minutes. Okay, nice. Deploying this by gut and pace, it was a nightmare. I was going crazy attempting to find a solution for the theme. Lots of PDS and VDS automagically validated through Airflow. Very big increase in deployment quality using Airflow and DBT project. Our users are much, much happier due to valid object in Dremio. Our data analytic teams are also much happier because we automated a lot of stuff. Happy end-user, happy data teams, less manual operation. Hey, quite nice. But, as an architect, well, I’m not so happy yet. Why? Because everything is still done in Airflow. And the goal is to remove a lot of stuff and give power to our engineer, our BI analyst team, so they can deploy stuff directly from JIT. And what they do is they develop, they test their SQL statement in Vremio. They transform it to DBT syntax. They test that again and again in the dev environment. Once they are confident that everything is good to go, they simply create a pull request, merge and develop. The merge and develop process for the source control simply does a compile to make sure that, syntaxically, everything is okay. And then, when everything is okay, they create a pull request to deploy from develop to deploy QA. And that’s where we are running the deploy DBT run to QA and then deploy to production. So, the team of two can simply develop something, PR the thing, get it approved by the other one and the team, and they are good to go. So, it’s kind of truncated, but now I am quite happy. 

Best Practices and Tips

Let’s now see what we can do for best practices or what we learned and some tips on the whole project. First thing is update your stuff. Update it. Why? The few lines that are chosen black, before we had a DBT run in about 3 hours for the project and we did not have a lot of objects at that time. But, we found the bug, we asked the DBT remote to correct that, they applied the solution and then we run in 4 minutes for the same thing, but we have 4 to 5 times the number of objects. Salesforce, exactly the same, by upgrading our flow from 283 to 290, which is quite fresh. We never hit the Salesforce report limit of 500 in an hour. 

Now, we had to throttle the code because we are hitting those limits. Just by updating stuff. Of course, you don’t update blindly. You need to read the change log and the new feature because we need to embrace change and evolve quickly. Because if we don’t read that, we don’t know of no way of doing things, we don’t know what’s coming in the pipeline. AI is a great example. If you keep blind, you won’t see it coming. And don’t be afraid to fail. Fail fast, learn fast. Maybe not in production, but fail fast, learn fast. That’s the best way of learning things and improving things rapidly. And, last but not the least, everything runs under Kubernetes and Azure. Get the best DevOps you can. Pay him a beer if you need to. But make friends with the best DevOps you can have because without DevOps, you won’t have anything that’s running there. So, that’s about it for today. Thank you.