May 2, 2024

Policy as Code: Automating Compliance with Data Mesh

Explore the application of Data Governance Policy as Code (PaC) in the context of Data Mesh, focusing on automating compliance for reliable and agile data management. This discussion examines PaC implementation strategies, optimsl tooling, best practices, and real world examples to highlight the benefits and address the challenges in automating compliance within a decentralized data architecture.

Topics Covered

Data Mesh and Fabric
DataOps and ELT/ETL
Governance and Management

Sign up to watch all Subsurface 2024 sessions

Transcript

Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.

Shawn Kyzer:

Welcome, everyone, and welcome to my session, Policy as Code, Automating Compliance with Datamesh. Before we dive into it, I’ll just tell you a little bit about myself. I am the Associate Director of Data Engineering here at AstraZeneca in Barcelona, Spain. We focus specifically on early science, so a lot of, like, research and drug discovery. And my partner and I, and two adorable pups, moved here from Washington, D.C., some years ago. So I love to tinker with tech, explore different corners of the world, travel. In fact, I’m going to Madagascar in a week or so, and love definitely chatting at conferences and sharing some useful information, hopefully, that you all will be able to apply in your day-to-day. And I definitely am a foodie. And as you can see, I’ve taken my dogs and turned them into kind of 8-bit, 4-bit characters there at the bottom, Gigi and Fritz.

Great. All right. Let’s just dive right into it. So the way this is going to work today is I’m going to start with a little bit of theory, and then talk about Datamesh, Policy as Code, and then we’ll get into actually some practical use cases. If we have enough time, this session actually is also a live coding session. Unfortunately, that typically takes an hour. But I will leave you with a GitHub repo, so you can try out some of the things that you’re going to learn today. 

The Challenges of Data Governance and Compliance

Great. So one of the things that I noticed when I started implementing Datamesh several years ago is that one of the major challenges is this notion of the federated data governance. And on my very first project, we had to set up a governance board from scratch. And so there was always kind of this disconnect of the creation of the policies and then the implementation of the policies. And so there would be kind of a written document, and then there would be the interpretation of the developers of what that written document actually meant. In real life, I mean, there’s increasingly complex data landscapes with different data types and different sources, and there’s a changing regulatory environment. So updating these policies and ensure they’re updated at the implementation level is a huge challenge. We have things that are being like right around the corner, such as AI regulation, especially at the EU level. And so a lot of this, you know, the cost of this is, you know, high reputational damage, or it’s just really, really expensive. 

So we need a way to be agile. And I think that that is, you know, we need to be agile, but also stay with our federated computational governments philosophy that is part of Datamesh. And so this is kind of what happens, right? So there’s this disconnect between policy creation, enforcement and implementation. And it just, yeah, if you ask the developers and engineers to implement these governance policies that are written by data governance professionals, sometimes the interpretation is a little bit messy. So this is where policy as code comes in. The idea is that we can interpret rules and regulations using machine readable code. We have automation and consistency. There is integration with the CSE pipelines and improved accuracy and reduced errors. 

Elements of Policy as Code

So let’s talk about what policy as code is before we get into Datamesh. And then we’ll talk about how the two fit together, and then some useful examples. Great. So this was taken actually from, I actually was inspired from a book called Policy as Code. You can find it on Safari. It’s an excellent book. I think it’s in the rough cuts. So, but essentially it’s a policy, right? Some data, a query against the policy and the data, and that is the input. And then there’s a decision point. Does it match? If it does, evaluate and validate the response and provide an outcome. If it doesn’t, issue a validation response and also provide an outcome. 

So where can this be applied? Again, very theoretical right now, almost everywhere, right? So you can use this in many different places, like from APIs to access control, to validating configuration files. It is really very versatile. So an example here is if you just want to kind of play around with some examples as we go along, of course you can like hop on over to open policy agent.org, but there are also many other tools that kind of are very policy as code oriented. And we’ll talk about those too. 

Data Mesh and Data Products

So now let’s talk about data mesh. So data mesh, as you probably already know, if you’re here is an architectural paradigm that embraces decentralized ownership and governance. And data products, which create the data mesh, are self-contained domain oriented data assets that encapsulate data, metadata, and policies. So one of the, I guess the key thing there is that the data products encapsulate the policies. And so that’s where policy as code is kind of perfect really, because the rest of data products we’ll discover are actually such things such as like orchestration pipelines and things like that. They’re already as code. So why not have the policy package as code with the data product or the checkpoint outside the data product? And we can, I’ll show you what that looks like. 

So the principles of data mesh are the distributed domain driven architecture, data as a product, self-service data infrastructure, and the federated data governance. Now the last two are where policy as code really shines through because self-service data infrastructure allows you to both create data products using self-service, but also consume from data products, self-service as a consumer. And both of those things, you can intersect with policy validations, right? So whether or not a data product is valid or not, that can be checked with policy as code or whether a user has access to consume the data of a data product that also can be represented as a validation step or an authorization step in policy as code. 

So this is enabling the domain teams. And just, we talked about the self-service capability, making sure that the teams are able to produce and maintain their data products and building in checkpoints in the CI/CD pipelines that provision the infrastructure. You also kind of want to put different checkpoints in there such as TerraTest, or you could also use OPA to do conf tests and different things like that as part of your CI/CD infrastructure pipeline. 

So policy as code in a data mesh architecture is going to be decentralized governance and centralized, but centralized standards, right? So there’s something between the autonomy and the consistency across the organization. It’s as though each domain team would be able to take policy as code and extend it, and they could add to it, right? It’s computational government. So if we have it as code, we know that we can automate it across the mesh. For example, the different policies can be in a shared GitHub repo, right? So you can version them and make sure that other people are updating the policies in their other data products if they want to bump a version. And the policies are embedded in the data products, right? So as we talked, they’re like a single governance unit. And there’s also a lifecycle integration. So we’re able to enforce that build, deployment, and runtime. Yeah, definitely, obviously collaboration is always, regardless of what type of governance you’re doing, that’s extremely important, because you do need to get the buy-in of everyone. You can prioritize, if you want to implement this, right? Don’t start by boiling the ocean and trying to do everything policy as code. Instead, decide on what your high-priority policies are that you really, really care about and start with those, right? A lot of people will start with security. So if they’re doing policy-based access control, for example, with attribute-based access control, maybe that’s something that you start with for the control ports for the access to the data products, right? You can have version-controlling policies for auditing rollbacks, continuous test and monitoring, and iterate and refine the policies based on feedback. And other people can do even pull requests to the policy as code, just like regular code. 

Separation of Concerns and Policy Centralization

So a lot of people have come to me and said, “Why not just use a data orchestration tool for this? What is really the difference of using something like OPA, for example, versus actually embedding it in Python code in a task in a DAG, in an orchestration tool?” So one of the things that I think is really important is that you do need to separate the concerns and do actually need to centralize the policies in some way, right? So even though data mesh is distributed by domains, there is also kind of a shared understanding at the very top for the federated computational governance. And there does need to be some centralized control there, just like at the base, at the platform, there’s also probably some shared centralized control. So if you’re able to centralize your policies as well, you can also kind of maintain that control. So they become more modular, reusable and maintainable, and it decouples the policy enforcement logic from the pipeline implementation. So even if you change something in a pipeline, your policy will still remain the same and vice versa. 

Here are some tools and technologies that you can use. So I do use OPA as an example because it’s open source, it’s very easy for everyone to access, there’s a wonderful sandbox. But in real life, you know, we’re using different tools, such as the implementation of like Immuta, for example, for our policy engine, right? There’s policy languages, so Rego, which is based on Go, is the policy language for OPA. But you could just as well do JSON or Cedar policy language, which I believe is more like Amazon, AWS, or even WebAssembly, right? You could even consider that. And then there’s infrastructure as code. And you can incorporate this in your policy as code. So if you have certain encryption standards or certain ways of naming S3 buckets, those kinds of things can obviously be this could be a tool that could help you write policy as code. And I also like to think that data quality is very important, because you might have a data governance policy that is all about data quality, what dimensions of data quality you measure. And so in order to measure those dimensions and codify it, right, you can use things like great expectations, SODA, I think we use Experian, sometimes talent data quality. So there’s a broad range of things that you can use for data quality as well, which I think is very unique to the data space in terms of policy as code. Cool. 

Practical Use Cases

So I see a really great question, actually, in the channel. And it’s the perfect timing to talk about this. It says, is OPA simple and intuitive enough for non-technical data stewards to write policies? Or is there a UI layer on top of that that can help non-technical data product owners or stewards to write policies using OPA? So OPA out of the box does not, to my knowledge, come with a UI. Now, you can use certain LLMs to help you to help facilitate writing these policies, right? But if you don’t already know, it’s kind of hard to decide if that’s doing the right thing, right? You kind of have to do some tests. However, I will say that I have seen– so if you go to Immuta, Immuta does actually have kind of what you see is what you get very intuitive ways of writing policies. So if you kind of– now, of course, you do have to pay for this. But it is very visual, and it doesn’t really– it’s very like low code, no code. So it’s definitely something worth looking into that could give you a nice friendly UI if that’s something that’s going to make more sense for your organization. 

So the reason I say that– so we have the first use case here is validating the self-service data product configuration. So one of the things when we first started creating data products was we discovered that a lot of data products all have the same patterns. And therefore, they can be configured. So data products as code, right? Or data products as configuration in this case. And the only thing was if we did a pull request to create a new data product, which then provisioned the infrastructure and created the GitHub repo, we did need to validate that that was actually– there were some basic validations that needed to happen against the configuration that were just much easier to put– have a machine do it, right? And so upon a pull request, execute a CI/CD process, and then utilize OPA for configuration validation and display the response. So this is an example of a data product, right? So maybe you have an input port is the SA results stream. Maybe you have some output ports, which is like an API that’s GraphQL, and just generally some descriptions. And as you can see here, we also have the data product policies, such as data retention and IP protection. And so there’s some expressions there as well. And then you can see here we have the OPA server URL. So this is the server that it knows to go hit to validate the data products, right? Or to validate the control port in the data product. 

And so this is what that policy as code would look like, right? So this one’s not really– it’s OK. But it’s just to give you a little bit of a taste of how you would implement this in OPA, right? So you would allow certain things, right, where it exists. Policy name is data retention, right? And you would just kind of also– you might validate certain parts of the URL, different things about the user, the method, the resource. Or if there is no policy, that would be maybe your response. And so then here we have result true, no errors, or result false. And it says, OK, there’s a missing or invalid data product policy, or there’s invalid policy enforcement configuration. So this is something that you would just want to happen on the fly. You don’t want humans to constantly being able to validate this. And then at the data governance level, if elements of what is a good data product change, then you want to be able to catch them in real time as soon as you commit or do a pull request for this data product before it gets provisioned as infrastructure. 

So the second use case is validating the security products, the security policies within data products. So this is just ensuring that there’s a correct configuration of security policies for output boards, right? So this was the actual self-service platform of creating a new data product. And the next thing that we’re going to get into is the actual validating security policies. So this is from the Red Joe playground. And you can kind of see here, just a quick example, this is attribute-based access control. And you can do policy-based access control, which I recommend. So a policy could be many attributes, right? And then you can keep– it’s almost like a nested doll, right? You can just keep going with this. 

AI Oversight and the Role of PaC

And finally, you want to do– eventually, we’re going to have to do some– well, right now, we need to be doing AI oversight. And so one of the things that everyone’s looking into now is, do we have other LLMs that monitor the responses or the prompts? Next year, will I be talking about, instead of policies as code, maybe policies as prompts? I don’t know, right? So that’s something that we really need to think about, because it will be impossible for us to manually think of all the different permutations that are valid or invalid that come out of the different artificial intelligence, especially the generative AI. So we really need to think about how we’re going to be able to audit that through code or whatever mechanism, right, through agents. 

The Challenge

Very quickly, this was the code challenge. I’ll just move through this really fast so that you all have this. And this is a policy-based access control. So it’s very complex, but it gives you kind of an idea. This looks really complex, but there is actually a very– I would call it a simple answer in code that you can use to enforce such a thing. You can go hop on my GitHub and play around with some of the examples. I could probably put that in the channel. Maybe you’re able to click on this. I’m not sure. And then also, I would like to invite you, if you would be interested, to come to my talk in Biotech X. It is in Switzerland. And we’re talking all about advanced data engineering and AI reshaping ingestion pipelines in early science. So some of the different fine-tuned models that we’ve used, some of the prompt engineering are different things that we’re just leveraging in Data Mesh with data for data pipelines. And that is actually very specifically the scope and how we’re kind of observing different ways that we can leverage these models so that we can easily query or chat with our data products, essentially. And also, we are hiring in Barcelona. So if you are interested in a data science, data engineering, general, even science careers, obviously, because AstraZeneca, please feel free to use this QR code or visit us at careers.AstraZeneca.com or find me on LinkedIn, and I’m super happy to have a chat about what we have there.

header-bg