Summary
A significant portion of the time spent by data engineering teams is on managing the workflows and operations of their pipelines. DataOps has arisen as a parallel set of practices to that of DevOps teams as a means of reducing wasted effort. Agile Data Engine is a platform designed to handle the infrastructure side of the DataOps equation, as well as providing the insights that you need to manage the human side of the workflow. In this episode Tevje Olin explains how the platform is implemented, the features that it provides to reduce the amount of effort required to keep your pipelines running, and how you can start using it in your own team.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack
- Your host is Tobias Macey and today I'm interviewing Tevje Olin about Agile Data Engine, a platform that combines data modeling, transformations, continuous delivery and workload orchestration to help you manage your data products and the whole lifecycle of your warehouse
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Agile Data Engine is and the story behind it?
- What are some of the tools and architectures that an organization might be able to replace with Agile Data Engine?
- How does the unified experience of Agile Data Engine change the way that teams think about the lifecycle of their data?
- What are some of the types of experiments that are enabled by reduced operational overhead?
- What does CI/CD look like for a data warehouse?
- How is it different from CI/CD for software applications?
- Can you describe how Agile Data Engine is architected?
- How have the design and goals of the system changed since you first started working on it?
- What are the components that you needed to develop in-house to enable your platform goals?
- What are the changes in the broader data ecosystem that have had the most influence on your product goals and customer adoption?
- Can you describe the workflow for a team that is using Agile Data Engine to power their business analytics?
- What are some of the insights that you generate to help your customers understand how to improve their processes or identify new opportunities?
- In your "about" page it mentions the unique approaches that you take for warehouse automation. How do your practices differ from the rest of the industry?
- How have changes in the adoption/implementation of ML and AI impacted the ways that your customers exercise your platform?
- What are the most interesting, innovative, or unexpected ways that you have seen the Agile Data Engine platform used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Agile Data Engine?
- When is Agile Data Engine the wrong choice?
- What do you have planned for the future of Agile Data Engine?
Guest Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
About Agile Data Engine
Agile Data Engine unlocks the potential of your data to drive business value - in a rapidly changing world.
Agile Data Engine is a DataOps Management platform for designing, deploying, operating and managing data products, and managing the whole lifecycle of a data warehouse. It combines data modeling, transformations, continuous delivery and workload orchestration into the same platform.
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png) RudderStack provides all your customer data pipelines in one platform. You can collect, transform, and route data across your entire stack with its event streaming, ETL, and reverse ETL pipelines. RudderStack’s warehouse-first approach means it does not store sensitive information, and it allows you to leverage your existing data warehouse/data lake infrastructure to build a single source of truth for every team. RudderStack also supports real-time use cases. You can Implement RudderStack SDKs once, then automatically send events to your warehouse and 150+ business tools, and you’ll never have to worry about API changes again. Visit [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack) to sign up for free today, and snag a free T-Shirt just for being a Data Engineering Podcast listener.
Hello, and welcome to the Data Engineering podcast, the show about modern data management. RudderStack makes it easy for data teams to build a customer data platform on their own warehouse. Use their state of the art pipelines to collect all of your data, build a complete view of your customer, and sync it to every downstream tool. Sign up for free at dataengineeringpodcast.com/ rudderstack today. Your host is Tobias Macy. And today, I'm interviewing Tevyeh Olin about Agile Data Engine, a platform that combines data modeling, transformations, continuous delivery, and work load orchestration to help you manage your data products and the whole life cycle of your warehouse. So, can you start by introducing yourself?
[00:00:50] Unknown:
Yes. Hi, everyone. I'm Tevye Olin, solutions architect from Agile Data Engine, the DataOps company. Thank you for having me here, Tobias, and, well, greetings for all the listeners.
[00:01:02] Unknown:
And do you remember how you first got started working in data?
[00:01:06] Unknown:
That's a that's a tricky question. If I remember correctly, it was when I was studying computer science in university, and, I I kind of, by accident, ended up, working in IT consultancy that was mainly focused on BI. So I started working as a BI and data warehouse consultant, back in 2, 010. So I have my background in in traditional data warehousing, and I I think back then, 1 of the kind of biggest questions was whether to go Inman or Kimball style. And I think it was somewhere around, 2015 when I got more involved in public cloud data data solutions, and I've been working with cloud data platforms mostly ever since as a data engineer or data architect before jumping, full time into into the product business with Agile Data Engine.
[00:02:01] Unknown:
And in terms of the Agile Data Engine itself, can you give a bit of an overview about what it is that you're building there and some of the story behind how it came to be and your, interest in working in that space?
[00:02:14] Unknown:
Yeah. Sure. Well, agile data engine is it's basically how I would describe it. It it's a cloud native DataOps platform. What it means is that that we kind of, combine the tool aspect of the DataOps under under 1 look and feel, user interface and and some back end services. I have to be honest here that I believe that you can achieve benefits of data ops by just bringing in tools. There is the human and methodological factor that play the most important role, but correct tools that can a enable the transition or even make it tremendously more efficient.
Agile Data Engine combines data modeling transformations with version control, CICD orchestration, monitoring. We also test automation, data lineage out of the box. And, we we want to offer something also to manage and improve the data of teams. So, basically, we have the team level reporting based on the development metadata and possibility to get benchmark or similar sized data team performance insights on the development and operation costs on your data products. I I think it, the story started back back in, around 2016, 2015.
It didn't really take that long to realize that that when we moved to the public cloud or or the ecosystems and their environments where we were work working changed drastically. Even though it's quite easy to they they can use new cloud services or or maybe because of that, the architecture and the managing managing, development started to become more complex. Development and especially changing something that was already in production became person dependent. Also, back then, there was no, for example, schema compared tools for AWS Redshift or or Snowflake, and there was no support for procedure or sequel.
This and also the fact that we had a quite a stack of, self developed scrape for automating parts of the data warehouse development in different environments led to this conclusion that we need something to manage all of it. So the agile end data engine was basically born around 2016 when we got just really, really talented guy, Harry Kallio, to lead the product development. He had a long experience on product development in analytical databases from Telgo world with big data and got us talking about DevOps for data already back then. We've shifted to the DataOps term.
I think it was in, like, 2017. Our first customer we got in was January 2017. It was running on AWS and Redshift. Although our product has been built on on cloud from the beginning, it wasn't SaaS product at first, but it was running on the customer's own cloud accounts. And then we had the Azure version published in 2018 and started the SaaS program in 2019. Currently, we have around 40 customers in in the Nordics, and we're expanding, hopefully, quite fast.
[00:05:39] Unknown:
And so in terms of the overall space of data ops, that is a industry term that has been repurposed by a number of different vendors and tools, and everybody has a different conception of what that really means to them. And I'm wondering if you can just talk to some of the categories of tools that typically fall under that DataOps umbrella and some of the ways that agile data engine might overlap with or supplant some of those different tools or categories of, operational aspects of working with data? Yeah. Well, first of all,
[00:06:16] Unknown:
I I have to briefly kind of define data DataOps because there are so many different defies definitions for DataOps. And, personally, I see DataOps is is basically, DevOps for data development. Of course, what it means, it means a bit different things in in, data world because you need to understand the kind of, different, principles and different things, impacting the data development when compared to the software development. How do you say? In software development, you're developing the functionalities of the business applications that actually create the data. And in data development, we are constantly, doing development with changing data derived from the actual business applications.
So we're down downstream dependent always on the business applications, And that's, I would say, 1 of the biggest things, on the data development compared to the software development. So the skill set working on data is, of course, because of that, a bit different than than working with the traditional software development and so forth. But in my opinion, if if we look at the tool stack in in data development, there's, of of course, the ELT or ETL or whatever you want to call it. So, basically, in that sense, agile data engine does the load and transform. We don't do the extraction.
Customers usually have several technologies how they extract and load the data from from their sources to the their cloud storages, and we catch the data from there. Then there is, of course, the kind of common common nominator for for DataOps and, DevOps, which which is the CICD. And we have fully automated, automatically generated pipelines for that. And, the kind of data specific development thing is that in agile data engine, we have full schema control or full schema management and versioning tool out of the box. So what it does, it basically when when we deploy something from Agile Data Engine, it resolves the state of the database on the deployment time, just in time.
So we don't want to kind of, when when we're working with data, we don't want to do the deployments kind of in in serial when we change, for example, a data type type in a production table column because we don't want to interfere with with the production loads that much. So we need to be kind of smooth and do only the necessary changes in the production. Of course, then then we have the load orchestration and data modeling there as well. So I would say that the these are the kind of tools in the architecture you can be be thinking of replacing with Agile Data Engine.
So, basically, we manage the data warehouse development, deployment, and orchestration,
[00:09:35] Unknown:
and things related to that. And for the overall space of operational concerns around data in a warehouse context, there are a number of different ways that teams might be managing the data ingest, managing the different data transformations. I'm wondering if you can talk to some of the opinionated architectural aspects of Agile Data Engine and some of the tools that might be replaced by teams who are moving into Agile Data Engine and some of the things they might be using before they adopt your solution?
[00:10:07] Unknown:
Well, basically, how Agile Data Engine works is that Agile Data Engine, the SaaS service push pushes down all the transformations to the target database. So all the transformations are done with SQL. So anything you want to run SQL on can be kind of replaced with agile data engines. What you need is the agile data engine and and your SQL platform, whether it's Snowflake. We have support for Snowflake, red AWS Redshift, Azure Synapse, SQL database, BigQuery, and then we have a coming or under construction support for Databricks as well.
[00:10:57] Unknown:
And in terms of the data ingest, data transformation tooling, it seems that Agile Data Engine aims to be that all in 1 solution. So teams who might be using a Fivetran or or an Airbyte, and then using something like DBT or custom SQL scripts to manage those transformations in that workflow. Is that something the agile data engine will own the entire workflow of?
[00:11:22] Unknown:
Mhmm. Maybe if 5 tran, not completely DPT, I would say yes. Depends a bit because, what we don't do, we don't extract from from the source databases. So we don't have that that part of the pipeline. That was actually kind of 1 decision we made when when we moved to to be a SaaS vendor that all the customers have their own kind of, things going on, how they extract the data or get the data or post the data or push the data from from their source systems or on premise systems to the public cloud. So there were several different methods and ways of of kind of doing that, and there's endless possibilities to, implement different kind of connectors to different source systems.
But 1 thing we wanted to be kind of, doing when when going to be SaaS service was that, we didn't want to move any of the customer data through our service. So, basically, when we do something, we push the transformation, push the copy commands, push the loads to the target database. So the target database handles everything. And that's kind of a 1 1 thing, it's it's kind of metadata driven approach that pushes everything to the target database, and the target database does the workload. The beauty of that is, of course, that that we can be a source service. We can kind of, scale exponentially, but we don't have to move any of customer data through our SaaS service.
[00:13:13] Unknown:
For teams who have a complex operational workload for being able to manage the life cycle of their data, it can take it can end up taking all of their time just to do the care and feeding of the system so that the time that they're supposed to be spending on the actual data transformations, data analytics can be, a fraction of the overall time that they have to spend just managing the system. And I'm wondering if you can talk to some of the ways that having a unified experience in the form of agile data engine and reducing some of that operational overhead can influence the way that teams think about the overall life cycle of their data and the types of experiments that they're able to iterate towards without having that operational burden to weigh them down.
[00:13:58] Unknown:
And, basically, how how we see it, the agile data engine that there's this kind of a modular way of, dividing the development into these, packages and entities and so forth. It it reduces the individual data engineer's signature signature on the work. So it makes it easier and faster to maintain other people's work and do tasks or work circulation among the data engineers, for example. 1 thing is, of course, that when, there there is a lot of automation possibilities in in agile data engine. We have a quite quite a lot of kind of, out of the box, sequel generation coming in in agile data engine, and support for all these databases, as I mentioned.
So, basically, we can we can do more robust solutions. We can also create template old templates and and so forth. And all the data loads are tied to the data entities or data products. So when we have these data products and we want to or anything any some something goes wrong, we're always able to see where the problem actually lies. So there is no kind of a scattered development environment where all the different transformations targeting the same entity, for example, are scattered around, and it's impossible to find the root cause for the error and so forth.
Of course, there's kind of a I would say that there's the human factor always, and and by kind of follow following DataOps principles and trying to get improve the data data quality or or, development quality all the time also reduces the, time spent on the operations. We have, examples of teams succeeding with less operation kind of data ops teams spending less than 5% of of their time on the operations.
[00:16:30] Unknown:
And another capability that you mentioned in the introduction to Agile Data Engine is CICD, and that is obviously a workflow that has become popularized with application development teams and being able to build and deploy their change sets rapidly. And CICD in the data context is something that a lot of people have attempted, and there are different levels of capability across different tool chains and platforms and work and deliverables. And I'm wondering if you could just talk through what CICD looks like in the context of a data warehouse and some of the reasons that that can pose unique challenges as compared to a purely software, delivery chain.
[00:17:16] Unknown:
The CICD pipelines in agile data engine engine are automatically generated. These are kind of recorded from scratch by by our our development team. And what it does that's our data engine does, it resolve the dependencies, on the deployments automatically. And like I mentioned earlier, there's the just in time schema control. So we don't want to deploy each DDL step in in sequence, but we want to jump from current state to the desired state with minimal interruptions and disruptions in production. So agile data engine or or the CICD building agile data engine resolves the required author or create comments during the deployment.
It also has that kind of a feature to roll back in in, for example, in Snowflake and BigQuery, meaning that that when the deployment contains grants and changes in the table and table structure, for example. We can roll back if some part of the deployment fails so that the database entity doesn't end up in in unwanted state. It also generates the workflows automatically. So so it looks for the dependencies and puts the load transformations on the correct place in in the docs, when the were when the data is loaded in.
1 of the nicest features in agile data engine is is the automated schema control in the deployments.
[00:18:56] Unknown:
In that context of CICD, 1 of the challenges that often comes up is either being able to have a nonproduction environment that has enough representative data to be able to work with effectively and understand whether your changes are going to have the desired impact or if they're going to introduce bugs or errors in the underlying data that you're trying to operate on or that you don't want to have to run directly against production. Obviously, there are some warehouses, Snowflake in particular, I know has the option of having, effectively copy on right tables where you can use an existing table, write a copy of it to a new destination to test out your transformations, and then blow that away when you're done. I'm wondering if you can just talk to some of the ways that you have worked to, simplify the conceptual aspects of being able to build out these either nonproduction or staging workflows to validate changes before they actually become live in production and start impacting real world data?
[00:20:00] Unknown:
Well, that depends a bit with that's not I would say that's not out of the box at our data engine, property because like you said mentioned in in Snowflake, we usually in project, we have been kind of cloning the data to the QA environment. So so in agile data engine, we, of course, we have the the usually, the ways that we have the development environment, then we have the QA or or test environment or both. And then in the end, we have the production environment, and first, we automatically deploy it to dev, then to test. And then if we are happy, we promote that to production.
Well, that's that's the kind of a age old problem in data development is that you need to always test with the production data or usually the kind of a even if if there's a good quality test data available and you're integrating over several sources, there's no integrated test data over several sources. So you still need to go with the production data. That's usually done, for example, in in, Snowflake by cloning the production data to the QA or or test environment.
[00:21:14] Unknown:
And another aspect of that is the question of data governance and data proliferation, where you want to reduce the number of copies you have of any given piece of data, particularly if any of that has PII. You don't want to necessarily be using test code on, PII data because of the potential for compliance issues or data leakage. And I'm wondering if if there are any controls within agile data engine to help with identifying and mitigating some of those risks in the process of, the early development cycle.
[00:21:46] Unknown:
We have kind of a we have best practices for for handling PII data in in development sense, And we have kind of ways ways of implementing, different kind of GDPR rules. But in a sense, automating that completely, I would say that it's it's a bit tricky question. But, yeah, we we have, I think if you go to our website, you can find some blog posts about, GDPR or Adlink PII data and modeling PII data with actual data engine. But yeah.
[00:22:25] Unknown:
And digging into agile data engine itself, can you talk through some of the, architectural elements of how it's implemented and some of the ways that the design and goals of the system have changed since you first started working on it? Oh, sure.
[00:22:39] Unknown:
First of all, Agile Data Engine is is SaaS SaaS, software running on cloud. So we have the kind of a multitenant sauce SaaS service service layer, which consists of, of the SaaS management, management, user management, federated AD, VPN connectivity, SI monitor monitoring, and external API, for example. And underneath or or besides that, then there's the customer specific design and deploy with centralized metadata, sir, designer user interface, building CICD and automatic pipelines. And then in the core, there's the run times that are tenant and environment specific and connect to the customer database with a private link or VPN.
The design calls goals of the system, I would say that the may major change was that we realized that we had to change the approach to scale better and start to build the SaaS SaaS Foundation from from the private edition. Agile Data Engine u the first version of the Agile Data Engine used to be, cloud native, but it was installed in in the customers, on cloud accounts. But now the SaaS version is a bit different, and we had to kind of, well, do a bit tricks tricks to not run the data through our SaaS accounts. So, basically, we used to have the e part as well in in agile data engine part.
We let it go when when we move to the SaaS service. 1 thing is, of course, that we are also seeing our may metadata on the development on on the kind of our database structures and and such. And the sequel itself much more valuable now for the customers
[00:24:58] Unknown:
than it used to be. In terms of the design and implementation of the system, I'm curious, what are the pieces that you're able to pull off the shelf to contribute to, the overall design and implementation of the system, and what are the aspects of the platform that were core to the functionality that you're trying to deliver that were necessary to engineer in house, to to build the custom solutions and address some of the shortcomings in the overall ecosystem of tooling that's out there?
[00:25:30] Unknown:
Basically, I I would say that almost all components have been coded by ourselves. So our development team has done wonderful work there. We used to have several couple at least couple open source software as a base for our tool. We we had, for example, Jenkins, but as a for for the CICD pipelines, but but we actually actually decided that it it it wasn't or realized that it wasn't efficient enough, and there were, security vulnerabilities, quite a lot. So so we let it go and call it that from scratch as well. And it actually works really, really, well.
1 thing we still have is airflow. So, currently, it's it's the only open source software that is part of our tool, And we we call it airflow flow on steroids because we have our own own libraries there and and we generate all the decks automatically. We can also run some chunks completely without airflow overhead. So that shouldn't be a show stop stopper either there. So it's it's not some sort of a vanilla airflow out of the box.
[00:27:08] Unknown:
So DataOps in general, there are definitely some very coarse grained obvious things that you need to solve for. So CICD, being able to manage the life cycle of transformations, ensure that they are run on a regular basis, implementing data quality checks. I'm just wondering, of the overall iceberg that is data ops, what are the things that are below the waterline that most people don't think about until they really start to get into the flow of solving their major problems and start to realize, oh, these are also things that I need to be thinking about and worried about as I as I work towards a very reliable and stable, analytical solution?
[00:27:42] Unknown:
That's quite quite a question. What I would say that okay. I I've used to used to working with Agile Data Engine, so so the kind of a CICD is and the schema control is something that for me, it's everyday life. So so it's quite easy. But I I would say that, if you want to kind of succeed with DataOps, you need to take all all the aspects kind of, in in control. You you need to kind of first, you have to have the agile methodologies, and and you have to have hopefully, you get that kind of ways of working correct, tooling correct. You have the lean and total quality management and stuff like that in place. So so most of the aspects are, I would say, human factor.
But then there's the tooling and and automation. And and I would say that 1 of the kind of main principles in in Lean is that we want to kind of get rid of all the waste and and prioritize your backlog and and then also automate as much as possible. This comes with the total quality management as well that that you want to kind of do kind of take the quality development, and product quality in in control and and in all aspects and all all parts of of the work. Test automation is 1 1 thing which is, I would say that, really easily forgotten on on the development.
So you don't test things or or if if you test, you did test things before going to production, but you don't have kind of a regular tests on on your data pipelines on on the data quality in the data and so forth. And, I would say that that's 1 of the major challenges in adopting DataOps. We have good good ways of answering the data data quality and data pipeline running. We have out of the box monitoring for the date data pipelines. We can send notifications that whether or not what data pipelines have failed. We can give you reports, of the failure rate of of certain workflows, for example, And we can create, data quality tests, running daily on on your day with your data logs, which then can end up in results stopping the data loads or just giving warnings and so forth.
[00:30:36] Unknown:
Did this answer any of the parts of the question? Yeah. No. It's just interesting to get some perspective from somebody who's working in the space day to day about some of the, more granular or nuanced aspects of the workflow that are only obvious once you start getting into the weeds with it. And to to that point as well, the overall ecosystem of data management, data engineering, the fundamental platform capabilities that are available have been shifting and, evolving quite rapidly over the recent years. I'm wondering if you can talk to some of the ways that those changes in that broader data ecosystem have, influenced the design thinking and product goals of Agile Data Engine and some of the ways that it has, driven customer adoption as the reality of operationalizing data has become a broader concern for more people?
[00:31:32] Unknown:
I would say that 1 1 change in in the data ecosystem that has emerged within couple of years, actually, after going to the public cloud, is that now we're starting to see, customers transferring from cloud vendor to another or from analytical database to to another. 5 years ago, all that stuff was quite new, So people kind of spent company spent tremendous amount of time and effort to kind of analyze which is their kind of end goal that end game that okay. We'll we'll go to AWS, and and they were kind of thinking that this is some eternal life life long thing that that they are going for. But now, I think more and more customers are kind of becoming aware. I don't know why they're they're chasing because of price or because of security or because, some services available on on other cloud platforms or diff different reasons, but that's 1 thing I think we've been seeing lately more and more.
And, actually, that's that's quite funny because, with agile data engine, because we're metadata driven, of course, there's a lot of custom SQL, almost all all the implementations on the Agile Data Engine include at least some portions of custom SQL. But still because usually, what it is, it's it's standard SQL. And standard SQL works quite well from database to database. So, basically, we're able to help customers to migrate their data solutions or data products from different cloud vendor to another. And that's something I would say that couple of years back, I wouldn't be believing that it's it's all kind of a efficient to do. And that's really nice thing because that also expands the life life cycle of the data products quite quite a bit.
[00:33:55] Unknown:
And for teams who are using agile data engine for managing that data ops workflow. I'm wondering if you can just talk to what that looks like for teams who are using it to manage their day to day engineering efforts.
[00:34:10] Unknown:
Well, I don't know. It doesn't necessarily differ that that much from other teams. There's the same tasks that you do need to do the requirement analysis. You need to get the data data and look at the data and run the data in into the database. But, creating the loads, you can you can import metadata directly to, Agile Data Engine. You can export it from from the source database. So so you can create the entities quite easily to Agile Data Engine. Of course, you can do that manually as well, and you can propagate, kind of a data flows forward in in different layers of of the database. And 1 1 thing is that when when you're gonna, get used to the kind of easy and trustworthy deployment, it means that you don't you no longer have to wait for Monday for the production deployment. So so the teams can kind of deploy to production.
We have a customer deploying 250 times per month to production. So, basically, when it used to be 3 months to do changes to production, now we do production deployments daily on daily basis, and we don't need to kind of skip deployments because, well, it's Friday, and it might get difficult. And, you know, what if loads loads fail during the week and so forth? It also really is time and effort effort from from kind of a fiddling or or keeping the lights on on on the data warehouse to the actual, from from the date DataOps tools stack management to the actual kind of, deriving insight from the data for for the end users.
[00:35:58] Unknown:
And another aspect of DevOps that seems to be getting carried into DataOps and that you are enabling through your tool is the idea of continuous improvement, obviously requires some level of insights into the way the work is being done, where are the bottlenecks, what are some of the things that are slowing people down. And I'm wondering if you can talk to some of the insights that you generate to help your customers manage that feedback loop and understand how they can improve their processes or identify new opportunities for how to use their data?
[00:36:30] Unknown:
Yeah. Of course, there are there are the agile agile methodologies, that that give you give you some insights on that as well. But we can actually integrate integrate the Jira tickets directly to agile data engine. So we can track track what happens where. But for the solution itself, we have some insights on the implementation, how well it fits into the best practices, and there are also database based performance considerations and so on. Then we also have this, well designed review that that combines the insight insights, from the solution.
We have questionnaire and interviews run by our professional services team to improve customer development and operation processes. Also, there are some insights on on on basic metrics on the overall solution and system on on delivery flow example, what we are, what are the lead times from design to production, number of commits and deployments to production, on time to value how many data products and how how many changes to exist things are being released and what are what is the reusability index of the data products. What are the kind of, dependencies on a certain data product and so forth. And and then, of course, the on activity and quality, how many workflow executions and what is what is the failure rate and say, basically, for for the data quality test.
So, basically, these insights can be used to manage the productivity and quality of the data teams over the different type of changes, like person rotations, consultancy, vendor changes, services to migrations, and target database or or cloud migrations.
[00:38:15] Unknown:
In the about page for Agile Data Engine, it mentions that you have a unique approach for warehouse automation. And I'm curious if you can just give some color to that statement and, help us understand what that some of the ways that the practices that you have around data warehouse automation differ from the ways that the rest of the industry are thinking about it. 1 1 thing really special,
[00:38:39] Unknown:
I don't think anyone else has has the, has the thing is that, just in time deployment of of of the schema changes, meaning that if we kind of, do development and we commit to dev development, for example, 20 changes in a table, we don't want to do 20 changes, in sequence to the production, but we want to just get directly to the end state. So that's 1 thing. Of course, our metadata driven approach that the data product definitions are kind of as packages, which include both the data model and the load transformation definitions, makes our CICD pipelines really kind of a flexible and automatic, but it also enables us to run automatic schema these automatic schema changes together with our automatic superintation of of deployments and workflows.
So, basically, what we can do, we can kind of, run numerous amount of of commits to the production every day and and, the kind of how how the deployments are done. If we have let's say we have a business real time, data data product that refreshes every 2 minutes. We can deploy changes to that without stopping the loads in between. Our CICD kind of takes care of everything. It stops or or tries to do it in between the loads and and stops the load from executing if the deployment is not done yet and then continues where it where it was left after the deployment is done. And I would say that that makes, our kind of, approach or our our solution really unique in in the context of DataOps tooling.
[00:40:33] Unknown:
Another aspect of DataOps and some of the ways that it has been evolving in recent years is the growth in adoption for machine learning and AI. And I'm curious how that has manifested in the ways that customers are thinking about the capabilities that your platform provides and some of the challenges that it poses to people who are trying to manage the overall life cycle of their data maybe beyond the warehouse.
[00:40:56] Unknown:
I don't know if that really has had such a huge impact, but I I think the data scientists are still doing their model analysis and and training with their tools of of preference. So so if you go to a data data scientist and say that, oh, now you need to use these tools, They stopped working probably. But 1 1 thing, of course, is that when when they want to put things in production, when to what they want the production for a data pipe wants to feed their models and so forth, that's where Agile Data Engine, of course, comes in. Then there are some examples like, for example, 1 1 customer used to this this was, couple of years back back still when when Redshift wasn't scaling scaling that well, but they used to, for example, use a JAR data engine for to export data out of Redshift, load the data into Spark cluster, do some heavy calculations there, and import the back data back to Redshift.
Yeah. These are these are some some examples, but but, yeah, I don't think I I think data scientists are still kind of working how they want to work. And when they want to get production for data pipelines running and and kind of get their data data top quality every every day then then take on to us.
[00:42:24] Unknown:
And in terms of your work with your customers and some of the ways that you're seeing them apply agile data engines capabilities to their problems and workloads, I'm curious what are some of the most interesting or innovative or unexpected ways that you've seen the platform used.
[00:42:41] Unknown:
Well, 1 case where 3 different health care projects that that that were not competing with each other are sharing kind of the common data model between each each other when they have kind of a some set of common, data sources. So they're basically, what they're doing, they're exporting their data out of, metadata out of Agile Data Engine, which is we're kind of want to avoid vendor work for us as well. So everything you kind of generate or or create in Agile Data Engine can be exported out. So, basically, they're exporting their data models and and sharing their data models with each other to speed up their data development.
That's that's quite curious curious way. We don't kind of, hand out data models readily available for SAP, but but some some kind of a partner vendor might might do that as well. Then we have a financial customer using our external APIs to get the data product definitions, the metadata, and the custom sequel out from Agile Data Engine and implementing a disaster twin capability, but also running some various automatic tests on on the metadata. Then there's some of these customers that that actually run their data warehouses parallel in cloud and also on premise with Agile Data Engine. So they're deploying the same packages to do 2 different completely different environments and and run run their data law loads also on premise as well. 1 big customer migrated a separate business real time implementation and teradata data warehouse into 1 1 data warehouse with Agile Data Engine and Snowflake with huge amounts of data.
And there are, like, a 1, 000 business users relying on data published each 3 minutes for them to make everyday business decisions, like, 247. The funny thing was that the business users didn't know this at all, that the whole back end had been replaced, only that the data is always on time and no issues with the data quality anymore. So it's been in production now for 1 and a half years and already, and only less less than t 33 percent portion of of maintenance work of overall development budget.
[00:45:12] Unknown:
And in your own experience of working on agile data engine and working with customers and just operating in the overall space of data ops as a service, what are some of the most interesting or unexpected or challenging lessons that you've learned personally?
[00:45:32] Unknown:
There's of course a lot of kind of a different in integration services and integration pattern patterns on our customers, getting the data in into the kind of data warehouse and I don't know if if kind of, that's 1 1 thing that, when when we started working on agile data engine, we had the extract part as well, as part of, as as part of the Agile Data engine. But but then then we realized that the customers don't kind of, they don't want that. They want to kind of keep control on different teams and different places and and different methods for different source systems on on feeding the data to their cloud platform.
That was kind of a quite surprising quite surprising by the time, and that's also what we decided to kind of, give up when when we moved to building the SaaS service. There's also 1 thing I don't know if whether it's a bad or good good thing, but but there's a lot of customers want to kind of, or have have eagerness to implement things in in plain cloud native services with kind of a very database or database vendor specific features. And that, in in some cases, it might might might be more efficient architecturally. But then again, that's 1 thing that, in how to say it nicely, it it it quite effectively creates a vendor lock on cloud platforms or or the database platforms. Because if if you use standard c equal, it's all kind of 90% of its its kind of transferable directly to another database.
But then when you start to use the kind of a really specific features of a database, then you get these kind of migration problems. I would say those are those are kind of them. That's that's about it.
[00:47:58] Unknown:
And for teams who are looking to address some of the operational challenges of managing their data life cycle, what are the cases where Agile Data Engine is the wrong choice?
[00:48:09] Unknown:
Now, of course, if if you want to do informed decision and and run for for a spaghetti architecture where you just kind of understand that, okay, people are doing their stuff on how they want, and and they're doing kind of a 1 of solutions. And when when something breaks down, you redevelop that. Then, of course, in that that sense, I I wouldn't go to Agile Data Engine. But that said, actually, usually, in in our customer cases, there are there are there is that kind of a data warehouse where so all kind of production rate data pipelines run and so forth. With agile data engine, you can also do this experiment experimentation.
There there is no kind of a statement that with that chart data data engine, you need to do kind of 3 layer data warehouse or whatever, but you can do whatever you want. But but we still have a lot of customers that use our chart data engine, but on the side, they use the do these experimentations with other tools. But I I think that's still that's okay, and I don't know. There is no 1 1 tool to fit all all purposes. So, also, maybe maybe if you're a startup in exploration phase, you don't want to use the Agile Data Engine. But but, yeah, I would say that depends, of course, the requirements you have for your data platform.
But Agile Data Engine is is scalable, and it supports kind of multi team development and so forth. So it's a good choice.
[00:49:40] Unknown:
And as you continue to build and iterate on agile data engine and look forward to some of the, upcoming challenges or evolutions in data workflows and data platforms, what are some of the things you have planned for the near to medium term or any particular projects that you're excited to dig into?
[00:49:57] Unknown:
Well, of course, our our vision is to lead data management, delivering innovation that allows our customers to maximize data warehouse productivity, to continuously build on in on investments in in data products and to manage operations with dashboard dashboards delivering real time insights and benchmarks to key stakeholders. So the data management related features will be more and more in our focus, and we will improve features on data products and in in database metrics layer. In in short, we want to enable the transparency today invest today investments in in data products.
Of course, we are looking looking to extend the soup the supported database services currently working on Databricks support, and, we will add more features on the business critical use cases and bring new regions to be supported. That's, I I would say, that the currently currently our goal.
[00:50:53] Unknown:
Are there any other aspects of the work that you're doing at Agile Data Engine or the space of data ops that we didn't discuss yet that you'd like to cover before we close out the show?
[00:51:04] Unknown:
At the moment, I I, I'm happy.
[00:51:08] Unknown:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:51:24] Unknown:
1 thing that that's kind of, getting getting more and more I I would say that it's a matter of time, but but automating kind of, the implementation using, ML and I AI to automate structured metadata from unstructured data, like videos and and freeze free text and so forth for for kind of analysis. And, also, maybe maybe kind of generating maybe not glossaries, but concept, how to say it, concept diagrams for organizations based on on their kind of unstructured data would be nice. 1 1 thing I don't know if if you've noticed, but there's a lot of rumors in social media about the death of SQL, but I would state those rumors are vastly exaggerated. The reason behind it is that AI will replace the need for SQL, but but there's a fundamental flaw in that thinking.
It might be that it it makes writing or generating the queries faster and easier, but we still need a mathematically exact way of defining the requirements for the datasets. And 1 of the most commonly known method for that is is SQL. So so that's perspective on on the tooling that that's not going away.
[00:52:52] Unknown:
Alright. Well, thank you very much for taking the time to join me and share the work that you're doing at Agile Data Engines. It's definitely very interesting platform, and it's great to see more options out there for reducing the operational overhead of managing data workflows because there's already enough complexity going on without having to add more to it. So appreciate all of the time and energy that you and the other folks at the Agile Data Engine are putting into that, and I hope enjoy the rest of your day. Thanks.
[00:53:24] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast thought in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com Email hosts at dataengineeringpodcast.com with your story. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Guest Background
Overview of Agile Data Engine
Defining DataOps and Tool Categories
Opinionated Architectural Aspects
Reducing Operational Overhead
CICD in Data Warehousing
Nonproduction and Staging Workflows
Architectural Elements of Agile Data Engine
Challenges in DataOps
Evolving Data Ecosystem
Day-to-Day Engineering with Agile Data Engine
Continuous Improvement and Insights
Unique Approach to Warehouse Automation
Impact of Machine Learning and AI
Interesting Use Cases
Lessons Learned in DataOps
When Agile Data Engine is the Wrong Choice
Future Plans and Projects
Closing Remarks and Final Thoughts