Summary
Managing an analytics project can be difficult due to the number of systems involved and the need to ensure that new information can be delivered quickly and reliably. That challenge can be met by adopting practices and principles from lean manufacturing and agile software development, and the cross-functional collaboration, feedback loops, and focus on automation in the DevOps movement. In this episode Christopher Bergh discusses ways that you can start adding reliability and speed to your workflow to deliver results with confidence and consistency.
Preamble
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
- For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt.
- Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
- Your host is Tobias Macey and today I’m interviewing Christopher Bergh about DataKitchen and the rise of DataOps
Interview
- Introduction
- How did you get involved in the area of data management?
- How do you define DataOps?
- How does it compare to the practices encouraged by the DevOps movement?
- How does it relate to or influence the role of a data engineer?
- How does a DataOps oriented workflow differ from other existing approaches for building data platforms?
- One of the aspects of DataOps that you call out is the practice of providing multiple environments to provide a platform for testing the various aspects of the analytics workflow in a non-production context. What are some of the techniques that are available for managing data in appropriate volumes across those deployments?
- The practice of testing logic as code is fairly well understood and has a large set of existing tools. What have you found to be some of the most effective methods for testing data as it flows through a system?
- One of the practices of DevOps is to create feedback loops that can be used to ensure that business needs are being met. What are the metrics that you track in your platform to define the value that is being created and how the various steps in the workflow are proceeding toward that goal?
- In order to keep feedback loops fast it is necessary for tests to run quickly. How do you balance the need for larger quantities of data to be used for verifying scalability/performance against optimizing for cost and speed in non-production environments?
- How does the DataKitchen platform simplify the process of operationalizing a data analytics workflow?
- As the need for rapid iteration and deployment of systems to capture, store, process, and analyze data becomes more prevalent how do you foresee that feeding back into the ways that the landscape of data tools are designed and developed?
Contact Info
- @ChrisBergh on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- DataOps Manifesto
- DataKitchen
- 2017: The Year Of DataOps
- Air Traffic Control
- Chief Data Officer (CDO)
- Gartner
- W. Edwards Deming
- DevOps
- Total Quality Management (TQM)
- Informatica
- Talend
- Agile Development
- Cattle Not Pets
- IDE (Integrated Development Environment)
- Tableau
- Delphix
- Dremio
- Pachyderm
- Continuous Delivery by Jez Humble and Dave Farley
- SLAs (Service Level Agreements)
- XKCD Image Recognition Comic
- Airflow
- Luigi
- DataKitchen Documentation
- Continuous Integration
- Continous Delivery
- Docker
- Version Control
- Git
- Looker
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering podcast, the show about modern data management. When you're ready to build your next pipeline, you'll need somewhere to deploy it, so you should check out Linode. With private networking, shared block storage, node balancers, and a 40 gigabit network, all controlled by a brand new API, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode to get a $20 credit and launch a new server in under a minute. And for complete visibility into the health of your pipeline, including deployment tracking and powerful alerting driven by machine learning, Datadog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you'll have everything you need to find and fix performance bottlenecks in no time. Go to data engineering podcast.com/datadog today to start your free 14 day trial and get a sweet new t shirt. And go to data engineering podcast.com
[00:01:05] Unknown:
to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macy. And today, I'm interviewing Chris Berg about Data Kitchen and the rise of data ops. So, Chris, could you start by introducing yourself?
[00:01:17] Unknown:
Hi. I'm Chris Berg. I'm head chef of a Cambridge, Massachusetts company called Data Kitchen. And, I guess our purpose for being is to help companies do data
[00:01:26] Unknown:
ops. And how did you first get involved in the area of data management?
[00:01:30] Unknown:
Well, yeah, I've had a long career. So I started working class kid from Wisconsin, went to the Peace Corps, taught math for 2 years, then went to Columbia, and then did a bunch of years at a project to automate air traffic control, with MIT and NASA. And so, you know, I've both been a very technical guy, wrote a lot of code, and did AI back when AI wasn't wasn't cool. And then for a bunch of years, sort of got into both consumer, Internet, and then enterprise software companies and and left the technical domain and started to be a manager. And about a dozen years ago, I joined an analytics company that did kind of it was a bootstrap company that did everything that you could think of for kind of health care analytics.
And so we did data integration and data management on a daily, weekly, monthly basis. We did visualization and charts and graphs. We had data scientists before they were called data scientists. And we even made the mistake of building our entire stack of analytics all metadata driven with software. I was the chief operating officer. So my life was pretty much structured around kind of 3 things. 1 is the guy who founded the company was a very bright, not too technical Harvard Medical School doctor who really knew the domain. And he would go off and talk to customers and have a bright idea and then come back to me as COO and say, well, here, take this bright idea and and run with it. And so I get people together in the room. I'd have some data scientists and data engineers. We called them ETL engineers back then, and maybe some software engineers, and we'd hash it all out and say, okay, this this cool idea is gonna take us 2 weeks to do. And I'd walk back into my boss's office and say, it's gonna take us 2 weeks, and be very proud. And he would sort of look down his glasses at me and say, wow, Chris. That takes 2 weeks. I thought that should take 2 hours. And I feel, you know, feel very embarrassed and sort of walk out of my office and then, you know, kinda get a call on my phone from 1 of our customers, and, they would say, Chris, it's a problem in the data. If you don't fix it, you're out. And I'd have some more right here sprout then, and then I'd walk a little farther. And we had a lot of bright people that we hired, and they say, oh, there's this cool open source tool. Can can I try it out on our next project? And so my life is kind of making the trains run on time in analytics as COO of a moderately sized kind of technology and consulting company was, you know, how do you deliver analytics in the broadest definition, whether it's it's data or a model or a visualization? How do you deliver all those things to a business customer fast?
And how do you deliver high quality so you don't have errors? And then how do you let your team innovate? And so that was my life for a bunch of years. And the company grew, and then we sold it to a West Coast company called Model N. And I was knocking around to figure out what my next thing was would do. And we talked to a couple hundred people and did customer interviews, my cofounders and I, who also worked at the same company, and we realized that a lot of people had the same problem we had, which was, you know, they are a, you know, there's a new role called a chief data officer or a chief analytic officer and they sometimes have data engineers and data scientists and people who do data vis underneath them or working with them. And they're kinda getting they have this sort of duality to them. 1 is they're getting beat up by their their business customers the same way I got beat up by my boss because they're just not going fast enough or the things break. And the other is this sort of great desire for innovation. They go to I was at the Gartner conference for a few weeks, and there was a, you know, a mess of of of sort of chief data officer types there. And and they really do wanna prove their value to the organization because they really do believe that analytics and and data has power to actually help organizations make fact based and and reasonable decisions. And so they're caught with this desire to innovate, and this this challenge of just, you know, going too slow and making the the trains run out of time. And so the way that we touch the world is we think the, you know, the idea to enter the idea to innovate really comes from, iteration. And so, that's where the idea of data ops comes from, and and that's what our our company, you know, our company is focused on.
That's a lot sort of a long answer. Did that answer your question?
[00:05:30] Unknown:
Yeah. No. That was great. So it's always interesting to get a sort of detailed history of people who have been in the space for a while and as all of the old things become new again simply because they've been rebranded. So, you know, there was the business intelligence analyst, and then there was the ETL developer and now they're data engineers and data ops as sort of an umbrella term to try and unify all these different roles that are getting new terminology
[00:05:54] Unknown:
simply because they're doing the same thing but with new tools. So Yeah. Yeah. And it's it's the tech industry. Right? So all these terms, data ops, data engineering, I think of them as as a gas. They sort of expand to fill the the available space. Like, I remember when big data just meant, you know, clusters of parallel machines with dumb storage. And now big data is is everything having to do with analytics. And so, you know, I think that's just an aspect of the tech industry we're not particularly fond of. But, you know, we have a we have an idea of what we think DataOps is and and and why it's why it's relevant to people. And, you know, we've authored a manifesto. We've we've talked about it at conferences. And we really do believe that it's a better way for people who are doing data analytics to work. And it's a much more satisfying, much more happier way to live your life instead of sort of being browbeaten by business or frustrated that things are breaking or killing yourself with, you know, sort of heroic deeds and then waking up, you know, Saturday morning when you should be at soccer with your kids with a panic that something is wrong in the database and then having to skulk off the soccer field because because you you just got an email saying it's broke and having to go in your car and fix something. And so it's, and I don't know if you feel like I can identify with that, but, like, it's it's I'm I'm sort of done with, you know, making changes and having things break.
It's just not a fun way to live.
[00:07:15] Unknown:
Yeah. I think anybody who has ever served any time on call can relate to something at least similar to that of getting called out in the middle of some event where you then have to go and sit and hammer away on your keyboard to try and fix something that shouldn't have broken in the first place or something that was just missed due to oversight or, missing process. So
[00:07:33] Unknown:
Yeah. Yeah. And and yeah. And I you can identify. Right? And so, you know, I think it's that look that my wife gives me. It's just like, as you've been married for a while, it's like, oh, you're an idiot. And so, and it's harder too because having been both the person on the keyboard who's done the work and the person who's made the mistakes and the person who's managed those people. I've been fortunate to see both sides of the equation. And, you know, I'm a big believer in Deming, this guy who, you know, sort of industrial process control. And a lot of the problems I think emanate not from an individual. I mean, almost all the problems don't have to do with an individual screwing up, and sort of yelling at someone or blaming something for a problem is not the right attitude. It's really about the system in which people work. And if that system is right, people can do great work. And, yeah, they're gonna make errors. And the only error, I think, is is if they make errors, that's fine. You just make sure they don't make the same 1 over and over again. And so loving your errors, iterating and improving, iterating and innovating, I think, are are what data analytic teams, data engineers, data scientists really, really need to focus on in order to be successful.
[00:08:48] Unknown:
Yeah. And that borrows 1 of the ideas from the DevOps movement of being able to learn from your failure where you use those as a as input to your feedback loops for improving your overall process and capability and making sure that you have automated checks in place to guard against those failures and prevent regressions from putting you in the same situation where you have that failure again for preventable for preventable reasons.
[00:09:17] Unknown:
Yeah. Yeah. Exactly. And and I think, you know, the the the DevOps kinda goes back to the agile man manifesto, goes back to total quality management, goes back to this crazy guy Deming. And that sort of idea of loving your errors and and trying improving them and automating the testing and automating the deployment, is is really, it's really gone through the software industry. And, you know, back when I was a development manager in the early 2000 and late nineties, I thought I was a pretty cool guy if I could get my team to ship software every 3 months. And, now if I would try to get a job as a development manager saying I could ship software every 3 months, I wouldn't get a job. Right? Because the expectation is that you should be able to ship software every 3 minutes or 3 seconds.
And why you know, the question is why does that expectation exist in software now? And why doesn't that expectation exist in, at least broadly in in the realm of people who do data and analytics? And and that's you know what? Could be because I I really do think it should.
[00:10:22] Unknown:
And that brings us to the question of the way that the concepts and principles of data ops are manifested in the way that roles are defined for people on an analytics team and in particular, how a data engineer can, implement and benefit from some of the principles
[00:10:43] Unknown:
of how you define data ops. Yeah. And I I so, you know, having been a data engineer and hired data engineers, I, you know, I think of data engineering as sort of a a software developerization of what people called DDAs or ETL engineers. And what do I mean by that? So I think, a lot of companies have hired people who know how to use tools like Informatica and Talent, and they're very comfortable that this tool is it and they exist in this sort of very slow but, successful waterfall process to to deliver things. And I think the idea of data engineering is that it's it is a way to think of their work more from code than visual, designs, more from a agile way of doing things than a, you know, a waterfall way of doing things. And I think it's actually really good for, you know, people out there to think of themselves as data engineers because I think that is actually the way the field's going. And companies that we talk to don't want to have to put a ticket in a system and then have a 6 month project to update their whether it's a data warehouse or a data lake or whatever they call their data infrastructure. It's just it's not operating at the speed at which their their businesses or their customers need. And so I think, you know, whether it's a data engineer doing the process of data ops or a data scientist or whether there's actually gonna be a role called a data ops engineer, I think data engineering is is really the a really important field and not just something that you should cost minimize and send to India and work from tickets. I actually think the data engineering, given that it's actually the majority of the work in the analytics cycle, is actually hugely valuable. And so I'm happy to see it.
If my impression of the term data engineering is really an upskilling or up upstatusing of the field, and I think it's just long long overdue.
[00:12:30] Unknown:
And 1 of the trends that has led to the introduction of the concept of data engineering is the idea of having all kinds of easily deployable resources through virtualization and cloud resources as well as managed services such as data warehouses or data pipelines. And a similar transformation has happened in terms of systems administration and and infrastructure operations with the cloud and those similar capabilities and things like configuration management. So I'm wondering how the work flow for somebody who is practicing some of the principles of data engineering and data ops would differ from more traditional methods for building data platforms for being able to process analytic workflows from ingestion through to analysis and delivery.
[00:13:25] Unknown:
Yeah. And I I hope I never have to go back to, like, buying servers and binding your database code to your database and then having to clone it to back it up. And I think those you know, it it it's sort of you know, the job I talked about before to do the analytics, we had hundreds of servers that we had to buy and provision. And for us having to it was very difficult to to sort of lift your data and your processing from 1 server to the other. And we had to sort of jump through hoops to figure out how to do that. And so I think the idea of cloud resources, the ephemeralization of resources, the idea of coding to create or shut down resources, containers, all those things are really, they give you the the ability to deploy quickly. They give you the ability to sort of reproduce what you've done, and and they take, you know, that DevOps term cattle, not pets. They they take in some ways, they they take that from being sort of a black art.
And, oh, we've gotta have that single point of failure because John over here in the corner knows where all the servers are and all the passwords are, and then you can't touch them, to being something, easily deployable. So I think the sort of cloud virtualization makes a makes a big deal in how you do it. And and the other way I think it makes a big deal is is sort of like cutting and pasting of of a database. And and and I think 1 of the secrets of going fast in analytics is being able to have versions in development that are based on what is in production.
And so sometimes you wanna do that, have all of your you know, if your database is reasonable sized, tens of terabytes, 100 of terabytes, sort of cloning it in in a Redshift or an EMR, and AWS is not an unreasonably slow thing to do. If you've got p to scale database, then sort of sampling a subset. But really, the idea is can I have a place where I do my work? That's where I've cut and pasted all the data, cut and pasted all the exact versions and environments that I have, whether it's versions of database or versions of Python libraries or that version of, you know, R that I'm using. And then all the code, I'm acting on it. And so I've got my place to do the work. Or like with some of our customers, they have variations of what's in production. They'll have production that's got 2, 000 people on and another copy of production that's slightly different that's got 3 people on. And so cutting and pasting of data and infrastructure really is enabled by virtualization. And that was just super hard before because you'd have have to go and talk to your database salesman who always seemed to be an ex football player and wanted to take license fees from you, and they were just I never really enjoyed talking with those types of people. But now so it's much nicer to, like, spin stuff up, try it out, shut it down, and follow the the the sort of principles that were laid out in in DevOps applied to to data. Yeah. And your comment about the sort of database vendor Oh, yeah. I just never liked the Oracle salesman. Man, they just they were always I don't know. They always, like, went to the gym before it, and it's like, oh, you've gotta get 4 more CPU licenses to try something out. And, like, why is that? I just wanna try something out to see if it works. Yeah. The explosion of
[00:16:20] Unknown:
open source tools for being able to store and process and analyze data has also led into the capability of having this degree of automation because you don't have to make sure that you're constraining your environments by the number of licenses that you have or ensuring that, you know, the license file is in place on every system so that you can actually do anything with it. So I think that also plays into the ability to have those multiple environments for being able to test your different versions of code and data before you actually release it into a production flow and ensure that you have those feedback cycles before you introduce an error into a a situation where you are going to get paged on the weekend in the middle of an event. Yeah. Yeah. And I think, fundamentally, people who do data engineering think of their work as code. And
[00:17:06] Unknown:
I have I think that's the right way to think of it. I I don't and I think someone who does data science, their work is code. And so whether it has a model or a model that has a bunch of random input perimeters, it's still code. And even if you look at a Tableau workbook, you open it up, it's code. Maybe it's XML code. But you can even there's a bunch of nested if then else statements. Now you can embed you can embed Python into Tableau. Fundamentally, the thought inversion is that all a lot of the tools out there are IDEs from a software development standpoint. Tableau is an IDE. RStudio is an IDE. Informatica is an IDE, integrated development environment, that produces code. And so if you think of your job in data analytics as a code producer, and that code is has a lot of weight from someone who's managed software teams. And if you grew up in software, you think I do code. And then there's a whole bunch of thoughts that come after that. It's like, you know, code is complexity. Code is communication. Code can create this big hairball that's gonna cause me problems later on. And so the culture of of how to treat and how to deal with the combinatorial complexity of a codebase environment is sort of, I think a lot of data and analytic teams are a bit a bit of denial of that. And I think the vendors are sort of feeding that into it. The vendors are like, hey. You do it in my system. Everything's great. And all the they all sort of say the same thing, like, oh, you'll be fast. It'll be great. But the reality is it's it's code and complexity.
And so how do you deal with that code and complexity? How do you change it? How do you modify it? And so software's got a lot of really good ideas from the DevOps movement. And, also, there is more to it than software does because data pipelines are these sort of living things, and sort of deploying them, monitoring them, testing them, is different. And so also the front end sort of creating environments for people to work in is is is different than the it's similar but different than the DevOps app that most software engineers have. Yeah.
[00:19:00] Unknown:
And 1 of the differentiations between a data oriented platform for being able to perform these analytics on these different workloads and the way that you would approach a traditional software application where it's largely going to be stateless except for whatever customer input is put into the database via the transactional system is that in these more analytic oriented workloads, there's generally a larger volume of data which provides a certain amount of mass and gravity that makes it difficult to make it traverse these different environments because of how slow and expensive it can be to make multiple copies of the same data if you're trying to replicate what you have in production, which in a lot of cases is necessary what are some of the techniques that are available for being able to perform that management of taking a representative sampling of production data and bring it down to prior environments so that you can have an appropriate representation when you're doing that testing and iteration.
[00:20:11] Unknown:
Yeah. And there's there's different ways to handle it. So 1 is, you know, if your date if your data is in the tens or hundreds of terabytes, just back it up if if and back it up to a a bucket store like s 3 or a file system and then restore it. And usually the networks and and having a copy a full copy of production. If your domain allows the ability to do that with security and and and, you know, sort of personal information. That that that's 1 way to do it. And then the other way is is to take it and do a sampling or a a filtering of that dataset if it's too large or there's security concerns. And that could be another sort of process that someone does, create a create a sampling ETL to to give you your test data. And then a third technique is there's companies like, Delphix or others that do data virtual ization. And then if they're they allow you to sort of kind of point your data point your database at different datasets.
And so, you know, in general, I think a lot of people, if you've got if you've got tens of terabytes of data, and just just copy and paste it because the network can handle it. But, yeah, I I think for for us, it's also the it's not a data problem, but it's you gotta have the data. You've gotta have a representative hardware software environment that's production, which means the same tools, the same versions of libraries. You've also got to have a copy or a branch of the code that you're working on, whether that code is SQL code or Informatica code or whatever. You've got to have that in your environment. And then you've got to have sort of a dial that says says I'm going to run part of the process, all the process. And in order to do that, you've got to also be able to say, I need to test each step of the process.
So if I'm going to make a change, I take a copy of what's in production, the data, the code, the scripted environments. I make my own place for it. And then I run, in essence, a set of regression tests to make sure I haven't I've got a representative sample. I make a change in something. I add another test, and then I run the whole thing again to make sure I haven't broken anything. And when you have your data environment, excuse me, having thousands of tests or think of 20% of your code is test code, whether that's testing as monitoring production, testing as regression testing, functional testing, system testing, all the same concepts that exist in the software world, then you have a really high confidence that you're not gonna get that call on Saturday morning because the tasks will tell you if you've broken anything.
And that that will also tell you that you're not the hero anymore. And that's also a cultural problem is that there's a lot of, I think, heroism in in in tech where if you're if you are really good at something, you can produce a lot of code and a lot of technical debt. And then you get to be the the person who's called in to fix it. And while that can be really emotionally satisfying for a while, it ends up not helping the team as a whole. And then you end up being pulled in in a if you're the hero and I I've done that role, you end up being pulled in a lots of directions and then you end up quitting. And so it's not as, so, there's a lot of reasons to to build have a development environment that's representative of production, that has your code, that has and to do all the work to make these tests, in the broadest term that allow you to make changes to production fast enough. And I think testing and data and analytics is an underappreciated field, and it
[00:23:49] Unknown:
being employed in these analysis pipelines is fairly understood from the principles of software engineering. But testing the data, particularly in the volumes and shapes that are more prevalent in analytic workloads, is something that is underappreciated and under under and, often misunderstood. And it also doesn't necessarily have the same quantity and quality of tooling available for performing those tests. So I'm wondering what are some of the most effective methods and practices for being able to, create and run those tests in a repeatable and manageable and fast enough method.
[00:24:31] Unknown:
Yeah. Yeah. And well, we could we could talk an hour about this. Right? Because we had to build a framework in ours. And so in our in our software to do that. And so, yeah, because if you think conceptually, you're testing 2 things. Right? You're testing the data. And as the data flows, whether it's streamed in or it's a it's a kill and fill refresh or it's another set of transactions being poured into your analytic data warehouse, data lake, whatever you call it. That that set of processing is you need to monitor that and make sure that the data or the artifacts are all being produced. So for the purpose of having the business people or customers that you're delivering to not find problems when you've delivered it. And so that is think of that I think of that as testing, but in reality, it's sort of monitoring, in some way. And, these sort of pipelines that you do, and our our our view of the world is that pipelines aren't just data pipelines. They're data pipelines that call that also have predictive models, that also have visualizations in them. And all those things need to be tested because someone could have changed them or some data could have come in from someone there that's crap and broken something. So that's 1 aspect of testing, right, monitoring what's in production live. The second conceptual aspect is that if the work that data engineers and data scientists do are really code, well, then all those ideas that exist in software development of regression testing or testing need to take place. And so how you do those things exactly and the logic of doing those things and keeping track of the test results is important. Because a lot of ways, the monitoring of what goes on, it's think of it as a Toyota factory, your your production process where data comes in on the left side, you know, what other it's Python, SQL, pick your pick your language, operates on it, and then there's some r and some predictive modeling, and then there's some visualization.
Toyota factory. And each step along the way that it's being transformed and artifacts are being added to the assembly line. And the assembly line is not aligned. It's really a directed graph, but that's that's not important. It's that conceptual idea that you you have something, a living process that is manufacturing innovation and value for your customers. And just like Toyota and Deming said, well, let's monitor that. Let's get statistics on it. Let's not yell at people who made errors on it. Let's give people the ability to pull a, you know, pull a chain and stop the assembly line. I think the same ideas of lean manufacturing statistical process control apply to, data and analytics.
And so you need to kind of think about what you're doing as a factory, but you also need to think about what you're doing as an innovation pipeline, deploying pipeline. And so those 2 things are actually really hard to do together. Sort of as if you've got a Toyota factory where you can have anyone in the factory take a machine out, change it, and put it back in in a few seconds. And so that is really the state that you want to have in analytics and that's a hard state to do, right? It's hard to it's hard to have the in the lots of changes, lots of data going by, and lots of innovation going by. And so the quality tests are actually can break up in a lot of different ways. So there is the basic of looking at did the data I get is what I expected, you know, the row count changes, the, looking at sort of, you know, whether it has the right number of columns, whether the, you know, the data fields are right. That's that those are tests that you you really should have because you wanna have those tests as close to when the data arrives as possible, just like Toyota. And so that way you can, kind of like Toyota does, go back and yell at the manufacturers of the, you know, of the of the radiator and say, look, you're giving me some defective parts. Go fix it. Because oftentimes you get defective data and sometimes the data engineering team has the has the choice to say, oh, I can just fix this data and then, I can keep my production process going. And so that can be a good thing to do sometimes, and it can be a bad thing to do sometimes. But you've got to think about how to how to manage that. And so tasks that can tell you that the data's bad before your customer sees it are really important. And so there's a whole lot of tests between, you know, statistical process control tasks, between the logical tasks that you get from your business customer, between what we call historical balancing. We look at what happened previously versus now.
And there's a whole art to actually writing writing good tests. But that, you know, that idea I find shocking in a lot of data analytics people that testing should be 20% of your work automated test, and that testing shouldn't sit sit on a shelf somewhere. It should be automated deployed as part of your process. That's really, that that's really the idea that we're trying to get across in data ops. Yeah. And I think that that's 1 of the areas where it differs too from just software tests is that with software, you're testing
[00:29:17] Unknown:
in the process of building the deployable artifact, but then once it's in production, you're not generally continuing to execute those tests because the system is essentially, you know, in a continuous state of test just by virtue of it running whereas with data, particularly with ingestion pipelines, you need to have those continuous tests to ensure that the data is matching the appropriate schemas and the appropriate volumes and shapes to make sure that you don't introduce those quality errors further down the line.
[00:29:43] Unknown:
Yeah. Yeah. And I just you know, there's a lot of times you you're building you have data warehouse builds that could take hours. Right? And and a data gets in at the beginning, and then you don't notice it until hours or day laters or, you know, God forbid, in front of boss or the boss's boss's boss. And so, how do you set up a system where people aren't being defensive, where they slow things down because they're afraid to get yelled at, but because their boss's boss's boss will see it and everyone's locked in this circle of, I don't know, fear.
And I think what breaks the fear is that you can have provable confidence that what you're going to give to someone, your boss's boss's boss, will work. And the only way to have provable confidence is to build automated tasks that you can trust and run those during development and run those while they're in production and parameterize them accordingly.
[00:30:41] Unknown:
And also addressing the concept of fear of iterating and fear of continued development is that the more often you make those changes and release them into production and particularly if you do them with shorter iteration cycles, then you will decrease the amount of fear because the change sets that are getting pushed out are smaller. So it's easier to find and fix errors as they go into production than to batch them up into large volumes and then release it all as 1 giant ball of mud, and then you have to try and detangle that once it's in production and already causing errors, and you don't know exactly which change is contributing to what error.
[00:31:16] Unknown:
Yeah. Yeah. Exactly right. Yeah. You know, smaller frequent releases are better because they're easier to bug, and they're also easier because then your customers are getting more value quicker, and you can learn from them. Because the whole point of agile is is feedback, is to counter the fact that we're all introverts who do this work and and get these crazy business people and customers to give us feedback to make sure we're we're going we're making the right choice on what the way we implement. And it's so I think the there is this emotional context that I'm trying to get across to people is that really it is the if you're living in fear of making changes and you're waiting for the big deploys and you're kind of hoping things are right or you've got these, like, change review boards or where it's the 3 people who know the whole system are sitting around the table and they're reviewing your code or making sure it doesn't break all these other systems. Those things are just they they are, they're just a pain and not fun. And, like, if you can get rid of your change review board and really empower the youngest and most least knowledgeable member of your team and have, him or her press a button and deploy and really have confidence that they can that that won't break, then then you're on to you've achieved a great velocity.
And so that that spirit, I think, needs to and a lot of companies isn't there because of fear. And also I think the intellectual framework of DataOps and the idea of DataOps really is not present in people in the data and analytics world. And so I'm I'm, you know, what I'm trying to do with the company is sort of colonize the the mind space of people on data and analytic teams to say, look. DataOps is important. Look. It's this has really changed the way people work in software development. It's changed the way people work in in factories and manufacturing. I mean, I grew up in the eighties in Wisconsin and saw the Japanese imports lose 50, 000 jobs in Milwaukee because of the fact that they can make cars better and cheaper by following Deming. And so these are a set of ideas that are, I think, the, whether you call them agile or lean or TQM, really need to apply to, to what we do in data analytics.
[00:33:23] Unknown:
And on the idea of fast feedback loops encouraged by fast releases and being able to use that to determine whether or not the results that you're providing are giving the value that is desired from the business. What are some of the metrics that you track to define how that value is created and how the various steps in the overall workflow are proceeding toward that goal of producing that value?
[00:33:52] Unknown:
Yeah. I hear you. We do we have a thing called tornado report, which allows you to track by data source, where the errors were. And so we also include the the data engineers or the data scientists as part of that. And so, in order to understand where the errors occur in production, you need to sort of find out where they come from and look at the hotspots. That's that's 1 area, sort of error reporting. The other part is sort of, SLAs. Are you delivering on time? We have, metrics around I think it's sort of an equivalent. There's not really a code coverage metric that exists in analytics, but we have a sort of, you know, ratios of tests to steps in the process. So you can tell sort of is this work being done well well tested. So it's about errors. It's about it's about deployment. It's about kind of ratios of code to tests that we that we enable with our software. But it's also something that we're continuing to evaluate and improve upon ourselves because I think it's, it's and I think the more modern DevOps tools actually are better at that because they've, you know, they've they've been doing it a while. So there's some inspiration and ideas that we're gonna take from those.
[00:35:05] Unknown:
And also to be able to keep those iteration cycles fast, 1 of the requirements in terms of managing continuous integration from a testing perspective is for the tests to be able to execute quickly so that you're not waiting for an hours long build to complete before you can determine whether the change that you made is having the desired effect. So how do you balance the need for larger quantities of data to be used for verifying the appropriate scalability and performance of a given set of changes against optimizing for the cost and speed of, working in nonproduction environments?
[00:35:43] Unknown:
Yeah. You know, we don't have sort of a magic box to do that. Right? And and, so what we think is is that what people should do if you think of all the work in analytics as a direct graph? And people should be able to run parts of that graph in their development environment. They should have a basket of parameters where they can say, okay, I want to run this test based on a small input data or a big input data so that my test can run fast and quickly. And also, they should try to have their work be done where it's, I can I'm working on 1 piece and it's my 1 piece is separate from other other pieces. And I can run it over and over and over again and make sure I do the the sort of design, see the results, judge the tests, and iterate and improve. Because another thing that's happening in in data engineering and data sciences is becoming much more of a team sport instead of the individual hero who's doing, you know, his or her coding from data to value to, you know, the sexiest job of the 21st century where I do everything in analytics.
It's a team of people who are doing things and that the teams have to decompose their processes into steps. And so I think, you know, like in software development, you should be able to choose which tests that you do and have some tests be functional tests, some tests be kind of the equivalent of unit tests, some be system tests, and be able to turn those on and off based on what you on what you do.
[00:37:02] Unknown:
Yeah. And there's a great book called Continuous Delivery by Jess Humble and Dave Farley, where 1 of the things that they talk about is the idea of having different tiers of tests where you have your commit tests that are just very quick sanity checks to ensure that everything is meeting the required specifications as far as code quality, lifting, simple things like that. So that if that fails, then it fails quickly and you know that you need to go back and fix something and then you have the next level of tests that are just very quick unit tests to be able to complete in a matter of minutes so that you again, know whether or not everything's going to work properly, before it gets to the more long running integration level tests to ensure that everything is working properly together. So being able to have those quick circuit breaks of, you know, I I know that this quality isn't going to pass, so I need to go back and refactor or, you know, everything made it through to integration testing. So I know that this whole set of error cases is already taken care of and addressed. I don't need to worry about that. So then you can then continue your development cycle having that level of confidence and wait for potentially a longer build to complete to see if there are any more deeply hidden or insidious errors that you need to address further down the line.
[00:38:18] Unknown:
Yeah. Yeah. And I I wish that more people in the data and analytics world had that thought, and they were far enough in thinking of automated testing to break them up into types and when you run them and how to optimally be able to allocate the task based on what your what pattern is. Most people I encounter, the idea of automated regression of your data analysis suite is foreign. And so there are most companies will have a few tests that run-in the beginning because some data engineer or ETL engineer put them in, and then they have largely manual processes to validate or they're just winging it. And, you know, there's there's somebody who's grabbing some data from somewhere. They're doing some transformation. They they look at it. It looks good, and then they push it out to the exec team. And so there's a lack of process discipline for most people who do data and analytics.
And I think that's also a challenge in the data ops movement because it's kinda more fun to not have to think about your process and just do some stuff, throw it together, hey, it looks good. And that, I think, is is 1 of the reasons why there's a lot of rework. There's a lot of data errors. There's a lot of mistrust in analytics. A lot of data and analytics projects are failing because of that, because people are just, you know, I mean, the biggest challenge in analytics in my my mind is getting business people to trust the data. And what that means is actually make data driven decisions as opposed to gut driven decisions. Because if you're running a major brand and spending 1, 000, 000 of dollars or you're a CEO, you're like, you've got a lot of reasons why not to trust the data. You've got your gut, you've got your intuition, you've got what your friends say, what you heard here, some major trends. And all that is in the mix of making a business decision and data is not the only thing that impacts it. And so, from a data and analytics field, we've got, I think, a while to go and getting people who are not in the field to start trusting and making data driven decisions. And I think that's a cultural change, and I think people are getting more and more on board, and there's more and more analytics as part of life. And, you know, when I sort of started doing data and analytics in 2005 or 6 full times, I had explained to people what it was. And I said, I had to tell them it's charts and graphs. And they're like, oh, okay. You do charts and graphs.
So I think now you can't go to the airport and not see it. So there's a cultural change that's coming in analytics and and the see the pants sort of thing, but really the those advanced things that you're talking about are advanced in data and analytics, and and I just wish they weren't because that that to to really just do automated testing. Just do that. And right now, you don't need to buy any software. Just make some automated tests. If you're data engineer, make automated tests and put them in whatever harness you have, and then you're gonna your life's gonna be better. That's if that's the 1 thing you should remember from this this podcast, and do them now.
[00:40:58] Unknown:
Yeah. To your point of the charts and graphs that, you know, working in data engineering as well as with infrastructure management or, sort of, you know, cloud automation. You know, there's the tip of the iceberg that people see of, you know, the web page loads or the graph has some, you know, pretty lines. And then there's the, you know, 90% of the iceberg that's under the water. That's all the work that needs to be done to make sure that all of those things are able to happen.
[00:41:21] Unknown:
Yeah. Yeah. And and from someone if you look at you know, I'm a big big businessman or woman. I'm looking at my charts and graphs trying to, you know, decide whether to spend some money or save some money or invest or do whatever. And so you have no concept of all the steps that go underneath it. And so and the weird discontinuities that happen in technical systems, Right? Like changing something could take a minute, and changing another thing could take up 2 months. And for your perspective, it's all like, don't know, it's all like a matte black magic beans underneath, and people and things are moving around. And so but they, I used to sort of bridle against those peoples demanding they make changes.
And so I used to think, oh, they just don't understand technology. They don't understand this distance continuity. But I realized that that's not unreasonable. It's not unreasonable for them to expect fast changes. It's not unreasonable for them to expect really high quality. And also it's not unreasonable for the people who are in doing the work to expect to be able to try some stuff out and innovate and try all those great open source libraries or new tools out there. That's not unreasonable. So everyone in this idea of I want something fast, I wanna innovate something, I don't wanna have any errors or problems, they're all perfectly reasonable. And so instead of, you know, I've got I went through my own process of sort of pushing back and on on all those in some ways, quality, features, time, innovation, and saying you can't have it all. And so, I spent 8 or 9 years building a system that tried to have it all by port by all if you do it all in metadata, it'll all work. And, you know, I got, like, 80, 90% there, and there's always like a bag on the end that didn't work. And so I'm not a believer in those types of systems anymore. But if you apply these ideas that have been done from manufacturing, done from the software development, you apply the principles that we we wrote about in the DataOps manifesto, I think you can actually do those things. It takes a while. It's a cultural change as well as some tooling change, but I think you can go fast and not break things and innovate.
[00:43:15] Unknown:
And as far as the data kitchen platform that you're building, how does that simplify the process of operationalizing a data analytics workflow and implementing some of the practices of data ops?
[00:43:29] Unknown:
The way that we we think about it is that there's a couple of aspects to what people need to do. 1 is that there's this idea of a pipeline, which is a graph. And in that pipeline are all the pieces that you do in analytics from grabbing data from somewhere, transforming it, putting in a database, doing models on it, visualizing it. So you need a graph that covers all that, and we do that. And there's other tools that actually do graphs like that. I mean, there's plenty of ETL tools. There's plenty of open source tools like Airflow or Luigi. That's 1 part of problem. Another part of the problem is when you run that graph, you've got to be able to test each point. And so you need a test framework sort of embedded in that part of it. You need to get test history. That's 1 thing also that we do. So that we call those since we're data kitchen, we call those recipes and tests, and we use a bunch of cheesy food metaphors. So you've done that. I got a data pipeline that's running. I can monitor it. I can keep track test results. Cool. And it integrates with all the other tools that you need to have because you may have Informatica. You may have Tableau. You need to work and containerize that. We've got some open source stuff that allows you to abstract some of that stuff away. So that's 1 part. The other part is you gotta do what you call CI and CD, which is deploying.
And there's tools like that. There's Jenkins. There's, tools that Amazon has or Azure has to be able to take that pipeline and all its code and put it through a process to deploy in production. And so that's similar to what those tools do. And then the other part is that you've got to have help people with building and building these environments to do their work. And so the Dev box like in our own software development, our dev box setup is you know, it's sort of bunches of Python and shell scripts that we use to set up our our Dev Box. And they're, you know, they're great, but they're for our software engineers. And so what you need is if you're gonna expect people to go fast, you need some way for a data engineer or a data scientist or even someone doing Tableau to kind of spin up some resources where they can make a playground and then be able to run tests and push a button. And so all that stuff is that you could theoretically kind of you probably could piece it all together from different tools, but then the surface of interaction for your users would be half a dozen different tools and scripting languages. And for software engineers, it's probably fine because they like soft they like all the tools. But we've sort of put that together in a in a package, and have a set of best practices that if you use the tool, you can then start doing day ops. And so for us though, since we're a small company, we're we're bootstrapped. We're, you know, we're trying to pay our employees' salaries.
You're selling software and and helping people through that transformation is what we're trying to do. But from a guy in the industry for a while, our purpose is to get people to do to do DataOps. And, yeah, you can use your software. But there's there's you can if you wanna start doing it, you don't need to buy software. You can start applying the principles now just with in Python or SQL or however you're doing your work.
[00:46:10] Unknown:
Is the need for being able to rapidly iterate and deploy systems that are designed to capture and store and process and analyze data become more prevalent. How do you foresee that feeding back into the ways that the landscape of tooling and data storage platforms are designed and developed particularly with an eye towards how they are managed from an operational context.
[00:46:37] Unknown:
Yeah. Some are some are good. Some tools are good DataOps citizens and some aren't. So databases generally are. Right? Because you can pour data in them. You can run them. Sometimes you can run them in a container, sometimes you can't. But you can pour data in, you can port a code in, and you sort of and have them in the same kind of virtual environment. You get the same results. Other tools are a little bit more challenging. Some of the ETL tools, you can you know, they they they really want to own treat them like blobs instead of being able to kind of branch and merge them. You know, Tableau, for instance, is not a very good DataOps citizen. It's we had to do a lot of work just to be able to like pull a Tableau workbook out, run some tests on it, and then push it back in.
And a lot of what they they they try to optimize kind of on the manual process, and they've got nice UI. But the whole point of of doing this is that you should be able to script a change. Everything should be done with scripts. And, every and so that's where the industry is is is on 2 challenges. 1 is that bigger companies are trying to own the entire footprint of analytics. And so they they bigger companies who like, the big Hadoop vendors are, like, everything and under the sun. And you just do it in our our platform, you're fine. And then even tools like like Tableau, they're sort of expanding out in the data prep, expanding on the data science. The data science vendors are expanding out in the data data visualization. And so I just reject that idea that there's sort of 1 boot, 1 tool on your neck. It's a multi tool world. And those tools need to be able to be scriptable, deployable. You need to be able to shove in the work that you've created in them and be able to test the results and then pull it back out.
Because fundamentally, if analytics is code, then all code should be in tools of version control like Git. Whether it's a predictive model or whether it's Tableau Workbook or SQL or Python, all that stuff should be in Git because it's a central place. And again, big companies, by and large, don't do that. They'll have some code in file systems. Sometimes it'll be in Git, or sometimes they'll put it in Informatica's store somewhere, in some database somewhere. And so the I think the industry needs to realize there's some work that they need to do to to to stop this belief that, and this has been going on for a long time, that you should do everything in my environment. And that's just not the way the world works. And there's so many existing systems, and there's so many tool chains that need to be coordinated. And so, I I find that just a challenge in that we try to integrate with a lot of tools, and some are good and some aren't.
And so that's that's 1 of the things I think the industry can do is is get better at being good data ops
[00:49:17] Unknown:
citizens. As somebody who spends a lot of time working with configuration management and automating deployment of servers and deploying various tooling or platforms on top of them. There are all too many occasions where I'll be trying to get something up and running, and it's not as simple as just dropping a text file with some configuration data into it. I then have to figure out a way to write a script to insert the right records into a database to ensure that everything is configured and operating properly. And whenever I run into database to ensure that everything is configured and operating properly. And whenever I run into that kind of situation, it's just painful, and I wish it was just drop a text file, have it ingested properly by the tool as it starts up rather than having to go through all of those extraneous steps of either manual processes or writing convoluted scripts to get things running the way that they're supposed to.
[00:50:02] Unknown:
Yeah. Yeah. And, like, we just all we wanted to do, like, in we have a Tableau container that's that we wrote, and that's part of our our analytic container, like, spec. And so in order to get all I wanted to do was take a Tableau workbook, have it connect to a data source, and then make a change to it, and then deploy it back, and get some test results out. It's like it seems like Tableau has a REST API. It also has a command line API, but neither of those was able to do it. So I had to, like, actually use Selenium and talk to their UI. And then the only way the UI is I could get the actual data out. The REST API could only get an image out. And so, like, fundamentally, think of these things as black boxes. I've I've if work is code, I wanna take my Tableau workbook and put in source code. And then I wanna make a change. And then I wanna be able to make sure I haven't broken anything. So I've gotta stick it in the system that's representative production, get some test data out, and then check to see if it's right. I mean, that seems to me like, how else are you gonna have your Saturday Saturday mornings free, unless you're able to do that in an automated way? And so, yeah, we had to jump through some hoops to do it, and Tableau is a big company. I think Looker's a bit better in the BI space.
Some of the data engineering tools, ETL tools are are a differing degree. But how to be a good citizen of in a in a DevOps, DataOps world, I think, is important for vendors. And just stop getting over this idea that you know, your sales VP is gonna because they have your 1 product, you're gonna replace every other product in the $100, 000, 000, 000 analytic industry. That other idea is also just makes me nauseous as a guy who, like, trying to sell our product, it goes stands around at that trade shows, once a month. It's just it's, it's not the way the world is, and and we live in a diverse tool world, and tool chains, and tools matter, and people love their tools. And the idea of the, you know, sort of big boot on people's neck where everyone's gonna use their tool. It's just it's just silly.
[00:51:47] Unknown:
And are there any other aspects of this conversation that you think we should discuss before we start to close out the show? You know, I I guess I think from my perspective, 1 of the I you know, from a career standpoint, I really
[00:52:00] Unknown:
like data and analytics for a lot of reasons. 1 is that there's a lot more women in data going into data analytics. And so I you know, if you go to, like, in Boston, there's gonna be the Open Data Science Conference. There's a lot more women. It's a very diverse field. It has a lot of different people with different skills. And I think, in some ways, the idea of the value of data analytics is is kind of lifting and separating from IT or software development to be kind of its own thing and its own culture. And so I think that's really good. And I I I really like the the data and analytics field and all the different subsets, whether you call yourself a data engineer or a data scientist or whether you're a a data artist who does Tableau. I think all those things are really it's a really big growing diverse field.
And I think it does have challenges with agile and and applying the agile principles and and doing data ops are are 1 of of the the biggest areas I see as as the problem. But there's also just a whole lot of opportunity for people to kinda grow in in this career, and I'm heartened to see billboards. Like, I was driving through Hartford 6 months ago, and I saw a billboard for a master's in data science. And those things didn't even exist 2 years ago. And, like, we've been hiring, and we are hiring. If you're a data engineer looking for a job, go to data datacatch.com. We're hiring a bunch of data engineers. But there's also a lot of graduate programs where people are actually learning to be data engineers. And I think that's a really, it's a great career. And so I think the transformation of data engineering from, yeah, it's the lunch pail guy who does the data work in the backroom, and to a skilled professional who delivers a high value to the organization. I think that's that's ongoing. But I really do, I'm a big believer in data engineering, and I I believe that, you know, they should be paid well, that they have as much cool kids in the block now or data scientists or even more so. And I think it's a it's a great career for anyone to get into, and there's just a lot of jobs out there.
[00:53:52] Unknown:
So if you're looking for, looking to make some money and and have an impact, it's a good career to get into. So for anybody who wants to get in touch with you and follow the work that you're up to, I'll have you add your preferred contact information to the show notes. And so, with that, I'd like to thank you for taking the time out of your day to join me and talk about the work that you're doing with Data Kitchen and your views on data ops as it relates to data engineering and analytics workloads. So thank you for that and I hope you enjoy the rest of your day. Okay. Yeah. Then thanks for the opportunity.
Introduction to Chris Berg and Data Kitchen
Chris Berg's Journey in Data Management
The Evolution of Data Roles
Principles of DataOps
Challenges in Data Engineering
Importance of Automated Testing in DataOps
Building Trust in Data Analytics
Data Kitchen's Approach to DataOps
Future of DataOps and Tooling
Closing Thoughts and Career Opportunities in Data Engineering