Defining DataOps with Chris Bergh

Hello, and welcome to the Data Engineering podcast, the show about modern data management.

When you're ready to build your next pipeline, you'll need somewhere to deploy it, so you should check out Linode. With private networking, shared block storage, node balancers, and a 40 gigabit network, all controlled by a brand new API, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode

to get a $20 credit and launch a new server in under a minute. And for complete visibility into the health of your pipeline, including deployment tracking and powerful alerting driven by machine learning, Datadog has got you covered.

With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you'll have everything you need to find and fix performance bottlenecks in no time. Go to data engineering podcast.com/datadog

today to start your free 14 day trial and get a sweet new t shirt. And go to data engineering podcast.com

to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macy. And today, I'm interviewing Chris Berg about Data Kitchen and the rise of data ops. So, Chris, could you start by introducing yourself?

Hi. I'm Chris Berg. I'm head chef of a Cambridge, Massachusetts company called Data Kitchen. And, I guess our purpose for being is to help companies

do data

ops. And how did you first get involved in the area of data management?

Well,

yeah, I've had a long career. So I started working class kid from Wisconsin, went to the Peace Corps, taught math for 2 years, then went to Columbia, and then did a bunch of years at a project to automate air traffic control, with MIT and NASA. And so, you know, I've both been a very technical guy, wrote a lot of code,

and did AI back when AI wasn't wasn't cool. And then for a bunch of years, sort of got into both consumer, Internet, and then enterprise software companies and and left the technical domain and started to be a manager.

And about a dozen years ago, I joined an analytics company that did kind of it was a bootstrap company that did everything that you could think of for kind of health care analytics.

And so we did data integration and data management on a daily, weekly, monthly basis. We did visualization and charts and graphs. We had data scientists before they were called data scientists.

And we even made the mistake of building our entire

stack of analytics all metadata driven

with software. I was the chief operating officer. So my life was pretty much structured around kind of 3 things. 1 is the guy who founded the company was

a very bright, not too technical Harvard Medical School doctor who really knew the domain. And he would go off and talk to customers and have a bright idea and then come back to me as COO and say, well, here, take this bright idea and and run with it. And so I get people together in the room. I'd have some data scientists and data engineers. We called them ETL engineers back then,

and maybe some software engineers, and we'd hash it all out and say, okay, this this cool idea is gonna take us 2 weeks to do. And I'd walk back into my boss's office and say, it's gonna take us 2 weeks, and be very proud. And he would sort of look down his glasses at me and say, wow, Chris. That takes 2 weeks. I thought that should take 2 hours. And I feel, you know, feel very embarrassed and sort of walk out of my office and then,

you know, kinda get a call on my phone from 1 of our customers, and, they would say, Chris,

it's a problem in the data. If you don't fix it, you're out. And I'd have some more right here sprout then, and then I'd walk a little farther. And we had a lot of bright people that we hired, and they say, oh, there's this cool open source tool. Can can I try it out on our next project? And so my life is kind of making the trains run on time in analytics as COO of a moderately sized kind of technology and consulting company was,

you know, how do you deliver analytics

in the broadest definition, whether it's it's data or a model or a visualization? How do you deliver all those things to a business customer fast?

And how do you deliver high quality so you don't have errors? And then how do you let your team innovate?

And so that was my life for a bunch of years. And the company grew, and then we sold it to a West Coast company called Model N. And I was knocking around to figure out what my next thing was would do. And we talked to a couple hundred people and did customer interviews, my cofounders and I, who also worked at the same company, and we realized that a lot of people had the same problem we had, which was, you know, they are a, you know, there's a new role called a chief data officer or a chief analytic officer and they sometimes have data engineers

and data scientists and people who do data vis underneath them or working with them. And they're kinda getting

they have this sort of duality to them. 1 is they're getting beat up by their their business customers the same way I got beat up by my boss because they're just not going fast enough or the

things break. And the other is this sort of great desire for innovation. They go to I was at the Gartner conference for a few weeks, and there was a, you know, a mess

of of of sort

of chief data officer types there. And and they really do wanna prove their value to the organization because they really do believe that analytics and and data has power to actually help organizations make fact based and and reasonable decisions. And so they're caught with this

desire to innovate,

and this this challenge of just, you know, going too slow and making the the trains run out of time. And so the way that we touch the world is we think the, you know, the idea to enter the

idea to innovate really comes from, iteration. And so, that's where

the idea of data ops comes from, and and that's what our our company, you know, our company is focused on.

That's a lot sort of a long answer.

Did that answer your question?

Yeah. No. That was great. So it's always interesting to get a sort of detailed history of people who have been in the space for a while and

as all of the old things become new again simply because they've been rebranded. So, you know, there was the business intelligence analyst, and then there was the ETL developer and now they're data engineers and data ops as sort of an umbrella term to try and unify all these different roles that are getting new terminology

simply because they're doing the same thing but with new tools. So Yeah. Yeah. And it's it's the tech industry. Right? So all these terms, data ops, data engineering, I think of them as as a gas. They sort of expand to fill the the available space. Like, I remember when big data just meant, you know, clusters of parallel machines with dumb storage. And now big data is is everything having to do with analytics. And so,

you know, I think that's just an aspect of the tech industry we're not particularly fond of. But, you know, we have a we have an idea of what we think DataOps is and and and why it's why it's relevant to people. And, you know, we've authored a manifesto. We've we've talked about it at conferences. And we really do believe that it's a better way for people who are doing data analytics to work. And it's a much more satisfying,

much more happier way to live your life instead of sort of being browbeaten by business or frustrated that things are breaking or killing yourself with, you know, sort of heroic deeds and then waking up, you know, Saturday morning when you should be at soccer with your kids with a panic that something is wrong in the database and then having to skulk off the soccer field because

because you you just got an email saying it's broke and having to go in your car and fix something. And so it's,

and I don't know if you feel like I can identify with that, but, like, it's it's I'm I'm sort of done with, you know, making changes and having things break.

It's just not a fun way to live.

Yeah. I think anybody who has ever served any time on call can relate to something at least similar to that of getting called out in the middle of some event where you then have to go and sit and hammer away on your keyboard to try and fix something that shouldn't have broken in the first place or something that was just missed due to oversight or, missing process. So

Yeah. Yeah. And and

yeah. And I you can identify. Right? And so,

you know, I think it's that look that my wife gives me.

It's just like,

as you've been married for a while, it's like, oh, you're an idiot. And so, and

it's harder too because having been both the person on the keyboard who's done the work and the person who's made the mistakes

and the person who's managed those people.

I've been fortunate to see both sides of the equation. And, you know, I'm a big believer

in Deming, this guy who, you know, sort of industrial process control.

And a lot of the problems I think emanate not from

an individual. I mean, almost all the problems don't have to do with an individual screwing up,

and sort of yelling at someone or blaming something

for a problem is not the right attitude. It's really about the system in which people work. And if that system is right, people can do great work. And, yeah, they're gonna make errors. And the only error, I think, is is if they make errors, that's fine. You just make sure they don't make the same 1 over and over again. And so loving your errors, iterating and improving,

iterating and innovating, I think, are are what data analytic teams, data engineers, data scientists really, really need to focus on in order to be successful.

Yeah. And that borrows 1 of the ideas from the DevOps movement of being able to learn from your failure where you

use those as a as input to your feedback loops for improving your overall process and capability

and

making sure that you have automated checks in place to guard against those failures and prevent regressions from putting you in the same situation where you have that failure again for preventable for preventable reasons.

Yeah. Yeah. Exactly. And and I think, you know, the the the DevOps kinda goes back to the agile man manifesto, goes back to total quality management, goes back to this crazy guy Deming. And that sort of idea of loving your errors and and trying improving them and

automating

the testing and automating the deployment,

is is really, it's really gone through the software industry. And, you know, back when I was a development manager

in the early 2000 and late nineties, I thought I was a pretty

cool guy if I could get my team to ship software every 3 months.

And, now if I would try to get a job as a development manager saying I could ship software

every 3 months, I wouldn't get a job. Right? Because the expectation is that you should be able to ship software every 3 minutes or 3 seconds.

And

why you know, the question is why does that expectation exist in software now? And why doesn't that expectation exist in,

at least

broadly in in the realm of people who do data and analytics? And and that's you know what? Could be because I I really do think it should.

And that brings us to the question of

the way that the concepts and principles of data ops are manifested

in the way that roles are defined for people on an analytics team and in particular,

how a data engineer

can, implement and benefit from some of the principles

of how you define data ops. Yeah. And I I so, you know, having

been a data engineer and hired data engineers, I, you know, I think of data engineering as sort of a a software developerization

of what people

called DDAs or ETL engineers.

And what do I mean by that? So I think,

a lot of companies

have hired people who know how to use tools like Informatica and Talent,

and they're very comfortable that this tool is it and they exist in this sort of

very slow but, successful waterfall process to to deliver things. And I think the idea of data engineering is that it's it is a way to think of their work more from code than visual,

designs, more from a agile way of doing things than a,

you know, a waterfall way of doing things. And I think it's actually really good for, you know, people out there to think of themselves as data engineers because I think that is actually the way the field's going. And companies that we talk to don't want to have to put a ticket in a system and then have a 6 month project to update their whether it's a data warehouse or a data lake or whatever they call their data infrastructure. It's just it's not operating at the speed at which their their businesses or their customers need. And so I think, you know, whether it's a data engineer doing the process of data ops or a data scientist or whether there's actually gonna be a role called a data ops engineer,

I think data engineering is is really the

a really important field and not just something that you should cost minimize and send to India and work from tickets. I actually think the data engineering, given that it's actually the majority of

the work in the analytics cycle, is actually hugely valuable. And so I'm happy to see it.

If my impression of the term data engineering is really an upskilling or up upstatusing of the field, and I think it's just long long overdue.

And 1 of the

trends that has led

to the introduction of the concept of data engineering

is the idea of having all kinds of easily deployable resources through virtualization

and cloud resources

as well as managed services such as data warehouses or data pipelines.

And a similar transformation

has happened in terms of systems administration and and infrastructure

operations with the cloud and those similar

capabilities and things like configuration management.

So I'm wondering how the

work flow for somebody who is practicing some of the principles of data engineering and data ops would differ from more traditional

methods for building data platforms for being able to process analytic workflows from ingestion through

to analysis and delivery.

Yeah. And I I hope I never have to go back to, like, buying servers and binding your database

code to your database and then having to clone it to back it up. And I think those you know, it it it's sort of you know, the job I talked about before to do the analytics, we had hundreds of servers that we had to buy and provision. And for us having to it was very difficult to to sort of lift your data and your processing from 1 server to the other. And we had to sort of jump through hoops to figure out how to do that. And so I think the idea of cloud resources, the ephemeralization of resources, the idea of coding to create or shut down resources,

containers, all those things are really,

they give you the

the ability to deploy quickly. They give you the ability

to sort of reproduce what you've done, and and they take, you know, that DevOps term cattle, not pets. They they take in some ways, they they take that from being sort of a black art.

And, oh, we've gotta have that single point of failure because John over here in the corner knows where all the servers are and all the passwords are, and then you can't touch them, to being something,

easily deployable. So I think the sort of cloud virtualization makes a makes a big deal in how you do it. And and the other way I think it makes a big deal is is sort of like cutting and pasting of of a database. And and and I think 1 of the secrets of going fast in analytics is being able to have

versions

in development that are based on what is in production.

And so sometimes you wanna do that, have all of your you know, if your database is reasonable sized, tens of terabytes, 100 of terabytes,

sort of cloning it in in a Redshift or an EMR, and AWS is not an unreasonably slow thing to do.

If you've got p to scale database, then sort of sampling a subset.

But really, the idea is can I have a place where I do my work? That's where I've cut and pasted all the data, cut and pasted all the

exact versions and environments that I have, whether it's versions of database or versions of Python libraries or that version of, you know, R that I'm using. And then all the code, I'm acting on it. And so I've got my place to do the work. Or like with some of our customers, they have variations of what's in production. They'll have production that's got 2, 000 people on and another copy of production that's slightly different

that's got 3 people on. And so cutting and pasting of data and infrastructure really is enabled by virtualization. And that was just super hard before because you'd have have to go and talk to your database salesman who always seemed to be an ex football player and wanted to take license fees from you, and they were just I never really enjoyed talking with those types of people. But now so it's much nicer to, like, spin stuff up, try it out, shut it down, and follow the the the sort of principles that were laid out in in DevOps applied to to data. Yeah. And your comment about the sort of database vendor Oh, yeah. I just never liked the Oracle salesman. Man, they just they were always I don't know. They always, like, went to the gym before it, and it's like, oh, you've gotta get 4 more CPU licenses to try something out. And, like, why is that? I just wanna try something out to see if it works. Yeah. The explosion of

open source tools for being able to store and process and analyze data has also led into the capability of having this degree of automation because you don't have to make sure that you're constraining your environments by the number of licenses that you have or ensuring that, you know, the license file is in place on every system so that you can actually do anything with it. So I think that also plays into the ability to have those multiple environments for being able to test your different versions of code and data before you actually release it into a production flow and ensure that you have those feedback cycles before you introduce an error into a a situation where you are going to get paged on the weekend in the middle of an event. Yeah. Yeah. And I think, fundamentally, people who do data engineering think of their work as code. And

I have I think that's the right way to think of it. I I don't and I think someone who does data science, their work is code.

And so whether it has a model or a model that has a bunch of random input perimeters, it's still code. And even if you look at a Tableau workbook, you open it up, it's code.

Maybe it's XML code. But you can even there's a bunch of nested if then else statements. Now you can embed you can embed Python into Tableau.

Fundamentally,

the thought inversion is that all a lot of the tools out there are IDEs from a software development standpoint. Tableau is an IDE. RStudio is an IDE. Informatica is an IDE, integrated development environment, that produces code. And so if you think of your job in data analytics as a code

producer, and that code is has a lot of weight from someone who's managed

software teams. And if you grew up in software, you think I do code. And then there's a whole bunch of thoughts that come after that. It's like, you know, code is complexity. Code is communication. Code can create this big hairball that's gonna cause me problems later on. And so the culture of of how to treat and how to deal with the combinatorial complexity of a codebase environment is sort of, I think a lot of data and analytic teams are a bit a bit of denial of that. And I think the vendors are sort of feeding that into it. The vendors are like, hey. You do it in my system. Everything's great. And all the they all sort of say the same thing, like, oh, you'll be fast. It'll be great. But the reality is it's it's code and complexity.

And so how do you deal with that code and complexity? How do you change it? How do you modify it? And so software's got a lot of really good ideas from the DevOps movement. And, also, there is more to it

than software does because data pipelines are these sort of living things, and sort of deploying them, monitoring them, testing them,

is different. And so also the front end sort of creating environments for people to work in is is

is different than the it's similar but different than the DevOps app that most software engineers have. Yeah.

And 1 of the differentiations

between

a data oriented

platform

for being able to perform these analytics

on these different workloads

and the way that you would approach a traditional software application where it's largely going to be stateless except for whatever

customer input is put into the database via the transactional system is that

in these more analytic oriented workloads, there's generally a larger volume of data which provides a certain amount of mass and gravity that makes it difficult to make it traverse these different environments because of how

slow and expensive it can be to

make multiple copies of the same data if you're trying to replicate what you have in production, which in a lot of cases is necessary

what are some of the techniques that are available for being able to perform that management

of taking

a representative sampling of production data and bring it down to prior environments so that you can have an appropriate representation

when you're doing that testing and iteration.

Yeah. And there's there's different ways to handle it. So 1 is, you know, if your date if your data is

in the tens or hundreds of terabytes, just back it up if if and back it up to a a bucket store like s 3 or a file system and then restore it. And usually the networks and and having a copy a full copy of production.

If your domain allows the ability to do that with security and and and, you know, sort of personal information. That that that's 1 way to do it. And then the other way is is to take it and do a sampling or a a filtering of that dataset if it's too large or there's security concerns.

And

that could be another sort of process that someone does, create a create a sampling ETL to

to give you your test data. And then a third technique is there's companies like, Delphix or others that do data virtual

ization. And then if they're they allow you to sort of kind of point your data point your database at different datasets.

And so, you know, in general, I think a lot of people, if you've got if you've got tens of terabytes of data, and just just copy and paste it because the network can handle it. But, yeah, I I think for for us, it's also

the it's not a data problem, but it's you gotta have the data. You've gotta have a representative

hardware software environment that's production, which means the same

tools, the same versions of libraries. You've also got to have a copy or a branch of the code

that you're working on, whether that code is SQL code or Informatica code or whatever. You've got to have that in your environment.

And then you've got to have sort of a dial that says says I'm going to run part of the process, all the process. And in order to do that, you've got to also be able to say, I need to test each step of the process.

So if I'm going to make a change, I take a copy of what's in production,

the data, the code, the scripted environments.

I make my own place for it.

And then I run, in essence, a set of regression tests to make sure I haven't I've got a representative sample. I make a change in something. I add another test, and then I run the whole thing again to make sure I haven't broken anything. And when you have your data environment,

excuse me, having thousands of tests or think of 20% of your code is test code, whether that's

testing as monitoring production, testing as regression testing,

functional testing,

system testing, all the same concepts that exist in the software world, then you have a really high confidence that

you're not gonna get that call on Saturday morning because the tasks will tell you if you've broken anything.

And that that will also tell you that you're not the hero anymore. And

that's also a cultural problem is that there's a lot of,

I think, heroism

in in in tech where if you're if you are really good at something, you can produce a lot of code and a lot of technical debt. And then you get to be the the person who's called in to fix it. And while that can be really

emotionally satisfying for a while, it ends up not helping the team as a whole. And then you end up being pulled in in a if you're the hero and I I've done that role, you end up being pulled in a lots of directions and then you end up quitting. And so it's not as, so,

there's a lot of reasons to to build have a development environment that's representative of production, that has your code, that has and to do all the work to make these tests,

in the broadest term that allow you to make changes to production

fast enough. And I think testing and data and analytics is an underappreciated field, and it

being employed in these analysis pipelines is fairly understood from the principles of software engineering.

But testing

the data, particularly

in the volumes and shapes that are more prevalent in analytic workloads,

is something that is underappreciated and under under and, often misunderstood.

And it also doesn't necessarily have the same quantity and quality of tooling available for performing those tests. So I'm wondering what

are some of the most effective methods

and practices

for being able to,

create and run those tests in a repeatable and manageable

and fast enough

method.

Yeah. Yeah. And well, we could we could talk an hour about this. Right? Because we had to build a framework in ours. And so in our in our software to do that. And so,

yeah, because if you think conceptually, you're testing 2 things. Right? You're testing

the data.

And as the data flows, whether it's streamed in or it's a it's a kill and fill refresh or it's another

set of transactions being poured into your analytic

data warehouse, data lake, whatever you call it. That that set of processing

is you need to monitor that and make sure that the data or the artifacts are all being produced. So for the purpose of

having the business people or customers that you're delivering to not find

problems when you've delivered it. And so that is think of that I think of that as testing, but in reality, it's sort of monitoring,

in some way. And, these sort of pipelines that you do, and our our our view of the world is that pipelines aren't just data pipelines. They're data pipelines that call that also have predictive models, that also have visualizations in them. And all those things need to be tested

because someone could have changed them or some data could have come in from someone there that's crap and broken something. So that's 1 aspect of testing, right, monitoring what's in production live. The second conceptual aspect is that if the work that data engineers and data scientists do are really code, well, then all those ideas that exist in software development of regression testing or testing need to take

place. And so how you do those things exactly and the logic of doing those things and keeping track of the test results is important. Because

a lot of ways, the monitoring of what goes on, it's think of it as a Toyota factory, your your production process where data comes in on the left side, you know, what other it's Python, SQL, pick your pick your language, operates on it, and then there's some r and some predictive modeling, and then there's some visualization.

Toyota

factory. And each step along the way that it's being transformed and artifacts are being added to the assembly line. And the assembly line is not aligned. It's really a directed graph, but that's that's not important. It's that conceptual idea that you you have something,

a living process that is manufacturing

innovation and value for your customers.

And just like Toyota

and Deming said, well, let's monitor that. Let's get statistics on it. Let's not yell at people who made errors on it. Let's give people the ability to pull a, you know, pull a chain and stop the assembly line. I think the same ideas of lean manufacturing

statistical process control apply

to,

data and analytics.

And so you need to kind of think about what you're doing as a factory, but you also need to think about what you're doing as

an innovation

pipeline, deploying pipeline.

And so those 2 things are actually really hard to do together. Sort of as if you've got a Toyota factory where you can have anyone in the factory take a machine out, change it, and put it back in in a few seconds. And so that is really the state that you want to have in analytics and that's a hard state to do, right? It's hard to it's hard to have the in the lots of changes, lots of data

going by, and lots

of innovation going by. And so the quality tests are actually can break up in a lot of different ways. So there is the basic of looking at did the data I get

is what I expected, you know, the row count changes,

the, looking at sort of, you know, whether it has the right number of columns, whether the, you know, the data fields are right. That's that those are tests that you you really should have because you wanna have those tests

as close to when the data arrives as possible,

just like Toyota. And so that way you can, kind of like Toyota does, go back and yell at the manufacturers of the, you know, of the of

the radiator and say, look, you're giving me some defective parts. Go fix it. Because oftentimes you get defective data and sometimes the data engineering team has the has the choice to say, oh, I can just fix this data and then, I can keep my production process going. And so that can be a good thing to do sometimes, and it can be a bad thing to do sometimes. But you've got to think about how to how to manage that. And so

tasks that can tell you that the data's bad before your customer sees it are really important. And so there's a whole lot of tests between, you know, statistical process control tasks, between the logical tasks that you get from your business customer, between what we call historical balancing. We look at what happened previously versus now.

And there's a whole art to actually writing writing good tests. But that, you know, that idea

I find

shocking in a lot of data analytics people that testing should be 20% of your work automated test, and that testing shouldn't sit sit on a shelf somewhere. It should be automated deployed as part of your process. That's really, that that's really the idea that we're trying to get across in data ops. Yeah. And I think that that's 1 of the areas where it differs too from just software tests is that with software, you're testing

in the process of building the deployable artifact, but then once it's in production, you're not generally continuing to execute those tests because the system is essentially,

you know, in a continuous state of test just by virtue of it running whereas with data, particularly with ingestion pipelines, you need to have those continuous tests to ensure that the data is matching the appropriate schemas and the appropriate volumes and shapes to make sure that you don't introduce those quality errors further down the line.

Yeah. Yeah. And I just you know, there's a lot of times you you're building you have data warehouse builds that could take hours. Right? And and a data gets in at the beginning, and then you don't notice it

until hours or day laters or, you know, God forbid, in front of boss or the boss's boss's boss.

And so,

how do you set up a system where people aren't being defensive,

where they slow things down because they're afraid to get yelled at, but because their boss's boss's boss will see it and everyone's locked in this

circle

of,

I don't

know, fear.

And I think what breaks the fear is that you can have provable confidence that what you're going to give to someone, your boss's boss's boss, will work. And the only way to have provable confidence is to build automated tasks that you can trust

and run those during

development and run those while they're in production and parameterize them accordingly.

And also addressing the concept of fear of

iterating and fear of continued development

is

that the more often you make those changes and release them into production

and particularly if you do them with shorter iteration cycles, then you will decrease the amount of fear because the change sets that are getting pushed out are smaller. So it's easier to find and fix errors as they go into production than to batch them up into large volumes and then release it all as 1 giant ball of mud, and then you have to try and detangle that once it's in production and already causing errors, and you don't know exactly which change is contributing to what error.

Yeah. Yeah. Exactly right. Yeah. You know, smaller frequent releases are better because they're easier to bug, and they're also easier because

then your customers are getting more value quicker, and you can learn from them. Because the whole point of agile is is feedback, is to counter the fact that we're all introverts who do this work and and get these crazy business people and customers to give us feedback to make sure we're we're going we're making the right choice on what the way we implement.

And

it's

so I think the there is this emotional context that I'm trying to get across to people is that really it is

the if you're living in fear of making changes and you're waiting for the big deploys and you're kind of hoping things are right or you've got these, like, change review boards or where it's the 3 people who know the whole system are sitting around the table and they're reviewing your code or making sure it doesn't break all these other systems. Those things are just they they are,

they're just a pain and not fun. And, like, if you can get rid of your change review board and really empower

the youngest and most least knowledgeable member of your team

and have, him or her press a button and deploy and really have confidence that they can that that won't break, then then you're on to

you've achieved a great velocity.

And so

that

that spirit, I think, needs to and a lot of companies isn't there because of fear. And also I think the intellectual framework of DataOps and the idea of DataOps really is not present

in people

in the data and analytics world. And so I'm I'm, you know, what I'm trying to do with the company is sort of colonize the the mind space of people on data and analytic teams to say, look. DataOps is important. Look. It's this has really changed the way people work in software development. It's changed the way people work in in factories and manufacturing. I mean, I grew up in the eighties in Wisconsin and saw the Japanese imports

lose 50, 000 jobs in Milwaukee because of the fact that they can make cars better and cheaper by following Deming. And so

these are a set of ideas that are,

I think, the, whether you call them agile or lean or TQM, really need to apply to,

to what we do in data analytics.

And on the idea of fast feedback loops

encouraged by fast releases

and being able to use that to determine whether or not

the results that you're providing are giving the value that is desired from the business.

What are some of the metrics that you track

to define

how that value is created and how the various steps in the overall workflow are proceeding toward that goal of producing that value?

Yeah. I hear you. We do we have a thing called tornado report, which allows you to track by data source,

where the errors were. And so we also include the the data engineers or the data scientists as part of that. And so, in order to understand where the errors occur in production, you need to sort of find out where they come from and look at the hotspots. That's that's 1 area, sort of error reporting. The other part is sort of, SLAs. Are you delivering on time? We have, metrics around

I think it's sort of an equivalent. There's not really a code coverage

metric that exists in analytics, but we have a sort of, you know, ratios of tests

to

steps in the process. So you can tell sort of is this

work being done well well

tested. So it's about errors. It's about it's about deployment. It's about kind of

ratios of code to tests

that we that we enable with our software. But it's also something that we're continuing to evaluate and improve upon ourselves because I think it's,

it's

and I think the more modern DevOps tools actually are better at that

because they've, you know,

they've they've been doing it a while. So there's some inspiration and ideas that we're gonna take from those.

And also to be able to keep those iteration cycles fast, 1 of the requirements

in terms of managing

continuous integration

from a testing perspective is for the tests to be able to execute quickly so that you're not waiting for an hours long build to complete before you can determine whether the change that you made is having the desired effect. So how do you balance the need for larger quantities of data to be used for

verifying the appropriate scalability and performance of a given set of changes against optimizing for the cost and speed

of,

working in nonproduction environments?

Yeah. You know, we don't have sort of a magic box to do that. Right? And and, so what we think is is that what people should do if you think of all the work in analytics as a direct graph? And people should be able to run parts of that graph in their development environment. They should have

a basket of parameters where they can say, okay, I want to run this test based on a small input data or a big input data so that my test can run fast and quickly. And also, they should try to have their work

be done where it's, I can I'm working on 1 piece and it's my 1 piece is separate from other other pieces. And I can run it over and over and over again and make sure I do

the the sort of design, see the results, judge the tests, and iterate and improve. Because

another thing that's happening in in data engineering and data sciences is becoming much more of a team sport instead of the individual hero who's doing, you know, his or her coding from data to

value to, you know, the sexiest job of the 21st century where I do everything in analytics.

It's a team of people who are doing things and that the teams have to decompose their processes into steps.

And so I think,

you know, like in software development, you should be able to choose which tests that you do and

have some tests be functional tests, some tests be kind of the equivalent of unit tests, some be system tests, and be able to turn those on and off based on what you on what you do.

Yeah. And there's a great book called Continuous Delivery by Jess Humble and Dave Farley, where 1 of the things that they talk about is the idea of having different tiers of tests where you have your commit tests that are just very quick sanity checks

to ensure that everything is meeting the required specifications as far as code quality, lifting, simple things like that. So that if that fails, then it fails quickly and you know that you need to go back and fix something and then you have the next level of tests that are just very quick unit tests to be able to complete in a matter of minutes so that you

again, know whether or not everything's going to work properly,

before it gets to the more long running integration level tests to ensure that everything is working properly together. So being able to have those quick circuit breaks of, you know, I I know that this quality isn't going to pass, so I need to go back and refactor or,

you know, everything made it through to integration testing. So I know that this whole set of

error cases

is already taken care of and addressed. I don't need to worry about that. So then you can then continue your development cycle having that level of confidence and wait for potentially a longer build to complete to see if there are any more deeply hidden or insidious

errors that you need to address further down the line.

Yeah. Yeah. And I I wish

that more people

in the data and analytics world had that thought, and they were far enough in thinking of automated testing to break them up into types and when you run them and how to optimally

be able to allocate the task based on what your what pattern is.

Most people I encounter, the idea of automated regression of your data analysis suite is foreign.

And so there are

most companies

will have a few tests that run-in the beginning because some data engineer or ETL engineer put them in, and then they have largely manual processes to validate or they're just winging it.

And, you know, there's there's somebody who's grabbing some data from somewhere. They're doing some transformation. They they look at it. It looks good, and then they push it out to the exec team. And so there's a lack of process discipline for most people who do data and analytics.

And I think that's also a challenge in the data ops movement because it's

kinda more fun to not have to think about your process and just do some stuff, throw it together, hey, it looks good. And

that, I think, is

is 1 of the reasons why there's a lot of rework. There's a lot of data errors. There's a lot of mistrust in analytics. A lot

of data and analytics projects are failing because of that, because people are just, you know, I mean, the biggest challenge in analytics in my my mind is getting business people to trust the data. And what that means is actually make data driven decisions as opposed to gut driven decisions. Because if you're running a major brand and spending 1, 000, 000 of dollars or you're a CEO, you're like, you've got a lot of reasons

why not to trust the data. You've got your gut, you've got your intuition, you've got what your friends say, what you heard here, some major trends. And all that is in the mix of making a business decision and data is not the only thing

that impacts it. And

so,

from a data and analytics field, we've got, I think, a while to go and getting

people who are not in the field to start trusting and making data driven decisions. And I think that's a cultural change, and I think people are getting

more and more on board, and there's more and more analytics as part of life. And,

you know, when I sort of started doing data and analytics in 2005

or 6 full times, I had explained to people what it was. And I said, I had to tell them it's charts and graphs. And they're like, oh, okay. You do charts and graphs.

So I think now you can't go to the airport and not see it. So there's a cultural change that's coming in analytics and and the see the pants sort of thing, but really the those advanced things that you're talking about are advanced in data and analytics, and and I just wish they weren't because that that to to really

just do automated testing. Just do that.

And right now, you don't need to buy any software. Just make some automated tests. If you're data engineer, make automated tests and put them in whatever harness you have, and then you're gonna your life's gonna be better. That's if that's the 1 thing you should remember from this this podcast, and do them now.

Yeah. To your point of the charts and graphs that, you know, working in data engineering as well as with infrastructure

management or, sort of, you know, cloud automation. You know, there's the tip of the iceberg that people see of, you know, the web page loads or the graph has some, you know, pretty lines. And then there's the, you know, 90% of the iceberg that's under the water. That's all the work that needs to be done to make sure that all of those things are able to happen.

Yeah. Yeah. And and from someone if you look at you know, I'm a big big businessman or woman. I'm looking at my charts and graphs trying to, you know, decide whether to spend some money or save some money or

invest or do whatever.

And so you have no concept of all the steps that go underneath it.

And so and the weird discontinuities that happen in technical systems, Right? Like changing something could take a minute, and changing another thing could take up 2 months.

And for your perspective, it's all like, don't know, it's all like a matte black magic beans underneath,

and people and things are moving around. And so but they,

I used to sort of bridle against those peoples demanding they make changes.

And so I used to think, oh, they just don't understand technology. They don't understand this distance continuity. But I realized that that's not unreasonable. It's not unreasonable for them to expect fast changes. It's not unreasonable

for them to expect really high quality. And also it's not unreasonable for the people who are in doing the work to expect to be able to try some stuff out and innovate and try all those great open source libraries or new tools out there. That's not unreasonable. So everyone in this idea of I want something fast, I wanna innovate something, I don't wanna have any errors or problems, they're all perfectly reasonable. And so instead of, you know, I've got I went through my own process of sort of pushing back and on on all those in some ways, quality,

features, time, innovation, and saying you can't have it all. And so,

I spent 8 or 9 years building a system that tried to have it all by port by all if you do it all in metadata,

it'll all work. And, you know, I got, like, 80, 90% there, and there's always like a bag on the end that didn't work. And so I'm not a believer in those types of systems anymore. But if you apply these ideas that have been done from manufacturing, done from the software development, you apply the principles that we we wrote about in the DataOps manifesto,

I think you can actually do those things. It takes a while. It's a cultural change as well as some tooling change, but I think you can go fast and not break things and innovate.

And as far as the data kitchen platform that you're building, how does that simplify the process

of operationalizing

a data analytics workflow and implementing some of the practices

of data ops?

The way that we we think about it is that there's a couple of aspects to what people need to do. 1 is that there's this idea of a pipeline, which is a graph. And in that pipeline are all the pieces that you do in analytics from grabbing data from somewhere, transforming it, putting in a database,

doing models on it, visualizing it. So you need a graph that covers all that, and we do that. And there's other tools that actually do graphs like that. I mean, there's plenty of ETL tools. There's plenty of open source tools like Airflow or Luigi. That's 1 part of problem. Another part of the problem is when you run that graph, you've got to be able to test each point.

And so you need a test framework sort of embedded in that part of it. You need to get test history. That's 1 thing also that we do. So that we call those since we're data kitchen, we call those recipes and tests,

and we use a bunch of cheesy food metaphors. So you've done that. I got a data pipeline that's running. I can monitor it. I can keep track test

results. Cool. And it integrates with all the other tools that you need to have because you may have Informatica. You may have Tableau. You need to work and containerize that. We've got some open source stuff that allows you to abstract some of that stuff away. So that's 1 part. The other part is you gotta do what you call CI and CD, which is deploying.

And there's tools like that. There's Jenkins. There's,

tools that Amazon has or Azure has to be able to take that pipeline and all its code and put it through a process to deploy in production. And so that's similar to what those tools do. And then the other part is that you've got to have help people

with building and building

these environments to do their work.

And so the Dev box like in our own software development, our dev box setup is

you know, it's sort of bunches of Python and shell scripts that we use to set up our our Dev Box. And they're, you know, they're great, but they're for our software engineers. And so what you need is if you're gonna expect people to go fast, you need some way for a data engineer or a data scientist or even someone doing Tableau to kind of spin up some resources where they can make a playground and then be able to run tests and push a button. And so all that stuff is that you could theoretically kind of you probably could piece it all together from different tools, but then the surface of interaction for your users would be half a dozen different tools and scripting languages. And for software engineers, it's probably fine because they like soft they like all the tools. But we've sort of put that together in a in a package,

and have a set of best practices that if you use the tool,

you can then start doing day ops. And so for us though, since we're a

small company, we're we're bootstrapped. We're, you know, we're trying to pay our employees' salaries.

You're selling software and and helping people through that transformation is what we're trying to do. But from a guy in the industry for a while, our purpose is to get people to do to do DataOps. And, yeah, you can use your software. But there's there's you can if you wanna start doing it, you don't need to buy software. You can start applying the principles now just with in Python or SQL or however you're doing your work.

Is the

need for being able to rapidly iterate and deploy systems that are designed to capture and store and process and analyze data become more prevalent. How do you foresee that feeding back into the ways that the landscape of tooling and

data storage

platforms are designed and developed particularly with an eye towards

how they are managed from an operational context.

Yeah. Some are some are good. Some tools are good DataOps citizens and some aren't. So databases

generally are. Right? Because you can pour data in them. You can run them. Sometimes you can run them in a container, sometimes you can't. But you can pour data in, you can port a code in, and you sort of and have them in the same kind of virtual environment. You get the same results. Other tools are a little bit more challenging. Some of the ETL tools, you can you know, they they they really want

to own

treat them like blobs instead of being able to kind of

branch and merge them. You know, Tableau, for instance, is not a very good DataOps citizen. It's we had to do a lot of work just to be able to like pull a Tableau workbook out,

run some tests on it, and then push it back in.

And a lot of what

they they they try to optimize kind of on the manual process, and they've got nice UI. But the whole point of

of doing this is that you should be able to script a change. Everything should be done with scripts.

And,

every and so that's where

the industry is is is on 2 challenges. 1 is that bigger companies are trying to own the entire footprint

of analytics. And so they they bigger companies who like, the big Hadoop vendors are, like, everything and under the sun. And you just do it in our our platform, you're fine. And then even tools like like Tableau, they're sort of expanding out in the data prep, expanding on the data science. The data science vendors are expanding out in the data data visualization. And so I just reject that idea that there's sort of 1 boot, 1 tool on your neck. It's a multi tool world.

And those tools

need to be able to be scriptable, deployable. You need to be able to shove in the work that you've created in them and be able to test the results and then pull it back out.

Because fundamentally,

if analytics is code, then all code should be in tools of version control like Git. Whether it's a predictive model or whether it's Tableau Workbook or SQL or Python, all that stuff should be in

Git because it's a central place. And again, big companies,

by and large, don't do that. They'll have some code in file systems. Sometimes it'll be in Git, or sometimes they'll put it in Informatica's store somewhere, in some database somewhere. And so the I think the industry needs to realize there's some work that they need to do to to to stop this belief that,

and this has been going on for a long time, that you should do everything in my environment. And that's just not the way the world works. And there's so many existing systems, and there's so many tool chains that need to be coordinated. And so, I I find that just a challenge in that we try to integrate with a lot of tools, and some are good and some aren't.

And so that's that's 1 of the things I think the industry can do is is get better at being good data ops

citizens. As somebody who spends a lot of time working with configuration management and automating deployment of servers and deploying various tooling or platforms on top of them. There are all too many occasions where I'll be trying to get something up and running, and it's not as simple as just dropping a text file with some configuration data into it. I then have to figure out a way to write a script to insert the right records into a database to ensure that everything is configured and operating properly. And whenever I run into database to ensure that everything is configured and operating properly. And whenever I run into that kind of situation, it's just painful, and I wish it was just drop a text file, have it

ingested properly by the tool as it starts up rather than having to go through all of those extraneous steps of either manual processes or writing convoluted scripts to get things running the way that they're supposed to.

Yeah. Yeah. And, like, we just all we wanted to do, like, in we have a Tableau container that's

that we wrote, and that's part of our our analytic container, like, spec. And so in order to get all I wanted to do was take a Tableau workbook, have it connect to a data source, and then make a change to it, and then deploy it back, and get some test results out. It's like it seems like Tableau has a REST API. It also has a command line API, but neither of those was able to do it. So I had to, like, actually use Selenium and talk to their UI. And then the only way the UI is I could get the actual data out. The REST API could only get an image out. And so, like, fundamentally, think of these things as black boxes. I've I've if work is code, I wanna take my Tableau workbook and put in source code. And then I wanna make a change. And then I wanna be able to make sure I haven't broken anything. So I've gotta stick it in the system that's representative production, get some test data out, and then check to see if it's right. I mean, that seems to me like, how else are you gonna have your Saturday Saturday mornings free,

unless you're able to do that in an automated way? And so, yeah, we had to jump through some hoops to do it, and Tableau is a big company. I think Looker's a bit better in the BI space.

Some of the data engineering tools, ETL tools are are a differing degree. But how to be a good citizen of in a in a DevOps, DataOps world, I think, is important for vendors. And just stop getting over this idea that you know, your sales VP is gonna because they have your 1 product, you're gonna replace every other product in the $100, 000, 000, 000 analytic industry.

That other idea is also just makes me nauseous as a guy who, like, trying to sell our product, it goes stands around at that trade shows,

once a month. It's just it's, it's not the way the world is, and and we live in a diverse

tool world, and tool chains, and tools matter, and people love their tools. And the idea of the, you know, sort of big boot on people's neck where everyone's gonna use their tool. It's just it's just silly.

And are there any other aspects of this conversation that you think we should discuss before we start to close out the show? You know, I I guess I think from my perspective, 1 of the I you know, from a career standpoint, I really

like data and analytics for a lot of reasons. 1 is that there's a lot more

women in data going into data analytics.

And so I you know, if you go to, like, in Boston, there's gonna be the Open Data Science Conference. There's a lot more women. It's a very diverse field. It has a lot of different people with different skills. And I think,

in some

ways, the idea of the value of data analytics is is kind of lifting and separating from IT or software development to be kind of its own thing and its own culture.

And so I think that's really good. And I I I really like the the data and analytics field and all the different subsets, whether you call yourself a data engineer or a data scientist or whether you're a a data artist who does Tableau. I think all those things are really it's a really big growing diverse field.

And I think it does have challenges with agile and and applying the agile principles and and doing data ops are are 1 of of the the biggest areas I see as as the problem. But there's also just a whole lot of opportunity for people to kinda grow in in this career, and I'm heartened to see billboards. Like, I was driving through Hartford 6 months ago, and I saw a billboard for a master's in data science. And those things didn't even exist 2 years ago.

And, like, we've been hiring, and we are hiring. If you're a data engineer looking for a job, go to data datacatch.com.

We're hiring a bunch of data engineers. But there's also a lot of graduate programs where people are actually

learning to be data engineers. And I think that's a really, it's a great career. And so I think

the transformation of data engineering from, yeah, it's the lunch pail guy who does the data work in the backroom,

and to a

skilled professional who delivers a high value to the organization. I think that's that's ongoing. But I really do, I'm a big believer in data engineering, and I I believe that, you know, they should be paid well, that they have as much

cool kids in the block now or data scientists or even more so. And I think it's a it's a great career for anyone to get into, and there's just a lot of jobs out there.

So if you're looking for, looking to make some money and and have an impact, it's a good career to get into. So for anybody who wants to get in touch with you and follow the work that you're up to, I'll have you add your preferred contact information to the show notes. And so,

with that, I'd like to thank you for taking the time out of your day to join me and talk about the work that you're doing with Data Kitchen and your views

on data ops as it relates to data engineering and analytics workloads. So thank you for that and I hope you enjoy the rest of your day. Okay. Yeah. Then thanks for the opportunity.

Data Engineering Podcast

Summary

Preamble

Interview

Contact Info

Parting Question

Links