A DataOps vs DevOps Cookoff In The Data Kitchen

Today, I'm interviewing Chris Berg about the current state of DataOps and why it's more than just DevOps for data.

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline or want to test out out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode. With 200 gigabit private networking, scalable shared block storage, and 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform.

And for those machine learning workloads, they just introduced dedicated CPU instances.

And if you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai.

Go to data engineering podcast.com/linode

today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show.

And managing and auditing access to all of the servers and databases that you're running is a problem that grows in difficulty alongside the growth of your teams.

If you're tired of wasting your time cobbling together scripts and workarounds to give your developers, data scientists, and managers the permissions that they need, then it's time to talk to our friends at StrongDM.

They have built an easy to use platform that lets you leverage your company's single sign on for your data.

Go to dataengineeringpodcast.com/

strongdm

today to find out how you can simplify your systems.

And you might say there aren't enough data conferences out there that focus on the community. That's why the folks at Data Council built a better 1.

Data Council is the premier community powered data platforms and engineering event for software engineers, data engineers,

machine learning experts, deep learning researchers, and artificial intelligence buffs who want to discover tools and insights to build new products.

This year, they will host over 50 speakers and 500 attendees

in San Francisco on April 17th to 18th and are offering a $200

to listeners of the Data Engineering podcast.

Go to dataengineeringpodcast.com/datacouncil,

all 1 word, and use the code depdash

200 at checkout.

And you listen to this show to learn and stay up to date with what's happening in databases,

streaming platforms, big data, and everything else you need to know about modern data management.

For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season.

We've partnered with organizations such as O'Reilly Media, Dataversity, and the Open Data Science Conference.

So go to data engineering podcast.com/conferences

to learn more and take advantage of our partner discounts when you register.

And go to data engineering podcast.com

to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Your host is Tobias Macy. And today, I'm interviewing Chris Berg about the current state of data ops and why it's more than just DevOps for data. So, Chris, welcome back. And for people who haven't listened to your last interview, can you just start by introducing yourself quickly?

Hi.

Thanks for having me, Tobias. My name is Chris Berg. I'm, CEO and head chef of a company in Cambridge called Data Kitchen. And I'm a bit of an older nerd, written a lot of code in my

life. So started off in Wisconsin,

went to the Peace Corps and taught math for a few years, went to Columbia and studied AI back when no 1 in the world knew what AI was. And then went to MIT and NASA and then, got the management bug. And then about 2005, got the data and analytics bug. And I started working, you know, what we now call data engineers and data scientists and people who did data visualization. And I was CEO of a company and we had a lot of smart people that we hired. And I worked for a CEO who

was a Harvard educated physician. And he knew a lot about healthcare and knew a lot about analytics, but he didn't know a lot about sort of how to make the trains run on time. And for years I had to gather

a data engineer and a data scientist and other people in a room and figure out how to get my boss's request done. And I'd come back to him and he'd say, you know, I'd come back and say, Wow, it's gonna take 2 weeks for our team to do something, being all excited. And he'd sort of look at me with his Harvard eyes and say, Chris, I thought that should take 2 hours. And, you know, when I would get out of my office, I would get calls from our customers sort of, yelling at me that the data's wrong or it's late. And then we had hired all these smart people and they wanted to use their own tools. So I lived this life for many years of how do you deliver kind of innovative analytics fast with high quality and using the tools that you love. And that's sort of the genesis of the company. You know, our company is really focused on trying to bring this idea of of data ops or data operations to the market. And so last year, and I'll add a link in the show notes for anybody who wants to revisit our conversation from then, but we talked about the idea of data ops and what it means at least as of that point in time. So can you start by just giving a quick overview

of how you define DataOps and some of the ways that the industry has taken on that moniker and either changed or updated that definition since we talked last? Yeah. You know, I currently manage people and have managed a lot of people in my career, but I'm I'm not really

that charismatic or even good of a manager. You know, I have an INTP Myers Briggs. You know, I've had to learn to manage and I think

learning to manage is different because I've noticed people who are naturally good leaders. I joined the CEO forum and there's a 29 year old woman in the CEO forum who's probably just a much better leader than I am, just charismatic

and confident

and just a person you want to just follow. And so leadership and management were hard for me to learn. And I do have 1 advantage though, is that it's, I think leadership and management with a group of people working together on some technically operated thing is a distinct set of skills. So you can read Harvard Business Review and go get your MBA, but really when you're dealing with this, think of 100 or dozens or even thousands of people working on something like a factory or working on, like, something building a big piece of software or working on all the various parts of the value chain and analytics. That technically complicated thing that we're all working on is requires a different management approach. And it's you know, part of it is, yeah, you've gotta be a good leader and read Harvard Business Review and all that stuff. But a lot of it is these ideas that started when people manage factory floors and this technically complicated thing with lots of people working on it, they were getting

lots of errors and lots of problems. And there was a guy named Deming and people look, you know, there's this original idea of Taylorism, break everything down into small pieces and then work on small pieces. And, you know, there's all these apocryphal stories about why that didn't work. But really the idea is if when you're in a technically complicated domain, if you can deliver something in a small batch to your customer, it's better.

And if you can do that quickly and iterate upon it, it's better.

And if you can measure and improve over time, it's better. And so those 3 ideas, I think are

inherent in what we call data ops. They're inherent in agile and dev ops. They're inherent in total quality management or lean manufacturing. They're all kind of the same ideas, kind of showing up in different forms, but they really

started with how do you

manage and lead a group of people who are working on something technically complicated for a long period of time. And that's a different type of management. And so in my sort of journey to get to data ops, it really is like, well, how do you lead people? And because a lot of the problems aren't that individual workstation or individual tool that people have are the problem or that even the individual is the problem. It's the people and the process

are really the focus of where the biggest value comes. And so not to say that like having a

faster machine or a faster compiler or a better database doesn't help.

I'm not saying that at all. I'm just saying that really the core problem is a management problem

of in analytics.

And,

that's and I think that was the core problem in software. And I think it was the core problem in manufacturing. It's a little bit convoluted way of getting to what I'm talking about. I don't know if that made sense. But Yeah. No. That that was a a good overview.

And like you were saying, a lot of the biggest challenges that we face as people working on these technically sophisticated projects isn't necessarily

deep in the bowels of the machine. It's more in how do we, as a team, work together to achieve some outcome that's valuable to the business, and what does that even mean? How do we align all of our operations

in a way that is going to deliver something that is useful to somebody at some point versus the old approach, at least in software, of having these various silos of responsibility where the developers throw things over the wall to the sys admins, and then the sys admins just have to figure out how to make it work. And so that's where the sort of genesis of DevOps came in is, you know, how do you align these different units so that they're working together instead of at cross purposes to each other? Yeah. And so that brings me to my question of how the sort of common definition or the accepted view of what DevOps is conflicts with

what people are expecting when they try to bring those conceptions to this definition of DataOps

and some ways that DataOps in particular is its own beast and unique from the DevOps practices that have been developed over the past decade? Yeah. Well, that's a that's a great question. And and so, yeah, I tried to position DevOps or ops of anything in this intellectual history of the last century. And and I think I think that's true. And so in that sense, I don't think data ops is new. It's just application of these principles of how to get a group of people to work together where they're all sort of working on the same

analytics,

you know, people have, I think, started to talk about it more. I think it's become more of a thing. And it's sort of similar to the early days of what happened in DevOps and Agile. You know, there were some the market hasn't settled on a definition.

People are talking about it. You get some companies

who do nothing related to using the term as a halo and marketing. And there's it's just sort of confusion. And a lot of ways,

terms and monikers for ideas are hard and because the marketing team gets it and wants to inflate it and cover their, you know, cover their company and glory. And it takes a little bit of intellectual work to get at, you know, what is this idea? And the way I position it is, yeah, you gotta manage a team of people in a technically complicated

environment. So, you know, look at lean and agile as ways to think about that and as a framework to do it. And so there are differences between, you know, what I call DevOps and and what I call DataOps. And so how do I know about those? Well, I managed teams of software engineers for many years. I've I've written a lot of code and managed teams in agile, manage teams in waterfall, manage teams with, DevOps principles. And then the the same goes when managing people who do data visualization or or data science or data engineering. And so I have a unique perspective,

both being able to write code in both areas and managing teams in both areas. And in a lot of ways, software engineers and data scientists and data engineers, we're really alike from a personality standpoint, but we're we don't talk to each other very much. We're kind of cousin nerds. And you do find some people who've, come from software engineering into data engineering, but by and large, they're just different tribes. And so as an example, you would

say, where do you put your stuff when you're a software engineer? And 99% people would say version control. And if I ask that same simple question to different people in different roles, maybe 5% now would say version control and 95% would say, well, we put it in a file system or on the shared drive, or it's here on my laptop. And so that's 1 area where just

the people's response are different because the they are there is different. And so in DataOps and DevOps, first of all, they're just different people. And, you know, characterizing

a broad generalization of people who do software is that they're, you know, they like technical things. So they're interested in how the machine works. And so they're interested in this cloud versus that cloud. They're interested in learning about Git. They may know a couple of different languages, the scripting language, they may know Python. Whereas broad generalization is that people who do data science or data engineering or data visualization,

they're more interested in the problem and therefore are not interested in doing several languages to get at it. And so as a result,

they tend to be less adventuresome technically. And not to mean that they're not smart. It's just that if someone knows R, they're just, they're gonna know R and they're not gonna pick up a scripting language and an XML variation and a DSL and

know 2 or 3 different languages and and be cognizant of it and and spend their evenings reading about it. They're gonna spend their time reading about how to develop a next algorithmic technique or learning about the data preparation or visualization or the other stuff goes on. So it's just there's different people. We're alike each other and and we have have different expectations. And that's just 1 sort of broad way that, you know, sort of data ops and dev ops. And I actually have, like, a whole bunch of different ways. I don't know. Tobias, do you want me to go through each 1, or or how do you want me to to to go through? I think that it could be interesting to sort of enumerate all the different ways that data ops and dev ops and software engineers and data engineers and data scientists are different, but that could probably fill up a podcast just by itself.

Yeah. Yeah. So I think having this sort of high level perspective

of the sort of difference in what's the primary concern of is it the technical bits and bobs that go into building something, or is it the how to achieve whatever the problem is?

I think having that perspective is definitely very useful and is a good way to continue in this conversation of data engineer, data scientist. We're more interested in figuring out how do we figure out the answer to this question versus, you know, systems administrators and software developers of you know, we're we're still trying to achieve the solution to a problem, but we're much more interested in the various components that go into it and the architecture of the system and how things flow through it. So,

having that context is definitely very useful as we continue on. And 1 of the things that you were talking about of when you're asking a software developer of where do you put yours whatever it is that you're working on, and answer is version control, that makes me wonder what is the equivalent to version control in the data space because I know that there are sort of varying levels of sophistication

in that, whether it's for the data storage or the machine learning models or the code that manages the ETL pipelines. So wondering if we can just explore that a little bit as a sort of next step. Yeah. Yeah. And it even depends on there's actually subtribes in data and analytics. And so people who do data visualization or there's a lot of analytics that are done self-service, Tableau, Axata, Trifacta, Excel.

And that's actually, I think, a good thing because they're close to the customer and they can iterate quickly. By and large, those are file systems or shared drives. They're not at all stored

in Git and version control. Data scientists, it depends.

There's some who like have Git, like, you know, there's somebody released a Git updated R module for Git. But, you know, in a lot of cases, it tends to be their shared drives or their laptops. Data engineers, it depends. So you have the data engineers who may use, may use Airflow and more modern tools. They're kind of programmer like data engineers and they tend to use version control.

Whatever format that that system is native. And so sometimes that is actually

a binary file or sometimes it's an XML file and it's sometimes it's stored in the database. And so, yeah, it just, it really depends on the subgroup and data and analytics.

And it's

puzzling to me because I go to data conferences and,

if you're a data engineer listening to this, well, you obviously believe data has power and obviously believe that it's better

to make decisions based off data than not data. Well, those decisions that are based off data or not data are based on code that is running on top of the data that may be a visualization or an R model

or an NTL code. And that's really, you should think that's high order intellectual property of your company. And so why is that on a shared drive? Why is that in someone's laptop and not kept? You would think, you read about Google and they have their own massive version of all their code in 1 place. That's the intellectual property of Google is in there. You talk to people who run hedge funds and they partition the code into what hedge, what strategy they have. And

you can't talk about it to the other people in the hedge fund because that's the core IP. And I think a data driven business, the core IP is the work that the people do and to put together the pieces in the value chain from source data to results. And that should be kept in 1 place because

it's important. And version control is, I think, the best place to put it and shared drives and blocking an internal system is not. And so that's where that's just 1 case of where the sort of data ops or dev ops mindset that applies in data analytics is just not there. And it's changing and people are learning. And

another counterweight to that is a lot of the tools that people use are not you can put the XML, for instance, that Tableau or that SSIS or Informatica

use in version control, but it's actually hard to diff and hard to merge. And you end up having to sort of parameterize it and do sort of kill and fill.

And there's a lot of proprietary engines out there that aren't really version control

code driven, but more just here's my it's just an external file format and it's my file format. It's got, you know, it could just be binary for all we know. You can't actually merge it. And that's, I think that's better in some tools, you know, I think for instance, Looker and the BI space is actually better and actually supports getting version control.

Other ones are, you know, Tableau has got a funky XML file that you struggle with. Then, you know, you don't wanna get into some ETL tools that just store it in freaking binary format that you don't even know what's going on. And so,

you know, the challenge I see is that as DataOps is getting more popular, being a tool vendor, they're struggling, or I don't think they're paying a lot of attention to being a good DataOps citizen because they'll have binary formats or they're trying to say, you use my tool.

Basically, they're selling magic beans. Right? You use my you use my tool and everything's wonderful.

And there is no magic beans and get over it. You've gotta do its code at the end of the day. It could be code that runs in an engine. It could be code that runs, has to be compiled, but it's code. Code means complexity.

Code means you know how to test it and deploy it and do all that stuff that you do with code and just get over it and stop selling magic beans to people that if you use my tool, everything's great. And so that's 1 of my frustrations is that as the market

matures, I think people I think people need to demand that the tools that they use, it's almost like the right to repair movement. Like, don't trap me in your tool. Like, give me the source code. Let me store it someplace else. Let me

understand what's happening in that source code. And it's just a matter of writing a better file format. And I think a lot of the vendors could do that, better.

And I think part of it too is that a lot of the early entrance to the market of data analytics and big data were largely coming from academia where the primary responsibility is to just get something that works regardless of how it works, and there isn't necessarily

a lot of time and effort put into the operability of the system. It's just, you know, you just put it up, and then you tune this knob over there and fiddle this dial over there and then everything's fine, just don't touch it. Versus some of the tool vendors who are coming in now where

they're coming more from the developer focus of its software. It needs to be managed and run as software. It needs to be deployable.

And then the other thing too is

that in the earlier days, we were in the era of proprietary software running the world, and so everything was closed. You don't necessarily want to open up the file format for somebody to be able to reverse engineer it or make it portable to another system. So there was that idea of lock in, whereas in recent years, particularly over the past decade or so, open source has become more and more of a force, and everybody wants to be able to peer into the system, you know, perform their own fixes, make their own enhancements, and actually understand how everything is working. So I think that that's also playing into the increased willingness

of engineers, whether it's data engineers, software engineers, data scientists

to have these systems that you have the access to be able to do that tinkering. Yeah. And so I'm curious what you have seen, you know, over the past few years, but particularly in the past year since we last talked. What you have seen as far as the overall influence that this concept of data ops has had on the types of tools and system components

that have become available or changes in the existing ones that have happened over that time that are targeting the big data and data analytics ecosystem

to make them more operational and software driven?

You know, I I gotta be honest and not a lot.

I, you know, I wish there was. I wish more people were doing it. And I think, you know, there are companies who are like, I think Looker is is from their beginning as as as thought of what they do as code. But there's a strong urge for proprietary lock in that and expansion of scope that drives the growth of a business. And so once you have a customer, you wanna take more share of wallet and,

you know, you can do more. And so for instance, Tableau just is a it's an amazing company and an amazing tool, but, you know, they have a data prep feature now. And so you can do data prep and it's still the same sort of funky file format that Tableau workbooks have and they haven't really improved it. They've expanded, but they haven't improved it. And so I do think that, you know, from a developer focus, open source and treating your infrastructure as cattle, not pets, those ideas I think are really important. But, you know, the proprietary lock in is such a draw and even the cloud providers have a whole another level of, you know, both taking open source and and bundling it in their offering or having, you know, with a little spin on it that makes a proprietary lock. So I think it's always gonna be in data and analytics in the software industry. People are looking for proprietary lock and it's the temptation is so hard because if you do get it, you can get a monopoly going and and, you know, you can make a ton of money. And so it's it's from a business standpoint, it's hard for people, especially people who are, have got a lot of venture funding to, to not succumb to that temptation. And so, you know, I think the solution there is for people to, at least in data analytics, start thinking of their work as a software does engineering code. And if they are using a proprietary

engine, they should think about taking that work and putting it in a source control and being able to inject the work that they do at runtime, see if it runs as a in its engine as a pet, not a you know, it's not as a pet, but as a nice robust cattle that's sort of infrastructure as code and tear downable and runable and then

log and monitor it and test it as it's running. And I think all those things can will just help you in your data and analytics journey. And and don't stick with, oh, there's, you know, it's my development server. This guy set it up and I can't, you know, I don't know what's going on, but this is where I do to my development. And, you know, if production breaks, we're just totally screwed. It'll take a week to rebuild. Those sort of things are,

I think are anti behaviors that you wanna work on, with your team. Because

the value in those seemingly mundane operational things is actually

huge. If you can

rebuild your environment quickly, especially with using clouds or multiple clouds or even virtualization

on a on a private infrastructure, that value is,

being able to sort of cut and paste your infrastructure, cut and paste your code, and then rerun it in a new way. That gives you such flexibility, such agility, that it's worth paying the operational price to do it. And it's amazing how that I saw in software that that belief didn't exist in 2, 000, that we had a bunch of pets as infrastructure.

Yeah. We did source code, but, you know, deploying it, wanting proprietary to block and even the operational side of of analytics or operational side of software development was unappreciated.

And it was, you know, the the guy who wasn't the best software developer did your releases into production and they were paid less than other people and they're, you know, they may be fun at parties, but they weren't the real tough, cool engineers, mainly because you just disrespected the operational side. And so I think in data science, data engineering, it's the respecting of the operational side needs to happen. And also, I think the respecting of the self-service side, which is another

dimension of

complexity and data ops versus dev ops that also has to has to happen. And so it's about, you know, if anything, the dev ops movement was about

upskilling or the respect of the operational side. And I think that's happening in the data ops movement, but it's it's still not a common practice. And I think another

challenge with with data ops is that it has these self-service tools, which are basically load code development environments.

And, you know, there's dozens of companies that do visualization tools or now data transformation tools, and now automated machine learning tools, where they're basically

tools that create code. And yeah, they got a nice UI, but at the end of the day, they're code generation tools. And so maybe some of that code's configuration, but it's still their code gen tools. And so you've got and those are actually really valuable for analytics because this big idea that analytics is a river of questions, it's always experimenting, always iterating, there's always a new dataset, there's always a new question. A lot of those things could be answered quickly by someone who doesn't have a PhD, who is really good with their own set of self-service tools using, and I've seen it with our customers using Tableau and,

Alteryx

or using

Excel. You can just do a lot of work in that. That that's good. And you gotta respect that work. And a lot of times people in larger IT organizations,

they say the good work happens centrally

here in the home office. The stuff in the field is, you know, it's desktop data marts. It's

it's sprawl, it's problematic, it's ungoverned.

And I think that respect has to happen. I think because it's it's about getting respect between

operations and and centralized development, getting between centralized development and self-service development,

and having all these and also between the data engineering role and the data science role. And seeing each 1 of those is valuable and useful as part of what we do in analytics, I think is, is important. And so it's a part of data ops is overcoming, I think, bias that people have on what's important to do and what's what's not important. And there's also the issue of which data sources are important, who has access to them, and the whole idea of silos, which is something that DevOps has been working against as far as the silo between developers and systems administrators. But in the data space, a lot of those silos can exist just by nature of

where the data is being captured. So whether it's an application database that stores some information or the CRM system where salespeople are logging interactions

or corporate file shares where people have Excel files or other, you know, CSV files, whatever it might be. And

so I'm wondering what your thoughts are as far as the sort of general techniques to help eliminate those silos or reduce the effects of those silos and unify data into systems that are more self-service or ways that they can be brought into the analytics life cycle? And also just your overall thought on how those different silos play into some of the necessity for this idea of data ops and sort of unifying the entire business across a common goal versus having these different systems that not everybody necessarily knows about or has access to? Yeah. So DataSilox exist in a lot of companies, right, because they're multi division companies because of legacy systems.

And it's hard for people to have access to data or to find data. And I think those are real problems. I'm not sure how DataOps directly addresses those.

I see customers

doing a lot of putting things into

cheap storage, like s 3 in the cloud and saying, okay, I just my first step

is I wanna take our 5 or 10 databases and and have them back up into the cloud so I can have access to all the data. And then I need some people to help me understand what the heck of those tables mean so I can then figure out what is predictive or what is analytically useful. And so silos are a problem. You know, I think the techniques of using data lakes or using clouds are

to, ways to help that. And so the question in my mind becomes, how do you actually start getting value out of that data? And so that's where I think DataOps

comes in, is that if you think about the high order bit thinking here is that small cycle climb and small batch size. So how can I do something small on the data I have from 1 or 2 silos and get it in front of out in front of my customers and see if it has value and then iterate and improve upon it? And that's the process that DataOps tries to address is like, you know, whatever tool or tool chain you use, whatever team or multiple teams in multiple locations you use, how can you make that small batch size and cycle time work? Because

of if it's a river of questions, you want to get feedback from your customers first, even when the data is 70%

right, even when you may not know the answer, even when everything isn't engineered perfectly so that you can figure out what you don't wanna do. And that's pure

agile DevOps idea, you know, minimizing the amount of or maximizing the amount of work not done by getting feedback sooner. And so there's so many like, if you look at the failure rate in analytics and it's such a dirty term that no 1 really talks about and it sort of drives me nuts that isn't talked more, is that half of all projects fail, either from a requirements or from a time to market according to Gartner. And 1 Gartner analyst says 85% of analytics projects fail. And yeah, we're all excited about analytics and it works really great. But, you know, if if your business users or your customers don't use it, if it doesn't influence them, if you've spent 3 months or 3 years building a big data infrastructure and you've got 1 or 2 customers on it who are using it occasionally, you're not a success, right? You need to have users

using it to impact their lives. That's the success.

And so I think that challenge of silos are important, but you could argue that governance is important. You could argue that knowledge of people who to understand data literacy is important. All those things are important. But I think the

way I see the world is the way to address those

is to do your work with technical excellence,

deliver it in a small repeatable cycle that doesn't kill your team when they want to change it later, and get feedback, and then take that feedback and iterate and improve. And that's the best way to engineer things. It's the best way to deliver. And it ends up just being a more pleasant environment because then you don't end up with like 3 month death marches or 6 month death marches, and then all of a sudden people don't use it. And I see that day in and day out. People are they're building these complex edifices,

and it's fun as an engineer to go off and work on the cool things and you've got you're filling out your resume to get stuff done, but the real value is, are people using it? Are you having an effect on the world? And you've gotta ask yourself, that's what you wanna do. And you don't wanna kind of hide off and build crystal palaces and then have no 1 use it. And

what have you found to be some of

the best places in the overall life cycle of an analytics project

to add these feedback loops and points of, you know, identifying potential failures and failing fast rather than, as you said, going through that 3 month death march and then delivering and then having something that nobody wants or cares about anymore and ways to get useful information

out of the overall processes that you're building to make sure that the continued viability of the project

is something that is possible, let alone probable. And particularly when you're adding in things like machine learning or artificial intelligence,

just, you know, just end to end, the different places that you can have these feedback loops in ways that they manifest in the analytics life cycle? Yeah. That's a that's a good question. You know, in some ways, it it is true that analytics is kind of a layer cake. Right? You've got raw data at the bottom, process data that you can use for, that you can use for visualizations or reports and raw and process data that you can use to do machine learning models or AI. And so that layer cake, you can get different pieces at different times. And so what I what we try to do with our customers is guide them to deliver things that they that are not perfect and to embrace errors. And so maybe the data is a bit wrong. Maybe your model is not as predictive as it could be.

Maybe it's not quite right, the visualization, maybe your schema is wrong or you don't have all the data, but get something sooner in whatever you're doing so that you can actually

hear what your customer is finding valuable. Because for the most part, I think we, as engineers live under the assumption that everyone in the world is good abstractors

and good taking things and distracting them into, you know, its logical structure. And it turns out most of the world isn't. And they actually have to see things in context.

And,

I don't know if this is a digression, but I'm gonna digress. So

in graduate school, I did this test called Lagranzi and Lagranzi. It's a psychology test. And I gave people a syllogism in terms of a, you know, if a then b, if b then c,

then a then c. And I gave them that structure. And, yeah, about 5% were able to get the structure, but I told that same structure in a in the context of a story, and 95% were able to get it right. And so stories and context really matters for people's understanding. And so, you have to give analytics in the context of the visualization or the UI that people are interacting with to learn. And so

try to get away try to get away there's a set of behaviors that I want I would like people to get away from. 1 is sort of crystal callous, you know, is don't believe the vendors on the magic beans. 2, don't fight your ability to wanna build your own crystal palace that you only live in. You know, try to try to get someone to throw rocks at us as soon as possible. And don't fear that they're gonna disrespect you. That actual feedback is good. And then, like we talked about the operation side, try to get away from doing things manually or disrespecting it and try to automate it and script it and treat it as cattle, not pets. And then the other part is is this,

you know, there's fear on 1 side and then there's heroism on the other.

There's a lot of,

hero worship in analytics. You know, the data scientist is the sexiest job of the 21st century, the Venn diagrams of data science skills, the I'm the, you know, I'm the greatest person. And so, you know, I like being a hero as a tech technologist. I have been, but I realized that heroism

means that you end up kind of creating a lot of technical debt that other people have to follow and it also locks you into

a place. And so, these behaviors of getting away from

fearing change or getting away from hero mentality or doing things manually or

trying to get away from vendor lock in, those set of behaviors

are what we'd like people to have

and find a project that they can use that on.

And it doesn't matter what, but deliver it in small batches, get feedback, and iterate and improve. Improve. And as another component of that overall feedback cycle and something to help ensure the utility and longevity of a project as you go past the initial launch and start with the continued

project as you go past the initial launch and start with the continued evolution of the project. I'm wondering if you can talk a bit about some of the various ways that you can

add testing to the life cycle in these different stages of the project,

for adding some of these feedback

mechanisms

and also just some of the ways that the overall

tooling or systems

have evolved

to make testing

easier or more of a first class citizen to the overall process?

Yeah. Yeah. And I I just think, you know, automated tests,

or automated monitoring of systems is absolutely essential.

And it's a linchpin

of this all working

of the whole data ops vision. And so,

about 2 years ago, I was talking with a guy who runs an advisory firm for data architecture,

very successful Fortune 500 companies. And I started to talk to him about automated testing and how essential it was. And here's a guy who's advising companies on what to do with data. And he goes, Yeah, I don't ever really advise people

you know, to do automated testing. And I just was like, I wanted to hit myself on the head.

Not because he's a bad guy. It just wasn't in his, his scope. And

I think 1 of the ideas here is that there's 2 ways to think about

testing or monitoring. So if you think of the journey as the data, as it goes from a silo

through a system, through a model, through a visualization,

right? That journey, and there's different places that it could go. It could all go into a Hadoop platform. It could bounce from database to database or to tool. But you wanna, as the data is flowing in, you wanna make sure that your data suppliers are not screwing you over. And

I use that term intentionally,

not because data suppliers are bad. It's just a lot of people don't respect data.

And typically

sometimes they'll forget or they'll drop a column or the meaning of the data will change or they'll,

you know, give you a whole

huge change in the dataset.

You know, they were giving you a 1000000 rows, suddenly they give you 10, 000, 000 rows.

And so you need to test that data as it flows through all your code, all your systems

to make sure that it's actually working. And you wanna be notified as soon as possible. Because a lot of times what's happened in the the standard way that people do systems is that they sort of put it up and then they wait for their users to say, this looks weird.

And then they they go yell at the data vendor and they fix it. And that cycle takes weeks. And I think it's the automated testing, the monitoring of the thinking of your analytics as a factory floor where you're monitoring it and using that monitoring data as a source of statistics

to understand your process and improve it, I think is vital. And so I think that's 1 thing, you know, sort of testing and monitoring and production. That's similar to software, right? Where you've got like Datadog or other ones that are monitoring your servers and your server logs for errors. It's a

similar concept. You wouldn't run a professional

software application if you didn't have someone sitting on the monitoring stream. It's a little bit more, you know, there's a lot more steps and a little more complicated in it. But I think what's different about data ops than

DevOps is those tests themselves that monitor

the data are actually, they're useful

for not only for monitoring the system, but they're also useful for

the testing of the code, functional

regression

unit system testing of the code. And because in 1 case, the the code is fixed and the data is varying in production. The data is flowing through your codes fixed. But in development, you can fix the data, but vary the code. And so the same tests apply.

Maybe not all of them, maybe 90% of them, but you can then use those

and test as a way to validate that your system is still working

or that if I've changed component 3 out of a 20 component list, that it hasn't broken anything downstream. And just like in software, you should think about 10, 20% of your work

being the creation

management of these tests, whether it's test driven development,

test next to development,

test just after development.

But kind

of building a framework of tests

to

help your development to prove that you haven't broken anything is really important. And just like in software, you should be able to have any software team. If you can get a person who's just graduated from college and they can make a commit and push it to production

and you're confident that it won't work, then you're successful operationally.

Why? Because the lattice work of tests, the lattice work of deployment operation

will catch

hopefully, the errors that any junior person will make.

And in data and analytics, that's a different world. A lot of times you'll have these things called change review boards, will be the 1 or 2 people sitting around who have the whole system in their head. And they're the ones who are gonna have to look at things and say, oh, this will break something, this or not. And that's just that's just, again drives me it's it's another source of friction because a, those people are really smart because they got the whole system in their head and they're like like looking at basic code that, you know, you should have a test to tell if it works or not.

And then 2, it just slows everything down because that group becomes a bottleneck. They can't meet very often. You've gotta have quorum. They've got follow-up questions.

And those types of process

fixes

end up slowing the deployment of code, slowing the getting of feedback. And so the thing that

unlocks that bottleneck is

number 1 is is testing. Test that monitor production, test that as functional, a test that actually tests the the code itself. And and if anything that anyone should take out of this in data engineering is is just write a test that does write a test in whatever tool you have that does that and run it every time your system builds. And then in development, run that same test against

a standard dataset that you have or even a copy of production dataset if you can get it. And that'll I think that'll help. That 1 idea will help you a lot.

And a big component of the overall ideas that we've been talking about is this notion of iterative development or agile life cycles. But I'm wondering what you have found to be some of the components of a data analytics life cycle that are particularly resistant to these agile or iterative development techniques

and ways that we can potentially work around them? Yeah. I'm not sure that any part of it is particularly resistant. I think there are people who are resistant.

And that tends to be on 1 side, more traditional IT who gets scared with agility because they've built a complex edifice that they only own. And they feel like, you know, their job's gonna get shipped to India if they don't own that technically complicated thing. And so giving up ownership and giving up trust is hard.

I also see some data science professionals denying agile and saying that, oh, I'm a researcher.

You know, there's a lot of data.

There's a

lot of exceptionalism and data scientists who wanna punt it over the wall. I've built the model and you figure it out. And so I think,

you know, what's people,

when you read

the magazines,

data science and now AI

are the cool things that people wanna get into. And, you know, we're hiring a bunch of people and and they all wanna be data scientists and do AI and they've taken their classes. And I I totally understand that because when I got out of graduate school, that's what I wanted to do. You know, I wanted to do AI. That was that's the cool thing. But I learned pretty quickly when I worked at NASA that the AI part

is really

the seasoning

in

the main course, the calories

all came from the other stuff that went with it. And the systems part of, servers, softwares, data transformations, visualizations,

adoptions, tests, deployment, automation,

all that stuff is

as important as a good model. And in fact, if you've got a really good system where you can change things, you should probably get a crappy model up first and see if people are actually gonna pay attention to it, then improve the model over time. And because improving the predictive accuracy of your model by a few percentage, it pales in comparison to actually all the other factors that go into making it real.

And so when I did air traffic control automation,

the algorithms that we worked on to sequence in space aircraft from a code standpoint were just tiny compared to all the other code to make the system work. And I bet every, you know, if you look at that from a data science standpoint, all the other looking at it from a system standpoint,

all the other pieces to make it real, to make it useful, the system,

it matters. And so getting people to think more systemically,

more operationally

is what we're trying to do and and get rid of the sort of hero culture.

Know, I can do the algorithm, so I'm awesome, you know. And everyone else has to sort of bow down before me.

You know, get get rid of that attitude. And as you're saying, there's all of this hype and excitement about

AI and machine learning. And partially because of that, but also because of the advancement in the tools and technologies that are available for building these types of projects, there's an increase in the availability

and sort of prevalence of them in these various systems. So I'm wondering what you have found to be

some of the ways that delivery and maintenance

of machine learning models changes the requirements

of the analytics platform as a whole? Yeah. There's there's a lot of great work, like in Kubernetes and and Kubo Flow. There's a bunch of companies that have

formed around sort of real time

model deployment, you know, like like there's Seldon and ParallelM and Metas Machine have all gotten

funding around this. There's, and I think there's like 2 or 3 more, and they all have an open source project. And I think that's that's great. Right? I think, you know, real time model deployment and model deployment is an issue. And I think there are particular cases where monitoring models, doing champion, Challenger AB testing, I think really matters.

Again, but I go back to it's a systems problem

and a people and process problem.

And so,

and I think there are tools that can help a data scientist or someone doing AI better.

But that's the hands on the keyboard to help an individual contributor

write code faster is not it's important, but not urgent.

You know, and I think why do I say that? And there's a shortage of data engineers. There's a shortage of really good data scientists. But again, looking at it from there's a group of people working together to deliver value to a group of customers, how does that group works together

and how they have all their sub workflows,

because there's a data science workflow that involves feature development, that involves sort of defining the model and then deploying the model. But that could exist

in a lot that often exists in a larger

data management, visualization,

governance workflow. And each there's

almost as if there's lots of sub workflows going on and there's a meta workflow

that needs that tries to manage it. And so I think data science is is great, has a set of particular problems, but I think data engineering has a set of particular problems. And I think data visualization or, you know, data governance or data dictionaries all have each have own particular problems. And I think there are tools that help people,

do their work in there. But the way I touch the elephant of the world is the systemic problem of how to get all those people working together

is really where the value lies. And if you can help that, I guess my bet is that helps this, that helps everyone

do a better job and can help this 50% of all data and analytic projects fail.

And looking forward, what are some of the overall trends

that you're most excited or encouraged by in the analytics and data platform space,

and some of the things that you're keeping a closer eye on for incorporating into your own work? I look at it a couple of different ways. Right? When I started focusing on data analytics in 2005, I had to explain to people what it is. And, you know, I do analytics with that. I do data. It's charts and graphs, and they got it. And now, it's everywhere. And so that's awesome. And the words that

have been hyped around big data or data science or cloud or AI are all becoming more cognizant and people have, they understand what AI is. And, you know, I think all that stuff is going to blow through. Because I think, you know, each 1 of those particular techniques end up being hyped to a lot. Like, I can remember big data meant parallel queries

on cheap hardware and cheap desk.

And that's what big data meant, but it and it's

meant to be everything in analytics. And the same thing with data science and the same thing with AI. All these words inflate. And I think the operational side is what's gonna come next. I think data ops is the natural place when people have gone through every technique And they're gonna say, well, really, they're gonna have end up with the same experience I have. It's not about the technique. It's about the team and how it works together. And that's that sort of more mature management experience, that data ops perspective, I think is what what people are are focused on. And, you know, I also,

from a data and analytics

viewpoint, I just think there's a lot of great people entering the field. And I think that's,

first of all,

I think I talked about this last time. I love the idea of data engineers

as opposed to ETL engineers

or data guys. And I

think there's a lot of women coming into the field, which is awesome. I think there's a lot of master's degrees people that are coming. And I think thinking of data engineering as a lifetime profession

that you can grow in and it's useful and it's not something that needs to be cost minimized, but as a real source of value is, I think that's just a wonderful thing.

And I think it also has to go with since data engineers a lot of times, the people who deal with the operational issues,

And, you know, the people who operate Tableau or develop models tend to not do that. So I think the rise of data engineers and the rise of data ops are sort of paired together.

And so those things, I think, are good. The sort of professionalization,

the upskilling of it, the talking of it, that this is a real thing. Because I, you know, I was I'm old enough to see when outsourcing to India was the cool thing and anyone who was taking all their data people and and putting them in Chennai. And that was that's what you did because you wanted to cost minimize your data. And now it's completely different.

And so I think that's a really good thing. And and also think from a as everyone knows, just from a tool standpoint, the amount of tools that people can use are growing and gaining and just filtering through all the tools, probably a part time job. I mean,

I've

subscribed to 3 or 4 different newsletters

and on data engineering and it's just the amount of, techniques and the amount of writing people, it's a full time job just to keep up. And I don't do it. I just sort of graze at some of these articles, but just trying to keep up is a full time job. Yeah. Don't I know it?

Yeah. Yeah. And it's good. I mean, people it's a big creative burst. Right? And,

it's funny,

we Yeah. I hired In my company, I hire some data engineers, also hire some software engineers.

And I've

7 or 8 years ago, it was super hard to hire data engineers because no 1

wanted to be it. There were, like, the music majors who knew some SQL.

And then there were, like, DBAs who had their own thing, but there wasn't the term was

odd and, like, you call them ETL engineers. No. It's, like, actually easier to hire data engineers than it is to hire software engineers

because people are

there's

such a burst of interest in it that people have, you know,

we, you know, it takes us months to find a good back end software engineer. Whereas a data engineer,

we can fill up a queue pretty quickly and hire 1. So it's an interesting

change. And maybe it's just Cambridge or maybe it's the way we're advertising, but that's what we see. And are there any other aspects of DataOps

and the comparison to DevOps or anything else about the work that you're doing at Data Kitchen that we didn't discuss yet that you'd like to cover before we close out the show? Well, I think

the first thing is,

we've been talking about it for a couple of years, like 3, 4 years, and we've been trying to explain the ideas. So we basically took all our

blog writings and website writings and reformatted them in a book. And so if people are interested, you can,

check out, we call it the DataOps Cookbook. It actually has recipes in,

that you can that you can cook. But it's it's a collection of ideas. There's lots of pictures that you can scan it to to get it, but it's a it's a way it takes a lot of the ideas that I've talked about and put it in more practical terms. And, you know, I think and I think DataOps is it

the, it's on the Gartner hype cycle, which happened a few months ago, which actually made a world of difference. You know, Gartner had its first data obsession in London. It had 400 people at it, which I find just incredible. And so the term itself is being,

likewise with the term of data engineer, it's just being used more and it's more common and people are having a shared understanding. So if you wanna learn more about data ops, read our book, read our blog,

and

it's feeling like it's becoming a thing.

Whereas,

when we talked last time,

it's thingness,

was still a little bit in doubt in my mind, but I think it's coming. And I think it is a trend for the future. And it's just, to me, it's just,

it's, that's partly why I founded the company. It's obvious that this has gotta happen. And

I think it's gonna be 1 of the most important things that happen in analytics in a while. And you know, being part of the movement, just being a vendor who sells some software to do it, but also just trying to, get these ideas out there, I think it's important because I struggled with these ideas over the years. And I'd like to

have people who are like me not struggle in the same way I had to and not have to go

redeeming from the beginning or get yelled at by their boss in the same way that I did and have a little bit more intellectual framework to understand their problems and how to solve them, instead of having to work from first principles.

And for anybody who does wanna check out that book and your blog, I'll have the links in the show notes. And I'll also have you add your preferred contact information for anybody who wants to get in touch or follow along with the work that you're doing.

And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. The biggest gap? Interesting. I don't wanna say the same thing over and over again, but I just don't think it's a tool and technology problem. You're not gonna solve the real problem with a faster query or a faster way to write code, or I don't think a better predictive model is going to change the world. I think all those things pale into the systemic problems that DataOps addresses.

And so I think things are better. I love working in the cloud versus proprietary database. I think there's all sorts of cool things that make doing machine learning better and easier and for mere mortals, which I think is gonna improve the world. But my way I look at it is it's

It's

it's really

I I I look at it from a different perspective. Alright. Well, I definitely agree with you on the fact that, most technical problems are actually just problems with how we address our fellow humans. So,

I definitely encourage everyone to go and try and be the best that they can as far as that, And,

I appreciate you taking the time today to join me and discuss the work that you're doing and help everybody get a understanding of data ops and how it can fit into their workflows. So thank you for all of that, and I hope you enjoy the rest of your day. Alright. Thank you.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links