Data Engineering Weekly with Joe Crobak

Hello, and welcome to the Data Engineering podcast, the show about modern data management.

When you're ready to build your next pipeline, you'll need somewhere to deploy it, so you should check out Linode. With private networking, shared block storage, node balancers, and a 40 gigabit network all controlled by a brand new API, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

to get a 20 dollar credit and launch a new server in under a minute,

and go to dataengineeringpodcast.com

to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macy. And today, I'm interviewing Joe Kroback about his work maintaining the data engineering weekly newsletter and the challenges of keeping up with the data engineering industry. Industry. So Joe, could you start by introducing yourself?

Sure.

I'm Joe Corbeck. I

have,

been a software engineer for the past decade or so,

mostly working at startups in New York City.

And more recently,

I've I've worked for the federal government, US federal government at the United States Digital Service,

based out of Washington DC.

And my

experience

in industry is is,

in the

big data space for the most part, as well as server side,

service based APIs

and and, DevOps and things along those lines.

And how did you first get involved in the area of the big data space and data management in general?

Yeah. So

that actually

dates all the way back to my undergrad CS degree.

I

was working with a professor on

graph algorithms and got involved in a project at the the National Labs in New Mexico

working on a Cray supercomputer,

and, you know, this was over a decade ago and the server had

40 gigabytes of RAM, which was kind of revolutionary at the time and able to run tens of thousands of hardware threads in parallel.

And so that was kind of my first foray into,

big data, and it's where I

came a little obsessed with

trying to make things run faster or debug complex systems.

And then

more professionally,

I

was working as a Java developer,

while maintaining

or while undergoing a a master's degree in parallel.

And I did

a a project in my grad program

where I I used elastic MapReduce on Amazon Web Services to build a a recommendation engine. This was back at the the days of the Netflix

the Netflix

recommendation challenge.

And

that ended up lining up pretty well with some work I was doing

at the time

for an ad network that I had, gone to as part of an acquisition.

And

at that ad network, I was on the team that was responsible for our ad operations,

ad optimizations,

and

I ended up being pretty heavily involved in

rolling out our Hadoop cluster,

both Elastic MapReduce and and later,

Cloudera

running in

Amazon Web Services.

And,

most of the team, I would say,

was was focused on the ad optimization algorithms, and me and a few other people were doing DevOps, things like Puppet,

making sure we had a good solid workflow engine,

rather than just cron running

at a time schedule. You know, this job starts at 1 AM. This this next job

runs at 4 AM even

if the first 1 failed, that kind of thing. So,

it was kind of that confluence that got me into,

big data

and and data management and worked at

on similar challenges at a few other startups after that as well.

It's always interesting seeing the breakdown of engineers of the people who are perfectly willing to manually set up a system, hand tune it once, and then just walk away hoping that everything stays running versus the people who want to fully automate everything and ensure that you can destroy it and rebuild it at a moment's notice without having to

worry about, you know, any failures in the system by doing destroy and rebuild cycle over and over until everything works.

Yeah. I I kind of,

I like both of those. So it's it's always it's always nice to be able

to destroy and rebuild things from scratch, but also spend that time upfront to automate it so that when it comes time to put thing into something into production, you don't have to sit there and hand hold it or throw it over the wall to someone else to to figure it out.

I I like to try to automate myself out of a job as much as possible in whatever project I'm doing.

Yeah. It's nice having that confidence that everything's going to work right if you need to start all over again and

as you mentioned, you were recently working with the US Digital Services

and looking back through some of your blog posts

and your postings elsewhere.

I noticed that you were involved with the health care dot gov site and then later working with Medicare systems. So wondering if you can just talk a bit about your experiences there, particularly as it pertains to the ways that you were managing sensitive data and ensuring that it was receiving appropriate levels of protection?

Yeah. Absolutely.

So the short kind of setup there is that,

I was working on a startup in New York. And as these things go, we we ended up shutting down and I was looking for a job with with high impact especially given the fact that I was at a startup that had been pre launch and needed, you know, 1 of, you know, it was 1 of these startups that had a good mission and I was really excited to to see our product get to market and that never happened. So,

I ended up working at the digital service and and first at healthcare.gov.

As a as a federal government employee, I mean, most of the people on that project are are contractors, are vendors working for the government and

worked on that for quite a few months,

when I first joined

and and then ultimately worked on

the Medicare modern

modernization effort. This program called the quality payment program, which is all about shifting the way that health care pays for pays doctors, puts more emphasis on quality rather than volume. So rather than getting paid 10 times for 10 x rays, you your doctor gets paid based on your outcomes. Do you stay out of the hospital? Do you get healthier faster? And so,

most people are probably

pretty familiar but there's tons and tons of data in healthcare.

And these 2 programs have very different types of data. HealthCare dot gov is mostly consumer

data

applying for

health insurance and providing the information necessary to do that. And

the Medicare system

is is much more,

health level data. So claims data

about when someone goes to the doctor and the doctor ends up billing Medicare,

as well as information about doctors themselves.

Because the government

and and most health insurers

want to make sure that a doctor is credentialed and that they are up to date and and all those types of things and, you know, are not committing fraud. So there's lots of,

sensitive data in both of these systems.

There as is is common in a highly regulated industry, a lot of there's a lot there's a lot of compliance.

The the federal government likes to take it to

a whole another level, I think. We've we've got these

500

page

risk management frameworks,

that come out of NIST.

And

a lot of times that is ends up being a big paperwork exercise. It's things like how often do you patch your system or

even things like how are you suppressing fire if there's a fire in the data center. And I think what's much more interesting to me as a software developer is is how do you design,

a system to be

secure

and keep people's data safe. And

there are a lot of things you can do

to do along those lines.

I'll talk about, I guess, 1

thing that I think was pretty interesting that we we did for for Medicare systems. So this was mostly

with provider data, which, a lot of doctors who are providers, they they have a LLC or a small business and they're billing medicare using their social security number.

So you've got social security numbers in the system, you have to keep those

safe. You don't want them to to propagate everywhere.

And there's a couple strategies.

And

1 of the the best ways that you can avoid

leaking data is to isolate it to as few systems as possible.

And the way that we did this was to,

introduce

a what we call the link key, which is essentially just a random

ID that's generated and matched up to a social security number and then you will isolate the social security numbers only to the system that need them and everywhere else in the system you just

use these opaque link keys. And now you just have 1

system that you need to protect really well.

And then

the other systems can you don't have to worry about logging and

and all the other types of things that can propagate this PII.

In practice, it turns out that that can be a really hard thing to do. We didn't get nearly as far down that path as we did as we wanted to.

1 of the things that

we did instead

was undergo

threat modeling sessions with the teams. And and so this is where you

get a a group of smart people together.

You think about all the data systems. You think about all the APIs, and you

and all the bits you're shuffling around between systems, and you kinda look at it

with a,

an eye for where things can go wrong, where data could be corrupted, data could be leaked, and you

come up with threats. You think about actors. You think about the

the sensitive pieces of data in your system, which may not be Social Security numbers. In some cases, it could be credentials to another system,

maybe your login system, things like that. So we,

would undergo these thought modeling exercises at the end. You you get a big list and you're able to rank it based on how risky,

each of those are and you're then able to start

more,

in a in a more methodical

way,

harden your system to those types of attacks.

And

so

I would say that that's an evolving process, 1 that you have to do continuously.

And I'm not the best security expert, but we had some really smart people on our team who were able to drive that process and

really improve the the security posture of our systems.

And on the healthcare dot gov side,

1 of the initial issues that I had at launch was that the way that the system was architected, it wasn't able to handle the sustained load of all the people who were visiting the site in a short period to sign up for it. And I'm not sure

at what point in its life cycle you got involved, but I imagine that there was some work necessary

to

rearchitect the way that the data was,

received and processed and managed. So I don't know if you can speak a bit to the,

resulting architecture that you settled on to make sure that it was able to handle the load that it was subjected to.

Yeah.

So when I arrived at healthcare.gov,

it was the 3rd open enrollment season.

And I had read the news articles and thought that everything was fixed.

And boy was was I wrong. Fast forward 2 years 2 years further down the line to this past fall and and the team really did

solve a lot of problems and and make the system more resilient. But going back to to when I arrived,

we

showed up and

started to get familiar with with what was going on and,

the system was very

was using a lot of technology that I was not familiar with. So or let me say that it was the enterprise version of a a lot of technology maybe I was familiar with. So so an example of that would be Apache web server. Instead of, you know, h t t p d that you're installing off of a YUM repo or something like that. We're using the the enterprise,

Apache JBoss server, which bundles its own

version of Apache and then talks between the Java web server and the

Apache process using a custom protocol. So I show up and and immediately 1 of the the systems that catches everyone's eye,

if you

are are able to to kinda once you're able to wade your way through the 100 or thousands of servers that are part of the system is,

the database

engines,

the storage services. And,

the 2 main ones are a system called MarkLogic,

which is a distributed database

centralized on or built around an XML

data model, or XML processing I guess is a better way to say it. That that sounds terrifying.

It is. And it and it gets more and more terrifying because the way that

the

application servers interacted with it, this XML database was was pretty inefficient.

And to put some context on it, the way that that the system was designed was

that it kinda kept a change log for auditing purposes of all the all the different things you did as part of your application for health insurance.

And you think about representing that as XML, it starts to get big pretty fast. In fact,

I I've been told that there were bugs along the way where people would end up with 100 megabyte documents because,

there were there were some quadratic bugs in adding entries into the into the change log. But even even without those bugs, you're talking about documents that are maybe in the 100 of k kilobytes

or even up to a megabyte. And

not a fault of the database but the way it was architected,

the ORM

that was doing these, well, I guess it's not a a relational mapping but the the the mapping from Java objects in the app server to XML

really only supported

or was only implemented 1 way which was to take an entire

XML document, take that and serialize it into a java object,

add a new entry to the change log, and then serialize that back to XML and put it on the wire back to the database. So immediately,

you're saturating your network. You're sending hundreds of kilobytes of data for for every

click on the website across the network. You're then writing that to disk, doing lots and lots of disk IO. Of course, Java is doing all these translations, you're pegging CPU

and,

you're also exhausting memory on on your app servers. So

the the hardware footprint to handle this system is is kind of large for the, you know, the request per second, the queries per second, it's it's supporting.

And it's and it's very close to kind of falling over any little,

perturbance in the force will kind of call everything, cause everything to cascade and so many moving pieces because of the

the way the system was designed to match the the the government's

technical reference architecture. There were JMS queues in between systems. There were

load balancers on load balancers on load balancers. I think I counted at 1 point in time there were 8 hops before you got to the app server,

within the data center. So you're you're talking about lots of complexity and

and and lots of

and and it and it becomes really easy to to make the system fall over. And

on top of that, of course, you, you know, you wanna do you have lots of batch processing to do on the on the back end. A lot of the ways that the healthcare system works, you know, you get people in who are enrolling on healthcare dot gov.

Well, that there has to be some translation then of sending those applications over to the insurance companies. So that was done via nightly batch jobs,

and and you had to then copy data out of this mark logic cluster into

eventually into a Hadoop cluster. And you ran into

all kinds of of issues where batch jobs would stack up on each other and no 1 had very good visibility into what was running when and what team was running things because

you're talking about dozens if not if not more

contractors

working on this on this

system and trying to coordinate with each other. And it's a little bit of a a a chaotic

setting.

So a lot of what we did the first the first year we were there was just try to

understand who the key players were, get an inventory of who was pressing what button on what batch processing that was happening that, you know, unfortunately it was a lot of the batch processing

had to touch the production system

and try to get out ahead of of some of these these issues. We also, of course, did things like balancing the cluster

to to try to eliminate hot spots. But as you can imagine with

a a service that's,

kind of pushed to the max and you have a situation where

1 1 node on the on the cluster can kind of bring down the entire system if it if it gets taxed too hard. So I think the the the way that a lot of these issues were resolved, unfortunately,

the the timelines to do new development

didn't

didn't allow us to completely rearchitect it. And and it's 1 of these things where I think,

some of us who came in from industry,

pitched something that was much more aggressive in terms of rearchitecting

and taking load off of this this x m l database and onto you know, more traditional

relational database that could probably hold the handle the load perfectly fine.

But ultimately, the the the people involved decided to more or less stay the course on on the architecture

and make changes around the around the periphery. So they

really got these batch processing issues under under control.

They were able to work with the database vendor to

to fix some scary,

distributed

locking problems that would take the website down to make it much more stable.

And they are, I believe, 2 years now into a into a

project to kinda API

a lot of these systems so that they can once they have, you know, an API instead of this

monolithic Java app, they'll then be able to start to to peel pieces off of the current database and put it on to something new. So lots to unpack there, but it's

certainly

interesting and eye opening and

a architecture that that probably none of us would design, but kind of something that was was shoehorned

into a design by committee type situation.

Yeah, it's always interesting

being exposed

to some of these types of architectures where

coming from a perspective

of all of these open source options and

massively scalable systems

and sort of,

community

advocated best practices for doing things and going into something that does have that more

stringent and structured

requirements

of,

everything needs to be able to match to a certain specification,

and that will eventually mutate the actual technical architecture

and seeing the things that result to give you an idea of what is actually possible

when you are forced to work within certain constraints.

Yeah. Absolutely.

Although,

as I've told some of my friends who who work for some of these open open source vendors,

I think the government

is

a great customer to have because

they're they're not

not afraid to use open source.

They're afraid to use open source without a vendor vendor contract.

So I think if if you can get

your product in front of the right people,

you have a good opportunity to bring some great technology to

someone who's accountable for for every system

in in the program.

Yeah. It also talks to the requirement

for long term sustainability

of some of these projects where we have new database systems coming out every month

and new ways of processing data and

to

implement to implement these new technologies, but for a system where you have a mandated time horizon that's potentially in the, you know, measured in the span of decades, you need to be very careful in the choices that you make for a technology to build and support these organizations and these platforms.

Oh, yeah. Absolutely. On the

on the

Medicare side,

much of Medicare claims processing

runs on COBOL

mainframes that were were implemented in the eighties.

And good choice, there's still a lot of baton frames in the world, believe it or not. But that's becoming a problem because

turns out it's hard to find COBOL programmers nowadays.

So but but these systems were designed and have lasted

40 40 years and they're

sitting there processing

data that that pays out 1, 000, 000, 000, 000 of dollars.

Medicare itself is 3.4%

of the the US economy. So sometimes older or boring technology can be the right solution.

Yeah, there's a lot of discussion particularly in operations

where you, you know, having boring systems is a good thing because it means that you can sleep well at night knowing that you're not gonna get paged with a system failure because you have some new unproven

technology that's being used in production.

Yeah. I agree on that. Of course, you no 1 wants to be the 1 maintaining a 30 year old system either. If and

now in the process

of your work,

you know, getting involved in these

large data processing systems and working with Hadoop, you somewhere along the line decided that it would be a good idea to start a weekly

newsletter detailing some of the technologies

and articles pertaining to the Hadoop ecosystem. So I don't know if you can speak a bit about your decision

that led you to wanting to start with creating that newsletter.

Yeah. So as I was saying

when I was kinda doing my introduction,

I I kind of fell into

to the Hadoop, the big data ecosystem

just through being at the right place in the right time like like I think a lot of people were. And I was feeling

overwhelmed with how much

was out there and how much how many products there were, how many even it seems like it's exploded exponentially, but at the time, I mean, Hadoop distributions

had 6 or 7 different components to them. And and so

my solution to

to that was to try to read as much as I could. I was following everyone, every

blog I could find, whether it was a, you know, a big vendor or

LinkedIn or Twitter. I was reading academic papers,

that

that people that teams were putting out and and really trying to absorb everything I could so that just so I could feel like I had my head above water to to stay on top of the industry. And and and also, you know, I kinda turned the page at some point where

I was solving

problems

at when I was working at Foursquare

and I knew that other people had solved these problems. So it it turned into more of a how do I apply what other teams are doing to the problems I'm facing. And and so I found myself reading

20 or so articles a week probably and then at some point, you know, I would send a couple around internally on mailing lists with a short summary. At some point, I said, well, I'm already doing all this research and and seeing all these articles.

I'd been a big fan of a couple of other newsletters that were out there.

And I said, well,

I'm I'm already doing this. Why don't I why don't I just turn it into a newsletter and see see if other people are interested? And and that was

was how, Hadoop Weekly

was was launched on,

inauguration day of of

president Obama's,

second inaugural back in,

what was that, 2020,

13.

That's,

funny the timing so that you have a anchor point that you can remember it by, you know, after many years have gone past.

Yeah. That was Was was that was was that intentional timing or just coincidental? It was coincidental, you know. I had to get the domain name and the the Mailchimp account and all those things and, like, probably did that over Christmas break and and that was the the coincidental timing. But you know, I just celebrated 5 years and when I looked back on it, I said wow,

Funny funny timing and kind of relearned that it was the inauguration day and yeah. That's funny. And a few months ago, you wrote a post

detailing your decision to actually rebrand Hadoop Weekly

to be data engineering weekly because the space has expanded so far beyond just the Hadoop ecosystems. I don't know if you want to just briefly cover your decision around that.

Yeah. I think,

the there are a couple main main motivators.

Data infrastructure,

data engineering

have become well, let me say that I guess that most people are not just using Hadoop. They're they're they're building all kinds of products with and around Hadoop. And,

we're starting to see

tons of other

products,

open source or or closed source. But, really, there's been, like, this this

just momentum shift to things like Kafka and Spark and Flank. Also got NiFi and Kubernetes.

There's been a big

push to the cloud which lets people experiment with these other tools

unlike, you know, when you're kind of in a fixed data center. So we're seeing a lot more adoption

of different,

tools

and along with that a big emphasis on real time and stream processing.

So when I started

the newsletter,

I mean, you had your big companies like Oracle and IBM, but but in terms of open source software, just about the only enterprise companies out there were selling Hadoop. And and I guess you you also had DataStax selling, you know, support for

Cassandra.

Now you have companies around all these different things. And so I I found it more and more interesting to to focus on things around Hadoop.

How you integrate with legacy systems, how you get data into a serving layer

to to actually

build a product based on

the data science that's done on your on your Hadoop cluster,

how you manage the data workflows. And

I think the the tooling

and the

ecosystem around Hadoop,

has matured. A lot of it has caught up with kind of the tools that existed in the relational database world before Hadoop was a big thing.

And

it's become

much more interesting to focus

on on those things than than Hadoop, which now has matured and is kind of moving a lot more more slowly.

Yeah. It's starting to become 1 of the bore, you know, quote unquote boring systems that just, you know, sits there and reliably does what you tell it to and you don't have to keep paying attention to it to make sure that everything's running as it's supposed to. So, it it it isn't generating as much press except for when they make their occasional releases with new features and beyond that it's just becomes part of the background and you know, some of the part of the day to day. Yeah. And and I think also, you know, you're absolutely right on that. And it it used to be that you had things like Alaska MapReduce and you could get a

Hadoop cluster up, and then maybe you could

run pig or hive. But now,

there are just so many tools to to launch other things and get things up and and turn them into boring technology too.

So,

that's I think that's a good thing.

I've operated to dupe in some of these other tools along the, you know, along the along the, the past couple years and and there was a long time where they weren't boring and and I'm glad to see things mature.

But you always have that that

balance where

things start to mature and then,

you start to see us innovation.

And

in order to generate a new newsletter every week, you must be reading

through a lot of articles and keeping track of a lot of different projects as they

iterate and,

watching videos. So what is your sort of workflow for being able to keep track of all of the different things that are happening in the data engineering

ecosystem?

Yeah. So

it's changed a little bit over the years. I think

at this point,

I

get a lot of my

news

kind of pushed to me. I've

curated a couple different channels whether it be

mailing lists or the folks I follow on Twitter or the, you know, accounts that I follow on Twitter.

I have a a throwback

where I'm I'm following a bunch of RSS feeds of of different

companies or blogs out there.

We

more recently, I've been been focusing a lot on medium. It seems like like that's where a lot of people tend to to publish or cross publish their posts.

And so

I have kind of this this day of data that that's coming at me throughout the week and, you know, I bookmark interesting articles when I see them.

Usually, don't get a chance to read everything

until,

you know, Saturday or Sunday the weekend when I'm actually compiling the newsletter. But that's that's kind of

the currency of things. It used to be that maybe there was more emphasis on Twitter at 1 point in time. I had written a script to

to go and find all the Hadoop related posts out there.

Deduping against stuff I had already seen.

That turned into

a a war between me and the people posting Hadoop related jobs on Twitter

and ultimately, they won. So

I kinda gave up on that script.

But I try to try to try to evolve with with the communication channels and and they they definitely have changed over over the past 5 years.

And but the kind of the the mailing lists and and Twitter

and have been the have been the 2 mainstays and even things like RSS feeds and and Medium have been a little bit more more recent.

And as you're going through all of these different posts and articles etcetera,

are there any particular

questions that you ask of yourself and about the content to determine whether you want to include it in a given issue?

Or I'm curious if you have a particular

sort of focus

or purpose behind the ways that you are selecting these different pieces.

I tried to,

for the most part, anticipate

what my readers would like

and I have to admit that that's kind of an informal feedback loop. I I do hear from people occasionally. I probably hear more from from people that are are liking what I'm curating than

than what I'm not. And so the main question

that I that I get at

or the the main way I try to answer that question is is just do I like the article? And

if there's an article that

I feel like I'm getting bored or maybe

it's not making sense,

maybe I'm I'm I'm in trouble following the post.

A lot of times I will try to do some

some background research, but if if it if it becomes something that

it turns out I'm not interested in then then that's kind of the first strike. And

my background

is

much more on the

as we've as we've talked about the the data engineering, the back end, the dev ops side.

So

that is what I tend to cover

and probably a little bit on on what maybe

people would call machine learning engineering. There was recently

a post covered in my newsletter

and I can give you the link for the for the show notes called the the flavors of data science and engineering, and it was kind of this

8 circle or 9 circle Venn diagram of of all the different

roles and how they overlap. And and

I think the articles that I cover are pretty squarely in the the data engineering, back end, DevOps,

side of things.

And that kind of is the like high level way that I narrow it down.

There are a couple other,

filters and I say filters because honestly, every week I start out and I probably have

40 or 50 links that I'm evaluating

to to include in the in the newsletter. And if you you post an article, you could also just get really unlucky where there are 15 or 20 other really good links that week

and and yours kind of fell off. But the other filters I use are if I'm reading something from

a vendor, is it just a vendor sales pitch or is there

something technically interesting in there that maybe is not vendor specific? I look

for,

is this blog post just a

regurgitation

of the documentation

for the tool. So you'll see a lot of people who maybe they went went through and just set up spark or hive or something for the first time and write a write a blog post about that. And sometimes they find something new and interesting, but a lot of times it's almost exactly the same steps that you would find in the tutorial on on the

the the website. And then kind of on the other side of things, you you sometimes find these really niche articles that are about a

tuning

of a

tuning option of

a proprietary database.

And they just tell you what the option is and how to set it and they don't really tell you what's happening under the under the hood. And and those kind of niche ones that are the ones that that'll also get filtered out. I try to cast a cast a wide net and and be fair. I think I'm following most of the major vendors and and some smaller ones on my on my RSS feeds.

I have gotten some flack in the past when I, not on purpose, but inadvertently maybe missed a a a big blog post from

from 1 of your favorite vendors. I'll I'll get some some angry emails. So I try to,

give everyone credit for all the hard work they're doing and include them in the in the newsletter as much as possible.

Yeah. It's difficult

being

the person who is, you know, in in some case responsible

for providing a lens on the industry as,

as, somebody who's producing a podcast on the space. So every time I go to my list of interview ideas, I have to look at it thinking through, you know, is this, topic that is going to be interesting enough to talk about for 45 minutes to an hour? Is it something that I'm going to learn something from, that my audience members are going to learn something from? You know, as you mentioned, sometimes you have to be concerned about, is this conversation just gonna turn into a vendor pitch without enough technical aspects to it. So it's always challenging

to

try and think about things beyond just is this something I'm interested in and figuring out whether or not it is more broadly applicable to the people who are subscribing to your newsletter or my podcast or things like that. Yeah. And and I think,

this is probably a a call out from both of us. If if anyone out there has feedback for me, I always love to hear it. I'm reachable on Twitter. I you can email info at data end weekly,

dotcom. I'm I'm always open to feedback or if you have articles to send my way, I I love to I love to get proposed articles from from folks out there. And

and then, Tobias, maybe that you and me should should form a support group to,

help review each other's,

Absolutely. Yeah. If having to,

have that deadline of every week, I need to release a new episode, it's often difficult to

keep track of,

something new and interesting for the week. So I'm always happy to get input on suggestions

for show ideas or feedback on past episodes of what went well or what didn't. So as you mentioned, you know, having that feedback cycle is very helpful and

particularly with podcasts, it's often very asymmetrical

of I publish something, somebody consumes it and that's the end of the interaction.

So closing that loop every now and then is very useful. Yeah. Absolutely.

And

having worked on curating the newsletter for the past 5 years, I imagine that there's been a lot of back and forth in terms of influence

both from your interest on the newsletter and from the newsletter on shaping your particular interests. I don't know if you want to talk a bit about that and how it has,

sort of led you in your current thinking of what your plans might be going forward.

Right. So

I have

nearly stopped publishing the newsletter a couple times here. It it is a lot of work.

I enjoy doing it and ultimately,

have I've made a a pretty big investment going forward and and trying to to make it to make it work.

And I think the big reason that I've done that is

my

professional

career has kind of shifted. So I used to,

as we're talking,

work pretty in the weeds on all of the big data technologies. I was deploying Kafka, deploying

Hadoop, Hbase. All these all these tools

working pretty

hands on with a lot of a lot of DevOps. And then when I went to work for for the government,

my role was was very different. It was mostly to be the technical voice in the room to to set expectations with executives that

no, really, we don't need to spend $10, 000, 000 on

a data center to, you know, run a an a web app that's gonna see

10, 000 users a month or something like that. So the newsletter has been a way for me to stay

connected to the industry that I

very much enjoy.

It gives me ideas for side projects all the time. And

sometimes I get to work on those, a lot of times I don't. But it's

it's,

it it my my interest in relationship with with the newsletter has definitely evolved over the years. It's

much more

kind of a

outlet for me now to to to get more technical and to, you know, occasionally read a a more

academic paper that's that's probably

pretty dense and in the weeds, but also see what tools are out there and

and stay on top of that, as things as things evolve.

My

my current

excitement

is is with the the shift to

things like

stream processing

and and having the tools

like Kafka,

like Flink to

really

be able to implement

some of these

patterns that were that have existed for a while, like the the CQRS

and and change data capture.

Concepts that have been around for a while but we haven't really had good tools to to implement them.

And so it's it's been

really exciting to see, you know, those go from

a blog post by Jay Kreps about

the, you know, the log

to

people are now actually building systems like at the New York Times

based on

these types of,

concepts and and these these architectures that are that are not necessarily novel, but are

enabled

by by this shift in in the tools that we have,

a lot for the first time.

Alright. And are there any other topics that we should discuss further before we start to close out the show? I think we're we covered a lot so it was good. Okay. So for anybody who wants to get in touch with you or subscribe to the newsletter or follow the work that you're up to,

I'll have you add your preferred contact information to the show notes. And as a final question,

from your perspective as somebody who is keeping up with the data engineering space, what do you see as being the biggest gap in the tooling or technology that's available for data management today? I wish I had a really good answer and insightful answer to this question.

I

I think that the thing that's been most striking to me on the data infrastructure tooling

and and probably the most frustrating as well is that we we seem to always have 2 or 3 or 4

tools to do the same job. A prime example of that is is workflow engines,

and all of them do it

kinda okay,

but nobody nobody does it really well.

And I do think we're starting to see some consolidation,

but, you know, a lot of the vendors have their momentum behind

1 tool and not the other. So

hopefully, we'll we'll start to see some more consolidation because I do think that's an area where we'll we'll start to see vast improvements

if, you know, you have people working towards the same goal instead of, you know, on the same software project instead of multiple software projects,

functionally doing the same thing. Alright. Well, thank you again for taking the time today to join me.

I have been subscribed to your newsletter for,

at least a few months now, and it has been very helpful for me as I try to find topics for the podcast. So thank you for that. Keep up the good work, and I hope you enjoy the rest of your day. Thank you, and, thanks for having me. It was a lot of fun.

Data Engineering Podcast

Summary

Preamble

Interview

Contact Info

Parting Question

Links