Summary
The rate of change in the data engineering industry is alternately exciting and exhausting. Joe Crobak found his way into the work of data management by accident as so many of us do. After being engrossed with researching the details of distributed systems and big data management for his work he began sharing his findings with friends. This led to his creation of the Hadoop Weekly newsletter, which he recently rebranded as the Data Engineering Weekly newsletter. In this episode he discusses his experiences working as a data engineer in industry and at the USDS, his motivations and methods for creating a newsleteter, and the insights that he has gleaned from it.
Preamble
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
- Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
- Your host is Tobias Macey and today I’m interviewing Joe Crobak about his work maintaining the Data Engineering Weekly newsletter, and the challenges of keeping up with the data engineering industry.
Interview
- Introduction
- How did you get involved in the area of data management?
- What are some of the projects that you have been involved in that were most personally fulfilling?
- As an engineer at the USDS working on the healthcare.gov and medicare systems, what were some of the approaches that you used to manage sensitive data?
- Healthcare.gov has a storied history, how did the systems for processing and managing the data get architected to handle the amount of load that it was subjected to?
- What was your motivation for starting a newsletter about the Hadoop space?
- Can you speak to your reasoning for the recent rebranding of the newsletter?
- How much of the content that you surface in your newsletter is found during your day-to-day work, versus explicitly searching for it?
- After over 5 years of following the trends in data analytics and data infrastructure what are some of the most interesting or surprising developments?
- What have you found to be the fundamental skills or areas of experience that have maintained relevance as new technologies in data engineering have emerged?
- What is your workflow for finding and curating the content that goes into your newsletter?
- What is your personal algorithm for filtering which articles, tools, or commentary gets added to the final newsletter?
- How has your experience managing the newsletter influenced your areas of focus in your work and vice-versa?
- What are your plans going forward?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- USDS
- National Labs
- Cray
- Amazon EMR (Elastic Map-Reduce)
- Recommendation Engine
- Netflix Prize
- Hadoop
- Cloudera
- Puppet
- healthcare.gov
- Medicare
- Quality Payment Program
- HIPAA
- NIST National Institute of Standards and Technology
- PII (Personally Identifiable Information)
- Threat Modeling
- Apache JBoss
- Apache Web Server
- MarkLogic
- JMS (Java Message Service)
- Load Balancer
- COBOL
- Hadoop Weekly
- Data Engineering Weekly
- Foursquare
- NiFi
- Kubernetes
- Spark
- Flink
- Stream Processing
- DataStax
- RSS
- The Flavors of Data Science and Engineering
- CQRS
- Change Data Capture
- Jay Kreps
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering podcast, the show about modern data management. When you're ready to build your next pipeline, you'll need somewhere to deploy it, so you should check out Linode. With private networking, shared block storage, node balancers, and a 40 gigabit network all controlled by a brand new API, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode to get a 20 dollar credit and launch a new server in under a minute, and go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macy. And today, I'm interviewing Joe Kroback about his work maintaining the data engineering weekly newsletter and the challenges of keeping up with the data engineering industry. Industry. So Joe, could you start by introducing yourself?
[00:00:55] Unknown:
Sure. I'm Joe Corbeck. I have, been a software engineer for the past decade or so, mostly working at startups in New York City. And more recently, I've I've worked for the federal government, US federal government at the United States Digital Service, based out of Washington DC. And my experience in industry is is, in the big data space for the most part, as well as server side, service based APIs and and, DevOps and things along those lines.
[00:01:33] Unknown:
And how did you first get involved in the area of the big data space and data management in general?
[00:01:39] Unknown:
Yeah. So that actually dates all the way back to my undergrad CS degree. I was working with a professor on graph algorithms and got involved in a project at the the National Labs in New Mexico working on a Cray supercomputer, and, you know, this was over a decade ago and the server had 40 gigabytes of RAM, which was kind of revolutionary at the time and able to run tens of thousands of hardware threads in parallel. And so that was kind of my first foray into, big data, and it's where I came a little obsessed with trying to make things run faster or debug complex systems.
And then more professionally, I was working as a Java developer, while maintaining or while undergoing a a master's degree in parallel. And I did a a project in my grad program where I I used elastic MapReduce on Amazon Web Services to build a a recommendation engine. This was back at the the days of the Netflix the Netflix recommendation challenge. And that ended up lining up pretty well with some work I was doing at the time for an ad network that I had, gone to as part of an acquisition. And at that ad network, I was on the team that was responsible for our ad operations, ad optimizations, and I ended up being pretty heavily involved in rolling out our Hadoop cluster, both Elastic MapReduce and and later, Cloudera running in Amazon Web Services.
And, most of the team, I would say, was was focused on the ad optimization algorithms, and me and a few other people were doing DevOps, things like Puppet, making sure we had a good solid workflow engine, rather than just cron running at a time schedule. You know, this job starts at 1 AM. This this next job runs at 4 AM even if the first 1 failed, that kind of thing. So, it was kind of that confluence that got me into, big data and and data management and worked at on similar challenges at a few other startups after that as well.
[00:04:08] Unknown:
It's always interesting seeing the breakdown of engineers of the people who are perfectly willing to manually set up a system, hand tune it once, and then just walk away hoping that everything stays running versus the people who want to fully automate everything and ensure that you can destroy it and rebuild it at a moment's notice without having to worry about, you know, any failures in the system by doing destroy and rebuild cycle over and over until everything works.
[00:04:34] Unknown:
Yeah. I I kind of, I like both of those. So it's it's always it's always nice to be able to destroy and rebuild things from scratch, but also spend that time upfront to automate it so that when it comes time to put thing into something into production, you don't have to sit there and hand hold it or throw it over the wall to someone else to to figure it out. I I like to try to automate myself out of a job as much as possible in whatever project I'm doing.
[00:05:01] Unknown:
Yeah. It's nice having that confidence that everything's going to work right if you need to start all over again and as you mentioned, you were recently working with the US Digital Services and looking back through some of your blog posts and your postings elsewhere. I noticed that you were involved with the health care dot gov site and then later working with Medicare systems. So wondering if you can just talk a bit about your experiences there, particularly as it pertains to the ways that you were managing sensitive data and ensuring that it was receiving appropriate levels of protection?
[00:05:39] Unknown:
Yeah. Absolutely. So the short kind of setup there is that, I was working on a startup in New York. And as these things go, we we ended up shutting down and I was looking for a job with with high impact especially given the fact that I was at a startup that had been pre launch and needed, you know, 1 of, you know, it was 1 of these startups that had a good mission and I was really excited to to see our product get to market and that never happened. So, I ended up working at the digital service and and first at healthcare.gov. As a as a federal government employee, I mean, most of the people on that project are are contractors, are vendors working for the government and worked on that for quite a few months, when I first joined and and then ultimately worked on the Medicare modern modernization effort. This program called the quality payment program, which is all about shifting the way that health care pays for pays doctors, puts more emphasis on quality rather than volume. So rather than getting paid 10 times for 10 x rays, you your doctor gets paid based on your outcomes. Do you stay out of the hospital? Do you get healthier faster? And so, most people are probably pretty familiar but there's tons and tons of data in healthcare.
And these 2 programs have very different types of data. HealthCare dot gov is mostly consumer data applying for health insurance and providing the information necessary to do that. And the Medicare system is is much more, health level data. So claims data about when someone goes to the doctor and the doctor ends up billing Medicare, as well as information about doctors themselves. Because the government and and most health insurers want to make sure that a doctor is credentialed and that they are up to date and and all those types of things and, you know, are not committing fraud. So there's lots of, sensitive data in both of these systems.
There as is is common in a highly regulated industry, a lot of there's a lot there's a lot of compliance. The the federal government likes to take it to a whole another level, I think. We've we've got these 500 page risk management frameworks, that come out of NIST. And a lot of times that is ends up being a big paperwork exercise. It's things like how often do you patch your system or even things like how are you suppressing fire if there's a fire in the data center. And I think what's much more interesting to me as a software developer is is how do you design, a system to be secure and keep people's data safe. And there are a lot of things you can do to do along those lines.
I'll talk about, I guess, 1 thing that I think was pretty interesting that we we did for for Medicare systems. So this was mostly with provider data, which, a lot of doctors who are providers, they they have a LLC or a small business and they're billing medicare using their social security number. So you've got social security numbers in the system, you have to keep those safe. You don't want them to to propagate everywhere. And there's a couple strategies. And 1 of the the best ways that you can avoid leaking data is to isolate it to as few systems as possible.
And the way that we did this was to, introduce a what we call the link key, which is essentially just a random ID that's generated and matched up to a social security number and then you will isolate the social security numbers only to the system that need them and everywhere else in the system you just use these opaque link keys. And now you just have 1 system that you need to protect really well. And then the other systems can you don't have to worry about logging and and all the other types of things that can propagate this PII. In practice, it turns out that that can be a really hard thing to do. We didn't get nearly as far down that path as we did as we wanted to.
1 of the things that we did instead was undergo threat modeling sessions with the teams. And and so this is where you get a a group of smart people together. You think about all the data systems. You think about all the APIs, and you and all the bits you're shuffling around between systems, and you kinda look at it with a, an eye for where things can go wrong, where data could be corrupted, data could be leaked, and you come up with threats. You think about actors. You think about the the sensitive pieces of data in your system, which may not be Social Security numbers. In some cases, it could be credentials to another system, maybe your login system, things like that. So we, would undergo these thought modeling exercises at the end. You you get a big list and you're able to rank it based on how risky, each of those are and you're then able to start more, in a in a more methodical way, harden your system to those types of attacks.
And so I would say that that's an evolving process, 1 that you have to do continuously. And I'm not the best security expert, but we had some really smart people on our team who were able to drive that process and really improve the the security posture of our systems.
[00:11:30] Unknown:
And on the healthcare dot gov side, 1 of the initial issues that I had at launch was that the way that the system was architected, it wasn't able to handle the sustained load of all the people who were visiting the site in a short period to sign up for it. And I'm not sure at what point in its life cycle you got involved, but I imagine that there was some work necessary to rearchitect the way that the data was, received and processed and managed. So I don't know if you can speak a bit to the, resulting architecture that you settled on to make sure that it was able to handle the load that it was subjected to.
[00:12:09] Unknown:
Yeah. So when I arrived at healthcare.gov, it was the 3rd open enrollment season. And I had read the news articles and thought that everything was fixed. And boy was was I wrong. Fast forward 2 years 2 years further down the line to this past fall and and the team really did solve a lot of problems and and make the system more resilient. But going back to to when I arrived, we showed up and started to get familiar with with what was going on and, the system was very was using a lot of technology that I was not familiar with. So or let me say that it was the enterprise version of a a lot of technology maybe I was familiar with. So so an example of that would be Apache web server. Instead of, you know, h t t p d that you're installing off of a YUM repo or something like that. We're using the the enterprise, Apache JBoss server, which bundles its own version of Apache and then talks between the Java web server and the Apache process using a custom protocol. So I show up and and immediately 1 of the the systems that catches everyone's eye, if you are are able to to kinda once you're able to wade your way through the 100 or thousands of servers that are part of the system is, the database engines, the storage services. And, the 2 main ones are a system called MarkLogic, which is a distributed database centralized on or built around an XML data model, or XML processing I guess is a better way to say it. That that sounds terrifying.
It is. And it and it gets more and more terrifying because the way that the application servers interacted with it, this XML database was was pretty inefficient. And to put some context on it, the way that that the system was designed was that it kinda kept a change log for auditing purposes of all the all the different things you did as part of your application for health insurance. And you think about representing that as XML, it starts to get big pretty fast. In fact, I I've been told that there were bugs along the way where people would end up with 100 megabyte documents because, there were there were some quadratic bugs in adding entries into the into the change log. But even even without those bugs, you're talking about documents that are maybe in the 100 of k kilobytes or even up to a megabyte. And not a fault of the database but the way it was architected, the ORM that was doing these, well, I guess it's not a a relational mapping but the the the mapping from Java objects in the app server to XML really only supported or was only implemented 1 way which was to take an entire XML document, take that and serialize it into a java object, add a new entry to the change log, and then serialize that back to XML and put it on the wire back to the database. So immediately, you're saturating your network. You're sending hundreds of kilobytes of data for for every click on the website across the network. You're then writing that to disk, doing lots and lots of disk IO. Of course, Java is doing all these translations, you're pegging CPU and, you're also exhausting memory on on your app servers. So the the hardware footprint to handle this system is is kind of large for the, you know, the request per second, the queries per second, it's it's supporting.
And it's and it's very close to kind of falling over any little, perturbance in the force will kind of call everything, cause everything to cascade and so many moving pieces because of the the way the system was designed to match the the the government's technical reference architecture. There were JMS queues in between systems. There were load balancers on load balancers on load balancers. I think I counted at 1 point in time there were 8 hops before you got to the app server, within the data center. So you're you're talking about lots of complexity and and and lots of and and it and it becomes really easy to to make the system fall over. And on top of that, of course, you, you know, you wanna do you have lots of batch processing to do on the on the back end. A lot of the ways that the healthcare system works, you know, you get people in who are enrolling on healthcare dot gov.
Well, that there has to be some translation then of sending those applications over to the insurance companies. So that was done via nightly batch jobs, and and you had to then copy data out of this mark logic cluster into eventually into a Hadoop cluster. And you ran into all kinds of of issues where batch jobs would stack up on each other and no 1 had very good visibility into what was running when and what team was running things because you're talking about dozens if not if not more contractors working on this on this system and trying to coordinate with each other. And it's a little bit of a a a chaotic setting.
So a lot of what we did the first the first year we were there was just try to understand who the key players were, get an inventory of who was pressing what button on what batch processing that was happening that, you know, unfortunately it was a lot of the batch processing had to touch the production system and try to get out ahead of of some of these these issues. We also, of course, did things like balancing the cluster to to try to eliminate hot spots. But as you can imagine with a a service that's, kind of pushed to the max and you have a situation where 1 1 node on the on the cluster can kind of bring down the entire system if it if it gets taxed too hard. So I think the the the way that a lot of these issues were resolved, unfortunately, the the timelines to do new development didn't didn't allow us to completely rearchitect it. And and it's 1 of these things where I think, some of us who came in from industry, pitched something that was much more aggressive in terms of rearchitecting and taking load off of this this x m l database and onto you know, more traditional relational database that could probably hold the handle the load perfectly fine.
But ultimately, the the the people involved decided to more or less stay the course on on the architecture and make changes around the around the periphery. So they really got these batch processing issues under under control. They were able to work with the database vendor to to fix some scary, distributed locking problems that would take the website down to make it much more stable. And they are, I believe, 2 years now into a into a project to kinda API a lot of these systems so that they can once they have, you know, an API instead of this monolithic Java app, they'll then be able to start to to peel pieces off of the current database and put it on to something new. So lots to unpack there, but it's certainly interesting and eye opening and a architecture that that probably none of us would design, but kind of something that was was shoehorned into a design by committee type situation.
[00:19:48] Unknown:
Yeah, it's always interesting being exposed to some of these types of architectures where coming from a perspective of all of these open source options and massively scalable systems and sort of, community advocated best practices for doing things and going into something that does have that more stringent and structured requirements of, everything needs to be able to match to a certain specification, and that will eventually mutate the actual technical architecture and seeing the things that result to give you an idea of what is actually possible when you are forced to work within certain constraints.
[00:20:32] Unknown:
Yeah. Absolutely. Although, as I've told some of my friends who who work for some of these open open source vendors, I think the government is a great customer to have because they're they're not not afraid to use open source. They're afraid to use open source without a vendor vendor contract. So I think if if you can get your product in front of the right people, you have a good opportunity to bring some great technology to someone who's accountable for for every system in in the program.
[00:21:13] Unknown:
Yeah. It also talks to the requirement for long term sustainability of some of these projects where we have new database systems coming out every month and new ways of processing data and to implement to implement these new technologies, but for a system where you have a mandated time horizon that's potentially in the, you know, measured in the span of decades, you need to be very careful in the choices that you make for a technology to build and support these organizations and these platforms.
[00:21:49] Unknown:
Oh, yeah. Absolutely. On the on the Medicare side, much of Medicare claims processing runs on COBOL mainframes that were were implemented in the eighties. And good choice, there's still a lot of baton frames in the world, believe it or not. But that's becoming a problem because turns out it's hard to find COBOL programmers nowadays. So but but these systems were designed and have lasted 40 40 years and they're sitting there processing data that that pays out 1, 000, 000, 000, 000 of dollars. Medicare itself is 3.4% of the the US economy. So sometimes older or boring technology can be the right solution.
[00:22:36] Unknown:
Yeah, there's a lot of discussion particularly in operations where you, you know, having boring systems is a good thing because it means that you can sleep well at night knowing that you're not gonna get paged with a system failure because you have some new unproven technology that's being used in production.
[00:22:54] Unknown:
Yeah. I agree on that. Of course, you no 1 wants to be the 1 maintaining a 30 year old system either. If and
[00:23:02] Unknown:
now in the process of your work, you know, getting involved in these large data processing systems and working with Hadoop, you somewhere along the line decided that it would be a good idea to start a weekly newsletter detailing some of the technologies and articles pertaining to the Hadoop ecosystem. So I don't know if you can speak a bit about your decision that led you to wanting to start with creating that newsletter.
[00:23:31] Unknown:
Yeah. So as I was saying when I was kinda doing my introduction, I I kind of fell into to the Hadoop, the big data ecosystem just through being at the right place in the right time like like I think a lot of people were. And I was feeling overwhelmed with how much was out there and how much how many products there were, how many even it seems like it's exploded exponentially, but at the time, I mean, Hadoop distributions had 6 or 7 different components to them. And and so my solution to to that was to try to read as much as I could. I was following everyone, every blog I could find, whether it was a, you know, a big vendor or LinkedIn or Twitter. I was reading academic papers, that that people that teams were putting out and and really trying to absorb everything I could so that just so I could feel like I had my head above water to to stay on top of the industry. And and and also, you know, I kinda turned the page at some point where I was solving problems at when I was working at Foursquare and I knew that other people had solved these problems. So it it turned into more of a how do I apply what other teams are doing to the problems I'm facing. And and so I found myself reading 20 or so articles a week probably and then at some point, you know, I would send a couple around internally on mailing lists with a short summary. At some point, I said, well, I'm already doing all this research and and seeing all these articles.
I'd been a big fan of a couple of other newsletters that were out there. And I said, well, I'm I'm already doing this. Why don't I why don't I just turn it into a newsletter and see see if other people are interested? And and that was was how, Hadoop Weekly was was launched on, inauguration day of of president Obama's, second inaugural back in, what was that, 2020, 13.
[00:25:31] Unknown:
That's, funny the timing so that you have a anchor point that you can remember it by, you know, after many years have gone past.
[00:25:39] Unknown:
Yeah. That was Was was that was was that intentional timing or just coincidental? It was coincidental, you know. I had to get the domain name and the the Mailchimp account and all those things and, like, probably did that over Christmas break and and that was the the coincidental timing. But you know, I just celebrated 5 years and when I looked back on it, I said wow, Funny funny timing and kind of relearned that it was the inauguration day and yeah. That's funny. And a few months ago, you wrote a post
[00:26:08] Unknown:
detailing your decision to actually rebrand Hadoop Weekly to be data engineering weekly because the space has expanded so far beyond just the Hadoop ecosystems. I don't know if you want to just briefly cover your decision around that.
[00:26:24] Unknown:
Yeah. I think, the there are a couple main main motivators. Data infrastructure, data engineering have become well, let me say that I guess that most people are not just using Hadoop. They're they're they're building all kinds of products with and around Hadoop. And, we're starting to see tons of other products, open source or or closed source. But, really, there's been, like, this this just momentum shift to things like Kafka and Spark and Flank. Also got NiFi and Kubernetes. There's been a big push to the cloud which lets people experiment with these other tools unlike, you know, when you're kind of in a fixed data center. So we're seeing a lot more adoption of different, tools and along with that a big emphasis on real time and stream processing.
So when I started the newsletter, I mean, you had your big companies like Oracle and IBM, but but in terms of open source software, just about the only enterprise companies out there were selling Hadoop. And and I guess you you also had DataStax selling, you know, support for Cassandra. Now you have companies around all these different things. And so I I found it more and more interesting to to focus on things around Hadoop. How you integrate with legacy systems, how you get data into a serving layer to to actually build a product based on the data science that's done on your on your Hadoop cluster, how you manage the data workflows. And I think the the tooling and the ecosystem around Hadoop, has matured. A lot of it has caught up with kind of the tools that existed in the relational database world before Hadoop was a big thing.
And it's become much more interesting to focus on on those things than than Hadoop, which now has matured and is kind of moving a lot more more slowly.
[00:28:23] Unknown:
Yeah. It's starting to become 1 of the bore, you know, quote unquote boring systems that just, you know, sits there and reliably does what you tell it to and you don't have to keep paying attention to it to make sure that everything's running as it's supposed to. So, it it it isn't generating as much press except for when they make their occasional releases with new features and beyond that it's just becomes part of the background and you know, some of the part of the day to day. Yeah. And and I think also, you know, you're absolutely right on that. And it it used to be that you had things like Alaska MapReduce and you could get a
[00:28:58] Unknown:
Hadoop cluster up, and then maybe you could run pig or hive. But now, there are just so many tools to to launch other things and get things up and and turn them into boring technology too. So, that's I think that's a good thing. I've operated to dupe in some of these other tools along the, you know, along the along the, the past couple years and and there was a long time where they weren't boring and and I'm glad to see things mature. But you always have that that balance where things start to mature and then, you start to see us innovation.
[00:29:34] Unknown:
And in order to generate a new newsletter every week, you must be reading through a lot of articles and keeping track of a lot of different projects as they iterate and, watching videos. So what is your sort of workflow for being able to keep track of all of the different things that are happening in the data engineering ecosystem?
[00:29:58] Unknown:
Yeah. So it's changed a little bit over the years. I think at this point, I get a lot of my news kind of pushed to me. I've curated a couple different channels whether it be mailing lists or the folks I follow on Twitter or the, you know, accounts that I follow on Twitter. I have a a throwback where I'm I'm following a bunch of RSS feeds of of different companies or blogs out there. We more recently, I've been been focusing a lot on medium. It seems like like that's where a lot of people tend to to publish or cross publish their posts. And so I have kind of this this day of data that that's coming at me throughout the week and, you know, I bookmark interesting articles when I see them.
Usually, don't get a chance to read everything until, you know, Saturday or Sunday the weekend when I'm actually compiling the newsletter. But that's that's kind of the currency of things. It used to be that maybe there was more emphasis on Twitter at 1 point in time. I had written a script to to go and find all the Hadoop related posts out there. Deduping against stuff I had already seen. That turned into a a war between me and the people posting Hadoop related jobs on Twitter and ultimately, they won. So I kinda gave up on that script.
But I try to try to try to evolve with with the communication channels and and they they definitely have changed over over the past 5 years. And but the kind of the the mailing lists and and Twitter and have been the have been the 2 mainstays and even things like RSS feeds and and Medium have been a little bit more more recent.
[00:31:51] Unknown:
And as you're going through all of these different posts and articles etcetera, are there any particular questions that you ask of yourself and about the content to determine whether you want to include it in a given issue? Or I'm curious if you have a particular sort of focus or purpose behind the ways that you are selecting these different pieces.
[00:32:15] Unknown:
I tried to, for the most part, anticipate what my readers would like and I have to admit that that's kind of an informal feedback loop. I I do hear from people occasionally. I probably hear more from from people that are are liking what I'm curating than than what I'm not. And so the main question that I that I get at or the the main way I try to answer that question is is just do I like the article? And if there's an article that I feel like I'm getting bored or maybe it's not making sense, maybe I'm I'm I'm in trouble following the post. A lot of times I will try to do some some background research, but if if it if it becomes something that it turns out I'm not interested in then then that's kind of the first strike. And my background is much more on the as we've as we've talked about the the data engineering, the back end, the dev ops side.
So that is what I tend to cover and probably a little bit on on what maybe people would call machine learning engineering. There was recently a post covered in my newsletter and I can give you the link for the for the show notes called the the flavors of data science and engineering, and it was kind of this 8 circle or 9 circle Venn diagram of of all the different roles and how they overlap. And and I think the articles that I cover are pretty squarely in the the data engineering, back end, DevOps, side of things. And that kind of is the like high level way that I narrow it down.
There are a couple other, filters and I say filters because honestly, every week I start out and I probably have 40 or 50 links that I'm evaluating to to include in the in the newsletter. And if you you post an article, you could also just get really unlucky where there are 15 or 20 other really good links that week and and yours kind of fell off. But the other filters I use are if I'm reading something from a vendor, is it just a vendor sales pitch or is there something technically interesting in there that maybe is not vendor specific? I look for, is this blog post just a regurgitation of the documentation for the tool. So you'll see a lot of people who maybe they went went through and just set up spark or hive or something for the first time and write a write a blog post about that. And sometimes they find something new and interesting, but a lot of times it's almost exactly the same steps that you would find in the tutorial on on the the the website. And then kind of on the other side of things, you you sometimes find these really niche articles that are about a tuning of a tuning option of a proprietary database.
And they just tell you what the option is and how to set it and they don't really tell you what's happening under the under the hood. And and those kind of niche ones that are the ones that that'll also get filtered out. I try to cast a cast a wide net and and be fair. I think I'm following most of the major vendors and and some smaller ones on my on my RSS feeds. I have gotten some flack in the past when I, not on purpose, but inadvertently maybe missed a a a big blog post from from 1 of your favorite vendors. I'll I'll get some some angry emails. So I try to, give everyone credit for all the hard work they're doing and include them in the in the newsletter as much as possible.
[00:36:06] Unknown:
Yeah. It's difficult being the person who is, you know, in in some case responsible for providing a lens on the industry as, as, somebody who's producing a podcast on the space. So every time I go to my list of interview ideas, I have to look at it thinking through, you know, is this, topic that is going to be interesting enough to talk about for 45 minutes to an hour? Is it something that I'm going to learn something from, that my audience members are going to learn something from? You know, as you mentioned, sometimes you have to be concerned about, is this conversation just gonna turn into a vendor pitch without enough technical aspects to it. So it's always challenging to try and think about things beyond just is this something I'm interested in and figuring out whether or not it is more broadly applicable to the people who are subscribing to your newsletter or my podcast or things like that. Yeah. And and I think,
[00:37:01] Unknown:
this is probably a a call out from both of us. If if anyone out there has feedback for me, I always love to hear it. I'm reachable on Twitter. I you can email info at data end weekly, dotcom. I'm I'm always open to feedback or if you have articles to send my way, I I love to I love to get proposed articles from from folks out there. And and then, Tobias, maybe that you and me should should form a support group to, help review each other's,
[00:37:29] Unknown:
Absolutely. Yeah. If having to, have that deadline of every week, I need to release a new episode, it's often difficult to keep track of, something new and interesting for the week. So I'm always happy to get input on suggestions for show ideas or feedback on past episodes of what went well or what didn't. So as you mentioned, you know, having that feedback cycle is very helpful and particularly with podcasts, it's often very asymmetrical of I publish something, somebody consumes it and that's the end of the interaction. So closing that loop every now and then is very useful. Yeah. Absolutely.
And having worked on curating the newsletter for the past 5 years, I imagine that there's been a lot of back and forth in terms of influence both from your interest on the newsletter and from the newsletter on shaping your particular interests. I don't know if you want to talk a bit about that and how it has, sort of led you in your current thinking of what your plans might be going forward.
[00:38:30] Unknown:
Right. So I have nearly stopped publishing the newsletter a couple times here. It it is a lot of work. I enjoy doing it and ultimately, have I've made a a pretty big investment going forward and and trying to to make it to make it work. And I think the big reason that I've done that is my professional career has kind of shifted. So I used to, as we're talking, work pretty in the weeds on all of the big data technologies. I was deploying Kafka, deploying Hadoop, Hbase. All these all these tools working pretty hands on with a lot of a lot of DevOps. And then when I went to work for for the government, my role was was very different. It was mostly to be the technical voice in the room to to set expectations with executives that no, really, we don't need to spend $10, 000, 000 on a data center to, you know, run a an a web app that's gonna see 10, 000 users a month or something like that. So the newsletter has been a way for me to stay connected to the industry that I very much enjoy.
It gives me ideas for side projects all the time. And sometimes I get to work on those, a lot of times I don't. But it's it's, it it my my interest in relationship with with the newsletter has definitely evolved over the years. It's much more kind of a outlet for me now to to to get more technical and to, you know, occasionally read a a more academic paper that's that's probably pretty dense and in the weeds, but also see what tools are out there and and stay on top of that, as things as things evolve. My my current excitement is is with the the shift to things like stream processing and and having the tools like Kafka, like Flink to really be able to implement some of these patterns that were that have existed for a while, like the the CQRS and and change data capture.
Concepts that have been around for a while but we haven't really had good tools to to implement them. And so it's it's been really exciting to see, you know, those go from a blog post by Jay Kreps about the, you know, the log to people are now actually building systems like at the New York Times based on these types of, concepts and and these these architectures that are that are not necessarily novel, but are enabled by by this shift in in the tools that we have, a lot for the first time.
[00:41:32] Unknown:
Alright. And are there any other topics that we should discuss further before we start to close out the show? I think we're we covered a lot so it was good. Okay. So for anybody who wants to get in touch with you or subscribe to the newsletter or follow the work that you're up to, I'll have you add your preferred contact information to the show notes. And as a final question, from your perspective as somebody who is keeping up with the data engineering space, what do you see as being the biggest gap in the tooling or technology that's available for data management today? I wish I had a really good answer and insightful answer to this question.
[00:42:08] Unknown:
I I think that the thing that's been most striking to me on the data infrastructure tooling and and probably the most frustrating as well is that we we seem to always have 2 or 3 or 4 tools to do the same job. A prime example of that is is workflow engines, and all of them do it kinda okay, but nobody nobody does it really well. And I do think we're starting to see some consolidation, but, you know, a lot of the vendors have their momentum behind 1 tool and not the other. So hopefully, we'll we'll start to see some more consolidation because I do think that's an area where we'll we'll start to see vast improvements if, you know, you have people working towards the same goal instead of, you know, on the same software project instead of multiple software projects,
[00:43:01] Unknown:
functionally doing the same thing. Alright. Well, thank you again for taking the time today to join me. I have been subscribed to your newsletter for, at least a few months now, and it has been very helpful for me as I try to find topics for the podcast. So thank you for that. Keep up the good work, and I hope you enjoy the rest of your day. Thank you, and, thanks for having me. It was a lot of fun.
Introduction to Joe Kroback and His Background
Joe's Journey into Big Data
Working with US Digital Services
Challenges and Solutions at Healthcare.gov
Starting the Data Engineering Weekly Newsletter
Rebranding to Data Engineering Weekly
Curating Content for the Newsletter
Impact of the Newsletter on Joe's Career
Closing Thoughts and Future of Data Engineering