Summary
Buzzfeed needs to be able to understand how its users are interacting with the myriad articles, videos, etc. that they are posting. This lets them produce new content that will continue to be well-received. To surface the insights that they need to grow their business they need a robust data infrastructure to reliably capture all of those interactions. Walter Menendez is a data engineer on their infrastructure team and in this episode he describes how they manage data ingestion from a wide array of sources and create an interface for their data scientists to produce valuable conclusions.
Preamble
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.
- Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind.
- Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
- You can help support the show by checking out the Patreon page which is linked from the site.
- To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers
- Your host is Tobias Macey and today I’m interviewing Walter Menendez about the data engineering platform at Buzzfeed
Interview
- Introduction
- How did you get involved in the area of data management?
- How is the data engineering team at Buzzfeed structured and what kinds of projects are you responsible for?
- What are some of the types of data inputs and outputs that you work with at Buzzfeed?
- Is the core of your system using a real-time streaming approach or is it primarily batch-oriented and what are the business needs that drive that decision?
- What does the architecture of your data platform look like and what are some of the most significant areas of technical debt?
- Which platforms and languages are most widely leveraged in your team and what are some of the outliers?
- What are some of the most significant challenges that you face, both technically and organizationally?
- What are some of the dead ends that you have run into or failed projects that you have tried?
- What has been the most successful project that you have completed and how do you measure that success?
Contact Info
- @hackwalter on Twitter
- walterm on GitHub
Links
- Data Literacy
- MIT Media Lab
- Tumblr
- Data Capital
- Data Infrastructure
- Google Analytics
- Datadog
- Python
- Numpy
- SciPy
- NLTK
- Go Language
- NSQ
- Tornado
- PySpark
- AWS EMR
- Redshift
- Tracking Pixel
- Google Cloud
- Don’t try to be google
- Stop Hiring DevOps Engineers and Start Growing Them
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the data engineering podcast, the show about modern data management. When you're ready to launch your next project, you'll need somewhere to deploy it. So you should check out linode at data engineering podcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production, and Go CD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today.
Enterprise add ons and professional support are available for added peace of mind. And go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page, which is linked from the site. To help other people find the show, you can leave a review on Itunes or Google Play Music, tell your friends and coworkers, and share it on social media.
[00:01:11] Unknown:
Your host today is Tobias Macy. And today, I'm interviewing Walter Menendez about the work that he's doing as a data engineer at BuzzFeed. So, Walter, could you please introduce yourself?
[00:01:22] Unknown:
Yeah. So, yeah. My name is Walter, and I'm a data engineer at BuzzFeed. Been with BuzzFeed for almost 2 years now. And, yeah, it's been sort of a really good time. Mainly and responsible for the construction, the maintenance, the sort of the overall improvements of all of the data pipelines available at Buzzfeed that sort of then empower, you know, sort of data literacy and sort of, like, data driven decisions throughout the rest of company. Yeah. So those are kind of sort of, like, my biggest responsibilities there.
[00:01:51] Unknown:
And how did you first get involved in the area of data management? And what about it interests you and keeps you going?
[00:01:57] Unknown:
I fell into data rather kind of by accident. What kind of happened was a lot of my experience before doing, like, you know, industry internships in college were in the MIT Media Lab. And if you're not familiar with the MIT Media Lab, you know, it's this research lab at MIT where a at the time their mission was kind of more about sort of just playing around with, like, different technologies and, like, different concepts and just prototyping constantly. And so, kinda my background had come from just working on a lot of, like, prototypical technologies. Eventually the Media Lab switched philosophies and became kind of more about, like, deploy or die. So kind of trying to push research groups to actually produce, like, work that would then be viable and could then be sort of like productionizable out in the real world. And so, you know, kind of a part of that transition was a lot of the projects I got more involved with ended up being a little bit more, like, data intensive. So, for example, 1 of my, research projects for a semester involves sort of the organization around cell phone metadata.
So they, for example, this 1, you know, this guy from Belgium, he was doing his PhD and he was in direct communication with, like, all of these, like, European phone carriers, and they had given him, like, access to all of these all of this kind of, like, you know, carrier metadata that they had about, like, when certain phone calls were being made, from kind of, like, what, you know, anonymized ID to what another anonymized ID was the call being made to. From what, like, you know, cell phone towers was the call used to, like, you know, redirect all that signal, all that kind of stuff. So, you know, doing kind of projects like that, and then like, you know, my, kind of like my senior project in college revolved around micro trend detection. That's kind of how I described it, where it was about trying to detect, you know, trying to basically reimplement trending topics, kind of how Twitter does it, but doing it for very small and very kind of like very specific geographical areas. So the point the professor in question at this time, he cared about this 1 town in Spain that was using Twitter to run its city government. So, like, the mayor had a Twitter, the police department had a Twitter, like the Department of Forestry, you know, or the Department of Urban Planning had a Twitter, and you know, you could tweet directly at all these individuals, and that would be considered like, you know, a public and legal action to, like, you know, talk to them. And they would they would collect agenda items for their meetings, you know, via Twitter. So, yeah, just kind of always I was always around data heavy projects.
And so my first internship in the industry was at Tumblr, where I was on the search engineering team, and I was working on improving the way they did trending tag detection. So, like, at the time they were rolling out trending tags. So, like, if you're not familiar with Tumblr as a platform, like, they have tagging like the same way pretty much any kind of major platform does. Except that, yeah, they kind of wanted to sort of know when people were starting to all of a sudden tag a lot of their content with kind of the same few fields. And they had done kind of, you know, an initial pass of what they kind of thought they should look like, but they were kind of wondering if there were ways to detect things even sooner. Like, the latency that they had for at the time was about like a day's, you know, worth of, like, behind, like, something would be you would know something would be trending basically the next day, but they kinda wanted something sooner than that. So that was kind of what I was involved in. And so that was a that was kind of a really nice project because it kind of exposed me to some, like, preliminary data science work that you could do with that kind of stuff. So, you know, trending tag analysis and, like, you know, anomaly detection. Right? These are things that a lot of the data engineers that I know now have dealt with. But then also being, you know, being able to, like, you know, productionize this. Like, you know, how do I, you know, start writing some crons? How do I automate the same kind of few database queries that I run over and over? So when I did my internship at Tumblr, I feel like I kinda synthesized a lot of the kind of data management that you're you're kinda talking about before into a single, you know, kind of like into a single, like, pipeline. And I think that's kind of what, you know, got me into thinking, like, oh, I think I really like data. I really like working with data, and I like working with data at scale. I think it's, like, I think that's kind of some of the biggest things that really keep me in data engineering is that, like, the scale at which data engineers have to fundamentally work at is just so intense kind of from the get go. Whereas, you know, with other projects, I feel like I've heard it's like, oh, you know, they always do, you know, either it's like, you know, a limited onboarding so they, like, control the scope or the scale at which they expose a product at or whether it's like, oh, this runs offline, we'll never have to worry about breaking a production server or running out of, like, resources and things like that. It's like, you know, with data engineering, it's like you have to get Gail right from the get go. And, you know, if it doesn't work at scale, it doesn't work at all, which, you know, it's kind of a fun problem to be tackling, you know, on a frequent basis. So, yeah. That's kind of my foray into data. Like, yeah. It started, you know, just through a bunch of, like, research projects. And then finally someone, you know, effectively told me to do it for a job, and then I did. So yeah.
[00:06:49] Unknown:
So 1 of the things that you mentioned that I wanted to sort of get a definition on is the fact that at BuzzFeed, you're working on data literacy across the company. So I'm wondering if you can just describe a bit about what that means in terms of sort of your idea of the concept and how that translates to your actual job responsibility.
[00:07:08] Unknown:
Yeah. I mean, so for me that kind of means just being able to sort of understand, relations manager. These sort of being able to just a client sales, like, a client relations manager. Be sort of being able to just understand, like, you know, this is, you know, the kind of information that we have around, this is the kind of information that we collect, that we aggregate. Yeah, and I feel like the how that translates into my actual job is I feel like through just basically making our data as easy to use, as discoverable as possible, and yeah, I think it's just really those 2 things. It's like, you know, really just focusing on making data discoverable and easy to access.
You know, because if your data isn't, you know, I think we've run into this a few times where it's like, you know, all of our data kind of sits in archives, like, you know, in the cloud. Right? Like, it's that's kind of a really common thing to do where it's like, you know, as you, you know, with any real time file, like, no data system, it's really common to just, you know, save all of it, right? And that data is just kind of sitting there, but so many people don't really know how to access that information. And they don't really know what they can do with all, you know, with all that data if they really just had So trying to build systems in such a way that makes that data easier to work with is kind of sort of, you know, like, you know, our main responsibility. And, you know, it, you know, it doesn't really translate to a sort of a really clear concrete thing every day, but just sort of like, this is what we think of, you know, when we're thinking about the properties of a system and the usability of a system. I think more importantly, we think more about what is the easiest thing we can make our end users do in order to get the data they want and be able to, you know, draw the insights they need from it. You know, come time to actually, like, engage with data. So I think I think that that's what data literacy means to me, and I think it's a concept that, you know, I think the organization also kind of sort of agrees with. Mainly because it's like with a lot of the product releases that we do at BuzzFeed, it's like we don't we don't release comfortably until we have a solid data workflow embedded into the product launch itself. So it's like, you know, if it's something like, oh, we wanna see this new product, how it's gonna work, there's data for it. Oh, we wanna AB test something, there's data for both variants or every single variant that you're working with. You know, things like that. Right? So it's like, that that's that's to me what that means.
[00:09:25] Unknown:
Yeah. It's interesting how the way in which data is stored and structured can really impact how people perceive it, both in terms of the utility of it, as well as their ideas of what they can actually use it for. Because if it's all archived away in a database, it makes it much less accessible to, nontechnical person. Whereas if you have some sort of dashboard, even just cataloging the types of data that you have, whether it's, you know, some sort of metadata about where it was gathered from or, you know, maybe even just any sort of linkage between this data and other data that you might have available. It can just add a lot more overall value to the data capital that you have invested in the business.
[00:10:07] Unknown:
Yeah. And and, you know, part of that also kinda comes being, like, if also being able to answer the flip side of that, being, like, oh, we don't collect that. Right? It's sometimes also hard for us sort of be like, oh, like, are we collecting the data on that? Like, you know, you know, especially because I interface a lot with the data scientists, and, you know, sometimes they'll just straight up ask, do we have access to this kind of thing? Like, do we already have, like, do we have this kind of data workflow already in place? And sometimes the answer is yes, we do. It's just a matter of connecting the pieces together. Sometimes it's just like sometimes the answer is no. We straight up don't have, you know, that we're we're not collecting that data, or that that data is not processable in such a way that we can get that kind of information out of it. So being able to also, you know, understand what the limitations of our current data workflows are is another big part of it.
[00:10:53] Unknown:
And you mentioned having a separate team for the data scientists and the data engineers. So I'm wondering if you can explain a bit about how the different teams are structured and if there are also sort of sub teams within those different organizations and the different kinds of projects that you're responsible for as a, you know, as a member of the data engineering team.
[00:11:13] Unknown:
My team is called the data infrastructure team, and we focus specifically on the word infrastructure because that's really what our deliverables are. We're delivering systems and entire architectures designed for the most efficient and the most performant transfer of data, from point a to point b. Right? Like, that is ultimately what we deliver to end users and that's what they interact with. When it comes to, yeah, so, like, that's my team. My my team kinda sits directly kind of, you know, focusing on the pipeline. So, like, in a way, like, you know, if you're thinking of a stack, my team kinda sits at the bottom of the stack where we are the ones kind of, like, front lining the ingestion of data into our Buzzfeed ecosystem.
Kind of like 1 level above us, what I'd like to say is kind of the insights layer of Buzzfeed where we have the data scientists, the data analysts, the kind of like anyone who basically has to touch data and, like, make a decision from it or basically deliver kind of insights to other parts of the organization. So whether that's, like, you know, to a note to a product team, you know, because we, we have data scientists working with people doing product launches, as I said, doing kind of AB testing. Also checking health metrics, like, you know, this product is great, but it doesn't increase the use of the site. So it's not actually that great, you know, in the case of an AB test, you know, if there's a data scientist basically being, like, oh, like, we didn't get a statistically significant result from our AB test. Like, our AB test is no good then, and we resolved to the control instead of, you know, trying to decide that a variant is better for whatever, you know, for whatever experiment we were running. And so yeah. So it's like they kind of, you know, they are a separate team that focused directly on, like, you know, drawing those insights into decisions. Kind of adjacent to the data scientist, I would say, is also what I'd call the, like, what's called the learning tools team. And they build tools designed to kind of bridge the gap between, like, the data pipeline and the non tech the non kind of, you know, non engineer, the non tech person at BuzzFeed that needs to be able to understand, you know, the analysis of the of their data that's coming out of our pipelines.
So for example, 1 of the tools that we built internally that is the most, I would argue, the most useful thing for editors at Buzzfeed to know about is this thing called the dashboard where it is a, you know, it's a very simple rendering of a post's traffic over time. And you can see, you know, on this dashboard, you know, when a post was published, how much traffic it got from, like, you know, buzzfeed.com, buzzfeed the app, you know, all of all the kind of, like, native BuzzFeed experiences. And, you know, then going to, you know, things like, oh, BuzzFeed the Facebook page, BuzzFeed the Twitter page, BuzzFeed our Tumblr page. You know, so, you know, the dashboard is able to classify those kind of 2 things. And, you know, often, you know, I imagine this won't be any surprise, you know, for a company like BuzzFeed. Most of our traffic comes from social media redirects.
So being able to sort of see that breakdown and being like, oh, this kind of content is successful on Twitter, but not Facebook. Or it's successful on only Facebook, and it's a dud on every other platform. So, you know, so that team works on building the tools and the interfaces that help editors themselves kind of make those decisions, you know, kind of like in a self guided way. So, yeah, so it's like mostly, yeah, kind of like my team sits at the bottom, you know, is ingesting the data, getting it into our pipelines and our systems. And then there's the data scientists and it's quite kind of a learning tools group that sits above us. You know, being able to, you know, start processing that data, start synthesizing it, making decisions from it. And then they reach out to pretty much the rest of the organization and the rest of the company. So both product teams, other engineers, other data scientists,
[00:14:55] Unknown:
and then other, like, editors, other client, you know, relations managers, you know, being able to go back to the client and say, like, this is, you know, the results of your post, this is the performance of your post and things like that. And what are some of the major types of data inputs and some of the different interfaces that you deal with for ingesting data into your pipeline, and also some of the end points that you're delivering the data to after it's been processed?
[00:15:19] Unknown:
So I would say the biggest the kind of, like, main source of data that we get and then the main format that it comes in is basically through our custom, like, impression collection where it's basically, you know, from, you know, any action on, you know, most of the experiences we control. So, like, buzzfeed. Again, like buzzfeed.com, buzzfeed the app, things like that. We we have sort of a custom, like, we set up pixel tracking on the site and we get that kind of, you know, ingested to us as blobs of JSON. Right? Like, kind of very simple, you know, low above the details and then from there we take those blobs of JSON. We, like, serialize them, deserialize them as we, you know, move them around. And then from there, you know, we take that, you know, and put that into, you know, whatever persistence that we kind of have. Whether it's, like, you know, whether it's a database, whether, it's our archives. Sometimes there is a, you know, a basically, there's a, you know, a message consumer read in in the way it's ready to, like, you know, take that data, pro you know, transform it, process it in whatever way, and then put that, you know, into where where wherever else it's going to redirect that data, whether it's back into the real time stream or to a to a persistent player. So that so that's the biggest thing. Basically, our own custom impression information and our own custom message format. The flip side would be we collect a lot of data from our distributed platforms, to the point where we basically have an entire team at BuzzFeed devoted to collecting data from other platforms. So that's data coming from, you know, Google Analytics because we, you know, as like any good like like any good like development of a system, there was another system to which we could compare our results to. So, like, actually something that we do daily is compare our impression totals with what GA thinks our post Scott, and we kind of compare to be, like, okay, good, our data is good, or, like, no, our data is actually really bad for this 1 day, and we don't know why. You know, so, you know, GA is 1 source of data. We also, again, get a lot of data from, like, Facebook, which comes with its own, you know, array of challenges because, like, Facebook, you know, changes their API responses all the time, or they'll change the format or the semantic meanings of certain, you know, things in their, in their message format. And as a result, you know, we'll break the ingestion of that data on our side, and that leads to some repair that needs to happen. And, yeah, so, you know, so those are kind of the 2 main sources, like, our custom, you know, our owned and operated collection, and then there's the distributed collection of data. And, yeah. And to make and as I said, like, to make these things exposable, yeah, there's it's usually either there's a persistence layer that we're, you know, shoving all this data in this data in, and it all kind of comes through again through the real time event stream. And then trying to think. Yeah. Whether it's being consumed in real time. If there's be if it's being consumed in real time, then we yeah. Then, yeah. As I said, there's message handler set up for that, and then they'll consume that data, put flash the persistence. If it's offline, then as it comes into the real time event stream, it's just archived.
And then there are offline processes that will periodically pull down some of the archives, put that information, or grab those messages, and start processing them, aggregating them, doing whatever it is that they'll do to them, and then putting and then, you know, flushing that result to that in whatever way they see fit. Whether it's, like, you know, other APIs, other, like, databases, and things like that.
[00:18:25] Unknown:
And 1 of the things that you mentioned is particularly when you're dealing with Facebook, but, in general, there are cases where your ingestion for validating the health and cleanliness of the data that you're ingesting. And, for validating the health and cleanliness of the data that you're ingesting and any sort of, monitoring or metrics or testing that you have embedded into your platform to be able to notify you of those kinds of situations so that you're not polluting your overall data warehouse with that bad data for a while before it surfaces that there is a problem.
[00:19:02] Unknown:
Yeah. So kind of as I said, we basically constantly compare systems. And that being said, we we start with kind of a system that we, you know, have relied on and, you know, battle tested to say this is correct, kind of regardless of what happens. So when it comes to the raw ingestion stream, we know we've compared that against Google Analytics so that we've seen there's an acceptable, like, variance between the 2 systems. We'll never get it quite right because there are some things that GA does that we don't and vice versa. Like, I think it's like 1 or 2 percent 5% basically is what we said is the difference. Like, for any given article, like, these 2 systems only should diverge by, like, 5% or, like, if that at all. So going off of that, then when it comes to ingesting things like distributed APIs or even even our own data because, as teams like adjust message formats, which happens a lot, based on if they're working on different part, you know, a different part of the product on like buzzfeed.com, and, you know, say they are focused on, you know, just changing some like adding information there. Like, there's also that kind of information where it's, like, you know, they could break our own ingestion, you know, without knowing. So in terms of validation, yeah. So, like, we'll compare systems. If we're all of a sudden being like, oh, hey, all of our Facebook traffic is down. Does that make sense? And then we'll go, we'll go back to the raw data and we're like, well, no, we have this much raw data for Facebook, but we only ingest this value. Clearly, our ingestion of it was wrong. So that's how we'll roll out that kind of thing. For kind of like longer term things, we have a lot of anomaly detection in place as well. We use Datadog a lot to emit metrics that of just like the volume levels of various things. So as you can imagine, like, with a website that is dependent on just kind of, like, natural, like, you know, user views, like, the traffic wave is kind of, you know, it rises and falls, you know, with the day as people visit the site. People, you know, then end up busy and things like that. So it's like we expect a certain pattern. So have anomaly detection set up around when the pattern starts breaking and starts deviating. And so we're like, All of a sudden, we're seeing that we're getting too few events of this kind.
We shouldn't, we should ask the producers of that kind of data. Being like, hey, like, are you no longer sending this data? Have we ruled out that we've broken the ingestion of that data? And or it's like, or like is, you know, is the data just coming in mal formatted? Like, is that as, you know, an issue with as well. So yeah. So I would say that, you know, those are kind of the 2 biggest things that we end up doing is that we end up just, you know, comparing notes across into across all of our pipelines and then and then ultimately just say like, saying, like, is this what we expect that kind of, like, in steady state?
[00:21:32] Unknown:
And you mentioned that the main core of your system is a real time ingestion pipeline. So I'm wondering if you can sort of enumerate some of the overall platforms and tools as well as the languages that you're using most widely across the team. And then maybe identify some of the outliers as far as languages or tools that you use and sort of custom edge cases?
[00:21:54] Unknown:
Yeah. So I'd say actually like BuzzFeed, Tech is actually a really big Python shop. We love Python. We think it's a great language to work in when it comes to, you know, working with data. I mean obviously for a number of reasons Python is very easy to write, it's very easy to read, and it's very easy to learn as well. You know, BuzzFeed is the kind of place where we have people working with data that come from all different levels of of kind of like tech background. So, like, some people are coming in as like, you know, they've been an engineer for a while. So, like, you know, they they speak tech, they they are like they are fluent in you know conversing with like in an engineering language. Whereas some people you know they're switching careers at BuzzFeed and they came from finance or they came from, you know, the arts and things like that. And they're now switching into like a technical role here at BuzzFeed. So Python is such a nice way to bridge the gap across all these different levels of experience where then we can all of a sudden, you know, kind of communicate with the same, like, tools because we're all working in Python. And obviously Python has a lot of, like, data oriented packages. Like, I mean, NumPy, you know, powers pretty much, like everything when it comes to sort of anything data intensive. SciPy is used a lot. There's obviously like NLTK. Like pretty much like any of the major like machine learning and like, natural language processing, like, Internet packages that are in Python, like, we've used most of them. When it comes to actual, like, you know, again back to, like, actual data work, So, you know, obviously Python is great, but then, you know, occasionally we will run into kind of like Python's performance limits where we're like, we just wish this were faster, and we know it can be faster if we just weren't in Python. So, you know, when we hit that kind of wall, we actually switched to Golang and we, you know, we have a few systems written in Go. And, you know, that that's kind of, like, seemed to be, like, a really standard pattern. It's like, you know, we'll prototype in Python, we'll, like, further we'll reach maturity in Python, we'll start evaluating performance, and if we can't get it out of Python, we'll get it and go. You know, because we think, you know, it's it's a really nice language. It's designed for, you know, concurrency and kind of like it's, you know, it's designed for servers. Right? So and you know that it's kind of like loosely typed but like, you know, in a way kind of like almost like writes like it like writes like Python but like reads like a strongly more like strongly typed language, at least like, you know, when in my experience in working with it. So there's that. In terms of technologies, so our kind of like messaging queue system is NSQ. We like NSQ a lot again because it's written in go, it's highly performant, and it, you know, it, you know, it's meant to be distributed, it's meant to not have a single point of failure, which is really nice for us. And it's so easy to set up and so easy to use that, you know, for us it, you know, kind of was a no brainer.
And then, so that's kind of, you know, what that's what's, you know, powering our message queuing as, you know, all around the place. And when it comes to, you know, consuming that data and when it comes to just, you know, quick lightweight APIs around data, we'll use Tornado as our asynchronous framework of choice, which is really nice because then we've, you know, we we use that for APIs but also for our message handlers. So, you know, we use the, obviously, like, the Python API for NSQ to start reading messages. And from there, like, it's really easy to spin up the consumption of a message based on, like Tornado's kind of like asynchronous framework. So we'll just kind of use the IO IO loop to like periodically ingest, periodically flush, to transform the data and put that into into persistence.
So those are so the idea, some of our main technologies are in that.
[00:25:17] Unknown:
I find it interesting that in that whole list, you didn't bring in any of the sort of major hitters that most people would be expecting like Spark or Kafka or anything like that? Sort of the JVM oriented big data tools.
[00:25:29] Unknown:
Oh, yeah. I mean, so we do use those. So we actually use Spark a lot. We'll use Spark as part of our kind of like MapReduce, like, you know, open framework of choice. It has a Python interface which is again kind of like a theme. Right? Like PySpark is really nice and I mean you know it's it's a bit of an arbitrary choice also. Right? Because you know it's mainly for the language availability that it has more less than their framework. But, you know, we'll kind of like we'll use Spark on EMR hosts in AWS a lot.
So when it comes to, you know, it's like writing the job, we'll deploy that, you know, to a remote, like, AWS cluster. We'll configure the cluster, start running the job and then put those, you know, results back into s 3 and interacts with, you know, our the output of our data that way. Which is kind of where our archives also live so kinda makes sense that, like, all our data is in s 3 in that kind of sense. So, yeah, there are some of these. There are some of the heavy hitters. But, yeah, like, I mean, I feel like, you know, after a certain point it kind of almost like goes without saying. Right. Fair enough. And what are some of the most significant challenges that you face both from a technical and an organizational perspective? I guess like 1 organizational problem that we kind of have is just the speed at which user needs kind of arise. Basically like like and this is kind of I guess like a side effect of the way that BuzzFeed is kinda set up where BuzzFeed is a company where like the company itself is constantly, you know, refocusing its business strategies, refocusing its models, and it's, you know, it it's basically trying to adapt and, you know, be, kind of, you know, the sustainable media company, you know, for the future. And which is which is kinda really cool to sort of see, like, the company, like, adapt in real time so much. But that being said, as the company adapts, so do our, you know, our data needs. Right? Because we need to, as it, you know, if we're adjusting to new platforms, we need to know how the data ecosystem works on that new platform so that we can ingest that data. Or if it's just like, oh, we're going to, like, change focuses on, like, you know, maybe we don't care about, you know, this 1 metric anymore and we care about this overall, like, this different thing, like, setting up the data for that. And the biggest way that we'll hear about this kind of thing is through our end users where they're directly saying, oh, we need this now. Like, it's like we don't need this anymore. We need this. And we're like, oh, okay. Well, fine. Or it's or, you know, and it's part of that also, you know, performance come out where we're saying this system doesn't work as fast anymore. It's like, oh, like, there's too much data in it and as a result, we can no longer, you you know, we can no longer, you know, satisfy our end users just as well as we used to. So a big issue with this was, we were using AWS Redshift for our long term data warehousing solution, but we just kind of found that it's like the more data we kind of put into it, because we were collecting a lot of like really granular, like, data and, you know, like 1 table basically got to like several billions of rows, like, within a month of just, you know, periodically like populating it. So we found that it's just like, you know, after a certain point querying it just proves completely unacceptable.
We you know queries would take you know 10 minutes when they used to take you know 2 and that just, you know, like it just it just led to such a slow workflow on behalf of our end users that, you know, things like that, Like, they were begging us for, like, we wish we had something better. And so, you know, being able to adapt to those things as soon as we can is kind of our biggest challenge. When it's like, how can we do this in such a way that is both, you know, not super disruptive to our overall infrastructure, but at the same time delivers the data that they need the data that they need in a efficient way and in a way that, you know, makes sense for what they're looking to get out of it. So disrupting ourselves is a yeah. I guess kind of what that sums up too.
[00:29:05] Unknown:
And with the need to have an eye on the ability to constantly reimplement or re architect portions of the system, I'm sure that that must lead to a certain accrual of technical debt. So I'm wondering if you can identify some of the, some of the areas in your system that have, sort of acquired the most amount of debt and some of the techniques or approaches that you're using to try and mitigate that?
[00:29:29] Unknown:
Yeah. I mean, I'd say that for us, our debt kind of just comes from there have been just a lot of systems that we worked on and, like, they a lot of the systems that, you know, that we initially built as a team, they were meant to solve 1 problem and 1 problem well. Which, you know, if you're building a new system, like, that's kind of what you were going for. Right? Like, this this makes sense. Like, it makes sense that you were optimizing for, like, a really specific use case, you know, if you were going to build a system for this, like, at all. So that being said, so an example of this is our, like, our our kind of, like, wrapper API around the real time event stream where, we built our system to collect originally, it was meant for only impression collection, right? Where it's like, you would go to buzzfeed.com, you'd, you know, visit something on the site, a pixel event would be fired, we'd collect that, and that's how we knew that you visited this post and, like, when and, like, you know, collected some metadata about the visit. So at first, that's all it was meant to do. But then as more and more teams kind of heard about, oh, there's this, like, system that I can, like, publish the message to and then consume it on, you know, through another thing. And it became a pattern that a lot of other teams started propagating through, like, you know, setting up a stream, setting up, you know, a message handler to read from said stream, and just like be able to sort that pattern. And we were like, oh, this is awesome. But then as the number of teams that started onboarding more and more inter service communication on it, You know, as more teams started doing that, we were kind of in the situation where we've created basically a very fragile system where if as a result of like maybe rogue behavior from the system, like hammering our ingestion and maybe, you know, leading to a, you know, ingestion outage, They not only take down their own service, is like, they they take down their own service like intercommunication, they take down the intercommunication of other services, And then, more importantly, they take down our impression collection, which is our business. Right? So, you know, effectively, you know, the biggest thing that happened here was just like the system was not designed to be able to handle all these things in its initial conception. And so, you know, you know, by the time we realized this it was it's like a few, like, kind of like outages like this happened where, you know, thankfully it never happened in a production outage, but it was like, we saw that, you know, with a staging instance for example, like there was a system that because of the so many messages they had sent to us all at once, they managed to crash a node and, you know, there was, you know, some minor data, like, outage there.
So that to us was kind of, you know, a sign that it's, like, you know, this system clearly has developed another need, but and needs to somehow better address this. So I mean obviously addressing tech debt comes, you know, in various forms. Like, you know, throughout Buzzfeed, there was a point in time, I wanna say like about last year, where we just, you know, went heads down and got rid of a lot of tech debt, you know. And a lot of this honestly just came in the form of big red diffs. Like, you know, and I'm, you know, and I'm a big fan. I think I forget who said it, but it was someone that I kind of respect and admire that said, you know, basically red diffs are good diffs. And so the number of times I saw a support request that was just like a sea of just red diffs where clearly it's like we don't use this anymore. There's no reason for this code to still be alive and running. So, you know, that was kind of the biggest thing that kinda came being like, you know, just axe it. Just rid of it. So that's definitely something, you know, 1 thing that we kinda being, like, you know, if this service doesn't need to send these messages, just don't. But obviously, you know, in in the in the particular case that I'm talking about, obviously, that wasn't the most acceptable solution. So what we did was we were like, you know, can we somehow divorce the these 2, like, you know, these kind of 2 use cases from the same system. And so that's kind of what we ended up doing more where it's like, you know, we recognize that we still need, you know, we we need to still be able to empower services to kind of, like, intercommunicate themselves with this, like, asynchronous, like, message based pattern. But at the same time, we still need to be able to collect impression data and have, you know, failures of either 2 systems be independent of each other. So we split we split the 2 systems. So we effectively created 2 separate ingestion points for those kinds of data such that, you know, if you do hose us on 1 of the 2, you don't affect the other and we can proceed to ingest data kind of like completely oblivious to the fact that there is someone, you know, being a bad producer. So, yeah. So it's like, but yeah, as I said, like, it, you know, it didn't really the the thought of this, you know, didn't come until kind of, like, you know, too late when it's like we realized that so many people had done this that was almost now a rampant thing. So, you know, not much we could have done at that point other than, you know, address it being like, you know, this is a use case that we now have to sort of, you know, seriously consider and, you know, put some architecture behind it. And counter to some of the sort of difficulties that you've experienced, what are some of the most successful projects that you've been involved in? And what do you use to measure the success of a project like that? Talking about again, our redshift woes. When it came time to, you know, we we kind of thought like, well, how can we improve this? You know, we turned to we turned to Google, basically. We started using more and more of the Google Cloud ecosystem, which ended up just being, you know, such a better move. Mainly because the way that just Google writes software, it's it's faster. Like, in like the number of times I've seen like you know I've done kind of like the same operation where it's like oh I just wanna like upload this file or like oh I want to you know populate a table from you know from a cloud file. Like just the performance boost on Google's behalf is so it's just it was so significant that it was kind of like, wow, we should have been here, like, years ago. Yeah. So, like, you know, we basically started taking, like, how can we slowly but surely take our data warehouse and basically migrate all of it into a different cloud storage provider?
And so I think that was kind of 1 of the biggest thing, you know, I think 1 of the most, like, impactful projects of late that we've done where being able to, you know, smooth all that data and have that accessible in a more performant way for end users was kind of the biggest thing. And I think, you know, the performance measure for success there is that, is, you know, end user satisfaction. I, you know, that we we had, like, a wrap up meeting and, you know, 1 of the data scientists that they said was just like, they they said this literally just works. Like, I I have to do so little thinking in terms of being able to, you know, being like, how do I tell the system that I want this data? How do I tell it what it looks like?
And then it's performance, like, you know, expectations, it's so straightforward. And, you know, and really that. So it's like, you know, if our end users are happy and they, you know, are just so if they're having just such an easier time for them to, you know, get access to this data, that is true their their Metro success. And I think that's a really kind of important thing across a lot of, like, like, kind of, like, data work and data management as we've kind of said, where it's, like, ultimately your data is meant to be used by someone and to be used, like, for other people. And so if there is no user centric, like, concern at, you know, during the process and during the kind of development and maintenance of of the system, then, I mean, ultimately, really at that point, you're just actually serving either just yourself or the computer. Like, you're not actually servicing a human at that point, which I think is a really important thing when it comes to data because really, data is just numbers. Data is just moving all these, like, you know, blobs of information back and forth and it's like, you know, machines are unintelligible to that unless you start telling them to be intelligible about it. But when it's really it's a human that ultimately needs to, you know, sit down and actually look at these numbers and, you know, draw meaning from them.
So yeah. So I'd say, like, you know, that are are kind of like, you know, a humanistic approach to as, you know, if they're happy, we're happy.
[00:37:03] Unknown:
And are there any other subjects that you think we should talk about before we start to close out the show?
[00:37:08] Unknown:
I'm looking at your list of questions. I think that was kind of a big hit. Yeah that buzzfeed is hiring. I mean yeah it's funny because it's like the number of times where we're like oh we're looking for data engineers And and I hear that a lot and I'm like okay I I get that. And then I'll get recruiting emails about you know people being like oh we're looking for data engineers too. And it's funny to me just how so many companies are looking for data engineers and I think I think very few of them actually understand what they're asking for. And not to say that they potentially not to say that they don't have, you know, real data needs, but I'm saying but there are so few of them I think are really sort of aware of, like, what they're asking for. And they're just kinda saying, like, oh, we have data, come engineer it.
Like that's such a meaningless ask. But it's but it's, you know, sadly the the kind of pitch I hear a lot. But it's like I think at BuzzFeed where it's like we do have very established, you know, data needs and our entire culture and our entire business is around the data feedback loop that we have on our content, on our traffic, you know, patterns, on our user engagement. We have data centered on all of this such that, you know, it's like data is really the reason BuzzFeed the business and BuzzFeed the company are what they are today. So, you know, at BuzzFeed you'll do real data engineering. And at another place, I don't know what quite exactly you'll be doing. So, yeah, like buzzfeed.com/about/jobs. That's where you'll find more about, what what we're hiring for and what our needs are.
[00:38:41] Unknown:
Yeah. It's kind of funny how certain industry trends will start to surface. And so then every company decides that they need to hire somebody with that new fancy job title even if it might just be a slightly different permutation on something that people are already doing that they don't necessarily properly identify. And, another trend as well is that it seems a lot of companies once they do see that there is sort of this new kind of position and they they will actively seek external candidates without necessarily properly identifying internal candidates that they can train into that new position as well. So for anybody who is out there, in a company who thinks that you do need a data engineer, you know, maybe start looking internally as well before you, expend all of your energy trying to bring on new talent.
[00:39:30] Unknown:
Yeah. And I would say, like, yeah, like naming is, like, obviously it's a very hard problem, like, you know, naming is I mean, there's a reason naming is such an issue that we always gripe on, but I thought it was really funny. Tumblr actually had a really cool job title for what I think was just basically like a dashboard builder kind of job like after the fact. But it was like, they called it the weapons engineer. Because because they were talking about how, you know, it was part of like this like the content spam and intelligence team, and they were talking about how the the kinds of problems that social media sites have like Twitter, Facebook, Tumblr, I think Google was another part of it. It's like there was like a, you know, a small group of, like, really, you know, really prominent social media companies that were having these problems.
And they were saying that, yeah, it's, like, you know, being able to understand these things and being able to, like, you know, quickly address them in real time because obviously they don't want spam on their site. They wanna shut that down as soon as possible. So they they were like, we're weaponizing our our insights in such that way and so we need people to build the tools around that. But really, yeah, it was just like it's just it seemed like a dashboard engineer job. Well so I was like cool job but cool job Dame but you know pretty standard job actually. So naming naming these things is hard. Understanding the thing you're encoding in the name is hard. And like the number of people that have talked about being like, I don't know. There was an article recently talking about how it's don't try to be Google. Like even Google can't be Google. Like in terms of talking about like a data need about that was, kind of coming up. And I just I thought that was really funny in terms of it was so meta being like, oh, so it's like, you know, in short, no 1 really fits this canonical definition of data engineering or with, like, the scale of your company. And so, like, it's like, well, if no one's doing it then what what is everyone actually calling it? But, you know, a more philosophical conversation for a later time.
[00:41:13] Unknown:
Yeah. And, on the note of training people internally there's a great talk that I'll link to in the show notes by Jez Humble from the sort of DevOps camp about the idea of, you know, don't try to hire a DevOps engineer, build them. And, I think that's still relevant to the idea of data engineering as well. So I'll, make sure to add that to the notes. For sure. And and I think I actually I agree with that mainly because I am a product of that.
[00:41:39] Unknown:
I graduated from college, you know, I I just was a software engineer, like, that's what I was kind of looking to be. And, you know, having again worked on so many different kind of technologies, it's like I had a, like, what I would like to consider like a big breadth of things. And, you you know, it's like I didn't, you know, I had some inklings as to what I wanted to do and things like that, but then, like, I ultimately, you know, went with the data team at Buzzfeed because I thought, you know, it's like I like data, I like the things you can do with data. I like the scale of data so, you know, I'll stick with that. And then, yeah, like ultimately joining as a new grad, I pretty much knew nothing about data itself, like data engineering itself, right? Like there was actually there was very little I kind of knew upfront, But there were kind of just a lot of general principles that I felt like still applied in terms of, like, you know, when you're thinking about a system's properties, like its fault tolerance, like its latency, it's, you know, its overall, you know, recoverability. Like, does the system self heal? Like, there are a lot of, like, things about these systems that, like, that they just kind of address themselves, like, you know, like, you know, or they present themselves in such a way that, you know, like, they could be about any pretty much pretty much any other kind of system.
It just so happens that there's, you know, data that's really sort of the core export of the system. So, you know, there's a lot of, you know, just general principles that kind of apply regardless of really what you're building. And I think that is just, like, you know, something that is definitely just learned on the job. So, like, yeah, I like and I definitely have learned a lot of that, you know, in my time at BuzzFeed. So and not having known much about data engineering at at the start was fine.
[00:43:08] Unknown:
Alright. For anybody who wants to follow what you're up to and, see some of the things you're working on, I'll have you add your preferred contact info to the show notes. And with that, I'd like to thank you for taking the time out of your day to join me and tell me more about the kinds of work that you're doing and, some of the tools that you're using at BuzzFeed. It definitely sounds like an interesting set of problems to be tackling.
[00:43:30] Unknown:
Yeah. Thank you very much for having me.
Introduction and Guest Introduction
Walter's Journey into Data Engineering
Data Literacy at BuzzFeed
Team Structure and Responsibilities
Data Inputs and Outputs
Tools and Technologies
Challenges and Technical Debt
Successful Projects and Measuring Success
Hiring and Industry Trends
Closing Remarks