A Look At The Data Systems Behind The Gameplay For League Of Legends

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Atlin is the metadata hub for your data ecosystem.

Instead of locking your metadata into a new silo, unleash its transformative potential with Atlin's active metadata capabilities.

Push information about data freshness and quality to your business intelligence,

automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans could focus on delivering real value.

Go to data engineering podcast.com

/ atlan today, that's a t l a n, to learn more about how Atlan's active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork, and Unilever

achieve extraordinary things with metadata.

When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode.

With their new managed database service, you can launch a production ready MySQL,

Postgres, or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs.

Go to data engineering podcast.com

/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy, and today I'm interviewing Ian Schwier about building the data systems that power League of Legends. So, Ian, can you start by introducing yourself?

Yeah. So you got my name right. Thanks for that. I'm an engineer

over at Riot. I work on

a team called the Data Central.

We are kind of the

team responsible

for

building and maintaining

any, like, data related products or anything like that for

any video game that uses the League of Legends game engine. So think like Unity or Unreal. We kinda have our own custom made game engine. We got a couple games that run on there, including, like you've mentioned, League of Legends and

Teamfight Tactics

and Wild Rift. So we've got a couple sweet games on that. And, yeah, we maintain all the data and ML stuff for it. And do you remember how you first got started working in data?

I guess, like,

my story is probably a little less

interesting than some folks. I went to college for CS.

I started working at Adobe

kind of as an an internship and then graduated there

on a video streaming team,

and I was more of, like, a consultant

in that role. Some fun things happened. Corporate stuff happened. I had to go find a new team. So I actually ended up on a data engineering team,

and that was, like, right around 2017.

I guess I just gained, like, a lot of SME and, like, I had a lot of passion for the space.

1 of my, I guess, subjects

in

school were, like, distributed systems and machine learning. So, like, the whole space of big data kinda clicked, made sense there.

Then I spent some time at DoorDash.

I think you might have had some DoorDash folks here on the show before. I think you had Sudhir here. Yep. I I actually got to work with Sudhir for a while. He's

amazing.

So I worked there, was 1 of the earlier engineers on the data platform team, and now I'm over here at Riot. So I've been doing it for, like, I guess, 8 years. Very cool.

And so in terms of what you're doing now at Riot Games, for anybody who's not familiar with League of Legends, I'm wondering if you can just give a bit of an overview about the game itself and the role that data plays in the overall player experience.

League of Legends is a 5v5,

like, MOBA game.

Multiplayer online battle arena, I think is what it stands for. It was made by our founders when they were back in college.

If you're actually really super interested on that, there's, like, a documentary

on Netflix. It's actually pretty decent at covering the whole story better than I could do. So please feel free to watch that. But the video game itself

is 10 people load into a game, and then it's 5 people per side.

Each player kinda picks their own

champion depending on, like, you know, what they want the characters to look like or how they want to play or whatever. They each take kind of a role,

which is to say they either play,

like, top lane, mid lane, or bot lane, which are specific kind of divisions of the map that they are kind of in control

of, or they'll play

more supporting roles or like a jungle, which is kind of a free for 1 kind of character.

And the whole point is to kind of take out the enemy Nexus, which is kind of guarded by turrets.

So through, like, a collection of team fights

and kind of, like, strategy,

you slowly make your way to the opponent's base to take over their turret.

And it's a pretty old game. I think the first patch came out in, like, 2008,

I think it was.

So it's been around for a long time, and it's become 1 of the more,

like, the most popular, like, online video games kind of

throughout the world. We have something around, like, 100 of millions of players

playing,

including, like, a kind

of competitive

esports

system as well. That ride is really kinda hedging there.

So in terms of

the

data that you're actually working with, curious if you can give a bit of a flavor as to the characteristics of it. So thinking in terms of 3 v's of volume, variety, velocity, whether it's largely structured or unstructured,

if you're dealing with real time data that you're

some of the ways that that data gets consumed and repurposed?

We definitely look a bit more like kind of a

traditional big data stack in that sense. So, like, we have the Hadoops, we have the AWS glues to power everything, yada yada yada. But I guess to, like, actually talk about what we've got going on,

we kind of collect data in a couple different

forms.

We collect data

from

all the people playing it, so all of our players. So we collect, like, their hardware information.

We, like, what were they doing inside of the game? We kind of do some

meta analysis to understand, like, how well do we think they're playing,

What kind of rank would we put them on? Who should they be kind of playing against?

Team matching kind of stuff is really important there. We collect data from

the game servers,

which are, like, ostensibly the heart of League of Legends

since the game is this kind of server assortative

architecture, meaning that the server is kind of determining

what moves are, like, correct and which ones to kind of ignore.

We collect all of the kind of game state data from that. So,

you know, a game lasts 30 to 45 minutes. So all the encounters, all the locations, everything like that, we end up collecting

to do whatever we need to do with. And then

it being kind of an online game, we have a suite of microservices

that do various different things from, like, allowing players to manage their inventories

or, like, whatever champions they bought or anything like that.

So we end up collecting all of that kind of data, so it's a pretty large I guess I would probably say it's a pretty decent variety just because, like, its logical division is pretty interesting.

We tend to see most of our data be in this kind of, like,

unstructured

JSON uniform, and we do a lot of work at the data warehouse and to kind of put schemas on top of that to understand it. So I guess kind of the consequence of that is most of our analyses end up being some kind of batch.

We have, you know, some

kind of real time e,

you know, like, use cases, but

it's kinda interesting. A lot of it is

definitively like a mini batch kinda deal where it's like, we don't really wanna take any action for some kind of machine learning models until after a game is done playing, for example.

So I wouldn't really call that real time. I would say that that's kind of just after the game, and that's where a lot of our actions actually kind of end up living. So we end up being in this kind of

really time sensitive batch land.

And the way we kinda use the data, it gets end up breaking into,

like, 2 big areas. It lands in, like, decision science,

which I think a lot of people listening are probably familiar with this. These are, like, your dashboards, your,

like, revenue

strategies, and everything like that, kind of the typical things you see out of data teams.

And then

recently, like, the past couple of years, we've been really

working more on, like, the machine learning and and, like, the MLOps

space.

We do a lot of work in like the player behavior space because we wanna make sure like people are playing fairly.

So we, you know, try to determine if they were playing intentionally bad

or, like, if they were just having an off day or whatever, and we try to use that to kind of inform models to say, like, hey, this player

needs to be banned or something or, like, this player is maybe should go on, like, a 10 day warning or something like that. We have a lot of, like, those kind of behavioral systems

going on.

We use a lot of that data to inform our matchmaking.

We kinda have a really challenging problem there is, like,

we're trying to model, like, a really latent

property of our players. You know? We're trying to, like, measure what their skill is.

It's not just like something like a linear regression can just map with 2 or 3 features. Like, we actually need a lot of other models to kind of understand, like, how credible they are, what their normal kind of variances

across champions,

what the meta is saying about what kind of champions are good right now. So we do a lot of that kind of activation

to try to, like, give better games.

And recently, we've been trying to kind of get further into the game and actually give them more, like, recommendations.

So, like, you know, just like the the game can be certainly really confusing. If you're a new player, there's a lot of

systems and a lot of, like,

things that you have to interact with to play well.

So, you know, we try to give you recommendations around, like, oh, what items do we think you should buy is like a is the currently launched

example. So if you, you know, go into game and you open up the item shop, we'll kinda tell you, like, hey. Based on your opponents, based on who you are and what you're playing, here are some items that would be really good for you or your items that are good against these opposing champions.

And that is this kind of really weird

I'm sure we could talk about it

whenever, but this is this really interesting space of, like, embedding machine learning into, like, binaries,

which is

a really weird space because most of the time I see it in services

or, like, in the data warehouse itself.

That's definitely very interesting. A number of points that I think we'll dig into there,

and to your point also of trying to

infer

a quantifiable

aspect of skill and the fact that it's not just a simple linear regression,

it's also not a static

value either because as people either spend more time on the game or if they step away for a long time and the game evolves without them and then they come in with a particular play style that is no longer dominant or no longer fits into the overall play ecosystem that also introduces a number of added variables to that equation.

And so in terms of being able to capture useful information to figure out things like that, be able to manage those recommendation systems.

And then also, I'm sure there are aspects of data collection and analysis that feed into just the

organizational and business elements of running the platform.

I'm wondering what are some of the

biggest challenges that you're facing either technically or organizationally

that are data focused?

Yeah. I think a lot of our challenges are probably more in, like, the technical space more than the organizational

space,

but they certainly exist.

A lot of it comes down to the fact that, like, League of Legends is just a really old game. Like, it's gone through a lot of different

changes and migrations

such that, like, even the way we just ingest data

from, like, the game server has kinda changed hands

multiple times, has changed systems multiple times.

You know, for example, at 1 point, we were using,

you know, fully Kafka based ingestion,

then we moved to, like, this s 3 batch ingestion.

Maybe we moved back or something.

So just in that realm, we have a lot of challenges around, like, kind of reconciling the current state of technology

with what the old state of the business was. Because, like, to a lot of players, it doesn't really matter if you're running on Kubernetes or not. You know? But, you know, the way we collect that data certainly does. And it's the way that

kind of shifts or, like, the distribution of data kind of shifts underneath us. We have to be able to, like, reconcile all of that at ingestion time

or maybe even at, you know, training time.

Ends up having a lot of, like, interesting cases around like, oh, hey. For this patch,

ta da, like, you know, half of this champion's data got nulled out because of a feature flag

or, like, a botched migration

that worked everywhere else, but not in live or something.

So we tend to have a lot of challenges just trying to move across that kind of, like, large

legacy space, I would say.

The other big challenge comes down to kind of the way we operate with,

like, our various vendors or publishers.

Just recently, we announced that Riot will be taking over publishing in, like, the Southeast Asia

areas. But, you know, before then, we had, like, another company involved

that would kind of help, you know, operate the game for us and facilitate

that exchange of data.

This is also true with Tencent and how we operate over in China. So there's kind of a lot of challenges

there in, like, the privacy space that we have to think about.

And then, yeah, I guess the final point is just we definitely have a lot of, like, derived

features and, like, derived metadata

that we have to think about.

Just to give, like, an example, we used to have this measurement called credibility,

which was

a way we could understand if a player

generally tells us the truth when they report somebody else and they say, like, hey. They were playing poorly or something.

And that feature in and of itself fed into many other machine learning models.

So we have to do, like, a whole lot of legwork to transform any of this data into some kind of, like, usable metric,

and it goes through many pipes before it even gets to players. So, like, I guess that complexity

just kind of grows as the business changes and as we try to model things like what we understand the meta to be, you know, what we understand good champions to be. So I guess it's all

maybe a typical answer, but it's all change management. Like, it it always is change management,

and it always is around, like, how old the system is and how the data changes underneath.

And that's certainly true for us.

And in that aspect of

being locked in by some of these legacy

aspects and legacy platform decisions,

I'm curious if there is

a particular

interface as far as the data collection or

manifestation of data products in the game

that

has created some of the constraints that you have to work around and ways that that is reflected

in

how you orient the overall architecture of the platform to be able to fit within some of those constraints that were implemented

early in the game's history?

That's a really fun question.

It's funny because, like, I think

as League got bigger and, like, as we kind of underwent the,

there's, like, a very

common

migration going on right now in companies to go from monolith to microservices.

As we underwent our own, we still ended up having this, like, big old monolith application,

this big monolith Java application that did kind of all the final routing

from,

you know, whatever is going on outside into the video game.

That piece is still

very much alive.

So

there are particular

data flows or like this 1

very explicit kind of JSON object

that

has to be changed in, like, very particular ways in order for a game to start. So

that doesn't bring up any challenges in and of itself for the data team, certainly for all the services teams. But

for us, what that means is

by the time

data gets into the game, there's kind of this really strict rule that, like, data can only be in the game if it was built into the game or if it comes in from this

very particular path

that is really nebulous,

like, really difficult to change and, like, the effects are super nebulous because of our kind of span

throughout the world.

So what that means is, like, we can't just ship a service called, like, the item recommender service that we ask the game server to kind of reach out to. Like, the architecture really kind of baked itself in such a way that it was like, hey. No. If you want this item recommendation data,

we don't really have a great way to do that kind of inference unless you just bake it into the game.

So and I think

because games are, like, pretty sensitive to latency, that restriction makes a lot of sense.

But as the game kind of grew and as our feature set kind of grew,

the wiggle room gets even smaller. Does that kinda make sense? So, like,

you know,

it becomes more and more difficult

to say, like, hey. At the beginning of the game, please reach out to our machine learning service that'll take 2 seconds to compute a feature to give it back to you. Because, like, 2 seconds

can make, like, a huge difference for the player experience if they're sitting on a loading screen even longer, if the game hitches

because, like, you're loading some bespoke

machine learning feature.

So a lot of our architecture ends up having to kind of follow this, like, don't touch the game server if you don't have to.

Don't touch the sort

of supreme JSON object if you don't have to. And I think I guess, like, it's funny because, you know, when you mentioned that, it's like an onion. Right? There's layers to this in the sense that, like,

even security

into the game servers is pretty interesting.

Like,

we don't allow arbitrary TCP connections to the game server.

Otherwise, it's like a huge security risk. We have, you

know, tons and tons of custom infrastructure in front of the game servers to monitor network activity

to, like, cut out any packets that get in the way.

So

even just kind of trying to be around the game server and kinda cheat the system a little bit is also super spooky. So,

yeah, I think that's definitely 1 of the things that make all of our kind of

products

really interesting from an engineering perspective because some of these constraints are

maybe hilarious,

but also serve, like, a really distinct purpose that you don't really wanna, like,

try to up and change willy nilly. You really need to, like, understand it. And sometimes that just means

going with what is, like, kind of

been demonstrated to work. So baking data in the game and just kind of accepting that,

you know, inference will always

be stale in 2 weeks, so to speak. Or, like, maybe even deciding that, like, this feature is too

sensitive

to a 2 week staleness, so we're just not going to ship it this way. We're gonna have to kind of invent another way to model this kind of problem.

Hopefully, that answers your question. Yeah. It definitely does. And 1 of the other

aspects that I was going to dig into is some some of what you're discussing around the

sensitivity to latency from the end user perspective,

both from

the avenue of I want to

provide this machine learning feature or I want to provide this data input to influence the way that the game experience is going to progress,

but also from the perspective of I need to be very careful about

where and how and what type of

instrumentation

I add into the game to be able to pull data out of it as well. Because

with the game engine being on a very tight loop to ensure that the user has that interactivity even if it is

dealing with potentially stale network data. The the end user needs to have that experience of the game as fluid or else they're just gonna drop out and how that influences the way that you think about being able

to instrument and collect data from that end user experience or where that data collection happens, whether it's on the client side or in the server or, you know, just

monitoring the TCP traffic and trying to rehydrate that into something meaningful.

It's funny because, like, I think, like, all those constraints also give, like, a really interesting, like, outcome, which is that, like, the game server kind of knows all.

It's this kind of ever knowing entity. Like, it knows everything that came into it to construct the game, all the important bits anyway,

and it can record everything that happens in the game and tell you otherwise.

So you really just need to be able to talk to the game server,

which has kind of led to a pattern.

I haven't seen it in too many places. You know, at DoorDash, if you wanted to say,

hey. What happened on the

entire journey of this delivery from restaurant to

merchants to driver

or to Dasher?

You had to kind of merge it across various different systems and maintain

many of foreign key to do so.

In league and I think also I've heard that Fortnite

might do something

similar. I haven't dug into it. I have a connection over there who's talking to me about it recently. But we basically are in this kind of fortunate place to generate a single artifact from a game,

just just like a single parquet file, essentially,

of just all the events that happened, when they happened, with all the sort of dimensional data that we care about. I guess the benefit was all of that stuff was in memory at 1 point in 1 system.

So, like, it's completely referentially

correct. Like, there's not gonna be, like, you know, this player 1

bought this item, and then you never see player 1 again. Like, that's never going to happen.

It was proven that it was all kind of in memory at 1 point unless the game server crashed or something.

So that actually leads to this really interesting dataset that gets generated that we

are trying to leverage kind of as much as possible.

It's the singular data set that's completely correct that we can, like,

ingest into the data warehouse and say we know everything about this game once the single file

finished ingesting,

which is, like, a really powerful feature. We can train machine learning models off of this, which we are doing, and we're trying to say, like, you know, if you ever need raw data for inference or something, the latency that this data is available is actually really quick. So you can just kind of hit that unfurl the file and then do your inference or whatever.

And,

yeah, it it actually cleans some data quality problems, but introduces other data quality problems.

Kind of a fun 1 is, like,

if there's ever, like,

some flag in this giant, you know, multimillion

c plus plus codebase

that is

turned on on 1 game server, but not the other

however many hundreds.

You might have a duplication of data or another row inserted.

And it's just kind of this random anomaly

that now you have to understand, like, well, what do we wanna do with it?

So kind of building our pipelines in such a way that anticipate that

is certainly challenging, but the benefit of it of this kind of dataset is really powerful.

This is kind of 1 of the only times I've seen this thing kind of happen. It's really interesting to kind of think about how it changes

things, if that makes sense. Like, if you go and read a blog post from, like, a Netflix,

it's kinda interesting because it's like, oh, they do all this work to munch all this data and collect it. Maybe we don't have to do any of that, and we can just kinda reap the benefits of their multi armed bandit system or whatever. So I think that's definitely, like, a really interesting

kind of capability we have, I think.

The biggest challenge with modern data systems is understanding what data you have, where it is located and who is using it.

SelectStar's

data discovery platform solves that out of the box with a fully automated catalog that includes lineage from where the data originated all the way to which dashboards rely on it and who is viewing them every day.

Just connect it to your DBT,

Snowflake, Tableau, Looker or whatever you are using and Select Star will set everything up in just a few hours.

Go to dataengineeringpodcast.com/selectstar

today to double the length of your free trial and get a swag package when you convert to a paid plan.

Another interesting aspect of your situation

that I'm curious about is the

kind of team topologies and the interfaces that you have from the

data engineering and data platform side of things

reaching across to the developers

of the game and the game engines and

some of the feedback loops that you have available for being able to say, I want to be able to create this type of information or the game dev saying, hey. I want to be able to

use this

data element or this type of recommendation

to

power this feature in the game and just some of the ways that the organization is structured to be able to facilitate

those interactions

and the interfaces that you have to be able to make sure that it doesn't just become a free for all of everybody throwing requests everywhere.

Yeah. Definitely. I think it's the requests everywhere and the data everywhere that's the really crappy part of that, you know, because it's hard to actually have, like, good quality data when it's just so random.

I think the best way to answer that is, like, kind of talking about the way

League is developed

from kind of the game engine perspective.

Since we build, like, the League of Legends game engine

and lead data central is kind of the team that helps

build any, like, data integrations out of the game into the data warehouse or any kind of frameworks or platforms to help you kind of develop your machine learning products.

1 of the things we see is we have this really kind of,

I guess, like, embedded

model where you see, you know, like, we'll have data scientists kind of working with individual game designers or game design teams,

be it on Teamfight Tactics or League of Legends or Wild Rift or whatever.

And we'll try to kind

of understand, like,

you know, what is it that you're trying to solve? What is it that you're looking into? And then the data scientists are actually

kind of powered to do, like, a little bit of POC work. And once they kind of understand, like, if the problem is solvable or whatever, they can come back to us or our kind of

overarching pillar or, like, under an organization called Tech Foundations, which tries to, like, provide

all the infrastructure for all the game teams.

We're kind of then able to say, like, hey. This is like a game engine feature. Let's go work with the game engine team to kind of make it more general for other teams.

So that's kind of, like, in from data scientists from specific game designer out to kind of

general change within any of the infrastructure tools is something we see a lot.

But we also do see a decent amount of just,

I guess, like, emergent work in the sense of, like, lots of folks coming and asking, like, can you help us kind of interpret this dataset? Can you help us kind of understand what's going on here? And for that, we are trying to kind of

solve that with some self serving tools.

We're really

exploring the world of, like, data contracts to see what we can do in the sense of, like, allowing

engineering teams or product teams to kind of

define what they want their data to be, and that's just kind of help them find it.

We find Alation the tool Alation to be a little useful here

because it this gives people a way to, like, look over the entire catalog of all the collected data,

and

we try to keep that

somewhat up to date. So if there's, like, some particular change they're looking for, they can kinda just reach out to the owning team. But I definitely think the main avenue is sort of this kind of embedded model

of data scientists really understanding

the problem space from the game designer perspective,

transforming that into, like, a data problem and then working with

however many teams across kind of the central

tech foundations

pillar to really bring that change out.

Since we build a lot of our own tools

from, like,

you know, the way we build the game to the way we ship the game

to the way we just, like, design characters.

We have a lot of ability to kind of go in and say, like, hey. We're just going to build this new component

within the artist tool to allow them to kind of do that.

So then the folks on my team have to be pretty

versatile,

like, in any given day, we're doing

something in c plus plus to something in spark, to something, like, as low level into the game engine as possible.

But it's all in service of trying to help data scientists kind of discover

ways that they can

give data to, like, game designers,

which is, I think, a really interesting challenge.

In terms of the

current

platform architecture that you've settled on, I'm wondering if you can give an overview of at least the kind of

core elements and the

aspects of the platform capabilities

that you are optimizing for.

There is, like, a lot of history here.

There's this really good blog post that some

engineers at Riot

wrote. It's like 7 parts or something

about, like, how the kind of container infrastructure has evolved

and just the way we ship services. And that is actually a really good demonstration of

why we've had to do so many changes, I think.

But kind of where we settled,

at least on, like, the data platform side,

is

we have an individual

kind of collection service

at our Edge network that collects all the data from

players,

and that ends

up feeding into

we have this fairly real time kind of Kafka based ingestion system

that is maintained

between us and the central team at Riot, a central data team. So a lot of data comes in through that and

then gets kind of Kafka ed into s 3, into our kind of Hive data warehouse.

We have a separate service that we maintain,

like, a separate Java service

that

kind of talks to the game server to ingest that file I had mentioned and kind of parse that out and understand that and push that into the data warehouse

also over s 3, although that one's, like, not super kafka e. And then we have

an internal collection service,

similar to the 1 for that's customer facing, but this one's generally more from microservices.

And it kind of goes through a similar path, but it's a little more vetted, a little more secure, so there's some shortcuts and optimizations there.

Once those all land in some kind of s 3 buckets,

be it through Kafka or do, like, custom services,

within just kind of sit on top of our giants, AWS

Glue and Hive and since I guess, partnered pretty heavily now with Databricks

since we leverage

Delta tables a lot to do a lot of our kind of, like, acid

changes

or, like, our acid constraints that we might want in certain tables. And then from there, we use, you know, airflow to do all of our orchestration of all of our spark jobs.

It all

kind of ends up into,

you know, maybe like a Tableau or some kind of front end dashboarding tool. Generally, Tableau is what we see. And then for the data activation cases,

depending on what we're kinda looking at, sometimes this is like, you know, another team, We'll just dump into their s 3 bucket. We're like a pretty big AWS shop. So

lots of tooling exists to kind of share files across different AWS accounts without too much hassle. When we talk about kind of the game, it's kinda interesting because at that point, we're kind of linking into

our custom compiler

system that we have. We have, like, this really big build farm out in Las Vegas that kind of builds the game

under various, like, compiler settings or, like,

under various,

I guess, like, shader settings, if you will, to, like, ingest kind of the graphical data and stuff like that.

So we'll hook into that

on kind of a batch cadence to say, like, here's the new item recommendation

data or here's the new, like, detection algorithm to determine,

you know, did a player play top lane or did they actually play bottom lane even though they said they were going to play top lane or whatever? That kind of, like, inference determine how a player actually played

all of that. It's, like,

translated from some spark code into some c plus plus code

that let's get kind of compiled into game.

And that one's really fun just because, like, I don't know if you've ever done the exercise of, like, trying to take a decision tree

and, like, actually codify all of its splits out into if else statements.

But, like, actually seeing it kind of rendered that way is really hilarious. So we have a couple different things that end up going through this

translation

layer, if you will, from Spark into, like, c plus plus I guess compiled.

It's pretty interesting.

It works

fairly well, but, yeah, I think that's kind of the breadth of it. So it's mostly like a hive shop just with some really interesting

kind of data activation cases.

As far as the

delivery step of the ML models in particular, you mentioned that those get baked into that compiled binary.

And I'm curious

what types of constraints

you have around the actual size of the binary for being able to deliver to end users because, obviously, you don't want to bake in, you know, the entire gbt 3 model, for instance, into a binary that somebody has to download and play. You know, you you want it to be able to actually fit on their hard drive.

Hey. By the way, in order for you to play this character, you have to download the 2 terabytes of all player data so we can give you 1 single number. That'd be fun. No. I think, like, it's really weird. Like, there becomes, like, a lot of, like, philosophical

change, I think. Like,

I think when you're in the service land and you're reaching out and we have a couple services

that are, like, just microservices

to detect, like, did you feed in game?

And those are really nice because we've left the game server. We've come back into microservice world, and we have, like, kind of your standard, like, network call procedures or something. But, you know, once you get into the game server, it's like, oh man,

someone's just gonna kind of call this as like a function

and like we can't go and make like separate network requests. We can't like pull anything off of a hard disk or anything like that. We just kinda have to bake all of this into RAM and constants and code

that return some enumerated

value that the game engine can understand.

So the constraints really just come down to, like,

is this even maintainable?

It's

really quite challenging to actually know, like,

I guess, like, a really good example is, like, for our kind of item recommender,

it's actually a really challenging problem to know if someone kind of selected

an item that we recommended them.

Right? Because, like, even though that recommendation happens in the game, we actually don't know what their decision was until after the game ended,

and we ingested it, and we parsed it out, and we, like, reinterpreted

the meta and everything.

So that kind of delay between, like, inference and what actually happened

is already super challenging.

And kind of on top of that, you know, we have to ship things that kind of the game engine understands.

So there's a lot of format problems that we end up having to run up against.

It's pretty funny because we have, you know, these kind of automated recommenders that are generating

data that gets kind of baked into binaries. So, like,

if we have a bad day or, like, you know, if Databricks just kind of loses a container

and half correctly

finishes a task and generates, like, invalid JSON data,

we break the entire build of the video game.

That's not very good for a video game company with lots of video game developers who need to build the video game. If you build the break the build tool, everyone kinda gets frustrated.

We have, like, a lot of really interesting lines to walk depending on

where we're kind of shipping the service,

if it actually ends up in game, if we need

developers to be able to call a function

and return it, or if they're instead of kind of using this

data system that's been built that's maybe a little more flexible,

that feels like a bit from, like, a game developer perspective, it feels a bit more like an RPC call as opposed to, like, a function call. It's really funny because it should be fairly easy, but once you kind of start to,

like, think that you only have, you know, like 60

picoseconds

or something to return a function value, all of this becomes really

more complex. So the models we end up shipping

tends to be,

you know, like,

as pickled as possible, as compressed as possible, or as reduced as possible

such that if we do have to send

layers of a neural network or something, for example, we only send the minimum amount of weights possible

or some kind of minimum amount of data that we can reconstitute

the model

in game in c plus plus with as little loss as possible.

And, like, that whole iteration loop is really challenging because we typically need to understand, like, I guess it's kind of similar to the Android problem

in the sense that, you know, someone could still be running a really old version of Android on a really old phone.

You know, you could play League of Legends on a toaster if you try it hard enough. So being able to, like, understand,

can we actually take the time to reconstitute

this model on everyone's machine,

or do we have to bake in kind of overrides if we detect that, like, hey. Your machine is just not gonna be able to run this in a reasonable amount of time. It's kind of a smart default.

I know I'm kind of dodging the question, but it's mostly because we have, like, a couple different models

sitting in game, and they all have different avenues and different constraints

that all are kind of suppressed

by this

latency that we're having to play against

and kind of the variety of computers that can play League of Legends. Computers and or toasters, I suppose. Yeah.

And

that aspect of

being able

to dynamically

toggle whether or not you're actually going to use a particular model or enable a particular feature based on

the resource capacity of the machine that's running the game and running the end user experience,

I'm curious how that factors into also the elements of

quality and consistency of the data that you're able to collect based on those resource constraints also and some of the ways that you factor that into the platform and the transformation

designs

to be able to manage that

lack of consistency, both because people aren't necessarily all going to be running the exact same version of the game because they haven't bothered to download it yet or because their machine isn't able to actually

run the instrumentation that is going to give you that additional feature or piece of information

from their experience.

It's interesting because I feel like we're gonna be kind of solving that problem ad nauseam. Like, it's never really gonna go away.

So kind of the things we found that kind of best help us kind of mitigate that

is really trying to lean into, like,

I guess, kind of the ML ops model or rather the DevOps model

and really kind of trying to lean into the idea that, like,

data scientists, if we're shipping something, let's ship, like, a package. You know, let's ship something that's version controlled with unit tests, with as much faked data as we can,

and

try to like, a significant amount of effort goes into, like, alerting

and just trying to do, like,

I wouldn't necessarily call it anomaly detection, but just trying to, like, capture the cases

where, like, you know, hey. This game only has 80% of its data reported. I don't really need it. I'm just gonna get rid of it, but I'm gonna, you know, tick a counter.

And if this counter gets too high, I'm going to page an engineer to see if they can come in and do something about all this missing data.

So we really try to leverage, you know, like, this kind of modular development to, like, hopefully have unit tests and integration tests to try to catch these things and test these things.

We work

pretty closely with our kind of game analysis team

to say, like, hey, you know, we've made this kind of nebulous change or, like, you know, it works on my machine. It works on all the build machines, but, like, can you go through and try to understand, like, how is that going to,

you know, work with, you know, random users and PC bangs? Or, like, how is this going to affect,

you know, people who play this particular champion because, you know, we have a 180 champions. We didn't playtest all a 185 of them. Can you help us understand that? So, you know, Riot has a pretty good QA team and a pretty good QA system to try to try to mock that stuff out. So we leverage that as much as we can.

And then I think, like, a big part of it is

we're not afraid to be a little more, like, experimental

with the data that we collected.

But,

you know, since we operate in, like, a bunch of different shards and a bunch of different regions and they all get deployed separately,

we have a little more flexibility on saying, like, hey, try this thing out in this region

for, like, a month.

Collect as much data as possible, wait for these kind of events to happen,

and see what our kind of flux is,

and then kind of address it in our pipelines accordingly.

I guess I guess the best way to answer it is, like, our rollout

for models

tends to be pretty long

as opposed to, like,

other shops where sometimes, like, shipping a model was just, you know, uploading weights to an s 3 bucket, and all of a sudden you we had the model shipped.

We tend to have a more kind of, I guess, structured rollout plan. We tend to, like,

monitor metrics very closely. We tend to work with, like, change advisory boards really closely.

I'm not sure if that really kinda answers the question, but I guess the answer is we we just try to do a lot of DevOps practices

and we try to roll out

pretty slowly to gain a lot more confidence until we're, you know, at that 95%

mark and we say, okay. Now any variance of the data

will just kind of treat as an emergent issue or treat as an on call issue and kind of update accordingly.

But it certainly happens as much as possible. We've just found that trying to get in front of it in, like, the development process saves us the most headache.

Data engineers don't enjoy writing, maintaining, and modifying ETL pipelines all day every day, especially once they realize that 90% of all major data sources like Google Analytics, Salesforce, AdWords, Facebook, and spreadsheets are already available as plug and play connectors with reliable intuitive SaaS solutions.

Hivo Data is a highly reliable and intuitive data pipeline platform used by data engineers from over 40 countries to set up and run low latency

Boasting more than a 150 out of the box connectors that can be set up in minutes, HEVO also allows you to monitor and control your pipelines.

You get real time data flow visibility with fail safe mechanisms and alerts if anything breaks, preload transformations and auto schema mapping precisely control how data lands in your destination,

models and workflows to transform data for analytics, and reverse ETL capability to move the transformed data back to your business software to inspire timely action.

All of this plus its transparent pricing and 247 live support makes it consistently voted by users as the leader in the data pipeline category on review platforms like g 2. Go to dataengineeringpodcast.com/hevodata

today and sign up for a free 14 day trial that also comes with 247 support.

Another interesting element of how to think about the

design and operation

of data platforms and data systems is

particularly given the longevity of the product that you're working on,

how to

reduce the overhead of the onboarding experience so that as new teammates

come onto the team

or as people maybe need to work cross functionally where somebody working on the game engine just needs to be able to

self serve some aspect of data.

How do you think about the design and user experience of those platform components to be able to reduce the

level of effort required to manage that onboarding and being able to come up to speed and become effective?

That's actually, like, a really big challenge that our kind of data engineering team

is thinking about

just because, like, you know, their on call load can be quite high,

you know, depending on the time of year. Like, if world is coming up or something and there's

a lot of, you know, analysts who are maybe a bit more curious about something at that point in time, their on call low can be pretty high because there is a lot of question of, like, where is this data? Can I trust it? Etcetera, etcetera.

And

given that the team is pretty small, we tend to leverage, like, vendor tools to help us as much as we can here.

We're spiking out use of Monte Carlo to try to see if we can, like,

reduce

the amount of times people come in and say, like, hey. This data looks weird for this 1 hour, which is the 1 hour I care about of all hours.

I think I mentioned before, we try to heavily use Alation to make sure that we have, like, an up to date catalog that we try to funnel people to as much as we can.

But I think just like the challenge of onboarding

at League is

1 that exists kind of

everywhere.

You know, we have, like, this internal Wiki

for if you're a game engineer and you're kind of getting caught up, and it says, like, in big bold headlining text, like, it will take you anywhere from 6 to 8 months to be productive.

They're just not trying to sugarcoat the fact that, like, this is, you know, many years worth of work and there's probably no way to get caught up quickly.

So I think,

you know, from the data perspective, we generally tried to lean

as much into the realm of, like, you know, use our UDFs,

use our kind of defined libraries, use our tools,

because they'll help you,

you know, not fall

for, like, common pitfalls or

save you the time.

I guess, like, a decent example is, like, you know, a new analyst might come in and naively try to query

all the game summary data

for, like, 1 particular shard.

But shards go by many names.

So, like, you might not even know you need to join against some other dimensional table. So we try to provide libraries or tools

that, you know, are kind of installed by your Databricks notebook as default,

and we try to lean you into using those as much as possible

until you kind of feel comfortable enough to start writing raw SQL against the Hive Tables.

It's actually definitely really interesting problem that we're

still trying to kind

of try to nail down.

I think

because there's kind of this accepted

bar of difficulty

across

all the different League of Legends teams,

you see a lot of different individual onboarding

projects kind of surface.

I was part of 1 that kind of helps try to say, like, hey. If you're a new services engineer, you come to this class and we teach you all the services infrastructure in, like, the course of a week and a half.

There's, like, a game

boot camp that we hold. So, like, new people coming into game engineering or folks transitioning into game engineering from another team. They might spend, like, 3 or 4 weeks

just going over individual components of the game engine and how it ships.

So, hopefully, you know, if you ask me again in 2 or 3 years, we have a more structured answer. But I think right now, we're really trying to solve it with, like, vendor tooling and libraries

to try to, like, reduce the friction of getting into the ecosystem and just start playing around.

In your experience

of working on the League data team, what are the most interesting or innovative or unexpected ways that you've seen data and derivatives of the raw data used either by Riot Games as an organization or by the players?

Yeah. I definitely think for me, it's this kind of, like, you know, translating

decision trees or translating neural networks or whatever into, like, c plus plus code. Like, it is just such a such an interesting experience to

kind of have to leave, you know, model training world, which, you know, everyone now thinks of in terms of Python libraries

or,

you know,

like Jupyter Notebooks or whatever,

kind of having to leave that and then come into the world of, like, no. This is a c function that allocates a pointer that sticks a weight into it, but then multiplies it out. Like,

that is just such a trip, I guess. Like and it's,

it is 1 of the most powerful things we have because we can really deliver, like, customized

experiences to all of our players.

But it's also just 1 of the things that I know, you know, if I don't go to a game company after this, it's just gonna, like, rot in my brain immediately. Like, I'm never gonna need to think about how to serialize

a decision tree ever again into c plus plus Maybe I will, but I have my doubts.

So I think there's that. And I also think just the way we

are

kind of fortunate to have

all of the data funneled through your game server.

It allows us to save a lot of grunt work, making sure that our data is consistent with itself. We can just kind of trust

the game server to do what it does and then just put checks in place to make sure that nothing

crazy happened or no, like, crashes happened along the way.

So that's really kinda nice. I remember, you know, at Adobe, we spent a long time

just trying to build systems to join various datasets.

And having all of that kind of go away

right up front is really interesting.

In your experience of working on the team and helping to implement some of those

technological

capabilities. What are the most interesting or unexpected or challenging lessons that you've learned in the process?

I think for me, the biggest thing has been, like,

very little of being, like, a machine learning engineer, if you will,

is

statistics. Like, very, very little.

A lot of it is

understanding

your model.

Like, we've had outages with models

that

actually took

60 days to manifest into any kind of problem.

But then the problem it manifested into was, you know, like, 100 of players being banned for no apparent reason.

And it's like, well, why did that happen?

But, like,

having to trace

the data through the data pipelines

and, like, understand

how you're computing a feature or even understanding that, like, you know, the golden dataset that maybe we train this model on is now just too old and we need to

recollect the golden dataset and try again

or something to that extent

has really just kind of bitten us time and time again. Like,

not actually knowing that our model is working

as expected

is just like an ever growing pain for us. Like, at any point in time, kind of without proper monitoring or proper tooling or proper testing, any given model we have is a ticking time bomb,

and that has nothing to do with the fact that we used

stochastic gradient descent

with some particularly hypertuned parameter.

You know? It all has to do with how we've deployed it and how we've engineered the system

to work.

And I think, like, as I kind of have left school and come into a couple different teams now doing data,

it's become more and more apparent to me that, like, the skills that I have is is in that realm of, like, building better systems,

not necessarily in, like, designing better algorithms.

And it turns out that that's a huge problem for a lot of organizations, so I'm quite lucky that I have the skill set. Absolutely.

I can relate in that regard. Yeah.

As far as

the work that you've been doing

at Riot Games and on the League Data team, what are some of the most interesting or informative mistakes that you've made, either personally or at a team level?

Oh, I think for us, it's definitely

always been around, like,

experimentation.

I think it wouldn't be surprising to me if in the near future, our team

spends

a lot more time trying to build tools to help us do experiments a bit easier and define what an experiment is.

But, you know, like,

launching a new service too soon

could lead to, you know, corrupt data for, like, a month. So, like, now for the rest of time, all of your pipelines have to take into account the fact that you shipped the service poorly 1 month in August in 2020 or something.

On the flip side of that,

waiting too long

to get in front of, you know, poorly playing players in Europe or or something. Not that Europe players play bad, but like just any particular area of the world, you know, taking too long to get in front of that and solve that from, you know, a data perspective

could mean we lose players. It could mean that people,

you know, change their reputation. It could mean that more toxic people show up to that shard and now, like, you know, we have Reddit posts kind of flaming us for days or whatever.

And that's just like a kind of

ongoing

problem that that we've never really took the time to address.

So, like,

I think for me and for a lot of our teammates, you know, you'll always hear us be like, oh, dang. Like, I wish I spent, like, 2 more weeks just thinking about this so we didn't have to deal with it or, like, you know, we spent way too long polishing the thing. And then by the time we shipped it, no 1 even cared anymore because all the hype was gone or just Fortnite beat us to it or whatever.

So I think definitely for us, like, just that realm of, like,

how do you balance

kind of being the shop that everybody wants to be where you move fast and break things,

but also, like, accepting the fact that, like,

you know, if you mess up data for months, you're going to have to live with that for years.

Yeah. I think that's definitely, like, at least for me, the biggest learning, and I know a couple of the coworkers I talk with, that tends to be, like,

a thing that we all kind of vent about at various times. Absolutely.

And as you continue to build and iterate on the systems that you're supporting and work with the game developers and game engine teams, what are some of the things you have planned for the near to medium term of the data stack or any projects that you're particularly excited to dig into?

We're really looking into, like,

what does live inference mean in a game engine, and how can we kind of change

the fact that, like, you know, we're afraid to get into the game engine because of network connectivity, because of network constraints or performance constraints.

But if instead we treat those constraints as like just a way to build the system against what could we do, like,

you know, is there a world where you know we embed LibTorch into the game

and that actually is like

not sufficiently difficult to ship or players can still use or what have you.

That is definitely, like, an area, like, in game inference that we're kind of getting into that I'm really excited about. I think it's

a bit less kind of the research y reinforcement bits and more just like

how could we better personalize an experience, how could we tell somebody,

hey, now's a good time to join a team fight or something like that. So I'm really kind of excited about that space. I'm also really excited about

getting to, I guess,

just kind of build more generalized

tools

for game developers to kinda have this stuff, like, out of the

gate. As more and more games come out of the league engine,

they have, like, a bigger reputation to kind of uphold,

you know. So like if we build,

you know, a system that really well

describes

the meta of TFT, like, you know, imagine some system that just really well defines it in terms of sentences so a game designer can, like, kind of take it into their world and kind of articulate it. A new game coming on the league engines game is like, hey. We want that. We want what TFT has

because, like, you know, who else is making a competitive card game in the space or whatever?

So I think

further generalizing our tools and, like, further generalizing

these kind of ML systems, I'm really kind of excited about.

Hopefully, you know, as the years go by, they become

more and more robust,

and I can talk about them again in, like, another fashion like this or something.

Are there any other aspects of the work that you're doing on the league data team or the data challenges or applications that you're supporting that we didn't discuss yet that you'd like to cover before we close out the show?

Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

You know, it's funny. I've been listening to the show for a while, so, like, I knew this question was coming.

And I was like, oh, like, what is my answer gonna be? I feel like this is, you know, this is really the time to shine, but it's

it's hilarious because I think the answer

I have

is

I wish there were

be it a tech or a piece of tool or be it just like a better course

that kind of helps kind of explain database fundamentals in the world of machine learning.

I feel like a lot of the challenges we end up solving

if you go back far enough in database literature, you find the exact same problem, just phrase in business intelligence words.

So a better way to, like, go back and kind of uncover all that, like,

research from database management systems,

I really wish existed. I really wish I didn't have to, like,

you know, read something

about how, you know, Uber solved their shuffling problem

and then say, like, you know, this sounds like a replication problem. Didn't MySQL solve this? And then sure enough, here it is in, you know, some 1980 text books sitting in the UC Davis library.

I wish that didn't have to happen all the time, but, you know, we stand on the backs of giants as it is, and sometimes you don't know what the giants are. So And everything old is new again.

Yeah. Everything old is new again. So, like, you know, I'm trying to take it upon myself to as I discover those things, like write blog posts or something just to say, like, hey. If you remember solving this in your organization,

so did Postgres in 19 nineties. Here's how they did it. Let's see what we can rip from that and and just kind of dress up.

Absolutely.

Well, thank you very much for taking the time today to join me and share the work that you've been doing on the League Data team and some of the challenges that you're facing and some of the solutions that you're building around that. It's definitely a very interesting problem space, so I appreciate you taking the time to share that, and I hope you enjoy the rest of your day. You too, man. I appreciate you having me on the show.

Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast,

which helps you go from idea to production with machine

learning. Visit the site at dataengineeringpodcast.com.

Subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links