Organizing And Empowering Data Engineers At Citadel

Hello, and welcome to the Data Engineering podcast,

the show about modern data management.

When you're ready to build your next pipeline or want to test out the project you hear about on the show, you'll need somewhere to deploy it, so check out our friends over at Linode.

With 200 gigabit private networking, scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform.

And if you need global distributions, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai.

And for your machine learning workloads, they just announced dedicated CPU instances.

Go to data engineering podcast.com/linode.

That's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show.

You listen to this show to learn and stay up to date with what's happening in databases,

streaming platforms, big data, and everything else you need to know about modern data management.

For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season.

We have partnered with organizations such as O'Reilly Media, Corinium Global Intelligence, Alexio, and Data Council.

Go to data engineering podcast.com/conferences

to learn more about these and other events and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macy. And today, I'm interviewing Michael Watson and Rob Krzanowski about the technical and organizational challenges that they're facing at Citadel.

So, Michael, can you start by introducing yourself? Yeah. No problem. So my name is Michael Watson. I'm the director of the data engineering organization here at Citadel and head of the enterprise data team. Been here for about 5 years

and

long time listener of the show, so really excited. And, Rob, how about you? Yeah. I'm Rob Kaye. I work directly with Michael. He's actually my boss. I lead 1 of the lead engineering teams here at Citadel, and

we work closely on both kinda tactical and strategic elements against the team. And so I drive a lot of sort of the initial data initiatives at Citadel. And, Michael, do you remember HayFirst got involved in the area of data management?

Sure. So, like, looking back over the last, like, 10 years since I graduated

undergrad, I feel like every, like, step along the way has involved

dealing with data in a research process 1 way or the other. I learned Python as an undergrad, and my first introductory computer science course, and it's kinda just been, like, full steam ahead ever since. And, like, along the way, like, working at, like, research companies, first in, like, market research and now within the context of a hedge fund,

Data is the lifeblood

of how we, like, make our decisions and now how we make investment decisions and

kind of managing the life cycle of that information

from

you get raw data from a vendor,

raw data from a web scrape, and then transform that into

a piece of information that can be consumed by an investment team. And then have that transform into an actual investment decision

is pretty much exactly what we do,

within the hedge fund. And so, like, getting involved in the data management of that is almost, like, inherent to, like, how we just run our business.

And, Rob, do you remember how you first got involved with data management?

Yeah. So going back all the way through my academic route. So initially, I kinda pursued a more academic career, certainly in pure mathematics before switching over to industry. I had actually done both mathematics and computer science back in undergrad, so it was a very natural transition.

Initially, I focused kinda heavily on full stack web development before transitioning more to

machine learning infrastructure,

and building machine learning models before joining Citadel.

And then upon my transition here, I kinda took over a lot of the more tactical initiatives

in the data engineering space that we're working on here. And in terms of how that kind of intersects with my background, I found it's been it's been kinda great to be able to interface with both the analytic aspects and some of the engineering aspects.

And so

as you mentioned, Michael,

1 of the main aspects of working in a hedge fund is the fact that you have all of these different data sources that you need to be able to incorporate

to ensure that you are making

pertinent decisions given the portfolio that you're dealing with. So before we get too much into the specifics of your data engineering practice at Citadel, can you just give a bit more background about some of the role that data plays in the overall business of Citadel and hedge funds in general? Yeah. Totally.

So to understand Citadel, it's good to know that the the top level, there's 2 different sides of the business. 1 is Citadel Securities,

which is a

almost a quantitative trading firm, that acts as a market maker, working a lot of very interesting industries, but that's actually kind of like a separate organization

from the hedge fund itself. So if you Google Citadel, you might see Citadel Securities. You might see Citadel Hedge Fund. From pretty much everything I'm talking about today, it's about Citadel the hedge fund.

And

in, that context, it's something that's called a multi strategy hedge fund.

And when you work within Citadel, that is a big impact on how the organization is laid out. So each 1 of those strategies

invests in a specific asset class and has

engineers and technologists

aligned to traders, portfolio managers, and quantitative researchers

working on, like, their bespoke use cases. So the different strategies are a,

fixed income business,

a credit business,

a commodities business that's dealing with energy products as well as agricultural products,

a quantitative strategies business,

and then

a longshore equities business. And that's where I've spent most of my career.

And that's teams

of portfolio managers

that are constructing their own portfolios. So it's a bunch of different stocks,

in 1 sector. So let's say 1 sector could be technology stocks where they're constructing a portfolio,

let's say Apple, Google, IBM, Microsoft,

and they have to construct it within a

specific risk model,

make sure they have the same number of long positions, a short position, so the market goes up or down. It stays relatively neutral.

And then we have analysts that are reading the k's and q's, understanding the fundamentals of the companies that they're investing in

very intimately.

And then we will pair them with incredibly strong data engineers, with strong fund computer science fundamentals

that are also right with communication. They can understand why are we possibly looking at a specific dataset

in the context

of a company that we might be investing in and see around the corners of things that could change in the data and build in safeguards and different pieces of an ETL pipeline,

that'll protect us from some sort of changes in the data source. And we have 5 different teams, for example, within the equities business, where we will have

engineers

sitting alongside investment teams, working

through their different data use cases,

alongside a role called a sector data analyst

to ingest data from the outside world, turn it into insights,

and then hopefully, they can convert that into a money making investment opportunity.

And so you mentioned that there are these different groupings of engineers

engineers in particular fit into the overall life cycle of

data engineers in particular fit into the overall life cycle of the decision making and incorporation of the data that is used to drive these different business decisions?

Definitely. So right now, we have 5 different data engineering teams in kind of like a hub and spoke model, where at the center, there's Rob Kay's team that works for the core data engineering, where they're dealing with a lot of the infrastructure that we might be using, like our Airflow infrastructure,

our JupyterHub deployments,

as well as managing some of, like, our event driven processes and DQM systems,

creating tools for other data engineers that are sitting on the trading floors with the investment teams. So that role has a little bit more of a software engineering tilts,

but it's very business focused and is is very business aligned with the investment teams.

And then the 3 other data engineering teams that are working with the investment teams in the equities business

are a little bit more commercial. You all can almost think of them as like a data engineering consultant,

where they will work on,

2 to 3 projects at a time that could last anywhere from 2 days to 2 months.

That is,

that corresponds to an investment idea. So let's say somebody is trading

some company that's buying and selling oranges

in Georgia.

And the weather patterns of Georgia might have a really big influence on the number of oranges that they might sell in a given quarter.

So the,

the engineer might look to look at all of the his like, first find out where all the distribution centers and the retail locations of that orange distributor,

and then align

that with historical weather patterns in those locations

and create some sort of signal, like the number of days that it was sunny over a Friday, Saturday, and Sunday in a given quarter, and see how well that might line up to historical

sales of,

an orange distributor.

Again, this is just like a hypothetical

scenario where there's not actually an orange distributor we're modeling, but you can imagine that you could extrapolate this on other sectors of the economy.

And those engineers, understand some to an extent the fundamentals of the company, but they're just really focused on best practices when building out

that pipeline,

for data, how it gets from the outside world, whether it's a web scrape or an outside vendor,

normalize an ingestion internally,

and then create the interfaces that investment teams would then use to consume that in either Jupyter Notebooks or Excel

or in,

our

Python or c plus plus libraries. And then we have 115,

which is the enterprise data engineering team, and they're working a little bit more with your traditional types of data. So, like, your market data, your

pricing data, data about the different types of securities you might be investing that we're getting from myriad different inventors, and and making sure that flows into

all of our internal systems that would refer to that when they actually go to make a trade. And how has the overall

nature of the

responsibilities and the work that the different data engineering teams are doing evolved over the past few years at Citadel

as the tooling and capabilities

have improved for being able to manage this data,

and as more sophisticated

analysis techniques have become more mainstream in terms of machine learning and deep learning and some of the different requirements

of data volumes and data quality has

evolved and increased as a result.

I would say the 1 thing that I that it took us a a while to learn is

you need a

engineer,

especially in the data space, sitting directly next to the end user of that data set.

I think maybe, like, looking back maybe 3 or 4 years ago, we would have, like, an investment team saying something like, I need to get location data about

oil tankers.

And they might then send that over the fence to some engineer that's maybe sitting on a different floor or even in a different office,

and they then have to

guess why might we be using this data for

modeling out oil tanker movements. They they then transform that, throw it into a table, and throw that back to a investment team on the other side of fence. And they might say, this is this data structure is in no way enables me to do, like, the type of time series analysis that I would wanna do.

So over the course of the last 2 years, we've brought the investment teams and the engineers

much closer together where they're sitting on the the same floor side by side, and you get a much stronger back and forth,

in terms of dialogue and idea generation

when the you have, like, an engineer sitting directly next to an investment professional or a trader

and and have that free flow of ideas. So, like, the the engineer can

see what's coming next, and they understand how they're using data. And some of the ideas are gonna start to come from the engineer opposed to just from the end user, whether that's an analyst or an somebody on the investment side.

Yeah. I imagine that that has

impacted your overall hiring strategies as well because of this

strong correlation

between

the quality and capabilities

of the teams

as a unit with the engineer embedded with the traders

versus some of the trends in engineering organizations where they're trying to push more for the ability to have remote engineers

because of the communications technologies that we have now,

where

because of the fact that you have different

business offices

probably across the world that are likely focusing on different companies or different business verticals,

You then need to have engineers who are able to work closely with them on the particular data types that they need. Whereas in a different office, it might be a completely different set of projects that they're involved with. So I'm curious how that has manifested in terms of

how you

focus your hiring strategies and how you focus the types of skill sets that you need to have

within an office to be able to ensure that you have a well rounded capability?

So 1 of the things that I think makes

a technologist

really successful at Citadel

is that they have an innate interest

in understanding financial markets, and they want to know how they can leverage

their engineering

skill sets

to be able to understand something about the world that maybe no 1 else has figured out yet

and get the validation of having that turn into to a successful trading idea.

And the engineers that have that innate driver, like, that actually resonates with them are the ones that are going to be most successful.

And

when you put somebody like that

directly next to a trader and give the those 2 an opportunity to have new idea generation and new approaches to problems that maybe have already been solved and failed, but, like, a new set of eyes

can give a new perspective on has been incredibly successful for us. So, like, if you were to look back maybe 2 or 3 years ago, we tried to start building out the data engineering or within Citadel. It was much more of a, like, remote engineer

where there was

requirements thrown over the fence.

The engineer would then try to understand

how that would

translate into, like, an ETL pipeline or a type of analysis

that they're guessing

an investment team might ultimately wanna do. But because we never had the opportunity for them to sit side by side,

the engineer was constantly guessing. And,

by bringing them in house

and sitting directly next to each other, we've seen, like, an incredible amount of growth in the value they were producing for the data engineering team and just new ideas coming out left and right. And so the engineers that are like that are that have that innate interest in finance, in understanding financial markets, but also have really strong underlying

software development skills are the ones that we have found that have

moved the needle

more than more than anyone else. I think that if somebody's just interested for technology and technology's sake, there's plenty of roles and opportunities

within

Citadel or or or many other firms to be successful, but specifically within the data engineering space where we we try to sit as close to the business as we can and understand how we're actually gonna try to use our our data and the investment process

that having engineers with that

innate drive for understanding financial markets and that commerciality

to to sit with a and sometimes nontechnical,

business user has been the the single biggest

evolution that we've gone through

over the last couple years to to end up in the the organizational structure that we have today.

And then in terms of the types of data that you're dealing with, you mentioned that you might be pulling from things like weather information over a certain period of time as it pertains to a business that you're looking at investing,

and then you might also be dealing with market data. So I'm curious if you can just talk through some of the sort of categories of data that you're dealing with,

and some of the process that goes into

identifying which sources are valuable, and then evaluating them for quality and maybe potential bias, and then ultimately

incorporating them into the overall flow of data that you're using to drive these different decisions?

So I'll take that 1.

I think 1 of the things that differentiates us in terms of the challenges that we face around our different categories of datasets, how we evaluate them, and how you value a data product overall is that we're operating in a space where we're looking at every sector of the economy. So that means sectors like energy, industrials,

health care, financials,

consumer facing services and products, technology, media, and telecommunications.

And in order to effectively be able to operate across all those sectors,

what we'll do is we will condense down

to the critical thing that we're trying to predict, such as a top level line item for a particular

public company.

And then,

we'll take different

permutations of a dataset that may be applicable to that company. So we might take a mean, average,

max spin, kinda different permutations of the dataset, and then

run a correlation

or a univariate regression to determine what is the best predictor for that particular company.

And at the end of the day, this really

ends up running into

challenges where you're working maybe with a 200 terabyte dataset

or,

you're working in a dataset where there's a really high uptime guaranteed.

And

what that

translates to is

systematic framework that we've constructed to,

be able to do that in both a structured way, but also apply it kind of broadly across these different categories.

And 1 of the things that is always a challenge,

particularly when you're dealing with a lot of different types of data, is just understanding what data you have, particularly when you have these multiple different teams that might be able to take advantage from a common

dataset or 1 team that has this bespoke dataset that they're dealing with. And so I'm curious what you have in terms of

sort of common infrastructure for being able to handle data cataloging and,

annotation on the data for being able to understand what is the purpose of this data, and what is that context that you've been able to capture,

and then some of the other processing infrastructure that you have available to these different data teams. And then then when it's necessary for them to be able to spin up their own

custom infrastructure for handling a special case that they're dealing with? Definitely. Yes. So, I mean, that like, the name data catalog kinda, like, is kinda near and dear to me because I built 1 of the first data catalogs that we used within Citadel going back to

2016.

And 1 of the challenges that we have within Citadel from a data engineering and data management perspective is

the importance

of secrecy and the importance

of privacy. So if

1 team is looking at a given datasets,

they don't necessarily want anyone else in the organization to know,

what that dataset is or even that they're looking at it at all because that then kinda gives up some

information. And so 1 of the biggest challenges that we have actually is

managing those permissions,

across the organization so that,

1, you can make data that you don't have to reengineer the wheel, every time, but, 2, you can kind of respect the privacy

and the

the permissions of,

somebody that was the first, the first comer or the the first mover on a given dataset. So it's always been a challenge we've had. But we do have,

a lot of our datasets cataloged in an internal system,

that we have

specific permissions around that tie into, like, internal permissioning. Somebody could search and then discover some of those datasets.

We also have a very large data management team that is

sharing,

with

sharing, with internal stakeholders, internal teams so that they can know, like, what are, like, the new datasets that are coming out of the market. And their job is to also,

kinda, like,

disseminate that information internally.

In terms of, like, our our shared tooling and infrastructure,

heavy users of airflow,

pretty much every 1 of our datasets

corresponds

to

a given Airflow

DAG in a monorepo

so that

when a user wants to work on a new project, we have a library that can it's called kick start.

It'll create a new directory within that monorepo.

It might create a new schema associated with that dataset.

It'll create the raw templates of either a web scrape or an ETL system,

and then that kind of, like, kicks off. It also might connect to Alembic

and run some of the

the DDL statements for creating the necessary schema. And then

as they develop that pipeline, there's a dev branch, that corresponds to our dev airflow and our dev databases.

Once that gets merged into master,

it then gets promoted to the,

the prod airflow server with the prod jobs and then updates the prod tables

from the Olympic migration

and goes from there.

So that's like having that mapping of DAG to

dataset

to schema and then to an entry in, like, our internal data catalog,

has been kind of, like, a really powerful unifying factor for dealing with the thousands of different datasets that we we deal with. But it's still something that we continue to to work on and improve. Like, what are the corresponding DQM checks associated with each 1 of those new datasets?

What are the different data access layers corresponding to each 1 of those datasets? Because we do have

a a pretty robust

data access layer,

that sits on top of that. So once the data gets loaded into and normalized

into those final,

SQL tables, most everything we have ends up in in SQL.

What are then the API what are then the queries that kinda sit on top of that that we templatize

that allows somebody to go into,

Excel

or Python

or Jupyter or Tableau and exact and extract the exact same view

of that data and all, like, the downstream

systems that we do our analysis in. So getting getting, like, essentially, like, our ETL framework and the SQL

schema tables

working side by side along with a cataloging framework and a data access framework that all kinda point back to the common dataset or the common concept of a dataset

has been really helpful. I do think there's probably more I take it back. There's absolutely more that we can do there to try to unify all of those. But it's something that, like, we've been pretty successful in so far, and we just continue to

push the button that, like, what what are the new integration points that we need to help get our, like, our time to analysis

as short as possible.

And then another challenge

in this overall space, particularly because you have so many different teams and then such a

broad scope as far as the number of different offices and number of

different business areas that you're dealing with is the overall aspect of

managing the

growth and maturity of the team,

both in terms of hiring, which we discussed earlier, but also in terms of ensuring that

engineers stay happy because they have some prospect for growth, whether that's in terms of the projects that they're involved with or the responsibilities that they have or some sort of promotional ladder that they have the option of climbing. So I'm curious how you handle

career development

and overall team management and cohesion,

giving the number of different teams and offices that you're dealing with and the size and scope of the business that you're working in.

That's actually something that we've spent a lot of time thinking about actually over the last year.

Because we have an org right now where we have data engineers

in London, Chicago, San Francisco,

Hong Kong,

that are working on a myriad of different types of problems

and aligning their skill sets as they progress through their career is really important for us.

1 thing that we're we're starting with

this year is the creation of an entry level data engineer role that's gonna work on our enterprise data engineering team,

where they'll work underneath a really strong

software engineering manager as the team lead,

and

really focus on core software development skills within the ETL systems

for our enterprise data. And that's the data feeding into

our different reference systems about all the different investable

instruments that we have within the firm, information about pricing data, a lot of your traditional market data. And as they really develop their software development skills within the data engineering

environment, we want to give them exposure to our business data engineers

that are working directly with investment teams

so that they can also develop their understanding of how we're using data in an investment

environment and for a specific investment

thesis.

So over the course

of 1 to 2 years, they not only are at a point where they have really strong development skills and understand all of, like, the tooling and infrastructure

around our ETL systems. They also are starting to get understanding

of what does that all mean in the context of an investment strategy.

So that's if they want to, after around 2 years, they can go in a direction where they're starting to work

directly with an investment team

on a on the training floor.

Alternatively,

we have the core data engineering team that is a little bit more of a software engineering tool that,

is responsible

for a lot of the core infrastructure and tooling that we have around,

data engineering that the other business data engineers are using. So they manage things like our airflow infrastructure, JupyterHub environments,

a lot of our invent driven,

ETL system we run Kafka,

love our DQM management systems, and, like, our our data engineering about like, our data evaluation frameworks.

And if they wanna go a little bit more deeper into, like, a software development career pathing, they could go on to to that team as well. But,

once you're already kind of established

mid career as a a data engineer, we do have additional trajectories to go, like, deeper on the individual contributor route, where you wanna go more from a data engineer into a data architects type of role.

And there we have,

really strong data engineers that do more architecture and design.

So that when there is a a complex problem around a, like, a difficult,

spark ETL system that requires a lot of, like, tweaking of the JVM. They're the go to resource for that, or there's a team that has a really high throughput of data,

going through Kafka. There'll be like the go to resource,

that the business data engineers can kinda lean on. And they're they're kind of seen as like the the wise data engineering

expert to go for some of these more bespoke problems.

On the flip side,

we also have plenty of opportunities to grow,

as a data engineering manager,

just because, like, the the the pace that we're growing within the org. There's constantly new teams that are developing.

And I think 1 of the things that Citadel does really well is

prepare

technologists

early on in their career for leadership opportunities.

We have a really good

internship program, where interns are constantly coming in throughout the year, and we pair

these really strong college

freshmen, sophomores,

juniors, people that are in grad school with sort of our best early career engineers

and

give the

the,

engineer within Citadel an opportunity to start lead and realize what it's like when you mentor

somebody that's more junior and be there to answer questions and coach and teach and just be be just a nice person and help them along,

or even within our rotational program. So when somebody joins Citadel, we have this program where you work in

a different team for every

4 months throughout the course of the year,

and pair them in the rotational program with a really strong

engineers to start getting them more management experience there as well, and then start career pathing towards their management roles. So

in terms of, like, taking everyone from

really strong mid career

architects

that into the data engineering space to to new grads that they need a lot more coaching.

We try to focus on creating opportunities for everyone across that space

to continue to grow.

And if we can do that in the context of

of the most important thing, and that is finding the best returns that we can in each 1 of the markets

that we invest in,

that we can create not only like an incredibly successful hedge fund,

which is always gonna be the 1st and foremost goal of why we were here, but we can also allow people to kinda like grow as a professional and as a technologist.

And in finding that sweet spot where both of those are directly aligned

is like, I guess, part of my job,

in helping kinda lead this team. I think that's something that we're we're actually doing we're at least we're doing we're doing a good job,

and we need to continue to reevaluate

how we continue to to do great and do even better there. But that's something that I'm particularly kind of proud about that we we really try to focus on at Citadel Data Engineering.

Another challenge,

particularly because you have all these different datasets,

and as you noted, you sometimes don't want to alert different teams of which data set you're working with. So it's not always easy to keep a global view of what data sets are available is how you manage the life cycle of the data from

identifying and incorporating it into your business decisions,

to storing it and actually using it for the analysis,

and then ultimately deciding when to either

retire it or whether to keep it updated. And so I'm curious how you approach that overall aspect given the,

strict regulatory environment that you're dealing with.

1 of the things I wanna point out is that the data engineering is only 1 part of our overall

data strategy at Citadel. Like we work and partner with 2 incredible groups of people, the

sector data analysts,

that are working closely with the investment teams on their data strategy

and how they want to,

extract information from these datasets and and incorporate it to their investment strategy are are critical in helping to

evaluate what datasets we wanna go forward with. And also the the data strategies group that are world class data scientists that are also helping us extract a lot of this information,

and figure out, like, which 1 of these datasets have legs. And so a lot of times the data engineers will get involved once we decide to go forward with a a gate given data product.

And then,

it's no longer a question of what data do we want to ingest.

It it eventually becomes a question of what data do we want to turn off?

Do we want to no longer support? Because there is

overhead

in continuing to maintain

a given data product that has

corresponding

pipeline

and a set of DQM

checks that will occasionally

fail.

A lot of the the data that we're consuming

is constantly changing and evolving.

If it's coming from a web scrape, a

there could be a total refactor of the site, where, like, they're using a totally different CSS tag, or totally different

product where they might be, like, originally

a static HTML page could have gotten changed to an Angular or a

React application where getting the information content from that could change dramatically.

When that happens,

it's gonna set off some alerts.

1 of maybe 1 of our data engineering SREs,

might start taking a look at it, and that takes time. Right? So we wanna make sure that we're only reacting

to data quality issues or or or failures in our ETL systems for data that's actually making an impact. And that's when our usage tracking systems come in really handy.

So every time that somebody

looks at

a given data points from Excel

or Tableau

or R or Python or c plus plus we log that

observation,

so that we end up getting tens, maybe a 100 of millions of different log entries

a day that'll tie back to a given data asset.

And then at the end of

the quarter, at the end of the year, we'll say, like, right now, we're we're

supporting

2,000 different data assets.

How many of those has no 1 looked at over the course of the last 6 months?

And then of the of the stuff that people aren't looking at, let's turn it off. Let's stop supporting that system so that when we know it's going to fail eventually,

we don't have to

have somebody spend their time trying to fix it.

And so we've realized in order to make this a sustainable process where we can continue to grow,

that it that knowing what to turn off is oftentimes

more valuable

than knowing what to turn on and go forward with. And that's something that, like, we've really,

focused on over the last couple years.

And

as you continue to evolve the capabilities

and requirements

of the data organization at Citadel, what are some of the challenges, whether technical

or business oriented

or team oriented that you are facing and that you're interested in tackling in the coming weeks months?

Yes. Like, 1 of the challenges

that

you get when you're a

investment

organization

that invests in, like, every industry amount

imaginable in every sector and every type of instrument

that's been around for 28 years

is you have

such a large sprawl of data that you've

accumulated

and understanding

what is the quality of that data across all of the different touch points is incredibly

difficult. And that's something that we're looking at tackling,

is

unifying our our DQM frameworks

across the different

data sources that we are ingesting from

and be able to

either, 1, sleep at night knowing that everything

is

as it should be,

or 2, at least you're being woken up

by a dataset that you know is important, you know it's wrong, and it's a system identifying it and not a person.

And if you could at least have if you could at least not have any unknowns in terms of your data, you'd know that there's a lot of things wrong with it,

but you don't have any unknowns,

at least gives you a really good foundation

of being able to address the the different

ETL systems that maybe were written 10 years ago that no one's looked at, but some production process might be looking at. And so really tackling the data the DQM

process and problem

is 1 of, for me, my biggest goals in 2020.

And are there any tools

or

practices or industry trends that you're keeping an eye on that you're excited to try and incorporate into your workflow?

Yeah. Totally. Like,

we we we built 1 library called Bong here, and there's a lot of similarities to Great Expectations

around

5th like, very early on when Great Expectations was get just getting going. And I think that project's come, like, a really long way. I think that 1 of the downsides with some of Great Expectations is that,

it requires somebody to enter in code for certain types of data unit tests. But for a lot of our engineers, that's really that's a really strong framework. So that's something that I I personally

think is, like, a great project I'd like to look at a little bit deeper.

And then there's also, like, industrial scale,

DQM libraries. They're specific for finance, some some aren't,

but we'll definitely be investigating more more in that space. We have a lot of also, like, really good internal tooling frameworks

for how you can write DQM tests, but they're not necessarily

deployed

across the entire stack. So, like, finding more unification

across the different ETL systems that we have

using common tooling, even if it's not the best, we're using the same framework everywhere, would be a huge win.

And are there any other aspects of your work at Citadel or the challenges that you're facing or the ways that you're using data that we didn't discuss yet that you'd like to cover before we close out the show? Yeah. I think 1 of the 1 of the coolest things that I personally worked on over the last couple of years is is how we have integrated

Jupyter Notebooks

with our deployment process of, like, analytics.

So we've we we have an internal JupyterHub deployment,

that runs on

HashiCorp Nomad similar to Qube, but it's something that we adopted relatively early on, and we're relatively mature at this, at this point.

And then we, on top of that, started creating a lot of Jupyter custom Jupyter plugins where

an analyst can come in

and click a button.

It'll then take out the code from Jupyter, store that within

an internal Elasticsearch

database.

And then whenever a user references

1 of those specific functions that were originally in the notebook

in either Excel or Python, r, or c plus plus.

A process that runs within

HashiCorp Nomad will read that into memory using the imp module in Python

and then execute it. Those functions all return pandas data frames, and then those return back to the clients that are originally requesting them via standardized API.

And what that allows for is a analyst that doesn't necessarily know how do you deploy a model or how do you deploy

a,

specific data wrangling

exercise

so that a

portfolio manager that maybe only knows Excel,

can access it seamlessly.

And then have the portfolio manager that only knows how to interact with Excel. They have no idea how to do anything in Jupyter and Python, or maybe couldn't even tell you what a CSV stands for. But if you can give them that

information that exists from that analyst, they'll be able to leverage it in a way that maybe nobody else in the world can because they might be an expert in the underlying economics

of the company or market that they're investing in.

And creating an infrastructure

that really pairs

the power

of analytics that you can get from, like, a Jupyter Notebook in Python or even RStudio

with some of, like, the existing

enterprise,

like, processes around research within either Excel or other frameworks

has been incredibly powerful for us.

So, like, we continue to kinda push the button on

how can we allow Jupyter to integrate into the research process.

And we were continuing to look for for new ways that we can do that going into 2020.

Well, for anybody who wants to get in touch with either of you or follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. And I'll start with you, Michael.

Sure. So I would say

a unifying system

that can

connect the underlying

tables,

that,

may exist in some antiserable format to the concept of a dataset,

link that to the concept

of a,

a set of DQM checks,

link that to the underlying

ETL systems that are processing that,

and then ultimately

link that to a set of downstream

interfaces

that

are accessible to users,

whether it be an Excel, Tableau, Looker, or any of your common downstream formats,

and have that all of that tied together in 1 concept of a dataset

and have that 1 concept of a dataset be permissionable

using active directory,

so that you could then deploy that within the enterprise

and permission people

to

access different parts of that individual dataset

throughout its entire lineage.

So that's something that doesn't exist

today, but it would be a great benefit if it did.

And, Rob, how about you? Do you have any particular gaps that you're feeling the pain of that you'd like to share? Yeah. I think 1 kind of trajectory as an industry that data engineering has been headed towards is kinda mirroring the revolution on the software engineering side of test driven development and just kinda starting with a court and set

of test cases

and specifications that are really written as code and then iterating on those starting with a couple of tests where they may be read, then you develop sort of a subsample prototype ETL pipeline, and you pass those tests. You continue iterating. It kinda start with a small germ

and kind of blossom on top of that,

opposed to a lot of the standard

data engineering processes in the industry as they exist today where you have to keep all that in your head today. Like, you have to be able to lay out the raw pieces on metal.

And there's less kind of, an interactivity

and

the subsampling

aspect to it that allows you to

iterate as quickly and optimally as possible

on the development life cycle kind of of those datasets and really treat them kind of as end to end

products or applications that you're developing. So I think there's a lot of headway. We've made a lot of headway on that internally.

And I think this is definitely something that I personally

am hoping kinda to continue seeing great developments on both kind of externally

in the data engineering community and also internally here at Citadel.

So that's something that I would love to have follow-up conversation on or anyone who wants to reach out to us about any of the problems that Michael has mentioned or that I've mentioned here. I would just encourage us to start that dialogue because we have a lot of ideas for how to solve these problems. We just need to,

we just need to continue pairing the business facing

data engineers with

the with those that are interested more kind of in this,

approach of starting with the core tools and

increasing the leverage of each data engineer and each business user.

And from there, I think we'll just be in a really good spot, 2022,

2025.

I really see a great future for data engineering both here at Citadel and and up there externally.

Well, thank you both for taking the time today to join me and share the work that you're doing and some of the challenges and successes that you've had at Citadel. It's definitely an interesting problem space, and it's always great to hear about the ways that people are attacking the work that they've got. So thank you for all of your time and efforts on that front, and I hope you enjoy the rest of your day.

Thanks, Shabazz. Thanks, Shabazz.

Listening.

Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways that is being used.

And visit the site of data engineering podcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links