Summary
The financial industry has long been driven by data, requiring a mature and robust capacity for discovering and integrating valuable sources of information. Citadel is no exception, and in this episode Michael Watson and Robert Krzyzanowski share their experiences managing and leading the data engineering teams that power the business. They shared helpful insights into some of the challenges associated with working in a regulated industry, organizing teams to deliver value rapidly and reliably, and how they approach career development for data engineers. This was a great conversation for an inside look at how to build and maintain a data driven culture.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
- Your host is Tobias Macey and today I’m interviewing Michael Watson and Robert Krzyzanowski about the technical and organizational challenges that he and his team are working on at Citadel
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by describing the size and structure of the data engineering teams at Citadel?
- How have the scope and nature of responsibilities for data engineers evolved over the past few years at Citadel as more and better tools and platforms have been made available in the space and machine learning techniques have grown more sophisticated?
- Can you describe the types of data that you are working with at Citadel?
- What is the process for identifying, evaluating, and ingesting new sources of data?
- What are some of the common core aspects of your data infrastructure?
- What are some of the ways that it differs across teams or projects?
- How involved are data engineers in the overall product design and delivery lifecycle?
- For someone who joins your team as a data engineer, what are some of the options available to them for a career path?
- What are some of the challenges that you are currently facing in managing the data lifecycle for projects at Citadel?
- What are some tools or practices that you are excited to try out?
Contact Info
- Michael
- @detroitcoder on Twitter
- detroitcoder on GitHub
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
- Citadel
- Python
- Hedge Fund
- Quantitative Trading
- Citadel Securities
- Apache Airflow
- Jupyter Hub
- Alembic database migrations for SQLAlchemy
- Terraform
- DQM == Data Quality Management
- Great Expectations
- Nomad
- RStudio
- Active Directory
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the project you hear about on the show, you'll need somewhere to deploy it, so check out our friends over at Linode. With 200 gigabit private networking, scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform. And if you need global distributions, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances.
Go to data engineering podcast.com/linode. That's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. You listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media, Corinium Global Intelligence, Alexio, and Data Council.
Go to data engineering podcast.com/conferences
[00:01:24] Unknown:
to learn more about these and other events and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macy. And today, I'm interviewing Michael Watson and Rob Krzanowski about the technical and organizational challenges that they're facing at Citadel.
[00:01:38] Unknown:
So, Michael, can you start by introducing yourself? Yeah. No problem. So my name is Michael Watson. I'm the director of the data engineering organization here at Citadel and head of the enterprise data team. Been here for about 5 years and
[00:01:52] Unknown:
long time listener of the show, so really excited. And, Rob, how about you? Yeah. I'm Rob Kaye. I work directly with Michael. He's actually my boss. I lead 1 of the lead engineering teams here at Citadel, and we work closely on both kinda tactical and strategic elements against the team. And so I drive a lot of sort of the initial data initiatives at Citadel. And, Michael, do you remember HayFirst got involved in the area of data management?
[00:02:13] Unknown:
Sure. So, like, looking back over the last, like, 10 years since I graduated undergrad, I feel like every, like, step along the way has involved dealing with data in a research process 1 way or the other. I learned Python as an undergrad, and my first introductory computer science course, and it's kinda just been, like, full steam ahead ever since. And, like, along the way, like, working at, like, research companies, first in, like, market research and now within the context of a hedge fund, Data is the lifeblood of how we, like, make our decisions and now how we make investment decisions and kind of managing the life cycle of that information from you get raw data from a vendor, raw data from a web scrape, and then transform that into a piece of information that can be consumed by an investment team. And then have that transform into an actual investment decision is pretty much exactly what we do, within the hedge fund. And so, like, getting involved in the data management of that is almost, like, inherent to, like, how we just run our business.
[00:03:15] Unknown:
And, Rob, do you remember how you first got involved with data management? Yeah. So going back all the way through my academic route. So initially, I kinda pursued a more academic career, certainly in pure mathematics before switching over to industry. I had actually done both mathematics and computer science back in undergrad, so it was a very natural transition. Initially, I focused kinda heavily on full stack web development before transitioning more to machine learning infrastructure, and building machine learning models before joining Citadel. And then upon my transition here, I kinda took over a lot of the more tactical initiatives in the data engineering space that we're working on here. And in terms of how that kind of intersects with my background, I found it's been it's been kinda great to be able to interface with both the analytic aspects and some of the engineering aspects.
[00:04:03] Unknown:
And so as you mentioned, Michael, 1 of the main aspects of working in a hedge fund is the fact that you have all of these different data sources that you need to be able to incorporate to ensure that you are making pertinent decisions given the portfolio that you're dealing with. So before we get too much into the specifics of your data engineering practice at Citadel, can you just give a bit more background about some of the role that data plays in the overall business of Citadel and hedge funds in general? Yeah. Totally.
[00:04:34] Unknown:
So to understand Citadel, it's good to know that the the top level, there's 2 different sides of the business. 1 is Citadel Securities, which is a almost a quantitative trading firm, that acts as a market maker, working a lot of very interesting industries, but that's actually kind of like a separate organization from the hedge fund itself. So if you Google Citadel, you might see Citadel Securities. You might see Citadel Hedge Fund. From pretty much everything I'm talking about today, it's about Citadel the hedge fund. And in, that context, it's something that's called a multi strategy hedge fund.
And when you work within Citadel, that is a big impact on how the organization is laid out. So each 1 of those strategies invests in a specific asset class and has engineers and technologists aligned to traders, portfolio managers, and quantitative researchers working on, like, their bespoke use cases. So the different strategies are a, fixed income business, a credit business, a commodities business that's dealing with energy products as well as agricultural products, a quantitative strategies business, and then a longshore equities business. And that's where I've spent most of my career.
And that's teams of portfolio managers that are constructing their own portfolios. So it's a bunch of different stocks, in 1 sector. So let's say 1 sector could be technology stocks where they're constructing a portfolio, let's say Apple, Google, IBM, Microsoft, and they have to construct it within a specific risk model, make sure they have the same number of long positions, a short position, so the market goes up or down. It stays relatively neutral. And then we have analysts that are reading the k's and q's, understanding the fundamentals of the companies that they're investing in very intimately.
And then we will pair them with incredibly strong data engineers, with strong fund computer science fundamentals that are also right with communication. They can understand why are we possibly looking at a specific dataset in the context of a company that we might be investing in and see around the corners of things that could change in the data and build in safeguards and different pieces of an ETL pipeline, that'll protect us from some sort of changes in the data source. And we have 5 different teams, for example, within the equities business, where we will have engineers sitting alongside investment teams, working through their different data use cases, alongside a role called a sector data analyst to ingest data from the outside world, turn it into insights, and then hopefully, they can convert that into a money making investment opportunity.
[00:07:21] Unknown:
And so you mentioned that there are these different groupings of engineers engineers in particular fit into the overall life cycle of data engineers in particular fit into the overall life cycle of the decision making and incorporation of the data that is used to drive these different business decisions?
[00:07:44] Unknown:
Definitely. So right now, we have 5 different data engineering teams in kind of like a hub and spoke model, where at the center, there's Rob Kay's team that works for the core data engineering, where they're dealing with a lot of the infrastructure that we might be using, like our Airflow infrastructure, our JupyterHub deployments, as well as managing some of, like, our event driven processes and DQM systems, creating tools for other data engineers that are sitting on the trading floors with the investment teams. So that role has a little bit more of a software engineering tilts, but it's very business focused and is is very business aligned with the investment teams.
And then the 3 other data engineering teams that are working with the investment teams in the equities business are a little bit more commercial. You all can almost think of them as like a data engineering consultant, where they will work on, 2 to 3 projects at a time that could last anywhere from 2 days to 2 months. That is, that corresponds to an investment idea. So let's say somebody is trading some company that's buying and selling oranges in Georgia. And the weather patterns of Georgia might have a really big influence on the number of oranges that they might sell in a given quarter. So the, the engineer might look to look at all of the his like, first find out where all the distribution centers and the retail locations of that orange distributor, and then align that with historical weather patterns in those locations and create some sort of signal, like the number of days that it was sunny over a Friday, Saturday, and Sunday in a given quarter, and see how well that might line up to historical sales of, an orange distributor.
Again, this is just like a hypothetical scenario where there's not actually an orange distributor we're modeling, but you can imagine that you could extrapolate this on other sectors of the economy. And those engineers, understand some to an extent the fundamentals of the company, but they're just really focused on best practices when building out that pipeline, for data, how it gets from the outside world, whether it's a web scrape or an outside vendor, normalize an ingestion internally, and then create the interfaces that investment teams would then use to consume that in either Jupyter Notebooks or Excel or in, our Python or c plus plus libraries. And then we have 115, which is the enterprise data engineering team, and they're working a little bit more with your traditional types of data. So, like, your market data, your pricing data, data about the different types of securities you might be investing that we're getting from myriad different inventors, and and making sure that flows into all of our internal systems that would refer to that when they actually go to make a trade. And how has the overall
[00:10:30] Unknown:
nature of the responsibilities and the work that the different data engineering teams are doing evolved over the past few years at Citadel as the tooling and capabilities have improved for being able to manage this data, and as more sophisticated analysis techniques have become more mainstream in terms of machine learning and deep learning and some of the different requirements of data volumes and data quality has evolved and increased as a result.
[00:11:01] Unknown:
I would say the 1 thing that I that it took us a a while to learn is you need a engineer, especially in the data space, sitting directly next to the end user of that data set. I think maybe, like, looking back maybe 3 or 4 years ago, we would have, like, an investment team saying something like, I need to get location data about oil tankers. And they might then send that over the fence to some engineer that's maybe sitting on a different floor or even in a different office, and they then have to guess why might we be using this data for modeling out oil tanker movements. They they then transform that, throw it into a table, and throw that back to a investment team on the other side of fence. And they might say, this is this data structure is in no way enables me to do, like, the type of time series analysis that I would wanna do.
So over the course of the last 2 years, we've brought the investment teams and the engineers much closer together where they're sitting on the the same floor side by side, and you get a much stronger back and forth, in terms of dialogue and idea generation when the you have, like, an engineer sitting directly next to an investment professional or a trader and and have that free flow of ideas. So, like, the the engineer can see what's coming next, and they understand how they're using data. And some of the ideas are gonna start to come from the engineer opposed to just from the end user, whether that's an analyst or an somebody on the investment side.
[00:12:34] Unknown:
Yeah. I imagine that that has impacted your overall hiring strategies as well because of this strong correlation between the quality and capabilities of the teams as a unit with the engineer embedded with the traders versus some of the trends in engineering organizations where they're trying to push more for the ability to have remote engineers because of the communications technologies that we have now, where because of the fact that you have different business offices probably across the world that are likely focusing on different companies or different business verticals, You then need to have engineers who are able to work closely with them on the particular data types that they need. Whereas in a different office, it might be a completely different set of projects that they're involved with. So I'm curious how that has manifested in terms of how you focus your hiring strategies and how you focus the types of skill sets that you need to have within an office to be able to ensure that you have a well rounded capability?
[00:13:39] Unknown:
So 1 of the things that I think makes a technologist really successful at Citadel is that they have an innate interest in understanding financial markets, and they want to know how they can leverage their engineering skill sets to be able to understand something about the world that maybe no 1 else has figured out yet and get the validation of having that turn into to a successful trading idea. And the engineers that have that innate driver, like, that actually resonates with them are the ones that are going to be most successful. And when you put somebody like that directly next to a trader and give the those 2 an opportunity to have new idea generation and new approaches to problems that maybe have already been solved and failed, but, like, a new set of eyes can give a new perspective on has been incredibly successful for us. So, like, if you were to look back maybe 2 or 3 years ago, we tried to start building out the data engineering or within Citadel. It was much more of a, like, remote engineer where there was requirements thrown over the fence.
The engineer would then try to understand how that would translate into, like, an ETL pipeline or a type of analysis that they're guessing an investment team might ultimately wanna do. But because we never had the opportunity for them to sit side by side, the engineer was constantly guessing. And, by bringing them in house and sitting directly next to each other, we've seen, like, an incredible amount of growth in the value they were producing for the data engineering team and just new ideas coming out left and right. And so the engineers that are like that are that have that innate interest in finance, in understanding financial markets, but also have really strong underlying software development skills are the ones that we have found that have moved the needle more than more than anyone else. I think that if somebody's just interested for technology and technology's sake, there's plenty of roles and opportunities within Citadel or or or many other firms to be successful, but specifically within the data engineering space where we we try to sit as close to the business as we can and understand how we're actually gonna try to use our our data and the investment process that having engineers with that innate drive for understanding financial markets and that commerciality to to sit with a and sometimes nontechnical, business user has been the the single biggest evolution that we've gone through over the last couple years to to end up in the the organizational structure that we have today.
[00:16:29] Unknown:
And then in terms of the types of data that you're dealing with, you mentioned that you might be pulling from things like weather information over a certain period of time as it pertains to a business that you're looking at investing, and then you might also be dealing with market data. So I'm curious if you can just talk through some of the sort of categories of data that you're dealing with, and some of the process that goes into identifying which sources are valuable, and then evaluating them for quality and maybe potential bias, and then ultimately incorporating them into the overall flow of data that you're using to drive these different decisions?
[00:17:07] Unknown:
So I'll take that 1. I think 1 of the things that differentiates us in terms of the challenges that we face around our different categories of datasets, how we evaluate them, and how you value a data product overall is that we're operating in a space where we're looking at every sector of the economy. So that means sectors like energy, industrials, health care, financials, consumer facing services and products, technology, media, and telecommunications. And in order to effectively be able to operate across all those sectors, what we'll do is we will condense down to the critical thing that we're trying to predict, such as a top level line item for a particular public company.
And then, we'll take different permutations of a dataset that may be applicable to that company. So we might take a mean, average, max spin, kinda different permutations of the dataset, and then run a correlation or a univariate regression to determine what is the best predictor for that particular company. And at the end of the day, this really ends up running into challenges where you're working maybe with a 200 terabyte dataset or, you're working in a dataset where there's a really high uptime guaranteed. And what that translates to is systematic framework that we've constructed to, be able to do that in both a structured way, but also apply it kind of broadly across these different categories.
[00:18:40] Unknown:
And 1 of the things that is always a challenge, particularly when you're dealing with a lot of different types of data, is just understanding what data you have, particularly when you have these multiple different teams that might be able to take advantage from a common dataset or 1 team that has this bespoke dataset that they're dealing with. And so I'm curious what you have in terms of sort of common infrastructure for being able to handle data cataloging and, annotation on the data for being able to understand what is the purpose of this data, and what is that context that you've been able to capture, and then some of the other processing infrastructure that you have available to these different data teams. And then then when it's necessary for them to be able to spin up their own
[00:19:24] Unknown:
custom infrastructure for handling a special case that they're dealing with? Definitely. Yes. So, I mean, that like, the name data catalog kinda, like, is kinda near and dear to me because I built 1 of the first data catalogs that we used within Citadel going back to 2016. And 1 of the challenges that we have within Citadel from a data engineering and data management perspective is the importance of secrecy and the importance of privacy. So if 1 team is looking at a given datasets, they don't necessarily want anyone else in the organization to know, what that dataset is or even that they're looking at it at all because that then kinda gives up some information. And so 1 of the biggest challenges that we have actually is managing those permissions, across the organization so that, 1, you can make data that you don't have to reengineer the wheel, every time, but, 2, you can kind of respect the privacy and the the permissions of, somebody that was the first, the first comer or the the first mover on a given dataset. So it's always been a challenge we've had. But we do have, a lot of our datasets cataloged in an internal system, that we have specific permissions around that tie into, like, internal permissioning. Somebody could search and then discover some of those datasets.
We also have a very large data management team that is sharing, with sharing, with internal stakeholders, internal teams so that they can know, like, what are, like, the new datasets that are coming out of the market. And their job is to also, kinda, like, disseminate that information internally. In terms of, like, our our shared tooling and infrastructure, heavy users of airflow, pretty much every 1 of our datasets corresponds to a given Airflow DAG in a monorepo so that when a user wants to work on a new project, we have a library that can it's called kick start. It'll create a new directory within that monorepo.
It might create a new schema associated with that dataset. It'll create the raw templates of either a web scrape or an ETL system, and then that kind of, like, kicks off. It also might connect to Alembic and run some of the the DDL statements for creating the necessary schema. And then as they develop that pipeline, there's a dev branch, that corresponds to our dev airflow and our dev databases. Once that gets merged into master, it then gets promoted to the, the prod airflow server with the prod jobs and then updates the prod tables from the Olympic migration and goes from there.
So that's like having that mapping of DAG to dataset to schema and then to an entry in, like, our internal data catalog, has been kind of, like, a really powerful unifying factor for dealing with the thousands of different datasets that we we deal with. But it's still something that we continue to to work on and improve. Like, what are the corresponding DQM checks associated with each 1 of those new datasets? What are the different data access layers corresponding to each 1 of those datasets? Because we do have a a pretty robust data access layer, that sits on top of that. So once the data gets loaded into and normalized into those final, SQL tables, most everything we have ends up in in SQL.
What are then the API what are then the queries that kinda sit on top of that that we templatize that allows somebody to go into, Excel or Python or Jupyter or Tableau and exact and extract the exact same view of that data and all, like, the downstream systems that we do our analysis in. So getting getting, like, essentially, like, our ETL framework and the SQL schema tables working side by side along with a cataloging framework and a data access framework that all kinda point back to the common dataset or the common concept of a dataset has been really helpful. I do think there's probably more I take it back. There's absolutely more that we can do there to try to unify all of those. But it's something that, like, we've been pretty successful in so far, and we just continue to push the button that, like, what what are the new integration points that we need to help get our, like, our time to analysis as short as possible.
[00:24:04] Unknown:
And then another challenge in this overall space, particularly because you have so many different teams and then such a broad scope as far as the number of different offices and number of different business areas that you're dealing with is the overall aspect of managing the growth and maturity of the team, both in terms of hiring, which we discussed earlier, but also in terms of ensuring that engineers stay happy because they have some prospect for growth, whether that's in terms of the projects that they're involved with or the responsibilities that they have or some sort of promotional ladder that they have the option of climbing. So I'm curious how you handle career development and overall team management and cohesion, giving the number of different teams and offices that you're dealing with and the size and scope of the business that you're working in.
[00:24:56] Unknown:
That's actually something that we've spent a lot of time thinking about actually over the last year. Because we have an org right now where we have data engineers in London, Chicago, San Francisco, Hong Kong, that are working on a myriad of different types of problems and aligning their skill sets as they progress through their career is really important for us. 1 thing that we're we're starting with this year is the creation of an entry level data engineer role that's gonna work on our enterprise data engineering team, where they'll work underneath a really strong software engineering manager as the team lead, and really focus on core software development skills within the ETL systems for our enterprise data. And that's the data feeding into our different reference systems about all the different investable instruments that we have within the firm, information about pricing data, a lot of your traditional market data. And as they really develop their software development skills within the data engineering environment, we want to give them exposure to our business data engineers that are working directly with investment teams so that they can also develop their understanding of how we're using data in an investment environment and for a specific investment thesis.
So over the course of 1 to 2 years, they not only are at a point where they have really strong development skills and understand all of, like, the tooling and infrastructure around our ETL systems. They also are starting to get understanding of what does that all mean in the context of an investment strategy. So that's if they want to, after around 2 years, they can go in a direction where they're starting to work directly with an investment team on a on the training floor. Alternatively, we have the core data engineering team that is a little bit more of a software engineering tool that, is responsible for a lot of the core infrastructure and tooling that we have around, data engineering that the other business data engineers are using. So they manage things like our airflow infrastructure, JupyterHub environments, a lot of our invent driven, ETL system we run Kafka, love our DQM management systems, and, like, our our data engineering about like, our data evaluation frameworks.
And if they wanna go a little bit more deeper into, like, a software development career pathing, they could go on to to that team as well. But, once you're already kind of established mid career as a a data engineer, we do have additional trajectories to go, like, deeper on the individual contributor route, where you wanna go more from a data engineer into a data architects type of role. And there we have, really strong data engineers that do more architecture and design. So that when there is a a complex problem around a, like, a difficult, spark ETL system that requires a lot of, like, tweaking of the JVM. They're the go to resource for that, or there's a team that has a really high throughput of data, going through Kafka. There'll be like the go to resource, that the business data engineers can kinda lean on. And they're they're kind of seen as like the the wise data engineering expert to go for some of these more bespoke problems.
On the flip side, we also have plenty of opportunities to grow, as a data engineering manager, just because, like, the the the pace that we're growing within the org. There's constantly new teams that are developing. And I think 1 of the things that Citadel does really well is prepare technologists early on in their career for leadership opportunities. We have a really good internship program, where interns are constantly coming in throughout the year, and we pair these really strong college freshmen, sophomores, juniors, people that are in grad school with sort of our best early career engineers and give the the, engineer within Citadel an opportunity to start lead and realize what it's like when you mentor somebody that's more junior and be there to answer questions and coach and teach and just be be just a nice person and help them along, or even within our rotational program. So when somebody joins Citadel, we have this program where you work in a different team for every 4 months throughout the course of the year, and pair them in the rotational program with a really strong engineers to start getting them more management experience there as well, and then start career pathing towards their management roles. So in terms of, like, taking everyone from really strong mid career architects that into the data engineering space to to new grads that they need a lot more coaching.
We try to focus on creating opportunities for everyone across that space to continue to grow. And if we can do that in the context of of the most important thing, and that is finding the best returns that we can in each 1 of the markets that we invest in, that we can create not only like an incredibly successful hedge fund, which is always gonna be the 1st and foremost goal of why we were here, but we can also allow people to kinda like grow as a professional and as a technologist. And in finding that sweet spot where both of those are directly aligned is like, I guess, part of my job, in helping kinda lead this team. I think that's something that we're we're actually doing we're at least we're doing we're doing a good job, and we need to continue to reevaluate how we continue to to do great and do even better there. But that's something that I'm particularly kind of proud about that we we really try to focus on at Citadel Data Engineering.
[00:30:51] Unknown:
Another challenge, particularly because you have all these different datasets, and as you noted, you sometimes don't want to alert different teams of which data set you're working with. So it's not always easy to keep a global view of what data sets are available is how you manage the life cycle of the data from identifying and incorporating it into your business decisions, to storing it and actually using it for the analysis, and then ultimately deciding when to either retire it or whether to keep it updated. And so I'm curious how you approach that overall aspect given the, strict regulatory environment that you're dealing with.
[00:31:30] Unknown:
1 of the things I wanna point out is that the data engineering is only 1 part of our overall data strategy at Citadel. Like we work and partner with 2 incredible groups of people, the sector data analysts, that are working closely with the investment teams on their data strategy and how they want to, extract information from these datasets and and incorporate it to their investment strategy are are critical in helping to evaluate what datasets we wanna go forward with. And also the the data strategies group that are world class data scientists that are also helping us extract a lot of this information, and figure out, like, which 1 of these datasets have legs. And so a lot of times the data engineers will get involved once we decide to go forward with a a gate given data product.
And then, it's no longer a question of what data do we want to ingest. It it eventually becomes a question of what data do we want to turn off? Do we want to no longer support? Because there is overhead in continuing to maintain a given data product that has corresponding pipeline and a set of DQM checks that will occasionally fail. A lot of the the data that we're consuming is constantly changing and evolving. If it's coming from a web scrape, a there could be a total refactor of the site, where, like, they're using a totally different CSS tag, or totally different product where they might be, like, originally a static HTML page could have gotten changed to an Angular or a React application where getting the information content from that could change dramatically.
When that happens, it's gonna set off some alerts. 1 of maybe 1 of our data engineering SREs, might start taking a look at it, and that takes time. Right? So we wanna make sure that we're only reacting to data quality issues or or or failures in our ETL systems for data that's actually making an impact. And that's when our usage tracking systems come in really handy. So every time that somebody looks at a given data points from Excel or Tableau or R or Python or c plus plus we log that observation, so that we end up getting tens, maybe a 100 of millions of different log entries a day that'll tie back to a given data asset.
And then at the end of the quarter, at the end of the year, we'll say, like, right now, we're we're supporting 2,000 different data assets. How many of those has no 1 looked at over the course of the last 6 months? And then of the of the stuff that people aren't looking at, let's turn it off. Let's stop supporting that system so that when we know it's going to fail eventually, we don't have to have somebody spend their time trying to fix it. And so we've realized in order to make this a sustainable process where we can continue to grow, that it that knowing what to turn off is oftentimes more valuable than knowing what to turn on and go forward with. And that's something that, like, we've really, focused on over the last couple years.
[00:34:55] Unknown:
And as you continue to evolve the capabilities and requirements of the data organization at Citadel, what are some of the challenges, whether technical or business oriented or team oriented that you are facing and that you're interested in tackling in the coming weeks months?
[00:35:16] Unknown:
Yes. Like, 1 of the challenges that you get when you're a investment organization that invests in, like, every industry amount imaginable in every sector and every type of instrument that's been around for 28 years is you have such a large sprawl of data that you've accumulated and understanding what is the quality of that data across all of the different touch points is incredibly difficult. And that's something that we're looking at tackling, is unifying our our DQM frameworks across the different data sources that we are ingesting from and be able to either, 1, sleep at night knowing that everything is as it should be, or 2, at least you're being woken up by a dataset that you know is important, you know it's wrong, and it's a system identifying it and not a person.
And if you could at least have if you could at least not have any unknowns in terms of your data, you'd know that there's a lot of things wrong with it, but you don't have any unknowns, at least gives you a really good foundation of being able to address the the different ETL systems that maybe were written 10 years ago that no one's looked at, but some production process might be looking at. And so really tackling the data the DQM process and problem is 1 of, for me, my biggest goals in 2020.
[00:36:58] Unknown:
And are there any tools or practices or industry trends that you're keeping an eye on that you're excited to try and incorporate into your workflow?
[00:37:08] Unknown:
Yeah. Totally. Like, we we we built 1 library called Bong here, and there's a lot of similarities to Great Expectations around 5th like, very early on when Great Expectations was get just getting going. And I think that project's come, like, a really long way. I think that 1 of the downsides with some of Great Expectations is that, it requires somebody to enter in code for certain types of data unit tests. But for a lot of our engineers, that's really that's a really strong framework. So that's something that I I personally think is, like, a great project I'd like to look at a little bit deeper.
And then there's also, like, industrial scale, DQM libraries. They're specific for finance, some some aren't, but we'll definitely be investigating more more in that space. We have a lot of also, like, really good internal tooling frameworks for how you can write DQM tests, but they're not necessarily deployed across the entire stack. So, like, finding more unification across the different ETL systems that we have using common tooling, even if it's not the best, we're using the same framework everywhere, would be a huge win.
[00:38:10] Unknown:
And are there any other aspects of your work at Citadel or the challenges that you're facing or the ways that you're using data that we didn't discuss yet that you'd like to cover before we close out the show? Yeah. I think 1 of the 1 of the coolest things that I personally worked on over the last couple of years is is how we have integrated
[00:38:27] Unknown:
Jupyter Notebooks with our deployment process of, like, analytics. So we've we we have an internal JupyterHub deployment, that runs on HashiCorp Nomad similar to Qube, but it's something that we adopted relatively early on, and we're relatively mature at this, at this point. And then we, on top of that, started creating a lot of Jupyter custom Jupyter plugins where an analyst can come in and click a button. It'll then take out the code from Jupyter, store that within an internal Elasticsearch database. And then whenever a user references 1 of those specific functions that were originally in the notebook in either Excel or Python, r, or c plus plus.
A process that runs within HashiCorp Nomad will read that into memory using the imp module in Python and then execute it. Those functions all return pandas data frames, and then those return back to the clients that are originally requesting them via standardized API. And what that allows for is a analyst that doesn't necessarily know how do you deploy a model or how do you deploy a, specific data wrangling exercise so that a portfolio manager that maybe only knows Excel, can access it seamlessly. And then have the portfolio manager that only knows how to interact with Excel. They have no idea how to do anything in Jupyter and Python, or maybe couldn't even tell you what a CSV stands for. But if you can give them that information that exists from that analyst, they'll be able to leverage it in a way that maybe nobody else in the world can because they might be an expert in the underlying economics of the company or market that they're investing in.
And creating an infrastructure that really pairs the power of analytics that you can get from, like, a Jupyter Notebook in Python or even RStudio with some of, like, the existing enterprise, like, processes around research within either Excel or other frameworks has been incredibly powerful for us. So, like, we continue to kinda push the button on how can we allow Jupyter to integrate into the research process. And we were continuing to look for for new ways that we can do that going into 2020.
[00:40:53] Unknown:
Well, for anybody who wants to get in touch with either of you or follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. And I'll start with you, Michael.
[00:41:11] Unknown:
Sure. So I would say a unifying system that can connect the underlying tables, that, may exist in some antiserable format to the concept of a dataset, link that to the concept of a, a set of DQM checks, link that to the underlying ETL systems that are processing that, and then ultimately link that to a set of downstream interfaces that are accessible to users, whether it be an Excel, Tableau, Looker, or any of your common downstream formats, and have that all of that tied together in 1 concept of a dataset and have that 1 concept of a dataset be permissionable using active directory, so that you could then deploy that within the enterprise and permission people to access different parts of that individual dataset throughout its entire lineage.
So that's something that doesn't exist today, but it would be a great benefit if it did.
[00:42:28] Unknown:
And, Rob, how about you? Do you have any particular gaps that you're feeling the pain of that you'd like to share? Yeah. I think 1 kind of trajectory as an industry that data engineering has been headed towards is kinda mirroring the revolution on the software engineering side of test driven development and just kinda starting with a court and set of test cases and specifications that are really written as code and then iterating on those starting with a couple of tests where they may be read, then you develop sort of a subsample prototype ETL pipeline, and you pass those tests. You continue iterating. It kinda start with a small germ and kind of blossom on top of that, opposed to a lot of the standard data engineering processes in the industry as they exist today where you have to keep all that in your head today. Like, you have to be able to lay out the raw pieces on metal.
And there's less kind of, an interactivity and the subsampling aspect to it that allows you to iterate as quickly and optimally as possible on the development life cycle kind of of those datasets and really treat them kind of as end to end products or applications that you're developing. So I think there's a lot of headway. We've made a lot of headway on that internally. And I think this is definitely something that I personally am hoping kinda to continue seeing great developments on both kind of externally in the data engineering community and also internally here at Citadel.
So that's something that I would love to have follow-up conversation on or anyone who wants to reach out to us about any of the problems that Michael has mentioned or that I've mentioned here. I would just encourage us to start that dialogue because we have a lot of ideas for how to solve these problems. We just need to, we just need to continue pairing the business facing data engineers with the with those that are interested more kind of in this, approach of starting with the core tools and increasing the leverage of each data engineer and each business user. And from there, I think we'll just be in a really good spot, 2022, 2025.
I really see a great future for data engineering both here at Citadel and and up there externally.
[00:44:35] Unknown:
Well, thank you both for taking the time today to join me and share the work that you're doing and some of the challenges and successes that you've had at Citadel. It's definitely an interesting problem space, and it's always great to hear about the ways that people are attacking the work that they've got. So thank you for all of your time and efforts on that front, and I hope you enjoy the rest of your day.
[00:44:56] Unknown:
Thanks, Shabazz. Thanks, Shabazz.
[00:45:03] Unknown:
Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways that is being used. And visit the site of data engineering podcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Episode Overview
Meet the Guests: Michael Watson and Rob Krzanowski
Michael's Journey into Data Management
Rob's Journey into Data Management
Role of Data in Citadel's Business
Data Engineering Teams at Citadel
Evolution of Data Engineering at Citadel
Types of Data and Evaluation Process
Data Cataloging and Infrastructure
Career Development and Team Management
Data Lifecycle Management
Challenges and Future Goals
Tools and Industry Trends
Integration of Jupyter Notebooks
Biggest Gaps in Data Management Tools
Closing Remarks