Summary
There are countless sources of data that are publicly available for use. Unfortunately, combining those sources and making them useful in aggregate is a time consuming and challenging process. The team at Enigma builds a knowledge graph for use in your own data projects. In this episode Chris Groskopf explains the platform they have built to consume large varieties and volumes of public data for constructing a graph for serving to their customers. He discusses the challenges they are facing to scale the platform and engineering processes, as well as the workflow that they have established to enable testing of their ETL jobs. This is a great episode to listen to for ideas on how to organize a data engineering organization.
Preamble
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
- You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science.
- Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
- Your host is Tobias Macey and today I’m interviewing Chris Groskopf about Enigma and how the are using public data sources to build a knowledge graph
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you give a brief overview of what Enigma has built and what the motivation was for starting the company?
- How do you define the concept of a knowledge graph?
- What are the processes involved in constructing a knowledge graph?
- Can you describe the overall architecture of your data platform and the systems that you use for storing and serving your knowledge graph?
- What are the most challenging or unexpected aspects of building the knowledge graph that you have encountered?
- How do you manage the software lifecycle for your ETL code?
- What kinds of unit, integration, or acceptance tests do you run to ensure that you don’t introduce regressions in your processing logic?
- What are the current challenges that you are facing in building and scaling your data infrastructure?
- How does the fact that your data sources are primarily public influence your pipeline design and what challenges does it pose?
- What techniques are you using to manage accuracy and consistency in the data that you ingest?
- Can you walk through the lifecycle of the data that you process from acquisition through to delivery to your customers?
- What are the weak spots in your platform that you are planning to address in upcoming projects?
- If you were to start from scratch today, what would you have done differently?
- What are some of the most interesting or unexpected uses of your product that you have seen?
- What is in store for the future of Enigma?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- Enigma
- Chicago Tribune
- NPR
- Quartz
- CSVKit
- Agate
- Knowledge Graph
- Taxonomy
- Concourse
- Airflow
- Docker
- S3
- Data Lake
- Parquet
- Spark
- AWS Neptune
- AWS Batch
- Money Laundering
- Jupyter Notebook
- Papermill
- Jupytext
- Cauldron: The Un-Notebook
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering podcast, the show about modern data management. When you're ready to build your next pipeline, you'll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40 gigabit network, all controlled by a brand new API, you've got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today to get a dollar credit and launch a new server in under a minute. And you work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform for Metas Machine was built to give your data scientists the end to end support that they need throughout the machine learning life cycle. Skafos maximizes interoperability with your existing tools and platforms and offers real time insights and the ability to be up and running with cloud based production scale infrastructure instantaneously.
Request a demo today at dataengineeringpodcast.com /metis dashmachine to learn more about how Metis Machine is operationalizing data science. And go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch, and join the discussion at data engineering podcast.com/chat. Your host is Tobias Macy. And today, I'm interviewing Chris Grosskopf about Enigma and how they are using public data sources to build a knowledge graph as a service. So, Chris, could you start by introducing yourself?
[00:01:33] Unknown:
Yeah. I'm Christopher Grosskopf, Chris, and I, am the technical lead on the data engineering team at, a company that uses public data to build knowledge graphs.
[00:01:44] Unknown:
And how did you first get involved in the area of data management? So my background's a little unconventional.
[00:01:49] Unknown:
I spent most of the last decade working in journalism, as a data journalist. I worked at a number of publications, including the Chicago Tribune, NPR, and Quartz. In those, I occupied a variety of roles, all sort of circling around data. So I was was a news applications developer, I was a reporter, and for a brief time, I also, worked on a grant building a data warehousing tool, for journalists. So I've had a number of roles using data in newsrooms and also building open source to be used in new newsrooms. So I've built a toolkit called CSV kit that has been pretty widely adopted and a data processing library called Agate. And so really have circled around data in a variety of different ways.
And then more recently have moved to Enigma where, we're sort of applying all those lessons I learned working with public data and news to, much larger, challenges.
[00:02:48] Unknown:
And so can you give a quick overview of the problems that Enigma was built to solve and some of the motivation for starting the company, if you have that context?
[00:02:59] Unknown:
Sure. So Enigma's sort of key goal is we're we're trying to connect public data and make it useful intelligence for businesses, I mean, and any kind of user, really. So the original premise of Enigma was to, like, what if we had Google for structured data? You know, Google has been this tremendous, game changing tool for allowing us to discover the Internet, the unstructured data of the Internet. And but there's not a there's not a corollary. There's not a similar tool for working with structured data. Data that might be hiding in databases or different, file formats.
And so Enigma sort of started out around that hypothesis and has gone through sort of a variety of iterations of trying to figure out what the shape of the the problem is. So we've done, work around things like anti money laundering, pharmaceutical efficacy, a lot of different kinds of problems working with large companies and all sort of circling around this question of how public data gets applied to solve real world problems. And now we're sort of raised our series c, and we're reinvesting in sort of what we think is the right way to do this.
[00:04:13] Unknown:
And so 1 of the primary sort of concerns in enigma is this idea of the knowledge graph. So can you give a quick definition of how you define a knowledge graph and, maybe some of the broad use cases that it enables?
[00:04:33] Unknown:
Yeah. So at its most simplistic level, a knowledge graph is really just a a way of structuring data about the world as a graph. So take facts about the world, and rather than putting them in rigidly schematized database tables, we structure them into entities and relationships in a graph, which can be traversed as a graph, gets all your sort of computer sciency graph capabilities. But at a sort of more abstract level, the way that we think about graphs is graphs are a way of connecting information, which shares some, ontological meaning, shares some semantic meaning, but doesn't share schema. So we construct graphs from a huge variety of sources that, were never intended to be used together. So dataset a and dataset b were produced by different, let's just say, federal agencies or different counties in nonstandard formats, and we want to somehow mesh those together into a common, a common query layer and and a common representation.
So by mapping those into a common ontology and saying column x in this dataset means the same thing as column y in this dataset, we can construct these knowledge graphs where that information which was disconnected, is now connected and queryable. It has a lot of power for building these datasets that cross, what are traditionally siloed sets of information. So I don't think we don't quite have the, like, 1 sentence explanation of what we think of as, like, a knowledge graph. Knowledge graphs or technology has been around for a long time, and I think a lot of companies have sort of had their spin on it. It's really the category of problem we're trying to solve with knowledge graph, which I think is the most interesting part.
[00:06:20] Unknown:
And 1 of the challenging aspects in any data project, but particularly for something of the scope and ambition of what you're trying to do at Enigma is establishing and adhering to a tax onomy because that will largely define the capabilities that are possible based on the data that you're using and the way that you're structuring it. So how is that established, and how has that evolved over the time that you've been in Enigma?
[00:06:52] Unknown:
Yeah. So this is still a relatively novel approach that we're taking within Enigma to sort of build knowledge graphs that can solve many problems. Traditionally, Enigma has approached problems like this as 1 offs, and we sort of learn from each of those prototypes the general patterns. And now we're sort of taking those general patterns, trying to build a more generalizable, more scalable implementation that will allow us to solve a lot of similar problems in the same way. So taxonomy becomes increasingly important in this new model. And I'm not gonna say that we have quite figured it out yet, but I think the thing that we know at this point is that we do have to derive the taxonomy we wanna use from the cases that we actually wanna solve. So we're sort of not trying to do the 1 true taxonomy that's gonna apply in every possible domain.
We're looking at a subset of use cases that we think have commercial value, have a lot of utility, and we're focusing our taxonomy on those and sort of iterating it the way you would iterate software to fine tune it for those use cases. And then we'll, you know, we may build a few parallel graphs while we're sort of working out what the right models are. I think we'd love to end on a model where we have a single internal graph that we expose views on for different sort of client uses, but we're really trying to take that laser focused approach that, the only way we can know what the right ontology definition is is to actually use it out there for real use cases, and see how well it fits and what we might need to change around.
[00:08:33] Unknown:
And given the fact that you are pulling all of this information and extracting these entity representations from various public data sources, I imagine that there's a lot of variability in the quality and consistency of the data that you're using and your ability to populate all of the different attributes of these taxonomies for these entities to be able to expose them. So I'm curious what are some of the processes that you use in constructing the knowledge graph itself and some of the strategies that you use to ensure that you are able to
[00:09:11] Unknown:
achieve a certain sort of critical mass of attributes for any given entity? Yeah. That's a great question. So we think that the value of the knowledge graph is that it provides a way of sort of building up these entities from many component parts. That's really part of the value we think we can offer. You're right that no particular dataset has every attribute that we care about, nor does any particular dataset have necessarily the quality threshold we want in the final data. And, of course, we can we can talk at length about the problems that the original data might come with because it's public data in terms of fields that are invalid or mixed data types or all those kinds of issues we also have to deal with. But at the level of constructing the graph, we use machine learning to entity resolve all of these disparate datasets into our common ontology domain. So it's not required that any particular dataset has any particular attribute. Right? We we have datasets, for instance, from, say, the SEC and maybe datasets of of corporate registrations from each state.
And with those, they'll have different ways of referring to the same company. Let's just just say Enigma. They may refer to Enigma by Dun's number in 1 dataset by a text string name and another dataset by address in another 1. They might have slightly different addresses referring to the same company. Maybe 1 is a CEO's address and 1 is a street address for the business front. We can take all of those, put those through our entity resolution algorithm, and come out with an entity which has sort of the summation of all of that. And in fact, in some cases, has sort of derived properties that are actually better than any 1 source can provide. Right? So we may be able to take different numbers for a certain attribute of a company from a variety of places and have some business logic that says this number is most reliable in these cases, this number is most reliable in another set of cases.
And that allows us to really construct an entity at the end, you know, within our knowledge graph that is superior to what you could get from any 1 dataset. And that's really where we think the value lies.
[00:11:24] Unknown:
And so can you give an overview of the architecture that you're using as the data platform and the systems that you're using for being able to collect and store and serve the knowledge graph? Sure. So,
[00:11:40] Unknown:
you know, we think about the Lightning Platform as a holistic project that extends all the way from the moment we acquire the data from a source, which, you know, might be a website or something like that, all the way to the API that serves to the client the results of the graph. So starting at the very beginning of that process, which is the part that my team, the data engineering squad, owns, we have an in house data, or really data workflow platform that we call Concourse, which is built on a combination of Airflow, Docker, and a handful of other technologies. And basically, the promise of that platform is that we write workflows as Python scripts, and they then sort of compile to a dockerized image and an airflow DAG that's able to run that image. So we have a thin layer of custom code that runs as a plugin in Airflow that allows us to actually implement that.
But the sort of TLDR version is, unlike regular Airflow, where the sort of default use case is that something runs as pure Python, and then you have to sort of do something special to make it run a Docker container. Ours is exactly the opposite. Everything that runs on Concourse is a Docker container, and that allows us to sort of add an additional layer of abstraction. So all of our workflows have dependency isolation to a large degree. They can have even c dependencies if we need, like, OCR. We can have Tesseract installed in that image. But they benefit from all the traditional, value of Airflow. So they get, orchestration, scheduling, and all of these other things that Airflow does for us. There's also a variety of other tools, sort of use case specific Python libraries that we use in that step to implement different parts of the process. We have utility libraries that do common, sort of ETL or data acquisition tasks, and then we have libraries that implements features that are specific to different projects, such as, ingestion for the linking platform. That's the core key sort of piece that we use for data acquisition and ingestion is that Concourse platform. It runs out in AWS, currently runs on EC2, although, we're looking at, possibility of migrating that to something like ECS or maybe even 1 of the serverless platforms.
Either as raw files, like CSV or something like that, to be consumed by some downstream process. But more frequently, we have sort of a standard output to Parquet that also writes a variety of metadata, and that's what's actually consumed for the linking platform. So we have sort of a bespoke format we call linked data package, which is a file based representation of a graph. So when we take a dataset and we've processed it and it's ready for entity resolution and the sort of machine learning piece, that is it's sort of encapsulated in this linked data package artifact that we put on s 3, in our data lake.
The next piece of that, which is sort of owned by a different team, we divide it into sort of 3 subteams. They that that squad owns the machine learning, which includes enrichment, feature generation, NED resolution, and all of that part of the pipeline is implemented in Spark. We have a Spark Auto Scaling Spark cluster out on AWS, and they implement those processes and can run these very large machine learning jobs to take all of our individual dataset graph fragments and resolve them into a single knowledge graph, which is sort of the the the principal artifact of the system, right, is this this singular, knowledge graph that contains the resolved entities and relationships.
They then hand that off, sort of an interesting footnote, also as an LDP. Because because these linked data packages are simply representations of a graph, we reuse that model as the contract for the 3rd stage. We hand off the resolved graph to the 3rd team, which handles sort of the hosting of the graph and then the delivery via the API. The current, model is we're loading that graph into AWS Neptune, which is a hosted graph database solution. And then we have a query layer out in Amazon that's built over the top of that, which really, really slims down the specific type of queries, not because we're worried about performance or anything like that, but because, we're really trying to serve very particular use cases. So we have these very targeted queries that clients can use to get exactly what they need out of the graph, and probably will end up exposing many endpoints in the future for different use cases. So that's sort of the the endpoint. All parts of that, except for the middle piece, are basically all implemented in Python. And then that central component being Spark is implemented in Scala for performance. And that's kind of the the architecture as it exists right now. And can you give an idea of some of the different types of data sources that you're pulling from and some of the processes that you go through to
[00:17:05] Unknown:
vet those data sources before you start implementing them in production?
[00:17:10] Unknown:
Sure. So, we ingest a really wide variety of data sources. They are primarily public data sources, which means they can be anything from a website scrape to a CSV we download it off an FTP. It could be some old legacy binary format. It could really be a lot of things. In some cases, we're ingesting many years back. Formats may change. So in some cases, there's can be a fair amount of nuance to how we gather all of the data we want. In some cases, we're also doing things like rolling ingestion. We don't do a lot of that at the moment, but I anticipate there will be more of that. It really kind of runs the gamut of data acquisition techniques that we apply.
In terms of the sort of up front investigation that we do, the kind of investigation we're doing right now is primarily sort of market value focused. Right? We're going out and trying to find what public data sets will serve the use case best. And when we don't find them, we are, we do buy data as well. So we have a handful of sort of critical data sets in the graph that are data sets that we've purchased and resolved in. But the key point there is that we're really focused on getting data to fill certain pieces of our ontology. You know, if there's a particular attribute we have very poor coverage in, we can go identify a dataset that has that. Once we actually get to the point of acquiring that data, we do sort of ontology driven validation of what we actually acquire. So the at the simplest, that's are the columns that we've ontology mapped actually there, so we expect these columns to be there. But it can be, are they the type we expect? Do they have the fill rate we expect? Things like that.
And that's a system that really is in its infancy, and there's a lot of opportunity to improve that, to be able to apply those kind of quality checks in a standardized way across all the data we ingest. And it's something that I think we're gonna be doing a lot more of in the near future.
[00:19:06] Unknown:
And in terms of being able to to consume all of these various data sources and process them in a timely fashion, I'm curious what you found to be some of the most challenging or unexpected aspects of being able to build the underlying infrastructure necessary to create and process these graph attributes and these graph entities and some of the software life cycle workflows that you've built in to be able to create and manage the ETL code necessary for ingesting all of these various sources?
[00:19:44] Unknown:
Sure. So I think that the most fundamental challenge that the data engineering team at Enigma has is figuring out how to scale out the bespoke part of the work that we do. Right? So every data source that we acquire is different, and it's a problem that in traditional ETL, you generally don't have. The number of data sets that you ingest tends to be relatively small, and you tend to be ingesting them for the purposes of analytical workflows, like, for instance, quick tracking or something like that, where the structure is fairly rigorous, you control the pipeline, if not end to end, at least the majority of it. We don't have that. We have data sets that are very heterogeneous.
They're they can be different in everything from from format to quality, to complexity, to size. So the hardest problem that I think we have is figuring out how to do the sort of traditional software engineering work of writing well abstracted code, while also allowing ourselves sort of the ultimate flexibility of recognizing that really at the end of the day, only code can account for the level of variety we see. You know, there's a long history of trying to tackle ETL with configuration or with WYSIWYG solution. Those are sort of always unsatisfying, and they're especially, ill equipped to the sheer variety of kinds of data that we're ingesting.
Really, we need code to do that. So we've built you know, Concourse is sort of the key piece of that because it allows us these dockerized workflows. We can isolate their dependencies and we can have them pinned to a particular version of the library, and they will run forever, or they should. Right? Once that that Docker image is baked, that artifact should be able to run that code in perpetuity unless the source changes. So that's sort of 1 piece of firming up that contract, but then we've also as we've been writing these things, we do discover cases where we need to share code. In the case of the linking platform, there's parts of that process that we wanna iterate independently of any particular workflow. We may wanna change how the ontology is consumed or how certain kinds of validation are applied or the exact output format for the linked data package.
So we've sort of got a mixed model now where we take the code for the workflow and we encapsulate that really tightly, but then there's pieces that we sort of attach, and those can change at run time. Those can be independent of any particular, workflow or the Docker image that's created from that workflow. But I'm not gonna say we've got the balance perfect yet. It's something that we're constantly iterating, trying to figure out how you construct ETL processes. You know, let's just pick a number out of the clouds here. A 1, 000 ETL processes, all of which are different, but without creating, you know, a 1000 times the technical debt. And that, I sort of think, is the the key problem that that my team is trying to figure out.
And I think we're making we're making good head headway on it. So in order to make that process work, we really have very extensive build tooling around our workflows. Because our workflows are code, because they're not implemented in some database system or something like that, our workflows are all in source control, and that process of building the docker images and building the tags to run-in airflow, that all happens in CI. And really, at this point, a a very large portion of our infrastructure lives in that CI system that's in charge of running the tests for those workflows, building out those images, pushing them to the appropriate environments, you know, dev to stage prod, and ensuring that versions are iterated correctly, that the images can be built appropriately. All of those kinds of things are sort of part of that software life cycle. And 1 thing that I think we try to keep front of mind at Enigma is that ETL is software. There's a lot of baggage around ETL. I think in a lot of companies, it sort of gets relegated to sort of 3rd tier engineering status. But at Enigma, ETL is right at the heart of the problems we're trying to solve. So we really treat everything around how we acquire data as software that is worthy of testing and tooling and automation and good quality code, all of the things that you bring to, platform architecture or something like that. And
[00:24:18] Unknown:
1 of the long standing points of confusion or uncertainty that has come up in a number of the conversations I've had is in terms of how you create and structure the unit or integration or acceptance tests around ETL code and overall pipeline code largely because of the volumes and varieties of data that you're dealing with and particularly in the case of dealing with unbounded streams of data, but also in the case that you're dealing with where you have such a large variety of data. So I'm wondering what types of tests you're creating and some of the litmus tests that you're using to ensure that the data that you're processing in production is able to meet the quality checks that you are building in during the early stages of creating those processing steps?
[00:25:14] Unknown:
Yeah. So sort of 2 answers to this question. The on the linking platform side, in terms of the data that's actually delivered for entity resolution and for our our our knowledge graph, we take the approach that nothing should reach that part of the process that could possibly cause it to fail. So we try to push all validation of the data as early in the process as possible, and that means at the time of acquisition. So when we acquire a dataset, the last thing that we do is take the ontology and apply all these validation rules to ensure that when it enters the knowledge graph construction process, feature generation enrichment and NDA resolution, that that process will not fail because of something that's wrong with the data. This has been a huge sort of pain point for us because that's the most complicated and longest running piece of the pipeline.
It really can't fail because 1 of a 100 input datasets has integer where a string is expected. So we try to provide, like, really rigorous validation on the data that we output from ingestion, but that doesn't solve the problem of how we maintain and test individual workflows, so the individual ETL components of the process. And that's an area where I think we're iterating a lot right now. I mean, 1 thing I already mentioned is we really do treat those ETL processes as software in its own right, and that means we do unit testing on our workflows. Right? If there is a piece of the workflow that, I don't know, say generates URLs based on some set of inputs, we test that. We we treat that as a as a unit of code that's worthy of a test. And if there is, for instance, let's say, a complicated XML, data structure that is an input from the source, we'll take a subset of that and write a sort of soup to nuts test using that as an input that runs it through our process and validates that at least for a fragment of valid source data, our workflow continues to work. Now what we don't do a good job at yet, but we're actively looking at, is how do we better handle the changes in the source, which are totally unpredictable, but which we really wanna catch immediately before we apply any ETL code at all. So we're looking at ways of caching the the structure of the data that we received last time we requested it, and then sort of looking at the the delta in the data structure from when we got it last time to what we're seeing now so that we can fail right at the beginning of the process and say, okay. This this source that's out there on the Internet, they just uploaded a different schema or a different kind of file. It is no longer what we thought it was. Really, the only recourse for us is to fail out as early as possible and get that in the hands of an engineer who can inspect it and how we can adapt to them. So rather than trying to build ultra durable pipelines, which is not really possible for these public data sources to change all the time, we're trying to build in really good error handling and failure cases that get that in the hands of somebody who can address the issue as quickly as possible.
[00:28:43] Unknown:
And when you're extracting all of these data sources and then building up the knowledge graph, is the graph itself something that can be easily updated incrementally, or do you have to do either like a full recompile of the node structures or recompiling large subsets of the structure?
[00:29:04] Unknown:
Yeah. So that's an area that we're really actively looking at. Right now, for sort of our first, go to market, we are rebuilding the graph. Our iterations on our source data are not so frequent that we really need to be constantly revising the graph. Right? Most of our use cases are not real time. People are consulting the knowledge graph for information about a company, about a place. And that information generally is not something which needs to be updated on a daily or hourly basis. So we are able right now to regenerate the graph on demand. It's a fast enough process that we can do it fairly frequently.
But we also know that as the size of the graph scales, there is gonna be a threshold at which we need to do iterative updates, and we've got some, like, pretty good ideas about how we'll be able to do that. It just hasn't been a priority for sort of a current the current cycle. The I expect that, you know, the scale of the graph that we're building, which is already, I think, fairly large, is gonna increase by multiple orders of magnitude in the next, year or 2. And so we really will have to tackle that problem at some point. It just hasn't been priorities thus far. And in terms of being able to build and scale that infrastructure,
[00:30:26] Unknown:
what are some of the challenges that you're facing currently and that you anticipate coming up in the near future?
[00:30:33] Unknown:
Yeah. So this is not quite the answer you're expecting, I think, but but the biggest challenge is acquire either knowingly or unwittingly. So that process of scaling out the number of datasets is something that we're trying to approach very methodically, and we're trying to sort of constantly iterate on the process itself to ensure that we're learning and figuring out what the right shape of all those processes are. Aside from that, we have sort of all the traditional scaling problems that come with building a platform like this. We have to figure out how to regenerate that graph in a performant fashion even as the scale of it increases significantly. We have to figure out how to get that refreshed graph loaded up into Neptune in, an efficient and time sensitive manner.
We need to scale out horizontally our, data acquisition and standardization processes a lot more than we already have. You know, we're not ingesting so many datasets now that we couldn't make do with just a fixed number of workers, but we want to be able to do things like, for instance, click a button and revalidate and apply a fresh ontology to every dataset that's part of the graph. Right? We'd like to be able to run that instantly over well, not instantly, but immediately over every dataset that we use as an element of the graph. And that means probably moving to some kind of a serverless architecture.
At the very least, it's gonna be moving to a containerized architecture that's more flexible than the 1 we already have. So we're gonna be solving that kind of scaling problem too, the sort of traditional infrastructure problems. And then looking out a little further at the kinds of problems that we're gonna be solving, we've got some really interesting challenges around things like temporality in the graph, how do we encode time. We have sort of multiple different kinds of time that we care about in the context of this data, and that's a challenge that I'm sort of especially keen to,
[00:32:47] Unknown:
to to tackle at some point in the next year or so. Yeah. I was actually just wondering about that aspect of versioning the data or being able to traverse the historical attributes of a given entity, particularly for the case of things like companies or maybe locations with some sort of historical significance so that you can see maybe some of the different uses that it is undergone over the course of time and being able to explore that in some fashion for people who are consuming that information and using it to enrich their own analysis.
[00:33:22] Unknown:
Absolutely. It's definitely something that's on our roadmap and something that I'm particularly really excited about. I think the temporality question is interesting because you do kind of have to architect the entire platform for it. The temporality of some attribute can, in different cases, be a function of what year a dataset is about, what day you acquired a dataset on, what time period a particular row is applicable to. Some of the attributes sort of, they decay. Right? Like, they cease to be accurate after some period of time, but others might be durable forever, or they might have fixed periods of duration that that that could be vary from attribute to attribute. So temporality is a really challenging thing to address within the context of the knowledge graph, but I think that our our sort of holistic approach to this and our way of of thinking about the knowledge graph, I think it is a solvable problem. And I think that there's a lot of appetite in the market for us to to figure that out and do it really well.
[00:34:28] Unknown:
And in terms of the actual data infrastructure and the environments that you're using for processing the, data ingest and ETL logic, I'm wondering if you are actually using some of the traditional software approach of having a production and a preproduction environment for being able to do some of that testing and validation logic and some of the challenges that you've had to overcome if you do, in fact, have that capacity built in? Absolutely. We do. We all of our workflows
[00:35:02] Unknown:
can run, first in staging environment and then in production. Eventually, we probably will have 3 environments, because we will want a truly parallel production environment where we can test system changes, in addition to having an environment to test workflow changes. You know, we have a pretty traditional software engineering process around our workflows. The workflows, you know, they they they flow through a process on our our Jira board, which involves testing and code review and sort of all the traditional checks and balances of software engineering. Nothing gets into production that hasn't run end to end in staging. I don't think we've encountered a lot of challenges that are specific to that, with the exception that keeping sort of artifacts in sync and building proper promotion policies in CI and all of those things are just complicated. And I think they're especially complicated in our system, given that we have this idea of compiling things to Docker images and tags, and we need to make sure that we're sourcing the correct version for each environment.
We need to make sure that all of the libraries that individual workflows can depend on or that the system itself can depend on are appropriately versioned across environments, and we can ensure that the right version is going where it needs to be. Those systems, those airflow systems are also being live deployed. DAG changes as we iterate on workflows. So we have to make sure that they have the right versions of everything to align with each deployment of the DAGs. So it's a complicated system, I wouldn't but I would say it's not been a special pain point for us. It's something that we will probably have to think a lot more about again, as this number of workflows continues to increase.
The level of nuance we need in that build tooling is gonna continue to increase. There will come a point at which we can't keep all of those workflows in a single repo. Other problems like that are on the horizon,
[00:37:00] Unknown:
but we're not quite there yet. And I think you also have the benefit of the fact that you don't have to try and replicate data from production to these pre prod environments to be able to run some of these validations because of the fact that you're rebuilding the resultant data during each run. So you don't have that issue of the data gravity between the environments to contend with or trying to figure out some, sampling subset of that production data to be able to use in validating earlier in the stages unless you're trying to do a direct comparison between the outputs of your staging environment and what's currently in production.
[00:37:41] Unknown:
That's right. I mean, we you know, s 3 is a wonderful thing, and it allows us to just sort of dump all of these outputs out there and keep them forever. So we can easily go in and compare staging and production outputs, compare outputs across runs. You know, our machine learning team, our knowledge graph team can consume the latest outputs. If 1 of those turns out to be invalid for any reason, they can easily roll back and consume an earlier version. We can vary the versions of the ontologies that we're ingesting with at any time if we ingest with a version of an ontology and then we say, oh, wait, that that's not gonna build the graph that we want, we can roll back and rerun that process with a different version of the ontology. So we have a fair amount of flexibility in how we wire these things together, and that's very much by design. It's it's reflected in the structure of the teams that work on this.
We have as, you know, clear a contract between those teams as possible, and that allows them to iterate very independently. And this is just 1 piece of that. And looking at your existing technical infrastructure
[00:38:45] Unknown:
and data platform, what are some of the weak spots that you are thinking about and that worry you when you start to consider the changes that you wanna have in place for the future? And what sorts of projects do you have planned to address some of those issues that you've identified?
[00:39:04] Unknown:
Yeah, sure. So, I mean, I've already touched on some of these things that making sure that we're not creating mountains of technical debt with our workflow processes is is constantly on my mind. You know, the scaling solutions that we are gonna need to do that sort of, like, 1 click run everything kind of model really is pushing us to move quickly on, towards some sort of serverless architecture. So we're looking at, you know, the world of tools that we have available is changing so rapidly, we almost can't keep up. But there's things like AWS Batch out there now, which provide very similar functionality to what we have built on top of Airflow. So that's 1 thing that we're looking at. Getting that scaling equation right and being out in front of demand is gonna be really critical for us. So I think we're thinking about that. I maybe wouldn't quite say worrying yet, but, it's something we have to figure out. And then in terms of, you know, the work of the other squads, I do think that graph regeneration and again sort of scaling ahead of that scaling curve and ensuring that the technology is there and ready when the commercial team gives us the next, you know, 100, 000, 000 rows of data and and says these are this is what's next. We we always have to be out ahead of that. And I think so far, we're we've done a pretty good job, but it's, like, it's a current continuous battle. And do you think that if you were greenfielding this entire project today
[00:40:29] Unknown:
that you would end up in some of the same spaces that you are right now? Or are there any major architectural decisions that you would make differently without the weight of legacy?
[00:40:42] Unknown:
Yeah. I mean, I I I don't think there's too many decisions we would make differently. I think that I think that there are tools that have come out since we started development, things like batch and other and other hosted services that we would seriously consider building on instead of the system that we have in house, but really to save us operational overhead, not because we think they're necessarily superior software solutions to what we have. The other thing that I think we might have done a little differently is, or this is just my opinion, I'm actually not I'd be interested to know if my colleagues would agree with me, but we don't have great system wide orchestration of the process yet.
And I think that perhaps we could have built more of that upfront or at least built the expectation of it upfront. So, automatically rerun the ingestion process. And we can sort of we can build that as a feature, but I think the system probably would benefit would have benefited from a little more thinking about how the disparate components could be orchestrated together so that we could sort of run the process in an idealistic model sort of 1 click. None of that is really, hurting us that much right now, but I do think that, you know, green when we talk about greenfields, I'm sort of inspired to think about really optimizing it. And those are the areas where I think we have pain right now. And in terms of the customers that you have and the types of projects that they're
[00:42:15] Unknown:
building on top of the data resources and infrastructure resources that you've created, I'm curious, what are some of the typical use cases that you've seen and maybe some of the ones that stand out as being particularly interesting or unexpected?
[00:42:32] Unknown:
Yeah. So, you know, this product line that we're working on now, we're really just getting ready to go to market with. But as I mentioned, we've built a lot of previous prototypes that look very similar and just weren't built on this particular technology stack. And those tend to be in spaces where you might expect the application of public data is really useful. So things like anti money laundering. We've done a lot of work with banks trying to catch money launderers, which is a problem where the application of a large amount of public data is sort of an obvious choice, and especially if you can take that public data and integrate it with the data they have in house in either literally in a knowledge graph or, at the very least, in a graph style sort of way of connecting it together, you end up with much more than the sum of its parts. So that's the kind of space that we've done a lot of work in. Other things we've done that I think are exciting, you know, we had a project, a product called pharmacovigilance, which did that sort of same thing, but using adverse drug event data sets.
When people take a drug and get sick, that information can be reported at the local level, the state level, the federal level. The systems do not share common schema. In some cases, there can be duplicates across those systems. So we've built tooling working with pharmaceutical companies to try to deduplicate those and generate a better, set of records around those adverse events. I think that's a really interesting, application and and the kind of application that I think we're gonna see a lot more of, going forward. You know, right now, we're still still figuring out what the sort of optimal market case is. We we know that there are many different areas in which we can apply this tooling, and we're trying to figure out, okay, which ones do we go after first? I think we have some pretty good hunches. There's a lot of interesting opportunities in areas like insurance, and, of course, in banking.
Any place where a company needs authoritative records on things that are in the public domain, like companies or places. I will say take you know, going back to sort of my history as a journalist, I think 1 of the most exciting applications of this technology is the ability to use ontology design to quickly assemble national or global datasets from disparate sources. So, you know, you look at an example of something like elections reporting, which can vary by county or state, building a national dataset is a nontrivial problem that people have spent many years on. And I think that using ontology to bridge the gap between local representations of that information and and sort of compile it into a de facto knowledge graph, I think that kind of model presents huge opportunities, both for building datasets that have value to our customers and also for building datasets that have, value to the public domain. Enigma has a long history of giving the the data we collect back to the public domain, and I don't think that stops with the knowledge graph. So I am personally very excited about the places where we can apply this technology to things which have real value to, to citizens and individuals as well.
[00:45:49] Unknown:
And as you mentioned at the beginning, you've recently secured a new round of funding. So I'm curious what types of new projects or business growth or feature additions you have in store for the future of Enigma and some of the ways that you're planning to grow or improve going into the future?
[00:46:11] Unknown:
Absolutely. Well, it's it's a super exciting time for us. We raised this round of funding, and our investors have really given us a food of confidence in this vision we see for for our knowledge graph technology. And we're doing more of everything, more proof of concepts with clients around the knowledge graph, more engineering dedicated to this technology, but scaling out really all parts of the organization. 1 big investment we're in the process of making is building out a team dedicated to doing the bespoke acquisition part. So we know that we wanna acquire a lot more data than we are today, so we are actively hiring for a lead for our data acquisitions team who will sort of be responsible for scaling out that human process of going and getting all of those datasets.
And the data engineering team will sort of, retain responsibility for the technical part of the process, for the the tooling, the pipeline, etcetera. That's a team that we foresee hitting double digits fairly quickly. And in fact, we're gonna open a second office. So that lead that we're looking to hire right now will also be responsible for Enigma's first expansion office. So, really, expansion across the board, we've got a couple, like, very significant contracts with, partners that are coming on board with the technology early or with other similar or related technologies that we've built at Enigma.
So there's just a tremendous amount of growth, and I'm excited about all of those things. All of those other problems I've mentioned, better tooling for acquisition, temporality, all of those things are things we're enabled to tackle because of that funding.
[00:47:55] Unknown:
And are there any other aspects of the work that you're doing at Enigma, knowledge graphs, public data, uses of the resources that you're building that we didn't cover yet that you think we should discuss before we close out the show? So I think the other thing,
[00:48:11] Unknown:
sort of going a little deeper into 1 thing we already talked about, you know, the the problems of acquisition you know, I talked about sort of the software engineering problem of how you abstract all those pieces, how you test it. But I think there's also a more fundamental problem that we're really thinking a lot about, which is what's the model for writing that kind of code? So the ETL processes we write, they exist as sort of an uncomfortable middle ground between something you want Visual Studio for and something you wanna do in a Jupyter Notebook. You know, there's sort of these evolving 2 models of writing data processing code. I see those sort of converging for certain use cases, and I think our use case is probably 1 where there is some intermediate that is better than either of the options we have right now. Jupyter Notebooks don't really work for us because it's very hard to write well modularized code. It's hard to build the kind of abstractions that we want in a Jupyter notebook where we sort of have blocks of code that run as independent tasks, you know, it's like separate nodes in the DAG, if you will. But at the same time, the sort of traditional software engineering tools are also a real pain for us because our processes largely are procedural, and you wanna be able to step through them 1 at a time, view intermediate states, verify that your transformation did what it did. In a lot of ways, that authoring process is more efficient for us than, you know, having to drop into the debugger and then restart something or or whatever it may be. So I'm really excited, and I and 1 thing we sort of have on our sort of moonshots list for the for the winner is to start looking into things like paper mail and other systems, which is a system for Netflix for automating notebooks. But we'd like to see that go a step further. We'd like to think about generating notebooks, or maybe it's degenerating notebooks into another format. But there's lots of interesting things, I think, which could sort of serve our middle ground use case where we want that procedural authoring model, but we also want the flexibility to run these things and and organize them the way we would organize more traditional software.
[00:50:23] Unknown:
And as you're talking about that, there are 2 projects that come to mind that might actually fit your use case at least partially. So when you mentioned decompiling the notebooks, there's a project that came out recently called the Jupytertext that might be helpful. But, the other project that seems like it would actually be more direct fit is, there's something called cauldron notebooks that was written specifically for being able to use a lot of the traditional software engineering principles of modularity, and executability, and using them in version control, but still having some of the notebook interface. So that might be worth looking at further.
[00:51:02] Unknown:
So I'll add links to all those in the show notes as well. Great. Yeah. Yeah. I I'd love to look at both of those. I mean, I think this is something that I feel like there's a mind share in the data engineering community swirling around these ideas. And, if these might be what we're looking for. Nothing I've seen so far quite hits the nail on the head, but I am confident there is something
[00:51:23] Unknown:
that can be built that will serve us better than the tools we have today. Alright. Well, for anybody who wants to follow the work that you're up to or get in touch about any of the things we've talked about today, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you view as being the biggest gap in the tooling or technology that's available for data management today. So I do think for our use case, the biggest gap probably is in the authoring tools.
[00:51:54] Unknown:
If there's another place where we need to improve a lot, it's in the observability of our processes. I'm not sure it's it's a question of specific tooling. Maybe it's tooling we actually have to build for ourselves. But, you know, we have this very elaborate process where data flows through our system, and and the provenance tracking within that is is somewhat limited and is really something we're gonna have to address. And I haven't seen a system out there that's gonna work perfectly for us. It's probably something we're gonna be working on in the coming year. Well, thank you very much for taking the time today to talk about the work that you're doing at Enigma
[00:52:32] Unknown:
and some of the issues that you're dealing with in your data engineering organization. It's definitely been very interesting and enlightening. So thank you for that, and I hope you enjoy the rest of your night. Great. Thanks, Tobias. You too.
Introduction to Chris Grosskopf and Enigma
Enigma's Mission and Knowledge Graphs
Challenges in Taxonomy and Data Quality
Data Platform Architecture
ETL Challenges and Solutions
Updating and Scaling the Knowledge Graph
Production and Preproduction Environments
Future Projects and Scaling Challenges
Customer Use Cases and Applications
ETL Process Models and Tooling
Biggest Gaps in Data Management Tooling