Building A Knowledge Graph From Public Data At Enigma With Chris Groskopf

Hello, and welcome to the Data Engineering podcast, the show about modern data management.

When you're ready to build your next pipeline, you'll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40 gigabit network, all controlled by a brand new API, you've got everything you need to run a bulletproof data platform.

Go to dataengineeringpodcast.com/linode

today to get a dollar credit and launch a new server in under a minute.

And you work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform for Metas Machine was built to give your data scientists the end to end support that they need throughout the machine learning life cycle.

Skafos maximizes interoperability with your existing tools and platforms and offers real time insights and the ability to be up and running with cloud based production scale infrastructure instantaneously.

Request a demo today at dataengineeringpodcast.com

/metis dashmachine to learn more about how Metis Machine is operationalizing

data science.

And go to dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch, and join the discussion at data engineering podcast.com/chat.

Your host is Tobias Macy. And today, I'm interviewing Chris Grosskopf

about Enigma and how they are using public data sources to build a knowledge graph

as a service.

So, Chris, could you start by introducing yourself?

Yeah. I'm Christopher Grosskopf,

Chris, and

I, am the technical lead on the data engineering team at,

a company that uses public data to build knowledge graphs.

And how did you first get involved in the area of data management? So my background's a little unconventional.

I spent most of the last decade

working in journalism,

as a data journalist.

I worked at

a number of publications, including the Chicago Tribune, NPR,

and Quartz. In those, I occupied a variety of roles, all sort of circling around data. So I was was a news applications developer, I was a reporter,

and for a brief time, I also,

worked on a grant building a data warehousing tool,

for journalists.

So I've had a number of roles

using data in newsrooms

and also

building open source to be used in new newsrooms. So I've built a toolkit called CSV kit that has been pretty widely adopted

and a data processing library

called Agate. And so really have circled around data in a variety of different ways.

And then

more recently have moved to Enigma where, we're sort of applying all those lessons I learned working with public data and news

to, much larger,

challenges.

And so can you give a quick overview of the problems that Enigma

was built to solve and some of the motivation for starting the company, if you have that context?

Sure. So Enigma's sort of

key goal is we're we're trying to connect public data and make it useful intelligence

for businesses,

I mean, and any kind of user, really. So the original premise of Enigma was

to, like, what if we had Google

for structured data? You know, Google has been this tremendous,

game changing tool for allowing us to discover

the Internet, the unstructured data of the Internet. And

but there's not a there's not a corollary. There's not a similar tool for working with structured data. Data that might be hiding in databases

or different, file formats.

And so Enigma sort of started out around that hypothesis and has gone through

sort of a variety of iterations of trying to figure out what

the shape of the the problem is.

So we've done,

work around things like anti money laundering,

pharmaceutical

efficacy,

a lot of different kinds of problems working with

large companies and all sort of circling around this question of how public data gets applied to solve real world problems.

And now we're sort of raised our series c, and we're reinvesting

in sort of what we think is the right way to do this.

And so

1 of the primary

sort of concerns

in enigma is this idea of the knowledge graph. So can you give a quick definition

of

how you define

a knowledge graph and,

maybe some of the broad use cases that it enables?

Yeah. So at its most

simplistic level, a knowledge graph is really just a a way of structuring

data

about the world as a graph. So take facts about the world, and rather than putting them in rigidly schematized database tables, we structure them into entities and relationships

in a graph, which

can be traversed as a graph, gets all your sort of computer sciency

graph capabilities.

But at a sort

of more abstract level, the way that we think about graphs is graphs are a way of connecting information,

which shares some,

ontological meaning, shares

some

semantic meaning, but doesn't share schema. So

we construct graphs from a huge variety of sources

that,

were never intended to be used together. So dataset a and dataset b were produced by different, let's just say, federal agencies or different counties

in nonstandard

formats,

and we want to somehow mesh those together into a common,

a common query layer and and a common representation.

So by mapping those into a common ontology and saying

column x in this dataset means the same thing as column y in this dataset, we can construct these knowledge graphs where that information which was disconnected,

is now connected and queryable.

It has a lot of power

for building

these datasets

that cross,

what are traditionally siloed

sets of information.

So I don't think we don't quite have the, like, 1 sentence explanation of what we think of as, like, a knowledge graph. Knowledge graphs or technology has been around for a long time, and I think a lot of companies

have sort of had their spin on it. It's really the category of problem we're trying to solve with knowledge graph, which I think is the most interesting part.

And 1 of the challenging aspects in any data project, but particularly for something of the scope

and ambition of what you're trying to do at Enigma

is establishing

and adhering to a tax onomy because that will largely define the capabilities

that are possible

based on the data that you're using and the way that you're structuring it. So how is that established, and how has that evolved over the time that you've been in Enigma?

Yeah. So this is still

a relatively

novel

approach that we're taking within Enigma to sort of build

knowledge graphs that can solve many problems. Traditionally, Enigma has approached problems like this as

1 offs, and we sort of learn from each of those prototypes

the general patterns.

And now we're sort of taking those general patterns, trying to build a more generalizable, more scalable implementation

that will allow us to solve a lot of similar problems in the same way. So taxonomy becomes increasingly important in this new model. And I'm not gonna say that we have quite figured it out yet, but I think the thing that we know at this point is that we do have to derive the taxonomy we wanna

use from the cases that we actually wanna solve. So we're sort of

not trying to do the 1 true taxonomy that's gonna apply in every possible domain.

We're looking at a subset of use cases that we think have commercial value,

have a lot of utility, and we're focusing

our taxonomy on those and sort of iterating it the way you would iterate software to fine tune

it for those use cases. And then we'll, you know, we may build

a few parallel graphs while we're sort of working out what the right models are. I think we'd love to end on a model where we have

a single internal graph that we expose

views on for different sort of client uses, but

we're really trying to take that laser focused approach

that, the only way we can know what the right ontology definition is is to actually use it out there for real use cases,

and see how well it fits and what we might need to change around.

And given the fact that you are

pulling all of this

information and extracting these entity representations

from various public data sources, I imagine that there's a lot of variability

in the quality and consistency of the data that you're using

and your ability to populate all of the different attributes of these taxonomies

for these entities to be able to expose them. So I'm curious what are some of the processes that you use in constructing the knowledge graph itself

and some of the strategies that you use to ensure that you are able to

achieve a certain sort of critical mass of attributes for any given entity? Yeah. That's a great question. So we think that the value of the knowledge graph is that it provides

a way of sort of building up these entities from many component parts. That's really part of the value we think we can offer. You're right that no particular dataset has

every attribute that we care about, nor does any particular dataset have necessarily the quality threshold we want in the final data. And, of course, we can we can talk at length about the problems that the original data might come with because it's public data in terms of fields that are invalid or mixed data types or all those kinds of issues we also have to deal with. But at the level of constructing the graph,

we use machine learning to entity resolve

all of these disparate datasets into our common

ontology domain. So

it's not required

that any particular dataset has any particular attribute. Right? We we have datasets, for instance, from, say, the SEC and maybe datasets of of corporate registrations from each state.

And with those, they'll have different ways of referring to the same company.

Let's just just say Enigma. They may refer to Enigma by Dun's number in 1 dataset by a text string name and another dataset by address in another 1. They might have slightly different addresses referring to the same company. Maybe 1 is a CEO's address and 1 is a street address for the business front.

We can take all of those, put those through our entity resolution algorithm, and come out with

an entity which has sort of the summation of all of that. And in fact, in some cases, has sort of derived properties that are actually

better than any 1 source can provide. Right? So we may be able to take different numbers for a certain attribute of a company from a variety of places

and have some business logic that says

this number is most reliable in these cases, this number is most reliable in another set of cases.

And that allows us to really construct

an entity at the end, you know, within our knowledge graph that is superior

to what you could get from any 1 dataset. And that's really where we think the value lies.

And so

can you give an overview of the architecture that you're using as the data platform and the systems that you're using for being able to collect and store and serve the knowledge graph? Sure. So,

you know, we think about the Lightning Platform as a holistic project that extends

all the way from the moment we acquire the data

from a source, which, you know, might be a website or something like that, all the way to the API that serves

to the client the results of the graph.

So starting at the very beginning of that process, which is the part

that my team, the data engineering squad, owns,

we have an in house

data,

or really data workflow platform that we call Concourse, which is built on a combination of Airflow, Docker, and a handful of other technologies.

And basically, the promise of that platform

is that we write

workflows as Python scripts, and they then sort of compile to a dockerized

image and an airflow DAG that's able to run that image. So we have a thin layer of custom code that runs as a plugin in Airflow that allows us to actually implement that.

But the sort of TLDR version is,

unlike

regular Airflow, where the sort of default use case is

that something runs as pure Python,

and then you have to sort of do something special to make it run a Docker container. Ours is exactly the opposite. Everything that runs on Concourse is a Docker container,

and that allows us to

sort of add an additional layer of abstraction. So all of our workflows have dependency isolation

to a large degree.

They can have even c dependencies if we need, like, OCR. We can have Tesseract installed in that image. But they benefit from all the traditional,

value of Airflow. So they get,

orchestration, scheduling, and all of these other things that Airflow does for us. There's also a variety of other tools, sort of use case specific Python libraries that we use in that step to implement different parts of the process. We have utility libraries that do common,

sort of ETL or data acquisition tasks,

and then we have libraries that implements features that are specific to different projects, such as,

ingestion for the linking platform. That's the core key sort of piece that we use for data acquisition and ingestion is that Concourse platform. It runs out in AWS, currently runs on EC2, although,

we're looking at,

possibility of migrating that to something like ECS or maybe even 1 of the serverless platforms.

Either as raw files, like CSV or something like that, to be consumed by some downstream process. But more frequently,

we have sort of a standard output

to Parquet that also writes a variety of metadata, and that's what's actually consumed

for the linking platform. So we have sort of

a bespoke

format we call linked data package, which is a file based representation of a graph.

So when we take a dataset

and we've processed it and it's ready for entity resolution and the sort of machine learning piece,

that is it's sort of encapsulated in this linked data package artifact that we put on s 3,

in our

data lake.

The next piece of that, which is sort of owned by a different team, we divide it into sort of 3 subteams.

They that

that squad owns

the machine learning, which includes enrichment, feature generation, NED resolution, and all of that part of the pipeline is implemented in Spark. We have a Spark Auto Scaling Spark cluster out on AWS, and

they implement those processes

and can run

these very large

machine learning jobs to take all of our individual dataset graph fragments

and resolve them into a single

knowledge graph, which is sort of the the the principal artifact of the system, right, is this this singular,

knowledge graph that contains the resolved

entities

and relationships.

They then hand that off, sort of an interesting footnote, also as an LDP. Because because these linked data packages are simply representations of a graph, we reuse that model

as the contract for the 3rd stage. We hand off the resolved graph

to the 3rd team, which handles sort of

the hosting of the graph and then the delivery via the API.

The current,

model is we're loading that graph into AWS Neptune, which is a hosted

graph database solution.

And then we have a query layer out in Amazon that's built over the top of that,

which really,

really slims

down the specific type of queries,

not because we're worried about performance or anything like that, but because,

we're really trying to serve very particular use cases. So we have these very targeted queries

that clients can use to get exactly what they need out of the graph, and probably will end up exposing

many endpoints in the future for different use cases. So that's sort of the the endpoint. All parts of that, except for the

middle piece,

are basically all implemented in Python. And then that central component being Spark is implemented in Scala for performance. And that's kind of the the architecture as it exists right now. And can you give an idea of some of the different types of data sources that you're pulling from and some of the processes that you go through to

vet those data sources before you start implementing them in production?

Sure. So,

we ingest a really wide variety of data sources. They are primarily

public data sources, which means they can be anything from a website scrape to a CSV we download it off an FTP.

It could be some old legacy binary format. It could really be a lot of things. In some cases, we're ingesting

many years back. Formats may change. So in some cases, there's can be a fair amount of nuance to

how we gather all of the data we want. In some cases, we're also doing things like rolling ingestion. We don't do a lot of that at the moment, but I anticipate there will be more of that.

It really kind of runs the gamut of data acquisition techniques

that we apply.

In terms of the sort of up front

investigation that we do, the kind of investigation we're doing right now is

primarily

sort of market value focused. Right? We're going out and trying to find what public data sets will serve

the use case best.

And when we don't find them, we are, we do buy data as well. So we have a handful of sort of critical data sets

in the graph that are data sets that we've purchased and resolved in. But the key point there is that we're

really focused on getting data to fill

certain pieces of our ontology. You know, if there's a particular attribute we have very poor coverage in, we can go identify a dataset that has that. Once we actually get to the point of acquiring that data, we do sort of ontology driven validation of what we actually acquire.

So the at the simplest, that's are the columns

that we've ontology mapped actually there, so we expect these columns to be there. But it can be, are they the type we expect? Do

they have the fill rate we expect? Things like that.

And that's a system that really is in its infancy, and there's a lot of opportunity to improve that,

to be able to apply those kind of quality checks in a standardized way across all the data we ingest.

And it's something that I think we're gonna be doing a lot more of in the near future.

And in terms of being able to

to consume all of these various data sources

and process them in a timely fashion, I'm curious what you found to be some of the most challenging

or unexpected aspects of being able to build the underlying infrastructure necessary to create and process

these graph attributes and these graph entities

and some of the software life cycle

workflows that you've built in to be able to create and manage the ETL code necessary

for

ingesting all of these various sources?

Sure. So I think that the most fundamental challenge that the data engineering team at Enigma has is figuring out how to

scale out the bespoke part of the work that we do. Right? So

every data source that we acquire

is different, and it's a problem that

in traditional ETL, you generally don't have. The number of data sets that you ingest tends to be relatively small, and you tend to be ingesting them for the purposes of analytical workflows, like, for instance, quick tracking or something like that, where the structure is fairly rigorous, you control the pipeline,

if not end to end, at least the majority of it. We don't have that. We have data sets that are very heterogeneous.

They're they can be different in everything from from format to quality, to complexity, to size.

So the hardest problem that I think we have is figuring out how to do the sort of traditional software engineering work

of writing well abstracted

code, while also allowing ourselves sort of the ultimate flexibility

of recognizing that really at the end of the day, only code can account for the level of variety we see. You know, there's

a long history of trying to tackle ETL with configuration

or with WYSIWYG solution. Those are sort of always unsatisfying, and they're especially,

ill equipped to the sheer variety of kinds of data that we're ingesting.

Really, we need

code to do that.

So we've built you know, Concourse is sort of the key piece of that because it allows us these

dockerized workflows. We can isolate their dependencies and we can have them pinned to a particular version of the library, and they will run

forever, or they should. Right? Once that that Docker image is baked,

that artifact should be able to run that code in perpetuity unless the source changes. So that's sort of 1 piece of firming up that contract, but then we've also as we've been writing these things, we do discover cases where we need to

share code. In the case of the linking platform, there's parts of that process

that we wanna iterate

independently

of any particular workflow. We may wanna change how the ontology is consumed or how certain kinds of validation are applied or the exact output format for the linked data package.

So we've sort of got a mixed model now where

we take the code for the workflow and we encapsulate that really tightly, but then there's pieces that we sort of attach,

and those can change at run time. Those can be independent of any particular,

workflow or the Docker image that's created from that workflow. But I'm not gonna say we've got the balance perfect yet. It's something that we're constantly iterating, trying to figure out

how you

construct ETL processes. You know, let's just pick a number out of the clouds here. A 1, 000 ETL processes, all of which are different,

but without creating,

you know, a 1000 times the technical debt. And that, I sort of think, is the the key problem that

that my team is trying to figure out.

And I think we're making we're making good head headway on it. So in order to make that

process work, we really have very extensive build tooling around our workflows. Because our workflows are code, because they're not implemented in some database system or something like that, our workflows are all in source control, and that process of building

the docker images and building the tags to run-in airflow,

that all happens in CI.

And really, at this point,

a a very large portion of our infrastructure

lives in that CI system that's in charge of running the tests for those workflows,

building out those images, pushing them to the appropriate environments, you know,

dev to stage prod,

and ensuring that versions are iterated correctly, that the images can be built appropriately. All of those kinds of things

are sort of part of that software life cycle. And 1 thing that I think we try to keep

front of mind at Enigma is that ETL is software. There's a lot of baggage around ETL. I think in a lot of companies, it sort of gets relegated

to sort of 3rd tier engineering status. But at Enigma, ETL is right at the heart of the problems we're trying to solve. So we really treat

everything around how we acquire data

as software that is worthy of testing and tooling and automation

and good quality code, all of the things that you bring to,

platform architecture or something like that. And

1 of the

long standing

points of

confusion

or

uncertainty

that has come up in a number of the conversations I've had is in terms of how you create and structure the unit or integration or acceptance tests around ETL code and overall pipeline code

largely because of the volumes and varieties of data that you're dealing with

and particularly in the case of dealing with unbounded streams of data, but also in the case that you're dealing with where you have such a large variety of data. So I'm wondering what types of tests you're creating

and some of the litmus tests that you're using to ensure that the data that you're processing in production

is able to

meet the quality checks that you are building in during the early stages of creating those processing steps?

Yeah. So sort of 2 answers to this question.

The on the linking platform side, in terms of the data

that's actually delivered for entity resolution and for our our our knowledge graph, we take the approach that nothing

should reach that part of the process

that could possibly cause it to fail.

So we try to push

all validation of the data as early in the process as possible, and that means at the time of acquisition.

So when we acquire a dataset,

the last thing that we do is take the ontology

and apply all these validation rules to ensure that when it enters the knowledge graph construction process,

feature generation enrichment and NDA resolution,

that that process will not fail because of something that's wrong with the data. This has been a huge sort of pain point for us because that's the

most complicated and longest running piece of the pipeline.

It really can't fail because 1 of a 100 input datasets

has

integer where a string is expected.

So we

try to provide, like, really rigorous

validation on the data that we output from ingestion,

but that doesn't solve the problem

of how we

maintain and

test individual workflows, so the individual ETL components of the process.

And that's an area where I think we're iterating a lot right now. I mean, 1 thing I already mentioned is we really do treat those ETL processes as

software in its own right, and that means we do unit testing on our workflows. Right? If there is

a piece of the workflow that, I don't know, say

generates URLs based on some set of

inputs, we test that. We we treat that as a as a unit of code that's worthy of a test. And

if there is, for instance, let's say, a complicated XML,

data structure that is an input

from the source,

we'll take a subset of that and write a sort of soup to nuts test using that as an input that runs it through our process and validates

that at least for a fragment of valid source data, our workflow continues to work. Now what we don't do a good job at yet, but we're actively looking at, is how do we better handle

the changes in the source, which are totally unpredictable,

but which we really wanna catch immediately

before we apply any

ETL code at all. So we're looking at ways

of caching

the the structure of the data that we received last time we requested it, and then sort of looking at the the delta in the data structure

from when we got it last time to what we're seeing now so that we can fail right at the beginning of the process and say, okay. This this source that's out there on the Internet, they just uploaded a different schema or a different kind of file. It is no longer what we thought it was. Really, the only recourse for us is to fail out as early as possible and get that in the hands of an engineer who can inspect

it and how we can adapt to them. So rather than trying to build ultra durable pipelines, which is not

really possible for these public data sources to change all the time, we're trying to build in

really

good error handling and failure cases

that get that in the hands of somebody who can address the issue as quickly as possible.

And

when you're

extracting all of these data sources and then building up the knowledge graph, is the graph itself something that can be easily updated incrementally, or do you have to do

either like a full recompile

of the node structures

or

recompiling

large subsets of the structure?

Yeah. So that's an area that we're really actively looking

at. Right now, for sort of our first,

go to market, we are rebuilding the graph.

Our iterations on our source data

are not so frequent

that we really need to be constantly revising the graph. Right? Most of our use cases are not real time. People are consulting the knowledge graph for information about a company,

about a place.

And that information generally is not something which needs to be updated on a daily or hourly basis.

So we are able right now to regenerate the graph on demand.

It's a fast enough process that we can do it fairly frequently.

But we also know that as the

size of the graph scales, there is gonna be a threshold at which we need to do iterative updates, and we've got some, like, pretty good ideas about how we'll be able to do that. It just hasn't been a priority for sort of a current the current cycle.

The I expect that,

you know, the scale of the graph that we're building, which is already, I think, fairly large,

is gonna increase by multiple orders of magnitude in the next,

year or 2.

And so we really will have to tackle that problem at some point. It just hasn't been priorities thus far. And in terms of being able to build and scale that infrastructure,

what are some of the challenges that you're facing currently and that you anticipate coming up in the near future?

Yeah. So

this is not quite the answer you're expecting, I think, but but the biggest challenge

is acquire

either knowingly or unwittingly.

So that process of scaling out the number of datasets is something that we're trying to approach very methodically, and we're trying to sort of constantly iterate on the process itself

to ensure that we're learning and figuring out what the right shape of all those processes are.

Aside from that, we have sort of all the traditional

scaling problems that come with building a platform like this. We have to figure out how to

regenerate that graph

in a performant fashion even as the scale of it increases significantly. We have to figure out how to get that refreshed graph

loaded up into Neptune in,

an efficient and time sensitive manner.

We

need to scale out horizontally

our,

data acquisition and standardization processes a lot more than we already have. You know, we're not ingesting so many datasets now that we couldn't make do with just a fixed number of workers,

but we want to be able to do things like, for instance,

click a button

and

revalidate

and apply a fresh ontology to every dataset that's part of the graph. Right? We'd like to be able to run that

instantly

over well, not instantly, but immediately

over every dataset that we use as an element of the graph.

And that means

probably moving to some kind of a serverless architecture.

At the very least, it's gonna be moving to

a containerized architecture that's more flexible than the 1 we already have. So we're gonna be solving

that kind of scaling problem too, the sort of traditional infrastructure problems. And then looking

out a little further at the kinds of problems that we're gonna be solving, we've got some really

interesting challenges around things like temporality in the graph, how do we encode time. We have sort of multiple different kinds of time that we care about in the context of this data,

and that's a challenge that I'm sort of especially keen to,

to to tackle at some point in the next year or so. Yeah. I was actually just wondering about that aspect of versioning the data or being able to

traverse

the historical attributes of a given entity,

particularly for the case of things like companies

or maybe

locations with some sort of historical significance so that you can see maybe some of the different uses that it is undergone over the course of time and being able to

explore that in some fashion for people who are consuming that information and using it to enrich their own analysis.

Absolutely. It's definitely something that's on our roadmap and something that I'm particularly really excited about. I think the temporality

question is interesting because

you do kind of have to architect the entire platform for it. The temporality

of some attribute

can, in different cases, be a function

of what

year a dataset is about,

what day you acquired a dataset on,

what time period a particular row is applicable to.

Some

of the attributes sort of, they decay. Right? Like, they cease to be accurate after some period of time, but others might be durable

forever, or they might have fixed periods of duration that that that could be vary from attribute to attribute. So temporality

is a really challenging thing to address within the context of the knowledge graph, but I think that our

our sort of holistic approach to this and our way of of thinking about the knowledge graph, I think it is a solvable problem. And I think that there's a lot of appetite in the market for us to to figure that out and do it really well.

And in terms of the actual data infrastructure

and the environments that you're using for processing

the,

data ingest and ETL logic, I'm wondering if you are actually

using some of the traditional software approach of having a production and a preproduction environment for being able to do some of that testing and validation logic and some of the

challenges that you've had to overcome if you do, in fact, have that capacity built in? Absolutely. We do. We all of our workflows

can run,

first in staging environment and then in production.

Eventually, we probably will have 3 environments,

because we will want a truly parallel production environment where we can test system changes,

in addition to having an environment to test workflow changes. You know, we have a pretty traditional

software engineering process around our workflows.

The workflows, you know, they they they flow through a process on our our Jira board,

which involves testing and code review and sort of all the traditional checks and balances

of software engineering. Nothing gets into production that hasn't run end to end in staging. I don't think we've encountered a lot of

challenges that are specific to that, with the exception

that

keeping

sort of artifacts in sync and building proper promotion policies

in CI and all of those things are just complicated. And I think they're especially complicated in our system, given that we have this idea of compiling things to Docker images and tags, and we need to make sure that we're sourcing the correct version for each environment.

We need to make sure that all of the libraries

that individual workflows can depend on or that the system itself can depend on are appropriately versioned across environments, and we can ensure that the right version is going where it needs to be. Those systems, those airflow systems are also being live deployed.

DAG changes as we iterate on workflows.

So we have to make sure that they have the right versions of everything to align with each deployment of the DAGs. So it's a complicated system, I wouldn't but I would say it's not been a special pain point for us. It's something that we will probably have to think a lot more about again, as this

number

of workflows

continues to increase.

The level of nuance we need in that build tooling is gonna continue to increase. There will come a point at which we can't keep all of those workflows in a single repo.

Other problems like that are on the horizon,

but we're not quite there yet. And I think you also have the benefit of the fact that you don't have to try and replicate data from production to these pre prod environments to be able to run some of these validations

because of the fact that you're

rebuilding

the

resultant data during each run. So you don't have that issue of the data gravity between the environments to contend with or trying to figure out some,

sampling subset of that production data to be able to use in validating

earlier in the stages unless you're trying to do a direct comparison between the outputs of your staging environment and what's currently in production.

That's right. I mean, we you know, s 3 is a wonderful thing, and it allows us to just sort of dump all of these outputs out there and keep them forever. So we can easily go in and compare

staging and production outputs, compare outputs across runs.

You know, our

machine learning team, our knowledge graph team can consume the latest outputs. If 1 of those turns out to be invalid for any reason, they can easily roll back and consume an earlier version. We can vary the versions of the ontologies that we're ingesting with at any time if we ingest with a version of an ontology and then we say, oh, wait, that that's not gonna build the graph that we want,

we can roll back and rerun that process with a different version of the ontology. So we have a fair amount of flexibility

in how we wire these things together, and that's very much by design. It's it's reflected in the structure of the

teams that work on this.

We have

as, you know,

clear a contract between those teams as possible, and that allows them to iterate very independently.

And this is just 1 piece of that. And looking at your existing technical infrastructure

and data platform, what are some of the weak spots that you

are thinking about and that worry you when you start

to consider the changes that you wanna have in place for the future?

And what sorts of projects do you have planned to address some of those issues that you've identified?

Yeah, sure. So, I mean, I've already touched on some of these things that

making sure that we're not creating mountains of technical debt with our workflow processes

is is constantly on my mind. You know, the scaling solutions that we are gonna need to do that sort of, like, 1 click run everything kind of model really is pushing us to move quickly on,

towards some sort of serverless architecture. So we're looking at, you know,

the world of tools that we have available is changing so rapidly, we almost can't keep up. But there's things like AWS Batch out there now, which provide

very similar functionality to what we have built on top of Airflow. So that's 1 thing that we're looking at. Getting that scaling equation right and being out in front of demand is gonna be really critical for us. So I think we're thinking about that. I maybe wouldn't quite say worrying yet, but, it's something we have to figure out. And then in terms of, you know, the work of the other squads, I do think that graph regeneration

and again sort of scaling ahead of that scaling curve and ensuring that the technology

is there and ready when

the commercial team

gives us the next, you know, 100, 000, 000 rows of data and and says these are this is what's next. We we always have to be out ahead of that. And I think so far, we're we've done a pretty good job, but it's, like, it's a current continuous battle. And do you think that if you were greenfielding this entire project today

that you would end up in some of the same

spaces that you are right now? Or are there

any major architectural decisions that you would make differently without the weight of legacy?

Yeah. I mean, I I I don't think there's too many decisions we would make differently. I think that I think that there are tools that have come out since

we started development,

things like batch and other and other hosted services that we would seriously consider building on instead of the system that we have in house, but really

to save us operational overhead, not because we think they're necessarily

superior software solutions to what we have. The other thing that I think we

might have done a little differently is,

or this is just my opinion, I'm actually not I'd be interested to know if my colleagues would agree with me, but we don't have great system wide orchestration of the process yet.

And I think that perhaps we could have built more of that upfront or at least built the expectation of it upfront. So,

automatically

rerun the ingestion process. And we can sort of we can build that as a feature, but I think the system probably would benefit

would have benefited from a little more thinking about

how the disparate components

could be orchestrated together so that we could sort of run the process

in an idealistic model sort of 1 click. None of that is really,

hurting us that much right now, but I do think that, you know, green when we talk about greenfields, I'm sort of inspired to think about

really optimizing it. And those are the areas where I think we have pain right now. And in terms of the customers that you have and the types of projects that they're

building on top of the

data resources

and infrastructure resources that you've created, I'm curious,

what are some of the

typical use cases that you've seen and maybe some of the ones that stand out as being particularly

interesting or unexpected?

Yeah.

So, you know, this product line that we're working on now, we're really just getting ready to go to market with. But as I mentioned, we've built a lot of previous prototypes

that look very similar and just weren't built on this particular technology stack. And those tend to be in spaces where you might expect the application of public data is really useful. So

things like anti money laundering. We've done a lot of work with banks

trying to catch money launderers, which is

a problem where the application of a large amount of public data is sort of an obvious choice, and especially if you can take that public data and integrate it with the data they have in house in

either literally in a knowledge graph or, at the very least, in a graph style sort of way of connecting it together,

you end up with much more than the sum of its parts. So that's the kind of space that we've done a lot of work in. Other things we've done that I think are exciting, you know, we had a project, a product called pharmacovigilance,

which did that sort of same thing, but using adverse drug event data sets.

When people take a drug and get sick,

that information can be reported

at the local level, the state level, the federal level. The systems do not share common schema. In some cases, there can be duplicates across those systems.

So we've built tooling working with pharmaceutical companies to try to deduplicate those and generate a better,

set of records around those adverse events. I think that's a really interesting,

application

and and the kind of application that I think we're gonna see a lot more of,

going forward. You know, right now,

we're still

still figuring out what the sort of optimal market

case is. We we know that there are many different areas in which we can apply this tooling, and we're trying to figure out, okay, which ones do we go after first? I think we have some pretty good hunches. There's a lot of interesting opportunities in areas like insurance,

and, of course, in banking.

Any place where a company needs

authoritative records on things that are in the public domain, like companies or places.

I will say take you know, going back to sort of my history as a journalist, I think 1 of the most

exciting applications of this technology is the ability to use

ontology design to

quickly assemble

national or global datasets

from disparate sources.

So, you know, you look at an example of something like elections reporting, which can vary by county or state, building a national dataset is a nontrivial problem that people have spent many years on. And I think that using ontology

to bridge the gap between

local representations of that information and and sort of compile it into a de facto knowledge graph, I think that kind of model

presents huge

opportunities,

both for building datasets that have value to

our customers and also for building datasets that have, value to the public domain. Enigma has a long

history of giving the the data we collect back to the public domain, and I don't think that stops with the knowledge graph. So I am personally very excited

about the places where we can apply this technology

to things which

have real

value to,

to citizens

and individuals as well.

And as you mentioned at the beginning, you've recently secured a new round of funding. So I'm curious

what types of new projects or

business growth or

feature additions

you have in store for the future of Enigma and some of the ways that you're planning

to grow or improve going into the future?

Absolutely. Well, it's it's a super exciting time for us. We raised this round of funding,

and our investors have really given us a food of confidence in this vision we see for for our knowledge graph technology. And

we're doing more of everything,

more proof of concepts with clients around the knowledge graph, more engineering

dedicated

to this technology,

but scaling out really all parts of the organization. 1 big investment we're in the process of making is building out a team dedicated

to doing

the bespoke acquisition part. So we know that we wanna acquire a lot more data than we are today, so we are actively hiring

for

a lead for our data acquisitions team who will sort of be responsible for scaling out that human process of going and getting all of those datasets.

And the data engineering team will sort of, retain responsibility for the technical part of the process, for the the tooling, the pipeline, etcetera.

That's a team that we foresee hitting double digits fairly quickly. And in fact, we're gonna open a second office. So that lead that we're looking to hire right now will also be responsible for Enigma's first expansion office. So, really,

expansion across the board, we've got a couple, like, very

significant contracts with, partners

that are coming on board with the technology early or with other

similar or related technologies that we've built at Enigma.

So there's just a tremendous amount of growth,

and I'm excited about all of those things. All of those other problems I've mentioned, better tooling for acquisition,

temporality,

all of those things are things we're enabled to tackle

because of that funding.

And are there any other aspects

of the work that you're doing at Enigma,

knowledge graphs, public data, uses of the resources that you're building

that we didn't cover yet that you think we should discuss before we close out the show? So I think the other thing,

sort of going a little deeper into 1 thing we already talked about, you know, the the problems of acquisition

you know, I talked about sort of the software engineering problem of how you abstract all those pieces, how you test it. But I think there's also a more fundamental problem

that we're really thinking a lot about, which is what's the model for writing that kind of code? So the ETL processes we write, they exist as sort of an uncomfortable middle ground between

something you want Visual Studio for and something you wanna do in a Jupyter Notebook. You know, there's sort of these evolving

2 models

of writing data processing code. I see those sort of converging for certain use cases, and I think our use case is probably 1

where there is some intermediate that is better than either of the options we have right now. Jupyter Notebooks don't really work for us because

it's very hard to write well modularized code. It's hard to build the kind of

abstractions that we want in a Jupyter notebook where we sort of have blocks of code that run as independent tasks, you know, it's like separate nodes in the DAG, if you will. But at the same time, the sort of traditional software engineering tools

are also a real pain for us because our processes

largely are procedural,

and you wanna be able to step through them 1 at a time,

view intermediate states, verify that your transformation did what it did. In a lot of ways, that authoring process

is more efficient for us

than, you know, having to drop into the debugger and then restart something or or whatever it may be. So

I'm really excited, and I and 1 thing we sort of have on our sort of moonshots list for the for the winner is to start looking into

things like paper mail and other systems, which is a system for Netflix for

automating notebooks. But we'd like to see that go a step further. We'd like to think about

generating notebooks, or maybe it's degenerating notebooks into another format. But there's lots of interesting things, I think, which could sort of serve our middle ground use case where we want that procedural authoring model,

but we also want the flexibility

to run these things and and organize them the way we would organize more traditional software.

And as you're talking about that, there are 2 projects that come to mind that might actually

fit your use case at least partially. So when you mentioned decompiling the notebooks, there's a project that came out recently called the Jupytertext

that might be helpful. But,

the other project that seems like it would actually be more direct fit is,

there's something called cauldron notebooks that was written specifically for being able to use a lot of the traditional software engineering

principles of modularity,

and executability,

and using them in version control, but still having some of the notebook interface. So that might be worth looking at further.

So I'll add links to all those in the show notes as well. Great. Yeah. Yeah. I I'd love to look at both of those. I mean, I think this is something that I feel like there's a mind share in the data engineering community swirling

around these ideas.

And, if these might be what we're looking for. Nothing I've seen so far quite hits the nail on the head, but I am confident there is something

that can be built that will serve us better than the tools we have today. Alright. Well, for anybody who wants to follow the work that you're up to or get in touch about any of the things we've talked about today, I'll have you add your preferred contact information to the show notes.

And as a final question, I'd like to get your perspective

on what you view as being the biggest gap in the tooling or technology that's available for data management today. So I do think for our use case, the biggest gap probably is in the authoring tools.

If there's another place where we

need to improve a lot, it's in the observability

of our processes.

I'm not sure it's it's a question of specific tooling. Maybe it's tooling we actually have to build for ourselves.

But, you know, we have this very elaborate process where data flows through our system, and and the provenance tracking within that is is

somewhat limited and is really something we're gonna have to address. And I haven't seen a system out there that's gonna work perfectly for us. It's probably something we're gonna be working on in the coming year. Well, thank you very much for taking the time today to talk about the work that you're doing at Enigma

and some of the issues that you're dealing with in your data engineering organization. It's definitely been

very interesting and enlightening. So thank you for that, and I hope you enjoy the rest of your night. Great. Thanks, Tobias. You too.

Data Engineering Podcast

Summary

Preamble

Interview

Contact Info

Parting Question

Links