Building Linked Data Products With JSON-LD

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Introducing RudderStack Profiles.

RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable enriched data to every downstream team.

You specify the customer traits, then profiles runs the joins and computations for you to create complete customer profiles.

Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack.

You shouldn't have to throw away the database to build with fast changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses,

but swap the decades old batch computation model for an efficient incremental engine to get complex queries that are always up to date.

With Materialise, you can. It's the only true SQL streaming database built from the ground up to meet the needs of modern data products.

Whether it's real time dashboarding and analytics, personalization and segmentation,

or automation and alerting, Materialise gives you the ability to work with fresh, correct, and scalable results, all in a familiar SQL interface.

Go to data engineering podcast.com/materialize

today to get 2 weeks free.

Your host is Tobias Macy. And today, I'm interviewing Brian Platts about using JSON LD for building linked data products. So, Brian, can you start by introducing yourself? Yeah. Thanks, Tobias.

Happy to be here. Brian Platts, as you said. I am the CEO and cofounder of Flurry.

I've been building enterprise software

my whole career. Flurry, really, for the past 4 ish years. It's an open source graph, database product.

And, yeah, happy to talk to you about JSON

LD and, which has a strong intersection with what we do. Absolutely.

And do you remember how you first got started working in data?

I have been building,

software companies really my whole career, so,

I'm on, I think, number 7 at this point. And as those companies mature,

you end up, typically acquiring adjacent products and trying to integrate them.

And that's really where, you know, data always ended up becoming the obstacle for me. You know, there's the marketing message of having a product suite and, of course, the idea of integrating apps and thinking about, you know, challenges like UI and

and, experience and and seamlessness

across various applications.

That always seems like it'll be a challenge, but the challenge is always in the data and and mostly because the data doesn't speak the same language.

So that's what's really led me down the path of becoming

very focused on trying to solve that problem.

And in terms of the conversation today,

I mentioned the idea of a linked data product, and I'm wondering if we can just start

by unpacking that term and what it means and some examples of when you might want to build a linked data product.

Yeah. So I think,

everyone's probably heard the buzz around data products by now, so we don't have to, spend too much time on that part of it. I'm excited about data products because I have been seeing and think in inevitable

transition happens

here hopefully in the shorter term than longer term,

where instead of people going out and just buying apps where data is an artifact of an app, we actually flip that completely around where people focus on data first and foremost, and apps can kinda come and go, but you're managing

a a very

high integrity

trusted set of data in the middle that doesn't have to change. And I think, of course, you know, machine learning and AI has been a big driver for that happening, and and the fact that we now have data products, I think, is a great example of that.

The linked part of this, linked data product, linked data really implies, I think, to to most people who are in the space,

kinda 2 things. 1, it implies graph

as opposed to relational or rectangular

kinda shaped data.

We're definitely focusing on graph data here. And then

what it invokes for most people as well is a set of standards put out by the World Wide Web Consortium around linked data

really in an effort to create kind of a similar environment that we have for the web, which uses HTTP

kind of fundamentally,

instead creating a a decentralized

data network.

And what are the standards that that goes by, and machines would be the, you know, primary consumer of that. There's kind of a third part that the linked data adds in, at least from my perspective and where we have to go with it, and that is

various forms of cryptography that enable the data to be verifiable,

trusted,

very sophisticated permissions embedded within the data. You know, we call this data defending itself. So I think for this to really reach its potential, we have to bring the cryptographic

sort of conversation

into linked data. But, yeah, those are those are kind of what we mean by adding that 1 word in front of data product.

And another

term that folks might have come across or understand

that is at least related to this concept of linked data is the idea of a knowledge graph. And I'm curious if you can explain some of the overlap between those ideas of a knowledge graph and a linked data products and, in particular, explain where they might be divergent from each other.

Yeah. I think of them I I don't know that there's a formal dictionary,

definition of these, but I think of them as roughly synonymous.

So a knowledge graph is a graph database made up of linked data based on these standards.

So this is a knowledge graph from my perspective. I'm I'm sure there's some other people with different definitions.

But, yeah, knowledge graphs really form around the concepts,

the foundational

concept of RDF,

which is a w 3 c standard. And then there's all kinds of things that build up on top of RDF, and I'm sure we'll probably end up talking about that, a little bit more. But when we talk about JSON LD in particular,

I'm excited about it because it does actually use RDF under the hood, but a lot of people don't know that or don't even have to know that. They're just working with JSON. And so I think it ends up making things much more accept

much more approachable

for folks who know JSON, but might not be as familiar with concepts or like RDF or even graph. It's a nice entry point into it.

So that's, that's what excites me in particular about JSON LD as it relates to this.

And now digging into JSON LD for people who haven't come across it before, the LD suffix, it stands for linked data. And I'm wondering if you can talk through some of the applications of JSON LD in this space of RDF,

the semantic web, which is where RDF largely

found its roots, and some of the domains in which JSON LD is typically used.

Yeah. So where we see JSON LD or even RDF

in general is

where there needs to be interoperability

across data. And and this really implies a decentralized

context so

that, you know, people who don't necessarily have to know or have an agreement about a schema or vocabulary upfront are able to actually merge, combine, and link data together, and it's because it it,

handles these sorts of standards. So the interoperability

across domains is the key here. And, of course, I I think as most data scientists know, they spend a huge amount of time on their data just trying to get it into a common format.

So imagine that work's already done. Right? It it's already in a format then machine can automatically figure out what it needs, and, of course, that's gonna speak to the interoperability

of data challenges. Now those exist,

certainly across

boundaries, like organizational,

governmental

boundaries,

but those also exist inside of companies, you know, in a financial services

company,

finance and risk,

probably calls things

which are identical to sales and marketing in the same organization by different vocabulary terms even though it's the same underlying data. So this exists not only across these boundaries, but even inside of organizations.

The JSON serialization,

I think, brings familiarity

for folks. So,

I don't think you could find a developer who hasn't worked with JSON at some point, so people are familiar with that concept.

And in fact, a simple JSON

document can itself just be used as JSON LD. Now it doesn't bring all the special characteristics into it unless you sprinkle it with a few extra things, but you can literally just take a a JSON document, transact it, and use it almost like a document database, but you have full graph indexing

and, query capability. So where

we see it, the specific domains where probably the most adoption has happened has been the push of Google and Microsoft

to use this across,

embedding this inside of web pages.

And now

almost 50% of the web has JSON LD in it.

So this is, you know, JSON data embedded within,

the actual web page, and it's used, you know, for no surprise by machines to figure out what a web page is talking about, etcetera.

And in fact, my guess is most people in your audience end up using JSON

LD tens, if not hundreds of times a day without ever knowing it. If you've you've posted a tweet or if if they're still called tweets. I don't know what they're called anymore.

But or shared something on LinkedIn and, you know, a nice description and image of that web page you shared comes up. That is because there's JSON LD embedded in the web page that the machine is running, and we have interoperability

there across services without,

you know, the web page provider and LinkedIn or Twitter or anyone else having to collaborate on that ahead of time. We're seeing it more in regulation,

and, we're seeing it more in an area that I would call intelligent agents,

which is really AI being able to go out and find and not only discover data and information, but because of the semantics

and the linked nature of it, be able to automatically figure out what that data means. So it's it's,

significantly more valuable for AI or other agents to be able to understand,

information.

And going back to to your, mention of the usage in Google and a lot of these social media companies of that linked data,

it's, if I'm not mistaken, you're talking about the open graph specification where if you inspect the header of a web page, you'll see an OG colon prefix to identify

what attribute of that open graph you're specifying in the page metadata.

Yeah. Open Graph and increasingly schema.org,

which is actually the, you know, schema that,

schema, if you will, vocabulary, I'd probably refer to it as, that Google and Microsoft put put out. So web pages are described sometimes in both, sometimes in 1 or the other. And 1 of the nice things which we'll probably get to about the space, is that, you know, that it doesn't live in a world where you think that the whole universe will

agree on the same schema or terminology. So there become ways to map schemas together and say this thing over here means the same thing as this thing over here where you don't have agreement on those.

We might get into that a little bit more. But, yeah, definitely Open Graph, schema.org are are heavily used and recognized by the major search engines

and other services that are leveraging web pages.

And so for

application developers and data practitioners

who are thinking along these lines of JSON LD or linked data, in the overall data life cycle, there are usually a few different stages

of the creation and processing of information

where

an application is generally responsible for the either capture or creation of that data. And then there are data pipelines and then data analytics that will extract and make use of that information.

And so I'd like to maybe talk about some of the the ways that JSON LD and these linked data concepts are present in those different stages of the life cycle. So maybe talking first about the application development and some of the concepts of domain modeling within the application, ways that JSON

LD and these schema semantics

come into play in that development process?

Yeah. So there's a quite a bit there in that question, and I think a good place to start is RDF

itself.

And 1 of the nice things so to answer the question, where is this used or how do you think about it? You know, what is RDF? RDFs are are triples

That's,

technically goes by subject, predicate, object, the 3 elements of it. But if you map this to, like, an Excel spreadsheet,

the subject is really like your row number. The predicate is really like your column, and the third part of it is really like the value in that cell. So it's a extraordinarily simple way of representing data.

In fact, any piece of conceivable data can be represented as our EF, which makes it a phenomenal way of just storing any information because

it's it's simple. It can store anything. There's 0 limitations.

So the questions come up on how do you serialize it and what are the best practices for it. And so, I'll mention a couple things there, and we'll kinda build up to the final part of your question. So in JSON,

when you think about

RDF or triples, what you're really doing is, the second part's the predicate, which is really the property, and the value are just your key value pairs in a JSON object. It's as simple as that. You know, it's, first name Brian in the case of me. But how do you uniquely identify a row? And there's a special term in JSON LD where the key is at ID.

So at ID is where you can give something a global identifier.

Now it doesn't have to be global. It can be relative.

But in being global, it means that it's more easy to combine with other data that's using the same global identifiers

for things.

Again, you can sort of map things together after the fact. So you could say this global ID that someone was using over here is the same as this other global ID that I was using over here, and the systems that manage,

knowledge graph data will be able to automatically infer that. So when you query for either 1, the same result will come out. So kind of this magic happens in in the background.

So when you create,

JSON LD, what you're really trying to do is use for your keys

that you're doing, like first name. Instead of just calling it first name,

ideally, you're using some sort of standard vocabulary.

Again, you don't have to because you can always address it after the fact, but we already talked about schema.org. Schema.org

has a very robust way of representing people and companies and other things.

And in fact, it saves you a lot of time as a developer because there's a lot of nuances when you model out data

that a lot of smart people have already modeled out those concepts for you and thought about the edge cases that normally you think about or run into in version 2 of your app or version 3 of your app. But there's nothing stopping you sort of doing the traditional app. If you're developing a new app, application development, I'm just gonna kinda create my data however I want. That's that's still a fine thing to do. You just wanna be able to leverage these common ontologies. And I think that is probably enough for that question, but we can certainly build on it from there. Absolutely. And now moving into the stage of I have the state that exists in an application or in some source system that I want to consume and perform analysis on for data engineers who are working on that process.

I'm curious if you can talk to some of the ways that having the RDF specification

or the JSON LD representation already present in the application

simplifies their work of doing that extraction, in particular, maintaining the context of the information that they're trying to extract. And for cases where there isn't already that representation and it's just a tabular structure like you will rip out of any database or just a a JSON API without these linkages

embedded in it, some of the ways that the engineers might think about

appending this the the additional contextual data in that JSON LD format to

simplify downstream operations?

Yeah. So 1 of the big benefits that you get is that if I have 2 JSON LD files,

say or databases, however I have that data. If 1 file, you know, is for

customers

and another file is or database is for orders,

those can exist independently, but they can actually be queried as though they're the identical database. So you think about all the effort and work that we put in today to

extract our data out and put it into something like a big data warehouse

really with the aim so that we can run queries across it. We're probably doing additional transformations,

you know, in that process to normalize schemas or other things. There's a lot of work involved.

And if the data natively sits in this format, all that work goes away. In fact, you don't even have to have the data physically co resident with each other to query across it as though it's 1 database.

So there's these great advantages to doing that. 1 thing that,

is also great about graph is it's really a superset

of almost every type of other type of way of,

managing data.

Especially with JSON LD, it it retains a lot of the simplicity of a document database like MongoDB

or something like that. It's incredibly

powerful from an analytics standpoint, which is usually why people go to relational databases

because they they're gonna be writing complex queries and be doing analytics on that. This data not only does all of those types of analytics, there's a whole additional set of analytics that become easier once your data's in a graph. And because it's a superset of really these other capabilities,

it means that you can flatten it as well. So even if I have JSON LD data as this native really rich, that's it's it's the most rich way, I believe, you can even represent information.

If I want to

flatten that data into rectangles

so I can stick it in a SQL database or put it in a CSV file, I can flatten relational. You can't go the other way. You can't take relational data

and without a a lot of transformation work, try to turn it into a graph, but a graph can be easily flattened. So it creates immense flexibility. It creates a lot of reusability

aspects of the data. And the challenge is is that our data usually doesn't originate like this because almost all apps are using relational databases now or some form of that. So

the shift that I believe we'll we'll see that I think we've started to see and will continue to see is that instead of favoring relational databases

as the system of record, we start using graph as the system of record. Traditionally, graph has been used more for analytics. It's more after the fact. Some system of record is maintaining the data, and then, ultimately, I have some transformations where I'm putting it into a knowledge graph or doing other things so I can do more sophisticated

analytics with it. But, if we can just sort of move that up where the graph is the operational data store, then I think it eliminates not eliminates, that's a strong word, but substantially reduces

all of the

data engineering work that have to go into making it usable for data scientists or machines downstream.

This episode is brought to you by HEX. If you're a data person, you probably have to jump between different tools to run queries, build visualizations,

write Python, and send around a lot of spreadsheets and CSV files.

HEX brings everything together.

Its powerful notebook UI lets you analyze data in SQL, Python, or no code in any combination and work together with live multiplayer and version control.

And now, Hex's magical AI tools can generate queries and code, create visualizations, and even kick start a whole analysis for you, all from natural language prompts.

It's like having an analytics copilot built right into where you're already doing your work.

Then when you're ready to share, you can use HEX's drag and drop app builder to configure beautiful reports or dashboards that anyone can use.

Join the hundreds of data teams like Notion, AllTrails, Loom, Mixpanel, and Algolia using HEX everyday to make their work more impactful.

Sign up today at dataengineeringpodcast.com/hex

to get a 30 day free trial of the HEX team plan.

Digging more into this concept of the graph representation of the data, I'm also interested in your perspective as to how necessary it is for that to be stored in a graph native database versus using this flattened representation that you were mentioning or using something like an adjacency list in a relational store for being able to represent that crash structure?

Yeah. I think that it's isn't necessary.

No. I mean, we've been combining datasets and merging datasets together for a long time. I would make the argument that it's not

scalable.

Right? So usually, it's a specific problem that we have to solve that we go about getting the data. We, you know, create our pipelines. We create transformations. We do whatever we need to do to focus on building out that data. And,

you know, we're done. I mean, we'll we'll accomplish the task. But today,

we have a pretty limited set of data

that we're doing this work with. And part of the reason,

there's a lot of reasons, but the 1 I'll focus on here is that

that process is expensive, and it's not super scalable, which limits how much data you're able to do this work with. If we're all more natively in this format and machines can automatically infer what the data means,

can automatically

follow links and and merge data together,

then it makes all that work we have to do to getting to that end state unnecessary,

which means we have a more scalable process. I believe it's estimated that about 96%

of the data that corporations end up producing from their applications,

etcetera, is dark, you know, meaning that it's it's never used. And, we're you know, in some cases, we're paying to store it, but we'll never use it. This is a way of actually making data more natively, much more usable, much more scalable, and be able to not only leverage it inside inside of our organization,

but think about the challenges

and the work we do just to, you know, get our CRM data with our ERP data combined into a single thing that we're running queries against.

Now think about taking parts of that ERP data and sharing it to other partners and external organizations.

A whole another layer of that work has to start again, whereas this format eliminates

all of those,

components. It just makes sense to do this,

but, of course, it's not what we're used to doing. And I think that's 1 of our our biggest challenges right now.

Another aspect of what you were just describing

with these linked data representations

and being able to easily

translate between the vocabularies used in your example between the finance department and the sales department is another conversation that's been happening in the data space recently around this idea of the metrics layer or the semantic layer for the data that has traditionally been,

encapsulated by the business intelligence layer, but is now being moved into a separate

concern so that you can use it in more than just the business intelligence context. And I'm curious how you see

these concepts of the metric store versus data linkages and shared vocabularies

as either

corollaries to or competitors for the same intent?

Yeah. It's a really good question, and and I actually focus on the word you used, which is layers.

And it's a it's a layer. And so this type of data natively embeds semantics in it, and the semantics are highly reusable and can actually be described by other vocabularies

that allow mappings between,

semantics.

To accomplish this with

traditional data or most data that we have today, we have to add a layer,

just like we have to add a layer like a data warehouse to query across datasets,

just like we have to add a layer for, you know, how do we protect data after we move it into the, data warehouse and assign permission?

Our focus, and this is a little bit unique in Flurry, is we allow that data,

to be what we call programmable,

but we can embed security directly into the the data tier, which allows anything that needs access to that data to be able to query for it, but they're only gonna get a subset of of the data that they natively have in it. So

using a traditional

sort of siloed data relational database that's coming off the back end of an application,

All these things you're talking about, adding semantics, adding all of these things can be done, but we're adding layer and layer and layer and layer to accomplish them. This is a way that we can structure data fundamentally that those layers become unnecessary.

Particularly speaking in the world where most people are right now of they have a data warehouse, they have their data pipelines for extracting from source systems, combining it into

a shared query layer.

There's that word layer again. I'm wondering if you can talk to some of the data modeling exercises

that are necessary

to start embedding the contextual awareness into the data versus it being a an after the fact exercise that has to be done or an exercise that has to be part of that extract process

and instead

making it native to the data itself rather than being something that is imposed upon it ex post facto.

Yeah. And I'll focus on

how you do this in JSON LD just to give a concrete representation because there are alternatives

of doing this for other types of data in different serializations.

But in JSON LD, I already talked about 1 special

keyword,

ad ID, and this can give a global identifier for an item

that can allow things to be mapped together. And, again, you can sort of map different ad IDs that you find out later really talk about the same thing. Another very common 1 you're gonna see is at type,

and this is where you're assigning the data a class. And this type of data natively supports

super and subclasses.

So it's also a way of kind of combining data, but it's also a way of adding powerful capability you know, pharmaceutical products. There's

around, you know, pharmaceutical products.

There's Fibo around financial,

and the list goes on and on. But in schema.org,

you know, you have things like movies, books,

and they're both subclasses of creative works. And in a JSON LD database, even though I said this is a book or this is a movie, I can basically do select star from creative work, and all the books and all the movies will come out because it understands the notion of class hierarchy within your data, so it adds very powerful capabilities.

You can actually do the same thing with properties.

You know, you might have a field called mother and a field called father, and they're both subproperties

of a field called parent.

And just natively, you can say selects,

you know, parents for for Tobias,

and it'll return both mother and father for that, for example. So those are 2 kind of common special keys you'll see in JSON LD.

You do not have to use those. So I wanna be clear with that. Just a plain old JSON

object

will be able to, in most systems, transact, but those add value to it. How do you do the mapping specifically was your question, and I wanted to lead to the 3rd

kinda common,

special

key that you'll see in a JSON LD file, which is at context,

so the at symbol in context. And what at context does is it allows you to map the names of the keys that you're using

to a global

identifier.

And so it doesn't really matter

in your data, in your JSON, what key you use for

first name, given name, f name, g name,

just name. You know, you you can call it whatever you want. You can use this at context to say anywhere you see,

g name for given name, it actually is what I mean by this is I'll use schema.org again. Schema.org/gename.

So you're attaching a global name namespace, in this case, schema.org

to that field.

And now anyone receiving that when they see g name in your file and the machines automatically

infer this, by the way, and kind of expand it out. But even as a human, as soon as you see g name, you know that that is actually referring to this vocabulary

that is published and open that's very well vetted, and you have an exact meaning of what that is. You also can learn other things. Like a given name always exists on a person. So now I automatically know this data is a person. And in a database,

a knowledge graph database, you could say select star from person. All of a sudden, all the people show up. Even though you never said that data was from a person, the fact that we said it has a g name, and g name stands for a given name, and a given name's always on a person, we're able to infer

automatically additional data. So a lot of things we're writing custom Python scripts or other things to infer

can just natively be inferred in this data. But your question was, how do we create these mappings? So in JSON LD, you use the at context field, and there can be some additional layers on top of that, but that's a very simple way of of mapping your raw JSON data into something far more useful in a global vocabulary.

Digging into that hierarchical

aspect of the data and being able to build these inferences

brings to mind the question of performance

and some of the ways to address

the lookup cost of all of these different attributes, particularly given the fact that to your point of I wanna find all of the people, but all I have is the g name to indicate that this is a person, which brings to mind the idea of sparse matrices

and some of the performance challenges around sparsity in datasets

and also to your point of making sure that this g name is a context that I apply to these different records or to the specific field in the world where most people are operating, that is typically done through

the transformation

of the data from the raw layer. So, using ELT approaches as an example,

you load the raw data. Now I'm going to coalesce all of these different column names into a single column name to ensure that I'm calling this 1 field the same thing everywhere in downstream use cases

in this linked data approach. Instead, I don't care what the column name is. I'm going to apply this contextual attribute to this column instead, and then I don't have to do those,

transformations. And so it's a,

an emergent schema rather than an enforced schema. And so I'm I'm interested in talking through some of the ways that these emergent properties of the data will impact

the performance characteristics of the system and some of the architectural approaches to mitigate that concern?

Yeah.

Well, I'm I'm glad you asked specifically about architectural approaches because I think that is,

the primary mechanism at least that we think about to mitigate these. But, yeah, if you are so there's there's a number of very simple ways that kinda layer on these standards to do things like subclassing. That's a pretty simple thing. Doesn't have huge performance implications.

There's additional standards

that allow you to develop these immensely rich ontologies.

And 1 of the most common used commonly used ones for those that get to that far

is OWL. And and I'll say about these standards, the neat thing is that all of this is just represented as rdf data. So, you know, you think about, something like a standard schema in a relational database, and, of course, you set up your schema upfront, and you have a special command for establishing your schema, think of your schema itself just being described as data. But it's not just schema. It's it's so much more beyond that. So OWL can

actually express concepts

that are NP hard problems, meaning that there's, you know, computationally,

you know, certainly quantum computers might make a big impact in this, that it is gonna be immensely difficult for a computer to figure it out. So 1 of the things they do with OWL is they break it up into several subcategories.

Kind of an easy category that computers can figure out really quick, sort of a more sophisticated 1, 1 where it can be figured out entirely with, for example,

forward chaining rules

and then all the way up into, you know, these concepts that, really aren't computationally

scalable, I guess, is what I would say. So 1 thing is is if you are

getting into this area where you're using these sophisticated

rules applies to this data.

And, by the way, 1 of the reasons to use this over, like, custom Python is that it's completely reusable. Once you develop these rules, not only are they data themselves, you query for the rules. You can sort of update them and do these things, but you can just layer these rules on top of any any data that talks about this. And all this inferencing and magic kind of will just automatically happen for it. In any compliance system, it'll all just work in.

But you have to sort of work within the set if you're gonna try and run these in real time, for example. Or maybe I don't need to run them in real time. You know, there's,

you'll see examples, for example, in in financial services where I don't have a bunch of things in the ontology that says, okay. If a person has at least 2 of these 5

types of accounts

and the combined balance of, you know, all the accounts exceeds $1,000,000

and the account is open and this and that, that's a that's an ontology

rule that can be expressed.

Then they're gonna be a high value customer. And you can select star high value customers, and all of a sudden your high value customers show up. But that's a bunch of computation that would normally happen in a Python script or something like that. And if you have immense volume of data and this is changing all the time and you're trying to run it in real time, it's a consideration to have. Do I need to run it in real time? Can I run it on the side once a day? Is that good enough? Do I have too much volume that if I run it in real time, it's computant

intensive? And do I scale that back? So you have to decide what you use and don't use, and it depends on your use case. As far as how can we address it with architecture, the thing that we're super excited about is that we have broken up the database

into 2 pieces.

So, traditionally, a database is a server you put up, and it does handles all your updates, and it handles all your queries in 1 server.

Can't think of a good reason, except for just a legacy, why that is. We can actually split these 2 functions completely, which means our query servers become ephemeral.

Our database servers, we can spin them up, spin them down. We can put them in all of our data centers right next to our apps if we want.

They cache segments of the data that are hot in memory that keep updated all the time and allows you to push the compute layer to the edge and allows you to have horizontal scalability of that.

So even if you wanna do some of these computationally

intense algorithms, what you can do is actually push them to your consumers,

push them away to the edge, and you don't have to maintain these big centralized servers that are trying to service

unknown load for a whole bunch of users. So there are architectural ways of of making this better as well. For people who have already invested the work of building up these centralized stores of their data, usually in something like a data warehouse or even a data lake or data lake house, whatever you wanna call it, the thing that you have all of your data stuffed into. They've already done a lot of work of trying to normalize the schemas and bring together

all of these records into a common representation

often through a lot of brute force and hard work. What are some of the lessons that they should be thinking about as far as their overall workflow to be able to bring in some of this contextual

knowledge

of their data, embed that into their workflows, and being able to start layering on these capabilities

of the emergent properties of these kind of schema linkages

without necessarily having to do a complete rewrite or rearchitecture

of their system,

and instead being able to add these,

data link linkages and these ontologies into their workflow

so that maybe, eventually, they move to a new architecture, but they're able to actually start taking advantage of these concepts today.

I'll talk about what I think is the most logical and easiest way to get into this as as an ex,

in the scenario you pointed out. You know, we're seeing more and more that there's actually drivers that force you into this scenario. So 1 is regulation.

You know, FDA is looking to speed up vaccine development, and they're looking to push the pharmaceutical companies to be able to submit data

using RDF, using JSON LD because it's decentralized and interoperable.

It's part of that process. You know, Department of Homeland Security in the US is looking to use verifiable credentials, which is just a cryptographic wrapping around JSON LD data

to facilitate,

customs processing.

So they can show it a chain of all the goods that went into a product and make sure there aren't, you know, tariffs in there. If the

in order to ship data across boundaries, if you have to represent it as linked data, then, of course, you're forced to do this. It it sort of doesn't matter the work you did with other things. The regulators are telling you to do this.

We're seeing people wanting to leverage AI as being another driver. If you really want the AI to be able to understand

what the data means natively without having to tell it everything and wanna scale more of your data for consumption,

that's another big driver. But if you don't have 1 of those drivers where you don't have a choice,

as far as representing your data like that, We've talked about some of the benefits that you would get out of having that, but kind of where do you start?

And the most logical place that people start with this

data and the concept of knowledge graph is they create a knowledge graph which has the class hierarchies and everything that represents their domain, their business,

and they map their existing data that they have into that domain. The idea there is that you can introduce other business users to be able to find data.

It effectively effectively accomplishes kinda schema on read. So if you have different parties that are you talked about semantics.

In mapping that, that's a way to automatically map that. A pattern we're seeing is to build that out, catalog your data, and then have,

new datasets

registered in that and have, machine learning or other services

automatically kicked off if they have permission to see it and enrich this catalog with additional metadata to improve the find ability. So that's a that's a real logical place to start. And once you have this kind of, domain described and this knowledge graph described,

you can start to instead of just pointing to all these different datasets and enabling some of this metadata and facilitation,

you can actually start to represent some of that data natively in the knowledge graph format over time, and it sort of just plugs in to the nodes of your overall graph. So that's where we're seeing most people start. When they don't have 1 of these external pressures, it's forcing them to do something.

And as you were talking about the idea of cataloging, it also brings up the question of data catalogs, data discoverability, which is another piece that is a separate layer going back to our earlier conversation

and

another set of exercises that is necessary,

usually bringing in something like a business glossary or bringing in different people to add documentation

on different aspects of the data through wikis,

labeling,

lineage. I'm curious how this idea of the, linked data attributes

as the embedded context of these records and the schema layering and layering on these ontologies

factors into that question of data cataloging and discovery and maybe some of the ways that it will at least accelerate, if not obviate the need for these explicit

data documentation exercises.

Yeah. I think about well, those who have used data catalogs understand the value,

that they potentially bring, although I know, you know, some implementations are more successful than others.

So what I'm referring to when I talk about this knowledge graph acting as a data catalog is sort of like data catalog 2.0.

So similar ideas, similar benefits,

but how do we bring even more capability into that?

We focus on this idea of first modeling out your business, which is an enhanced or sort of a knowledge graph version of a data catalog,

and then being able to use that structure you set up, not just for humans to find out, you know, what datasets have,

store sales information,

but to actually allow machines to natively do that and bring some of this more sophisticated

query capability

to those particular assets.

Once you have that hierarchy established,

you can also start using it for

automatic classification.

So if you're using, you know, natural language

unstructured data, you're sort of running it through a generic model, but your business isn't generic. Your business looks for certain things in that. This hierarchy

is a native way of actually classifying information and giving a lens to machines, machine learning, or other algorithms to do that. In Flurry, we extend this to be able to handle,

permissions and permissions in a a hierarchical way. Once I have this this kind of set up, I can tag nodes. I can establish certain rules, and it's gonna apply to all data assets that sort of flow down from that. So you just start unlocking a lot of capability

to

taking

this concept of this catalog into more of a concept of an ontology.

Solves the same thing the catalog did, but really starts to bring in all of this additional capability as well.

Another thing that comes to mind as we're talking about layering on these ontologies

and the schema

and the

embedding of this contextual information

into the document that has the attributes

is is the idea

of schema enforcement or attribute enforcement. I'm thinking in terms of things like the schema store in things like like Kafka, where you can say, this is the schema for this topic. So any event that comes in has to conform to this structural specification of the data. I'm curious some of the ways that that,

enforcement

of attributes and data types can come into play with these contextual elements that are being

appended to

the documents themselves and that that idea of type and structural enforcement?

So I think most people who've worked with relational data are used to, you know, traditional

SQL like,

schema enforcement.

And there's a couple nuances.

But as it relates to the closest analogy, the thing that we use in our product and and other

knowledge graph products or graph products use is the standard called shackle, which stands for shapes constraint language, and it allows you to establish very, very sophisticated

rules as far as what data is allowed or not. And you can really use this in 2 ways. 1 is you can use it as a reporting engine. The actual reports with the data quality scoring that come out the other side comes out again just like everything in this space as data. So you can easily ingest that into an app, parse it, you can query for it,

whatnot. But, if you're acting as an operational data store as well, it can do enforcement like traditional schema on a relational database would. It'll reject something if it's not a string.

You know, of course, you can do kind of the next level up where you're doing cardinality

min and max. You're doing regular expressions,

but you can take that to entirely new levels where you can actually peer through the graph and say,

this is a valid value if it's connected to this type of entity,

and that entity has a value greater than x. So you can really start linking it through the graph and have validation

happen in ways that you just cannot even

approach

in traditional schema enforcement

in SQL.

The twist that I mentioned is that the whole purpose of this is to end up having fully composable data. You can just layer datasets together.

It's like it was in a data warehouse. It doesn't even need to be in a data warehouse.

You can issue queries across multiple datasets dynamically.

Well, when you do that and you're combining data together, you can't enforce things like cardinality

constraints.

For example, if I had Tobias

in my database and you had it in your database, and in 1, it said your eyes were brown and the other, it said your eyes were blue, even though that would be eye color would be single cardinality,

I now have 2

different values for the same property.

So in each of those datasets, I can enforce single cardinality, but when I merge them together, I can't necessarily choose a winner. All of a sudden, everything becomes multi cardinality.

And this has benefits and challenges, but it becomes a slightly different way of thinking about composable data. It's a challenge that you just simply cannot get around.

And this is where you might apply something like a shackle rule after the fact where after I merge these datasets together, I wanna see if it still complies with my schema. The nice thing about shackle, again, it's just data. It's like another dataset. It's incredibly reusable. You don't have to recreate it. You can associate it with things and use it in so many different ways. But that would be an example of how you approach schema and how some of this kind of composability does change things beyond what people are used to thinking.

This episode is brought to you by DataFold,

a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production.

DataFold leverages data diffing to compare production and development environments and column level lineage to show you the exact impact of every code change on data, metrics, and BI tools,

keeping your team productive and stakeholders happy.

DataFold integrates with DBT,

the modern data stack, and seamlessly plugs in your DataCI for team wide and automated testing.

If you are migrating to a modern data stack, DataFold can also help you automate data and code validation to speed up the migration.

Learn more about DataFold by visiting dataengineeringpodcast.com/datafold

today.

And in your experience

of working in this space of linked data

using JSON LD as the representation

for being able to embed all of this ontological and contextual information? What What are the most interesting or innovative or unexpected ways that you've seen those concepts applied to data workflows?

I'm I'm really excited, I think, about new things that this,

opens up that would, I think, be extraordinarily

challenging with today's technology, and we've been doing a lot of work in in verifiable

credentials. And this idea, you know, I've talked about embedding permissions and security and trust into the data layer. You know, right now, this always sits in the app layer. And when we suck our data out, we copy it, and we paste it into the data warehouse, we lose all the data security rules. We have to try and recreate them or limit access or whatever.

So, you know, embedding,

security into the data is,

a very, very powerful concept,

but now we can start even extending that further.

And if we think about this data ultimately forming

the world wide web of data, if you will, where these datasets can interconnect, be composed, etcetera.

1 of the coolest concepts is to be able to

cryptographically

have ownership embedded around the data

and allow people to pluck nodes out of the graph, if you will, and store them in a wallet on their phone to prove things about them, have complete control over their data and their data ownership,

so not have to relinquish that data. We are doing a ton of exciting work around this right now with skills and credentials in higher

education. Flurry's powering a big network of a lot of the big names in higher ed to be able to have reusable credentials when you apply for a job. You can prove you have degrees. There's no way of of faking any of this information. You can prove it was issued by MIT or by Arizona State University,

all independently with 0 third parties.

So this concept of decentralization

and trust with 0 third party involved is super powerful. It creates sovereignty over people owning, their own data. Think about your financial records. Think about, you know, proving that you have a job without

the to the mortgage company without them having to call your employer. Think about, being specific when you go from 1 doctor to another, pulling your medical records into your wallet, and be able to very specifically share with a new provider which parts of that record they get,

they can independently verify them without having to even though you're the 1 that gave it to them. So this idea of giving control over data, and then we start to wrap,

a special field of cryptography around 0 knowledge cryptography around it, and it starts to open up being able to do analytics across data where you never physically share a single piece of data. So there's a lot of exciting types of cryptography

that can wrap around these concepts

that can enable data analytics across, you know, competitors without them ever sharing a single row or record of data. They can still do things, across boundaries.

So a lot of exciting opportunity

once we solve some of these fundamental

rules and problems and store data in this really, really rich native way.

And in your work of building

Flurry and working in this space of linked data products and exploring the capabilities and the challenges, what are the most interesting or unexpected or challenging lessons that you've learned in the process?

I you know, 1 of the

challenges I think that we have faced the most is that we live in a world where we've grown up thinking about data in a rectangular

way, you know, whether it's using Excel, which is rectangular, relational databases,

more rectangles. Everything's rectangles.

And,

you know, the shift to starting to view data as a graph and kind of through this connected cloud, it's a big shift, and it's it's a hard 1 sort of to to go through.

For some people, you need to pick up the concepts, I think, gradually.

And 1 of the things is while all these ideas sound great and it sounds

sounds like it solves the world's problems,

you know, it's still just not the way the world works, so it's not practical to jump completely in. You have to kinda come in gradually

into it. JSON LD excites me a lot because it brings,

it's relatively new serialization

of these concepts that have been around for a while.

But, of course, as I started out saying, everyone knows JSON. So it just brings instant familiarity

for anyone who's worked with data in a JSON format, which is everyone,

to some of these ideas,

and it provides a foundation that you can just start building a familiar foundation. You can start building some of the additional concepts

on top of very gradually.

And so I think that's kinda what we need. And 1 of the reasons we're so excited about JSON LD in particular is that we think it's gonna open up these concepts to more people in,

you know,

in increase adoption more rapidly.

And for people who are listening to this conversation, they're excited about the ideas

of embedding context,

having

these automatic linkages between different records and ideas. What are the cases where either the linked data approach in general or JSON LD in particular are the wrong choice?

Linked data in general. I mean, if data is never gonna be needed to be reused or interoperable,

part of the whole point of this is to make data more interoperable. So, you know, if you're throwing something something up and you know that data is never gonna be used downstream,

it doesn't necessarily make sense to do this. You might if it's what you're familiar with, but if you're more familiar with MongoDB or Elasticsearch

or MySQL

or whatever it is, just use whatever you're familiar with in those scenarios.

So, that would be a a key area of when not to use it. Yes. About JSON LD in particular, you know, JSON is,

obviously

1 of the least efficient

serialization

formats anyone could ever conceive of. So a lot of people who deal with JSON don't like JSON for that reason, but, of course, the counter side of that is it's ubiquitous.

It's just everywhere.

So if you want compact

serializations,

then, you know, there's there's other ways

of getting that in much more compact format.

Now JSON LD is just 1 of many serialization

types of RDF. So if you're still committed to RDF, you know, there's turtle, which,

those files end in dot t t l.

Turtle stands for I don't know what the full acronym is, but terse

is what the t is, meaning it's a very, very compact way of serializing and representing RDF data. So there's many formats of representing it where, you know, if you're looking to,

have things in a very compact

format, you might wanna look at over using JSON as the main mechanism. JSON's really almost more for humans than it is for machines.

And

as you continue to

invest in this space and explore its capabilities

and challenges, what are some of the future directions that you would like to see for

overall adoption or integration

with these linked data concepts in the data ecosystem and areas where there's opportunity

to bring these ideas more seamlessly to a broader audience.

It just opens up a lot of new capability, and so I'm

about some of those new capabilities. We've talked about a few of them here, this idea of,

actually

owning data,

having complete control over who uses it and how they use it, ideas around programmable

data, being able to embed sophisticated

security or logic directly into the data tier, I think is immensely powerful,

and it's something we spend a lot of time on. The cryptography

around this space, some of it is, you know, still relatively early, but 0 knowledge proves this idea of proving things that data says without ever disclosing the data itself

is, I think just gonna open up a lot of really neat possibilities.

So those are all exciting. 1 of the things that we're working on now is creating a human

layer on top of this sort of global linked network

that,

works a lot like GitHub where every change

is recorded and cryptographically

verifiable. You have complete traceability and provenance of every piece of data.

And an important thing that we have focused on in this is this concept of time travel, efficient time travel, being able to issue a query as of any moment in time

and instantaneously

getting a response. The reason beyond just kind of being great for lineage and provenance sort of tracking, the reason we think this is so critical is when you are doing decentralized

computing

across well, you're doing computing across decentralized

data.

You can't have predictable results unless you have a notion of time and clock in time. Because if all these data sets are constantly updating, I can never reproduce the same results for the same thing. And being able to say, okay. I'm gonna run this computation.

It's gonna involve these, you know, 50 or a 100

decentralized datasets,

but I want this, done as of

this exact moment in time.

Every 1 of those datasets should know how to reproduce

every state of data at every moment in time instantly.

It gives you completely reliable results. You can reproduce any query with the identical results at any future date, etcetera.

So locking in time, this notion of, you know,

a fact really isn't a fact unless you embed time. So time needs to become a core component of all data management to do it. It's something we're really focused on, and and I hope that capability continues to spread.

And with that question of time and evolution and mutation of data in in an immutable way also brings up the question of for you know, what is the horizon

back to which you are going to support? Like, what at what point do you need to start dropping history

for storage or performance reasons? And curious what your, thoughts are around ways to

either expand that horizon or address that need without

necessarily

destroying the ability to do that time travel

up to a certain threshold. Well, you wanna be able to efficiently represent deltas. And as long as you can do that in a very efficient way,

I see no reason to not,

retain complete history. But there's 2 parts of, you know, managing data.

1 is managing sort of the log of changes,

and, we create, you know, cryptographic linkages between every 1 of those changes so you can prove out the entire dataset. You can prove it's never been tampered with. But the other part that's used even more frequently are indexes, and indexes are just optimizations

of that original log. So that's where I see you having the option, and this is what we do, of saying, I don't wanna retain history in my indexes

because indexes are just creating copies and copies optimized in different ways.

And if the queries don't need to leverage that history,

then there's no need to do it. And that's where most of the space in the database comes from. It comes from all the indexes.

So,

I think you can do it forever. I mean, you can get a 2 terabyte

hard drive now for, like, $20.

It's just an immense amount of data. Or not $20, a $100, $120

is what I meant to say. It's just an immense amount of data that you can actually store for virtually nothing. There's no reason to not have that provable history there. And if you actually look at most database patterns, updates don't happen a lot. You know, we create a lot of new data, new records, new rows, if you will, but, actually, changes to existing rows are not as frequently. Now there's certain workloads where that's not the case, but

probably 98%

of all data has more addition of data than they do churn of existing data. So it it ends up not being a lot. It's super cheap to store.

You can have completely

provable

data if you do have that storage.

Get rid of it in the indexes if you're concerned about this space. Are there any other aspects of this space of linked data products, JSON LD,

the semantic and ontological

aspects

of data attributes and data structures that we didn't discuss yet that you'd like to cover before we close out the show?

Well, I feel like we could, you know, at least I could talk about this for a long time. I we've hit so many things. I'm pretty satisfied with the ground we've covered. Yeah. Definitely lots of opportunities for more deep on a number of different areas. But, yeah, for for today, we'll we'll leave it as is. And so for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today? Yeah. I think it's helping people

identify if they're starting from scratch an ontology or a vocabulary that might already be exist, you know, for their specific needs.

Takes a lot of time and usually a lot of iterations to develop a really good vocabulary and to be able to leverage other people who have thought through all those edge cases

is great, and I don't think we have a lot of good tools to be able to find that. And as I said, you can just create your own and kinda do some mappings later. But, ideally, if you can find something that fits your specific use case kinda upfront,

that would be ideal. And and that's something that I hope we can do more of and I hope we see more of out in the industry to help people get going in this. Absolutely. Absolutely. Well, thank you very much for

taking

the time today to join me and share your thoughts and experiences around these concepts of linked data, building products around them, and some of the ways to represent

building products around them, and some of the ways to represent that information

in a way that is,

usable and understandable to all of the different components that need to interact with it. So definitely a very interesting and rich space to explore. So I appreciate you taking the time today, and, hope you enjoy the rest of your day. Thanks, Tobias. It was a pleasure being here.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links