Build Better Data Products By Creating Data, Not Consuming It

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Atlan is the metadata hub for your data ecosystem.

Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan's active metadata capabilities.

Push information about data freshness and quality to your business intelligence,

automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans could focus on delivering real value.

Go to data engineering podcast.com

/ atlan today, that's a t l a n, to learn more about how Atlan's active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork, and Unilever achieve extraordinary things with metadata.

When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode.

With their new managed database service, you can launch a production ready MySQL,

Postgres, or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs.

Go to data engineering podcast.com/linode

today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Nick King about the utility of behavioral data for your data products and the technical and strategic considerations to collect and integrate it, and some of the work that he's doing at Snowplow to support all of those efforts. So, Nick, can you start by introducing yourself?

Sure thing, Tobias. It's great to be here. My name is Nick King. I look after product marketing and the US business here at Snowball.

And do you remember how you first got started working in data? I've been in data for a really long time. So probably the first time was migrating

off of AS 400s

and mainframes

off of variants of DB 2 into

various other data warehouses in a very long time ago. So it's been a long time. Got to work on the early days of what became SQL Azure.

I worked on the early days of what became BigQuery and and Joomla and some of those pieces.

Worked on a bunch of AI projects in my previous life at a company called DataRobot and then ended up at Snowplow

really because I kept trying to solve problems, and I kept pulling up Snowplow to solve

them. And so

I don't know the history was. I was back in New Zealand looking after my parents

sort of post COVID, just ending COVID there, and I had 2 weeks locked in a quarantine facility in New Zealand, which sounds like New Zealand, but it's an army guarded quarantine facility. And I was constantly, like, pulling up these projects I wanted to work on.

Just in frustration of, like, dealing with bad data, I kept creating my own data. And I knew the snowplow guys. I was joking. It's like, hey. Every time I try to solve a problem, I keep pulling you guys up. And And that sort of started the conversation. Like, why don't we start working together to solve it? So that's

a very TLDR

of of sort of how long I've been in data and really sort of how I ended up where I am today. The The mention of AS 4 100 brings back memories as well because 1 of my first jobs in tech, I was a sysadmin, and 1 of the systems that we had as part of our environment was an AS 400. And 1 of the projects that never actually finished before I had moved on to another job was actually bringing in a brand new AS 400

and running the migration to that. So

They're they're really great systems, actually. I met someone the other day that still maintains them, and, you know, you could sit in to get them, and they work pretty resiliently.

It's been a while since I've cracked open RPG code, though, so I hear it's a lot easier to work with. Yeah. I never actually had to work in the code layer. I was more just dealing with the network interconnects and making sure that I was able to talk to the right things.

Yeah. That's true. I remember putting SNA over IP, so that's a whole different podcast probably. But but, you know, I think those fundamentals on those types of data architectures have really helped

over the last sort of 20 years with them because, you know, back then, you know, we really were hardware engineers trying to get performance out of that stuff.

And to be fair, you know, a lot of that stuff's gone away now, but what we learned in that space makes us much better at, you know, how we architect data warehouses and how we, you know, put data together today. So I'm glad I had that tour, Judy. Absolutely.

That brings up the kind of key phrase that I have said on this podcast many times before, but the idea of mechanical sympathy of understanding what the hardware is actually doing at the fundamental layer and why you don't wanna write your code that way because it will just abuse that hardware, and you will be wasting cycles.

100%.

I would say even 10, 15 years ago, a large part of high performance data management was understanding that the drive, the rewrite throughput of those data systems

and then how you'd also sequence a network. You know, a lot of projects solve that. Right? I worked on a technology.

It was called Spanner,

and we're using it in Google Maps Engine. And what it would do would

actually time slew the packet

to write the data in their same sequence so you didn't have to have a locking mechanism. So each hard drive had a GPS chip. And then understanding, like, the time slew of how long that took to get to a a hard drive in Germany, which is a hard drive in Virginia,

which allows you to sort of concurrently write all that data globally. And so, you know, there's so many of these interesting projects that often

the underlying mechanical elements have a huge impact on how they perform. And even columnar data warehouses, again, similar types of of scenarios. Absolutely.

In terms of the topic at hand, you mentioned in your introduction to how you got involved with data some of the idea of data creation, of having to create data to be able to solve a particular problem. And that also fits in with the idea of behavioral data.

And I'm wondering, before we get too far into it, if you can just give your working definition of what constitutes behavioral data and data creation

and how that might be differentiated from some of the other types of data sources that people are going or people are likely working with.

Yeah. So let's start with data creation.

So I think, you know, for the last 30 years, really since, like, America started doing ETL

pipeline transformations,

we've all been in a sort of ETL concept. Right? Where you take data from somewhere, you transform, and you load schema first, schema second. It doesn't really matter. There's always this concept that you had to source the data and do something with it.

But as we look at the category of data creation, there's something very different in the way that, you know, modern data required

sort of required how that data was being created. So not just the lineage of the data, like, where it was being created,

but also how it was being created. And and what I learned particularly is building a number of ML models in my previous life, but also just trying to maintain enterprise accountability with data, is that it becomes almost impossible

to have a common understanding across a whole organization of your data. So I might understand parts of the data, you might understand parts of the data, and everyone's trying to apply their lens, their ontology to that data.

And what I found was is that most people were dissatisfied with the data anyway. Right? You're kind of grabbing Salesforce data to describe 1 thing or, you know, Google data to describe something else.

And that sort of concept of passing that ownership to someone else sort of didn't make sense to be trying to make, you know, really important business decisions. And I think what's held most people back from creating their own data is the complexity or the accessibility to create that data.

If you look at the way things are acting today, most of us have access to drop an SDK onto our website or drop JavaScript onto our website. We have the ability to, like, emit events from Salesforce.

We have all these different types of business systems, whether it be IoT devices. They're all really capable of emitting some form of of event.

And so,

really, when I talk about data creation, it's really about pushing

out to each of these different endpoints, these places where something happens when an event occurs, the ability to create that event, and then run that through a pipeline where you really have this concept of schema always.

And so we talk about schema first or schema second. When you live in a data creation

concept, you're creating data, but then your schema is always evolving as that data gets created.

And the way that you evolve schema is you have good events and bad events. Good events, obviously, map schema, bad events do not.

And this also has a whole bunch of interesting things that happen. 1 is you can evolve the schema. Bad event could mean something upstream changed.

Okay? That's schema revolution.

That event could be something doing something bad. It could be something attacking that event. Okay? That could also be something interesting.

And, traditionally, when you have an ETL pipeline, you really have no visibility upstream. So if someone changes something, your whole pipeline breaks, or you just don't know, and your data drifts. I mean, that's bad. And so all these problems existed from just this sort of lineage, this sort of waterfall approach that ETL had. And so that's got us started talking much more about data creation.

And it turns out when you kind of look at how you think about the type of feature tables or metric tables you're trying to build, if you have a common understanding of your data and the ability to to create it at scale and do that schema revolution,

it's far more performant than trying to maintain 50, a 100 pipelines of data and always maintain those ETL pipelines. So that's, you know, the definition of data creation

is being able to create data across all of your enterprise systems,

do so in an event based way, and having that schema logic behind it. So that's the first part. That's sort of what Snowplow does. It's the core of what Snowplow is. But then as you've got control of the data creation, it opens up a second

level of interesting things you can start doing with it.

If you have this evolving scheme that you understand, if you know the lineage of the data, then you can start to link those events together in a way that's also predictable. You can recognize ways of stitching those events together, and you can start to have much more control of what that data landscape looks like. So I'll give you, like, a practical example. We work with a company called Willy Parker. They sell glasses. Go to their website. You check some stuff. You know, you go to the old winter rally app. You put the glasses on your face. You go into their store. You kinda swipe your credit card. You do what have you.

That's 3 different very distinct journeys in a retail experience. And in those 3 distinct examples, traditionally, you have multiple tools, ETR. You try join it, and you might figure out you pulled that off

6 months down the line or 3 months, or maybe if you're really good, like, 2 days after it. But it's very hard to take a real time action on that data. What we do for them is we actually stitch it together, you know, near real time. So as the event comes through as we know it, we can start to actually start to inform,

you know, the AR app or the in store promotion or the the website to actually change the behavior in real time. So because you understand your current data, you'll see the event stream come through, and you can stitch it that way, you've got far more powerful types of behavioral concepts you can also start to put together. So that's that's 1 example of a behavioral data concept.

But to get, like, into the the data structure of that, you know, think of the data being led this way. You have, actually, the system data, what created that data.

You have the transaction, what I moved from 1 place to the other, what was the payload data. You have the demographics, who was that.

And then behavioral data is linking each of those events here to say, hey. Nick came to my store. He went to this place. He did that. He then bought. He then had a refund.

And that's kind of the the holy grail, I think, for a lot of us. We think of, like, a a data pattern, but we very really can stitch together those time sequence events to do that. And so

behavioral data is taking all those dimensions

and putting them in a logical, connected way to kind of drive extrapolation for it, and that makes

ML far easier.

But even just, you know, simple,

you know, what was the typical journey and how often that journey exists can be much easier to kind of replicate as well. So that's 2 very big examples, but I think it kind of covers both data creation and behavioral data for you. Yeah. I like the framing of it too because

the way you're describing it, behavioral data is a subset of that data creation problem.

And the way that you're describing it, it also comes off as being distinct

from the event tracking approach that a lot of people might be familiar with, particularly with things like Google Analytics or Segment where you have an event and it is a discrete event and doesn't necessarily have contextual or semantic linkage with any other event where even if somebody is on a web site, you know, maybe the semantic or contextual linkages that it's within a given session on that website, but you don't necessarily

know anything else beyond that unless you have already done the work to kind of add that extra context. I'm curious if you can talk to that distinction between

event tracking as a lot of people understand it and how it maps into this, I guess, superset of the problem space of behavioral data and some of the extra pieces of information or some of the additional

mechanisms or utilities that are necessary to be able to kind of level up into that behavioral data approach.

There's 2 common ways I see this done.

The first

obvious 1 is, like, creating 1 view of it of it ID, so let's just say your email address and all the different events that exist with that and then how those events are linked together.

So you end up with this, what I call a behavioral profile of an entity.

And, like, the behavioral profile could be you. It could be Tobias' house or Tobias' companies. And there's also we have relationships between, you know, buyer groups and or, like, purchasing groups. That's sort of like a human example, but the same exists to systems. There's always these relationships that exist. And so you wanna create 1 of you of all the events with the linkage of events to provide that. What is the view of Tobias right now?

That is incredibly powerful and informative for all sorts of systems, sales performance,

market research, market targeting. And having that in 1 place can have a really big impact on how you think about that behavioral data. And that today is how most people think about what is behavioral data. But there's another set of behavioral data which is also really interesting. When you start to do that linkage,

when you take an event and you say, okay. This event was precursed this other event here, and this was the time between those 2 events. And then this event was linked to this event, and this was the time between these events. You end up with this 4th dimension, effectively. So you have, like, what is the event at what time, and then how do these get linked together? And so you end up building this time series journey of what's happening.

This is really important because you start shifting to almost a graph approach. You basically have joins of events.

But what's really interesting here is it's really hard for machine learning

to actually predict and join those events together without some logical connection.

So if you were trying to, like, look at your mouse clicks and say, okay.

This user clicked these 3 places and converted, or this person clicked these 4 places and converted, The email will generally try to find the laziest way to solve that problem, so it's always gonna look at what converted. So you end up with lots of like, the primary conversion becomes people going from docs or getting started. You end up doing all this, like, weird stuff to try and make the ML data usable.

But when you have the graph join and explain the join between the data, then suddenly the ML sees that as part as a workable feature, and it stops trying to cheat. Then you also don't introduce your own biases to trying to, you know, fake a journey.

So that's why that actual joining of events becomes a really powerful,

a bit, way of sequencing a journey.

And particularly, if there's time series modeling

or even simply just choosing a repeatable journey, like, these are the 4 steps I wanna understand.

Is there a way to make it 3 steps?

That linkage becomes immensely powerful. And so as you're doing data exploration,

being able to say, okay.

Show me all the customers that started with this n step and show me all the variations that got into that n step. By being able to stitch all those together, you have a far more informative data set. And so we start to see

that used a lot more for

web optimization,

product optimization,

and we're starting to see it used somewhat for dynamic sort of optimization of some things. Like, we have 1 customer, Charlotte Tilbury. They use some of that data to optimize sort of their cut performance. And I'm using web examples because they're the easiest ones to sort of explain, but, you know, that type of optimization is really interesting.

Another example,

a large financial

organization uses a similar type of behavioral data to understand

is there a security risk? Was the way that this

action occurred

did the behavior in that that event look like it was consistent with other activities that led to that event? But if we were to string those together, you have a much broader profile of what led that event to happen. So without speaking so many, I'll use another example.

Bad actors tend to have different behaviors. Right? So 1 might be if you've got, like, a script trying to hack a website. The mouse might not move. And so you don't really look at, did the mouse move before this event? Because they'd be tracked as during events. In behavioral data, the mouse moving event would be part of the same journey as something else. And so did it move in a in a organic way or inorganic way? So did it go, like, plus x 1, you know, plus plus, or did it move in a way that looked like the person had muscles? And so just looking at the organic versus nonorganic movement, it's kinda interesting.

Another 1 is looking at, like, okay,

was there metadata around the payload which was interesting? So were there multiple email addresses? So it turns out bad people

tend to have lots of email addresses from people that aren't there. They might have 400 variations of, like, atgmail, at Hotmail sitting in the cache. Most good people, like, we might have 4 or 5 email addresses that might make sure we're not gonna see 1, 000. And so those types of behavioral elements

next was, like, at this stage, what do we see and how do they come through? Were there organic movements?

That sort of sets up how behavioral data can be used. So that gave you a, you know, an ecommerce example, a security example,

an email example, all of those are why behavioral interesting data is so interesting. But the hard part is if you don't create it deliberately,

it's really hard to sort of

emulate it or kind of instigate it.

And that brings up the topic of things like tracking plans that I've heard people discussing where they say, okay. This is the event that we want to track. This is how we're going to construct it. This is where we're going to record it. And I'm wondering if you can talk to some of the

organizational patterns that are

most effective and most useful for being able to work with this behavioral data and some of the ways that that maps into the actual

technical application and technical,

I guess, like, task flow of how the

application and data teams work together to be able to understand what are the pieces of information that we want to track and how.

Yeah. I mean, I think I see this a lot where there's, like, a Google Sheet that exists in the business, and that's, like, all the tracking plans and, like, you know, there's usually, like, Jira ticket numbers and isn't up to date, and there's 1 poor soul that kind of tries to maintain that.

That's 1 way to do it. Like, at least it's in 1 place,

but I see that a lot. I think what we see works well and what we sort of strive for at Snowplows, we have this concept of a of a tracking catalog. So we go through and we actually enumerate all the schemas. We give you the idea of what are you tracking, what are the events, what are the entities, what's the version number of that. And so we try and put

over the top of the schema

an element of what's being tracked and where that's at. And so you always wanna be able to go through and quickly enumerate and maintain a catalog of your schema

but also understand the relationship of your tracking catalog and your tracking plan.

And what I think is a best practice or what I believe is a best practice there is being very deliberate

upfront as to your events and entities and being you know, methodical about the version numbers of what you're trying to do. Like, have a a deliberate rollout process. And

why the Google Sheet way and other ways doesn't work as well is because you sort of work async, but there's not really a very consistent release cadence for that. And so it's very easy to, like each individual is delivering the exact tracking they need to solve that problem.

And you get these weird things like duplication of events and other things, and so you end up with

that duplication. When you have a tracking catalog,

it allows you to think of all the events that might come, whether they be system events, DevOps events,

user events, behavioral data, or what have you.

That overall universal view

is really the best practice that we put into our product that we encourage folks to do. You could do it in a spreadsheet. It's harder. But that view of having that release management almost approach your tracking catalog

and your tracking plan is is key.

And then the other piece that we kind of innovated around this is,

you know, often you sort of take that tracking plan and it just becomes you just add a new column to some table somewhere and and does it. Right? And so you end up with this table sprawl. Sprawl. And so that's great because you've got your table, but someone at the end has to rejoin those back into something else. And so this sort of never ending Ponzi scheme of ETL processes gets kinda keeps getting recreated.

And so the way we would look at it is we'd say,

you know,

build your allow your schema to evolve, and so if you need to add stuff that sits in that schema, the tracking catalog, and the tracking plan's updated,

and then it goes into like 1 force yourself to have

1 universal table to kind of bring all of these together.

So an example of someone who does this really well is Strava. I don't know if you've seen the app. It's like a fitness app, kinda measures all the segments. And so everything in Strava are instrumented in Snowplow.

And

everything from the way their infrastructure systems work, for the way their application works, the way everything kinda works is if they can be an event, a tracking catalog, and it goes in this overall flow, and if they build 1 master atomic table.

And that atomic table then gets aggregated or modeled to what other organizations need.

Having

a tracking plan and catalog that maps back to this concept of an atomic table that then gets modeled to the data products people need to use

provides a lot of consistency, but also it solves 1 universal problem is that everyone

queries from the same source.

So you have this

pure lineage effect where it's like we know it was created because we had a tracking planning catalog. We know it's in the atomic table, so we know these are all events, and then we aggregate from there. Everyone is sourcing from that that atomic table. And that reduct reduces table sprawl, but it also stops, like,

drift and forking of all these different tables. And then, you know, in some ways, I think we're all working towards this concept of, like, a blue check mark for data. I'd love to 1 day see a table

with a blue check mark on that you have, like, the creation, the pipeline, the tracking plan, and then it was like, okay. Guaranteed. Everything is guaranteed to this table.

And that's why the tracking catalog, tracking planning

concept is 1 that's getting so much attention right now. Just make it much more usable because the Google Sheet View, it just, you know, it sort of works, but not quite well enough.

Absolutely.

And another dimension of this behavioral data conversation is that a lot of the ways that people at least initially interact with this concept is through something like a JavaScript tracker on their website.

But when you talk about it, it's a lot more detailed than that. It's not just the events that happened to get fired from the website. It's also the events that are triggered by, you know, the back end system that sends an email or the application, and that brings up the

broader concept of how we as data engineers typically interact with building these data flows where largely we rip the data out of the database, pull it out of context, try to reconstitute it. You know, sort of becomes the MRE of information.

So I'm wondering what are some of the ways that this principle of data creation can serve to

reduce the burden and improve the signal to noise ratio for these data integration efforts where

rather than just pulling all of the raw records out of the database and hoping for the best, you're actually collaborating with the application team to be able to say,

give me meaningful events or give me meaningful records that I don't have to work on reconstituting

that can have a stable API even if you want to modify the underlying schema, and then you just hand that off to me. I know what to do with it from here. Yeah. I love that. And you touched on a couple of great context in that. The first 1 is, you know, how do you reduce the signal to noise and get no or get that to a good and known state?

When you do the extraction, you also extract the aggregations. And so often, systems aren't designed to be or event driven. They might be, you know, a point in time systems you lose that ability to manage change of state.

The other thing is, like, 1 of the aggregations that causes a lot of problems for, like, EML and BI is, like, is it aggregated by 2nd, by hour, by week, by organization?

And you put that all into 1 table, and and that number might be representing an entire day's activity or it could be 1 specific event. And so that when you're training on those types of those datasets or even trying to, you know, drive

some level of metric table from it, it gets really complicated trying, like, you know, fix that. And so what you do is you either

reverse these generics. You divide it by 7 if it's a week and give it 1 day number, and so you end up with these weird, you know, normalization activities to try and make the data fit the aggregation that this is asking for.

When you take an event driven, you know, data creation point of view,

everything is effectively an atomic event, and then you can choose how to aggregate it. But then you can kinda go through and work through your different sources. So, usually, web's where we start because it's pretty easy to deploy to, you know, a tracker, a JS tracker.

Then we tend to see mobile. Then we tend to see,

like, sales data or, like, performance data. Then it's usually support data. And so each of those, you kind of go through this almost like a data contract discussion with the business. What is the data we expect? What's the data you need us to produce? And how do we make sure we deliver that data consistently?

And then what is the right altitude or aggregation to provide that data?

If you think about where a lot of, like, the churn comes for data engineering teams, it's like people they might have said they wanted this aggregation, but they really meant something else, and they just didn't know how to articulate it. It. The system might be just doing weird things upstream. You don't have control of it. So you just you're always trying to, like, figure out why this thing's going rogue. And so the first big, big step forward for data engineering,

when you take a data creation approach, you know it's almost like farm to table data. And so you know exactly where it started and where it ended up. And so you know you know what's happened. So you can have very specific

discussions with the organization

and get to very specific outcomes. You can document those or maintain them in schema,

inversions,

end up with a much more robust

alignment with the business.

Number 2 is it gets much easier to define the data contracts and produce data products. And so, you know, data contracts, we talk a lot about what is the data expecting.

You know, how do we make sure that this the data is in line with the scheme of the business is expecting?

And data products in some ways is, you know, what do you produce and how do you provide that data back to the organization? And I think we are

you know, for the most part, people I talk to, we're all kind of getting this direction where it's like, we don't wanna kind of give carte blanche access to all the data.

We do wanna drive deliver curated

data products that people can come and consume from. And we, as data engineers, we wanna be accountable for how that product gets created. Right? We wanna make sure we've got, you know, organic data. We wanna make sure it's, you know, farm to table. I just came out with that, but this is kinda true. Like, you kinda know exactly where it goes. It helps your consumers know it's good, but you also can stand behind it. You can also troubleshoot it. And so, traditionally, you you hoped it was right because your business cons constituents told you the data was right. Whereas when you do data correction, you can actually get much more authoritative and actually much more deliberate about your current. So that that's probably the first time saving I see.

The second is just the cognitive load

that it takes to deal with ETL and data.

So I know Salesforce data really do well. I think most of us do, except when people customize Salesforce data. Right? And so you end up with all these different things, these different fields you're trying to work with.

And so you end up with this, like, cognitive loading of trying to understand these data structures.

And so the other thing when you take a data creation approach is the the grammar, the way you describe each row,

each column is defined in a way that can be repeatable.

And what I find happen is that organizations

are able to navigate that data much faster. So in the Striver example,

you know, you could be there, and you gotta know what the data structure's gonna look like. You know what the column names are gonna be. They've got a really good governance approach. So shout out to those guys who are listening. It's it's world class. But the the way they do the governance means that as a new starter, I remember, like, get in there, use the data, and I know what to look for. There's not 3 columns called customer 1, customer 3, capital customer. There's 1 column called customer and data management, data governance. There's also you know, you've got more flexibility but much more control.

And that cognitive load is really important as well. So those are the 2 things that I tend to call out. There's a bunch more too, but those are probably the 2 I'd I'd start with. It's probably the 2 most important ones.

Prefect is the data flow automation platform for the modern data stack, empowering data practitioners to build, run, and monitor robust pipelines at scale.

Guided by the principle that the shouldn't get in your way, Prefect is the only tool of its kind to offer the flexibility to write workflows as code.

Prefect specializes in gluing together the disparate pieces of a pipeline and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20, 000 community members, Prefect powers over 100, 000, 000 business critical tasks a month.

For more information on Prefect, go to dataengineeringpodcast.com/prefect

today.

That's prefect.

Another

issue that is

embedded in the concept of behavioral data is that of the ethical and privacy considerations that need to be balanced against the business objectives and the kind of analytical outcomes that you're aiming for. And I'm wondering how you have seen teams

address that balance and identify

what are some of those privacy and ethical considerations that they need to

understand

and how

to include that in the process of determining what events to track, what specific pieces of information are allowable,

and how to manage the governance of that information after it has been created?

Yeah. You know, in some

ways, there's, like, what we're being guided to by, you know, the EU with GDPR.

Just last week, the White House released another proposal. I don't know if they signed it, but it's it's looked like at the beginnings of another safe harbor, you know, so we can start working with European data centers again.

But GDPR

is a relatively well written piece of legislation,

and it gives us a lot of ideas on how to manage the data. Right? The right to be forgotten, how we share data with 3rd parties.

But, also, what's interesting, I don't know if you've caught this, is that

3rd party data collection,

so Google Analytics or Adobe or some other, that data collection

is quickly being

ruled as, you know, not compliant to a lot of countries' privacy.

And first party data creation

is actually in line with how,

you know, the EU and others would like to see

businesses and applications evolve. And the reason for that is there's no intermediary

that can kinda get in the middle and be like, oh, we decided to do something different. You click this, you are not realizing that's what you need to do.

And so the first thing is understanding, like, the policy landscape.

And so I think GDPR is a good guidance for a lot of us on, like, how we store the data, how we ask for permission with the data, how we give people the right to be forgotten,

and then also how we collect it. So trusting,

you know, others to collect the data on our behalf and do so. While it's easy because they give us a scheme or you can just put it on,

generally,

you've lost the sort of the element of, like, the lineage of that data. You don't know what's happened to it, so it's very hard for you to commit to it. And I think, ultimately, what individuals want from people that use their data is accountability. What did you use this data for? What did you collect on me? And did it ultimately make me, as a consumer, better off?

So that leads into the second part of this is

it's possible now to collect huge amounts of data of things that people didn't even realize were, you know, an interesting data point. And what generally I would recommend is, you know, understand what your business policies are. Do you wanna collect gender? Do you wanna understand ethnicity? Do you wanna have a ZIP code? How do you think about what data is important and informative?

And build a tracking plan to support that.

And then the other thing is pseudoanonymize by default. So we encourage our customers to always pseudoanonymize,

in some cases, anonymize data as well. And that really helps customers, you know, kind of build a policy. Like, at that point, once you've pseudoanonymized

and you've collected the right data and you've got an understood

business policy of what you're collecting and not collecting and you've kind of got the GDPR stuff built in which will help you go do as well then you've already got like a a platform

that is largely conformant to what the guidance of, you know, I would say the EU is probably leading us. It's 1 of their best, you know, I think, innovations in a long time is what they've done with GDPR, and I can't believe I'm saying that, but I've actually grown to love this piece of law. But then the other thing is what we do with it as data engineers.

And I think as a data engineer,

I think we all have pretty good moral compasses. We know that we should, you know,

make the right decision. But, ultimately, the way I look at it is, like, you want to

protect that customer's data. We wanna be making decision that makes their life better because of it. You could argue, does a better glasses buying experience make my life better?

Maybe. Maybe not. It's not gonna change my life. But if I can get through and have options that I might not have had before and I feel more confident, then I think that's adding some value to that individual.

And so I like to use, like, is this adding value to individuals as a litmus test? And then the third thing I'll say is your ability to react to things that change.

I don't think people wake up in the morning and be like, I'm gonna do something sketchy,

but things can change.

So I'll give you a very real example. During COVID,

I was working on an application that was basically approving credit cards, basically approving, like, different credit applications.

And, like, you know, we shut the country down. Everyone was going home. We didn't know what was happening. Certain neighborhoods were much more effective without being able to go to work every day than others.

So what happened was zip code became a really strong indicator

of whether someone was gonna be able to pay their credit card payment or not make their credit card payment almost overnight.

Now in a time when, like, everyone's, like, in a really tough position, the banks actually were, like, incented to try and help these people out, but their systems

were effectively blocking based on ZIP codes. So ZIP code became a very biased

indicator of whether something should be accepted or declined.

Now none of us would have thought zip code was a bad thing to collect

prior to COVID, but it turned out it was a really bad thing to weight your models on afterwards. So what do you do? Do you drop the column? Do you stop collecting it? Do you reweight it? The answer is you've gotta be able to, like, understand, like, okay. That column suddenly became, you know or did you hold that back? You create another view for it. And so that level of flexibility and understanding is what's really important, the ability to react.

And when you have these complicated ETL pipelines, it turns out you don't know if ZIP code was used in that motion

way upstream. And so you're able to sort of make these decisions inside your schema and drive that inference. And so that's a sort of a third part to that compliance angle.

The 4th thing I would say, because this is a complicated and burly topic, is knowing where your data flows.

And I do think this is where people are gonna get more and more in trouble

over the next 3 to 5 years. And as data engineers, we have to think about this.

General rule of thumb, if you create the data in Europe, keep the data in Europe, pretty obvious. Maintain that from your pipeline is actually really hard to do. You know?

1 of the companies I worked with in a previous life, we actually didn't know where in the world your files were stored because that was just the way the infrastructure work. Worked. Just who knew? It could be in Germany. It could be in Australia.

Your files existed somewhere in the ether, but we never knew. And so I think as data engineers that you have to be able to explain that your data is still in place.

And if you can explain it, then generally the auditors or people that want to understand will will generally be pretty happy with you and be supportive.

And I'd say the the very last thing is, like, Apple and others.

I mean, with ITP,

Apple basically stopped using cookies. So we're seeing a massive shift in how we're using cookies. I expect to see a lot more there.

There's even some rumors about how, like, Apple's gonna start looking at how they price things around those pieces. I think those are all really important things to do. Like, that's the way they're gonna approach Apple such a large part of the market.

I think that for us is,

part of what this type of data creation can also help for is be prepared for those changes, because I do think customers do want more privacy.

I do think we're gonna see

sovereign nations have different opinions for each other. Like, already, we're seeing countries in Europe

shifting. So, like, Austria,

Germany,

France, Spain,

Italy,

Denmark, all have banned GA effectively

from a data correction tool.

So that flexibility is sort of immensely important.

As far as the

organizational

aspects of managing

behavioral data and data creation,

what are some of the

ways that you have seen the adoption of that strategy

impact

the team dynamics or the ways that teams are structured and organized?

I genuinely see the teams breaking into, like, 2 or 3, usually 3 distinct functions.

1 function is sort of the running the business functions. They're like producing the they're using the data to produce these very specific reports that everyone runs the business on. So that's kinda what we know today. We see the metrics tables that builds out.

The second is this evolution of, like,

behavioral understanding for the customer.

Almost like if this, then that. There's almost this element of, like, hey. We see this. This behavior equals that, and their ability to produce that and repeat that becomes far more powerful.

Do these 3 steps and this event happens. So the ability to analyze the customer

journeys, analyze the system journeys goes up. And so that's the second type of

evolution of the team. We see these teams, like, defining journeys.

And then the third,

which is kind of my favorite, is when you see people activate those journeys. They start to actually take those insights in real time,

change the behavior.

It's a good you know, I like that does this is AutoTrader.

So what they do is they have this concept of

how you consume the different roar like, the different cars, and they'll actually figure out what cars you like and dynamically change the cars you're working like, you're seeing as you scroll down. So this concept of scroll depth and scroll depth location,

which is cool. And so that Autotrade UK, I should say. And what's cool with that is

1, is they've taken the behavioral insight that as you scroll, you kinda wanna see more of the car you like further up the stream.

But also,

you wanna go through and be able to say,

alright. Change the behavior of the application based on that. And so you're taking behavioral data for how I like Toyotas, but my stream's got Nissans in it. So, therefore, it's generically changing the behavior of Toyotas and changing the applications as I scroll through it. So that's, like, the first is just a view of the business, trying to send the business. The second is definition of the behavioral, you know, if this, then that. And then the third is

the actual activation of that behavior in applications, which is

super cool.

Yeah. It also brings in the complicated aspect of,

you know, kind of nature versus nurture to some degree. You know, we're using behavioral data to inform this application behavior, which then influences the actual behaviors that the user is creating.

So, you know, did their behavior map to a certain pattern because of the way that we changed the application as a result of their information, or is it the other way around?

This is an email debate. I've seen this

with so many examples where, like, do we bring people in and force them to have these, like, are we naturally forced to make decisions? Here's the way I think we're data in the sector is there's

there's, like, the intention.

Right? What's the intention of that, you know, that user agent? Intention is they wanna find a vehicle that suits them.

The intuition is like, okay. Is this a good car for me and not a good car for me?

I think, generally,

I still believe humans have the ability to, like, understand the intent and the intuition. They might have very little leaves of capacity for intent and intuition, but, generally, like, there is that sort of cognitive ability to kinda go make that decision.

And then I think the other thing is, like, to make this individual

better or worse off by doing this. If you're, like,

signing them up, like, sending something bad, but they choose it because it's the only option, that's then being worse off. So I don't agree with that. But if you see them a better option and better off for it, then I'm okay with that. And so it's about

and how do you score a measure? I think it's you know, that's where, like, I think individuals still know that that's an option. So that I think the choice and optionality sort of provides a natural control for that. We could argue that social media doesn't give you the same natural controls, and so, therefore, it's more of a controlled environment. But, generally, in a lot of stuff we work in, you're in these sort of open ecosystems, so it's less of a, you know, are we forcing to choose a Twitter? Because that's what we showed them.

So, yeah, my general rule of thumb is, like, if you make their lives better and you give them the choice between and you give them both intention and intuition, then, generally, you're okay.

And so now digging a bit more into

Snowplow itself, we didn't dig too much into the specifics of it because I did a pretty good interview with Alex Dean, 1 of the creators of the product, about 4 years ago now. And so I'm wondering if you can give some of the notable updates or changes to either the technical aspects of how Snowplow is built and maintained

as well as just some of the shifts in the product focus or the general kind of intent of where you're trying to take the product and the business.

A lot's changed in ground. This just highlights some of the things we do,

that we've talked about today. So first is, obviously, we focus on data creations. We have all the ability to create the data, maintain pipelines, etcetera.

But what I think is interesting that we see is

the ability for us to enrich on those pipelines but also dynamically adjust the schema. So how we do schema revolution with you, we like,

evolution with you, we allow you to update and evolve the schema. So that's a really big problem to solve that we we solve for you as part of the platform.

And then the other thing we've really focused on a lot is just the different types of cloud data warehouses we support.

And so, you know, we see a lot more customers now, just default Snowflake. We see a lot more Databricks. We still got a lot of redshift out there, BigQuery still there. And so we've spent a lot of time, like, optimizing performance of those loaders as well just because, you know, in some of the cases, we're talking billions of events a day. And so I just think the the way data is scanning now is super exciting, but also just a lot of technical complexity that we take care for of customers to do that.

There's also just the way that we deploy this now. So there's 2 main ways we see this deployed.

1 is what we call private SaaS, which is effectively we would deploy

inside of a customer's VPC.

So if you're a large multinational and you wanna have complete control inside your own VPC,

we effectively backplane and and deploy inside your environment, which

maintains the sort of data integrity for behavioral data. And the other is, we have a cloud version, which effectively allows you, like, to spin it up. We'll manage the front end infrastructure. The data stays on your side and it goes to your data warehouse, but we take care of the sort of the provisioning of the the platform per se.

It really depends, I think, the private VPC option.

You know, it's much more enterprises. You know, all the enterprise controls you'd expect, you sort of see a lot more of our larger organizations doing it. The cloud version, you know, we see all sorts departments, smaller organizations, some larger organizations trialing it. There's sort of a bunch of options there. And I think, you know, to a lot of your questions around this data, it's considered to be sort of the most powerful asset companies have. So people can, you know, wanna treat it accordingly, or they they know they need to do it, so they create it that way.

We've also been evolving

around

what we call, like, the tracking catalog, so I'm glad you brought that up. We saw these spreadsheets. Like, we've gotta do better. And so we built a tracking catalog, which really helps customers. It includes with the metadata and version tracking.

We have, obviously, the SKU management I talked about. So, like, a lot of that, like, what I call data product management,

like, all the different things you need to do to create data products is really what we're focusing on building out. Like, you know, if a year from now, you're, like, if we need to build a data product, the answer is slow plow, then I've pulled off my mission.

Because there's all this

complicated stuff that data engineers need to solve for, and sort of at least my vision, the team's vision is we want to get those tools to make that data product management lifecycle really powerful inside Snowplow. So that kind of gives you an idea of where we're heading. You know, I think the last thing that I find really interesting

is that

we're often asked, like, what's the best practice for ecom? What's the best practice for, like, customer data?

We're figuring out how we can provide more of that insight because

we did a survey a little while ago, and and, you know, a lot of the people come back like, look. This is a really hard job. Like, people have ridiculous expectations of us. You know, we've got lots of cold start problems that shouldn't be cold start problems. So is there a way to sort of share more of that? So 1 of my other sort of projects is being able to, like, make that more available for other data engineers. So at least you see how stuff was built. We give you, like, a starting place, like, an accelerator to kinda work with that. And so that's something else we're working on. It's just it's like, if I can have someone build a composable CDB way without having, like, re imagine the world and go talk to 50 consultants so that I can make that available for more people,

I think that's a good thing. And I hope they choose Snowpla to build on top of it. But we're gonna be you know, continue with that open source thesis, and sharing a bunch of that insight is also pretty important to me too. Another interesting aspect of the problem space that you're working in and some of the ways that it plays off of the overall trends in the data ecosystem is the idea of the so called modern data stack that seems to have catapulted a lot more

businesses into the space of feeling comfortable and capable of being able to actually build out a data platform

and start to bring in some of these analytical capabilities

and analytical workflows.

And I'm curious if you have seen any

meaningful impact on the adoption and uptake of Snowplow and some of the maybe kind of degrees of sophistication that teams are able to

move into as a result of the broader availability of these

managed data tools and data platforms so that they can actually start to think about how to

incorporate data creation and behavioral data into their overall kind of data strategy.

Look. I love what the modern data stake has done for a lot of organizations. Right? Like, previously, you'd have to get some of these capabilities. You'd have to, like, spend 1, 000, 000 of dollars and spend so much time doing it. And so it really derisked

the opportunity to kind of get these outcomes and have exposed these technologies.

At the same time, it added, like,

so many choices.

Right? And so I think often there's this problem of, like, what is the choice and how do I start, and where do we go to from that? And so

where I think what's been awesome is that we now know most of our customers we work with have some cloud data warehouse that allows us to have super full and highly scalable. There's no IOPS conversations anymore. I can see they put on Snowflake, and it just scales Databricks, what have you. That gave just a lot more accessibility. I think the other thing is that the businesses sort of know that this data engineering data side of things exist and expect it to solve world hunger for them. So they ask so many questions of the business.

And I think up until this year, right, we just business of just putting more money into data engineering and data analytics. Maybe not in the people, but small tools will start more requests.

And so that created 2 things. 1 is I think that data engineers

have a seat at the table now that they didn't have before into the business.

And secondarily,

like, I think

the business has also asked the data teams to provide insights as well as just the raw data to provide something that's consumable by the business. And I think that shift is something I've seen, you know,

especially this year, especially since we've sort of seen

budgets get a little smaller this year and people being a little more conservative like how they're making purchase decisions

Like, hey. We understand this stack exists,

but, like, how do we transfer the value between the great data work the teams are doing and what the business is asking for?

And so that's definitely been something like if I just talk about the conversations I've had this last year, it's more like, we understand this in the modern day tech can build this great stuff, but how do we get this to a business outcome faster? How do we define value

from this? And I think

for us, the way we would talk about it with our customers and even with each other is,

you know, by producing

data products that the business can consume,

the business sees real value in that product. That's the value exchange you create.

And so I think the modern data stack has given us all these tools to create these data products

far faster and, you know, way smarter ways with so much data, be way more accurate, you know, end, end, end.

But at the same time, you know, there's so many tools that we kinda wanna give guidance as well on top of the modern data stack. And so

that's been key for us in sort of how the modern data stack has opened up opportunities that weren't there before.

Secondarily, I think the other thing is you gotta be able to work with that ecosystem.

And so for us, you know, we've got an open source option there for folks. We work with, you know, as many folks as we can. We sort of have, you know, an open architecture in the way we think about how the data is creating our open source approach. And so that, I think, also helps,

you know, give put into the hands of more people and more engineers.

For that, I'm super grateful.

You know, 15 years ago,

there wasn't a huge community of data engineers.

It was a much smaller, you know, DBA community,

and I'm pretty grateful for, like, what the modern data stack has also created and, like, capability in the ecosystem as well.

Data engineers don't enjoy writing, maintaining, and modifying ETL pipelines all day every day,

especially once they realize that 90% of all major data sources like Google Analytics, Salesforce,

AdWords, Facebook, and spreadsheets are already available as plug and play connectors with reliable intuitive SaaS solutions.

Hivo Data is a highly reliable and intuitive data pipeline platform used by data engineers from over 40 countries to set up and run low latency ELT pipelines with 0 maintenance.

Posting more than a 150 out of the box connectors that can be set up in minutes, Hivo also allows you to monitor and control your pipelines.

You get real time data flow visibility with fail safe mechanisms and alerts if anything breaks, preload transformations and auto schema mapping precisely control how data lands in your destination,

models and workflows to transform data for analytics, and reverse ETL capability to move the transformed data back to your business software to inspire timely action.

All of this plus its transparent pricing and 247 live support makes it consistently voted by users as the leader in the data pipeline category on review platforms like g 2. Go to data engineering podcast .com/hevodata

today and sign up for a free 14 day trial that also comes with 20 fourseven support.

And in your work of

moving into your role in Snowplow and working with customers and helping to understand some of the strategic and tactical elements of data creation and behavioral data management? What are some of the most interesting or innovative or unexpected ways that you have seen these practices or Snowplows specifically employed?

I'll give you another 1, which is which I haven't seen I've seen it, like, in some of

the experimental elements, which is using

Snowplow to create

behavioral data between

warehouse

robots.

And, effectively, it takes each of the different warehouse robots, creates an inventory for each individually,

and then tries to bring them back together into 1, you

know, swarm of

robots themselves. And so I think that's, you know,

not just looking at the behavior of, like, interactions over a point in time, but also how interconnected systems can work.

That, I think, was a pretty interesting

use case, which I think most of us kinda geek on, like, the swarm mentality.

The other 1, I think, is kinda interesting

and probably something that we all experience every day.

If you ever go to, like, a paywall, you know, you go to, like, an article, like, they sort of pops up on your feed, you click it, and then you get, like, to the first paragraph and says, like, please pay here.

For the longest time, that paywall optimization

was kind of just defined by editors. It's like,

you know, the sports articles apparently were the ones to, like,

not convert on so you gave away lots of sports articles, but then

the, like, gossipy ones were better for conversion

because it was like, oh, it's gonna be a point in time. So this whole, like, kind of un

not particularly scientific, but kind of rote way

of, like, how to figure out what to optimize the paywall on. But what we saw happen was

companies like the GlobalMiles started to build

their own system for dynamically adjusting the paywall,

and they actually went and disrupted the paywall industry by taking that behavioral data. And they created a company called Sophy.

I think it's got s o p h I, and it basically is a platform that dynamically optimizes paywalls. And

what's cool with that,

it turns out that the Gosphy articles were the ones the highest performing ones for converting. So you can thank the data for getting more Kim Kardashian articles.

And so, like, just that level of, like,

behavioral information that can inform lots of other things became a really powerful product. You know, those are 2 examples, I think, where you see not just the individual behavior of, like, 1 entity, but then how you take multiple entities and provide it to be some, you know, profitable or usable service as well.

In your own experience of working in this space and working with customers and as well as trying to use a snowplow and behavioral data for solving your own problems, what are some of the most interesting or unexpected or challenging lessons you've learned in the process?

I think my

advice

is,

you know, really do think through, like, how you think of your tracking plan up front. And so

that sounds so simple. But if you do that right,

everything else flows from there, and you can evolve it. So what I see go wrong has been someone starts with, like,

all these different things at once and tries to converge and conform. I think where I see it is, like, start simple. Start with a single tracking plan for your website and define your customer. And then once you've defined it there, evolve your schema to support a mobile application.

In some ways,

where I see it go wrong is you try to build these, like, massive master data management projects, you try to solve data for the whole company. I don't think I've ever seen 1 of those

be successful.

I'm sure they have been, but I can't think of too many that have, like, been like, alright. We nailed it. Our data problem's now solved. And what I think Snowflake allows you to do is

focus on choosing 1 problem statement, and then you can evolve it to the next 1 without having to, like, massively refactor all of that schemering. So I think that would be

the thing that I've seen

that allows data engineers and data teams to evolve with their their deployment.

Generally, when you see these huge projects, I try to break them down into smaller

smaller chunks.

So that's kind of the

challenges I've seen, and and I think

the other thing is there's a bit of a cultural shift, I think, with Snowflake.

So if you've been doing ETL models forever,

the concept of creating data sort of seems a bit far. Like, what you know, can we create enough? And, like, you know, what's the difference between data creation and data exhaust? And, like, you know, is it really that broken? And so there's this almost like

there's this there's this sort of moment I see working with customers like, well, ETL works, and you sort of show them data creation. They're like, this actually works.

We've just had this muscle memory to ETL things for so long. I think it it definitely is a different way of thinking about it.

And so for people who are trying to

make their lives easier or be able to build more effective

analytical and machine learning systems? What are the cases where

snowplows specifically or behavioral data in general might be the wrong choice?

Yeah. In those scenarios,

like, I like to ask a customer if they're data first or data centric and, like, where they sit on that that modality.

And I think if an organization just wants to magically have this behavioral element and doesn't consider themselves to be data first or data centric, I think, generally, I'm like, okay.

We're gonna coach them towards thinking that way, or we're like, okay. If you don't believe that you can solve this with data, you just want the end outcome,

I feel like that's usually an indication they haven't reached the data maturity that we might be looking for.

That's 1. For me, the other thing I look at is like, okay.

You know, how do you think of, like,

how you present the data? So we're not gonna build all the fancy dashboards for you. We're gonna plug into existing systems.

We don't come with a bunch of, like, you know, graphs up front. Again, if you're data first or data centric and you've got visualization teams, you've got enterprise systems,

that's kinda where we sit. So we do require some level of data sophistication.

So you can't just deploy and, you know, pointy clicky

come out with, like, a, you know, a pretty picture at the end. You gotta you gotta have to do some code and and do some data engineering.

As you continue to work with your customers and work with the business, what are some of the things you have planned for the near to medium term for a snowplow

or any particular projects that you're excited to dig into?

Data product management is a big area of focus for us. Like, what does that life cycle look for data products and and the evolution, how you create them? But we're also thinking about how we talk about data product accelerators.

So I talked a little bit about

how we think about

how we actually provide

the path to building a data product. Do you remember I talked about, you know, the pain people feel in, like, building data products and how hard it is to repeat them? So I spent a long time thinking about this. And what I realized is, like, you can't just give someone a recipe for a data product because everyone has, like, different equipment, different elements.

So what we do at Data Product Accelerator is we really

think about the end outcome of data product, and then we kinda take people through a cooking show. Like, hey. Here's what you set up. This is what you shoot the stage. Here's how you deploy your schema. At this stage, think about, you know, connecting these pieces up. Here's how we get to that end outcome. And so

we are on the path to produce a number of data product service. We start off, like, Google Analytics ones.

We

also look at things like ecom.

We have things around GDPR. And so we're just taking a lot of that knowledge we've built up and building it into these accelerates to help customers and data engineers build their data products. And,

you know, I think as a data engineer or, you know, probably not a full time DE anymore, but I definitely still have that in my blood That if I can help people start out with the idea of what they want and give them a walk through to get, you know, h in the way there, I think that's, like, a good service for data engineers, but also for the business. Like, the business, like, is pretty ruthless now asking you for time to value how fast you build these things. So when someone asks you for these outcomes, I'm really hoping that from Snow Hollow's vector, we can build these accelerators. So you're like, alright. Someone needs a composable CDP. We'll deploy Snowplow. We'll run through these steps throughout them going, hey, customer or internal customer, here's what you need. So you can kinda build what the customer wants way faster. And you apply, you know, Steam Revolution and everything else on it. You just now I think it's a big amount of value we can help data engineers have for their organizations. Those are 2 things that I'm working on at the moment.

And then the last is just more ways of deploying.

So we talked about, like, the loaders. I know it's sort of in the bowels, but getting really efficient about how we load Databricks and Snowflake and others has a huge effect, especially when we're getting up into the 1, 000, 000 or billions of events a day. So, yeah, those are 3 things that are coming.

Definitely check out the data product accelerators.

Data product management is something with a tracking catalog. It'll officially be out, you know, when this podcast comes out, and so that's another thing we've got going. And, you know, for me, like, I'm just trying to get as much new stuff out as possible. Like, it's fun, but, also, I I know I can help people that are my friends and people that, you know, face a lot of problems that we face.

Are there any other aspects of the work that you're doing at Snowplow or the that data creation and behavioral data can be used in analytical and machine learning contexts that we didn't discuss yet that you'd like to cover before we close out the show?

Yeah. We didn't talk a lot about, like,

feature engineering and just the advantages of how data creation solve that. So a lot of my background most recently has been in email.

And so

the beauty with doing data creation is you really introduce no noise

into it because because you create the data. So as long as you're constantly evolving, creating more like, collecting more or adding more features,

it's really powerful when you do, you know, feature creation or you create your feature store. And so

that's the beauty of this whole approach is, in some ways,

the problem we really solve is we remove all the noise from your data. We get it ready. And so you'll hear us talk about AI ready data.

You know, generally, what you're doing when you're sort of preprocessing data is you're removing the noise. You're trying to, like, provide

an informative feature set.

And so when you take Snowplow's approach to this,

you effectively have the atomic rows, which effectively become informative features,

but also

you're able to use a lot less data to train the model Because there's less noise, because you had to do, like, less transformation,

then, you know, those rows are much more informative.

So that then totally changes your your loop speed for a refreshing model. So you're able to take train models faster, but detect drift faster, You're able to retrain and have a model factor in it faster. So the the advantages downstream from having a highly performant data creation

and data management approach,

dramatically

solves for how you look at how you drive EML,

directly reduces the costs.

And so it's kind of a little bit of a sort of inverse, I think. We went through this big data phase,

and, like, big data was super helpful. But now if you're looking to, like, drive applied

AI or ML use cases,

you really want what I'd call minimum effective data. So you want the smallest amount of informative effective data to to train that model.

And you hear there's a number of different concepts just around tiny data, tiny ML. There's a bunch of that stuff popping up, But, really, what I think is all the secret is if you can create the data to be highly specific, you can have a lot less of it to train.

And that is,

you know, I think the downstream implications of applying data creation

just for EML

justify switching away from ETL as a way of sourcing the data. And, you know, if if anyone's ever gone through the process of

training data and cleaning data before you do your model, like, I would say 80, 90% of the work is not the math or the, you know, the Python to get it running. It's, you know, what is the data you're working with, and does that data adequately affect the behavior of who you're trying to, like, work with? You know, some of my most interesting ML discoveries have nothing to do with the data, but it's more about what the people are doing around the data, what we could extrapolate from that data based on, you know, someone's repeated behavior or something we're trying to influence. And so that was

probably the thing we didn't touch on, which is, you know, probably a whole another podcast itself, which I think is a really interesting branch of what data creation can do for email.

Yeah. Well, fortunately, I actually have a whole another podcast on machine learning. So we'll have to, set up another conversation around that. Definitely interested in discussing kind of the ways that the behavioral data and data creation aspect can feed into the feature engineering and how that might hook into things like feature platforms or feature stores and the overall machine learning life cycle. So as you said, a whole whole another podcast episode just on that.

But for anybody

who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. Look, I think we've got a lot of tools and technology,

but what I think is really missing is how to bring those together. And so I do think there's a real opportunity for us to share more

and provide more about how we bring those pieces together. And so that, for me, like, the concept of a data product accelerator, I think, is

an example of, like, the technology and the tooling is there. But if we can bring that

repeatability to technology and tool it, I think that's gonna help people so much.

You know? Like, I don't know if you've ever experienced this when you're, like, stuck trying to do something, you finally find someone's code that actually works. Like, oh, I totally missed that. I think right now we've got so much innovation as individual tools

that only a small group of people have actually, like, putting them together in a way, and and it's really hard to access that. So I think for me, if I can bring some of that insight into the product

and can help people have access to that as part of the platform, I think that's gonna be really powerful.

And, like, if I can just save, you know,

someone an hour or, like, help them get to the point where they can build something an hour faster that, you know, makes them more powerful and, you know,

builds from that community. I think that's a good thing. I think that is something I wanna weave in technology and tooling far more than any other gap I see. There's plenty of other, like, nerdy gaps, but I think if you ask what's the biggest impact,

it's, like, helping my data engineering friends figure out how to get to an outcome faster than having to, like, go through a 100 docs, search, drink 50 coffees to try to figure it out. Probably not the most geeky answer you ever had, but I think if you ask me, like, what's a really big gap we could solve for, that's probably number 1. That's a great answer. You'll just have to go drink the 50 coffees for fun.

Yeah.

That's right. That's right.

Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing at Snowplow and your perspectives

on the applications and utility of data creation and behavioral data for being able to build data products and analytical systems. So I appreciate all the time and energy that you

are putting into making that a more tractable problem for your customers and the ecosystem and for taking the time today to share it. And I hope you enjoy the rest of your day. Yeah. Thanks, DeBos. I really enjoyed it. Thanks for the thoughtful questions and the conversation. And good luck to everyone listening in, and hopefully, we get to keep chatting.

Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning Podcast,

which helps you go from idea to production with machine learning.

Visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links