Unpacking The Seven Principles Of Modern Data Pipelines

Hello, and welcome to the Data Engineering podcast, the show about modern data management.

RudderStack makes it easy for data teams to build a customer data platform on their own

warehouse. Use their state of the art pipelines to collect all of your data, build a complete view of your customer, and sync it to every downstream tool.

Sign up for free at dataengineeringpodcast.com/rudderstack

today.

This episode is brought to you by DataFold, a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production.

DataFold leverages data diffing to compare production and development environments and column level lineage to show you the exact impact of every code change on data, metrics, and BI tools,

keeping your team productive and stakeholders happy.

DataFold integrates with DBT, the modern data stack, and seamlessly plugs in your DataCI for team wide and automated testing.

If you are migrating to a modern data stack, DataFold can also help you automate data and code validation to speed up the migration.

Learn more about DataFold by visiting data engineering podcast.com/dataFold

today.

Your host is Tobias Macy. And today, I'm interviewing Ariel Pahorilis about the 7 principles of modern data pipelines. So, Ariel, can you start by introducing yourself? Sure. Hey, Tobias. Thanks for having me on the show. So as as you mentioned, my name is Ariel.

I've been in data space for around 20 years now in in different roles and and types of organizations. I started my career, as part of a now it's called data engineering. Back then, it was just a a data professional in in a large organization, large corporation, part of the internal data team developing

data pipelines and reports internally. I then moved on to

a product company in the data analytics field,

working,

with organizations,

evaluating technologies and helping them choose the right 1.

And,

later on, I moved to a leaded data consultancy,

helping a organization implement those different data products that are available out there on the market and and obviously their own custom solutions.

So I'd like to think I've been kind of in all angles of a of the, data life cycle,

data implementation life cycle. So the part I'd like the most, was always the the product part where I find the innovations coming through at the faster pace. And that's why I've recently

joined Riverie, leading the product marketing here at Riverie. If you haven't heard of Riverie, we are

a startup in the space of a data integration. We offer a SaaS CLT platform that really simplifies

end to end data integration processes with, starting with no code all the way to to low code and even high code. They,

options in the platform

that handles everything from data ingestions to data transformations, orchestrations,

even reverse ETL. All of that with a,

solid foundations of a data ops management to simplify the process. So a bit of an intro about myself and the and who I work for. Yeah. And for people who do wanna deeper dive on the Rivory product itself, I did do an interview with 1 of the founders a little while ago, which I'll link in the show notes.

And in terms of your experience working in the data space, do you remember how you first got started working in data?

Yeah. I actually majored in the information systems. So data was always on, I guess, my kind of natural career path coming out of school. I actually when I was,

graduating, I already started working for

a small company named Intel

as part of their internal data team. So I,

always been in and around data.

I think fortunate to choose a field that they I think, is is becoming more and more important to businesses.

They keep on

improving the offerings and operations using data.

And to me, what's interesting is that, it's also

true to our not just our professional life, but also kind of our our day to day life. Data is is really everywhere. And and kind of having studied and working in the field for so long, from my perspective, it's really great to see how data is, is helping bring value again, not just to businesses, even just to in our day to day life.

And

for the conversation at hand, at Rivery,

you published a white paper identifying

7 principles of modern data pipelines. And before we get into that, I wanna start by getting a definition of what you mean by modern in the context of data pipelines.

I think when you,

start a discussion around what modern is, it's it's very easy to,

to focus on the technologies behind it. Is it cloud first? These days, obviously, does it have AI,

generated AI in it? And I think, yes, that that's obviously a big part of what modernizing is. And, definitely the the cloud first is a huge component of a of what a modern

solution looks like. I always like to think about it from from the reasoning perspective of why we need those technologies in place. And when I started my career as as a data engineer, again, back then it wasn't called data engineer. I was dealing with very different requirements,

for data pipelines.

Data pipelines back then had very few data sources they had to ingest data from. In most cases, it was a single database, maybe a simple CSV or Excel file. I typically had a single target to load it into. It could be a SQL Server database on prem. That was kind of the popular choice back then. I had the pretty small data volumes that I I had to ingest,

and the the business requirements were calling for data to be

loaded maybe once a week, sometimes even once a month, a month closure.

Today, data professionals really need to deliver data pipelines

that are much more complicated,

across all those aspects.

Many more data sources they have to ingest data from starting from internally data sources, but also a a huge amount of external data

a data sources.

You know, there's 40, 000 SaaS applications out there in the world. So each organization on average is using about the 40 to 400,

SaaS applications.

That's a huge amount of data sources that the data could, could live into, to be ingested into your pipeline. The data volumes would be a way far, far greater and larger. The whole big data era is is now gone in terms of a a buzzword, but still data volumes are much more significant than the, than what I used to have when I when I built my data pipelines back in the day. We see that the data pipelines do not just end in 1 single database. They actually need to go and and maybe ingest data into a data lake and then into a data warehouse.

Maybe there's multiple data warehouses.

Sometimes if you really push the the solution, you also go, and and bring the data not just into a lake or warehouse, but you actually bring it into an operational application. So that aspect has changed as well. And, of course, the data latency, the businesses are now asking for the most part as close as possible to real time, near real time data delivery. So all those aspects are really what in my mind defines what a modern data pipeline needs to to provide. And and, yes, being cloud first is a key component to be able to deliver on those requirements.

But,

1st and foremost, that that's what a modern data pipeline, is in my mind. They be deliver on those requirements that the business needs.

And just to add some juxtaposition

as well, I'm curious

what are some of the aspects of

if we're talking about modern data pipelines, then the corollary is probably legacy data pipelines or old data pipelines and

we've we've we've we've we've we've requirements around data usage in organizations?

Yeah. So we've categorized this as a,

7 different principles.

You know, you you mentioned technology that that's definitely where it starts. I talked a little about the being cloud first. I think that's a key component to a, to be able to set up an infrastructure that they, that can scale and keep up with the pace of, of the modern, data pipeline. Essentially, the the current business requirements of of being able to set up a new data pipeline very, very quickly. So you need

an ability to to do that without having to deal with infrastructure management, 0 infrastructure management.

Obviously, that's what the cloud provides.

The whole shift from

an ETL approach to an ELT first approach, where you first bring the data as quickly as possible and then transform it in place and and benefit from,

the cloud endless compute power from DAB to then leverage and use that data for other use cases.

So moving to an ELT first mindset,

I think that movement has started ever since.

Redshift has been around for for about 10 years, obviously, with Snowflake and and and other cloud data warehouses that they're becoming more and more, standard. The way you would configure those data pipelines again to to be able to support the speed,

of a of delivery that the business requires, I

think that's another important aspect. We used to see systems that,

were quite heavy in a,

in terms of a user interface with a lot of options

to to drag and drop and and configure different components.

I think we're seeing a move away from those

heavy GUI based applications because,

it just takes longer to learn those applications and the different knobs and and features you could have with those types of interfaces. And so the common,

the common approach,

to reduce the learning time is really just to choose a system that everybody can learn and pick up in a few days. And that means that system needs to be able to speak common languages that everybody speaks, which is, in the data world, SQL and Python. So as long as, the system enables you to do transformations

and and bring your data in the way you want to with, with SQL and Python,

I think that's a that's a key component to, to be able to modernize it and deliver on that time to value that the business is is asking for. The 4th principle we

noticed more and more in the past years is is the need to really,

support multiple storage layers for our data. It used to be that all the data could live in the warehouse, data volumes,

cost of a of storing those data a those larger volumes as well as different applications and use cases start dictating a different architecture, where data needs to be stored in different locations. Sometimes it is in the data lake. Sometimes it is in the warehouse. Sometimes an organization has a hybrid approach where they have multiple,

warehouses in place, even the the biggest rivals. And so being able to really store data in in the different locations,

I think that's a key component

or principle to it to give us the flexibility that we need to support the different use cases of of a modern data pipeline. And then the last

3 is is about the I think,

really showcasing the value of the modern pipeline. And especially

in 2023

where every expense is being scrutinized,

budgets are are tightening, and and we really wanna show the value for each

each business operation that we take. The ability to use the data,

in in the most efficient fashion is is key. And that starts with really bringing data not just to,

a storage layer such a a lake or a warehouse, but actually to business applications where the business users are working and and operating,

what's known in the industry as reverse ETL.

So being able to support that in a data pipeline is is a is key as well.

Just to be able to close that last mile and and bring, bring more value faster to our users by providing the right data point in the right place where they operate and not locked, in, in a warehouse or a lake where it's harder for them to access a data point.

The ability to

manage all that,

you know, we talk about a single data pipeline, but the reality is that in every system that we architect, there's there's many, many, many different data pipelines or many, many different jobs that execute different data movement operations. So being able to really see

where the data is, where it's being blocked if if there are some some failures, being able to respond to those very quickly. So getting full transparency,

into what's happening both from health perspective of of our data movement, but also from a cost perspective, as I mentioned earlier. I think these are critical components to be able to deliver on both the consistency and and the quality that the business requires from our data, but also maintaining the, the spend at a low level and optimize as much as we can. And it all really comes together in kind of a debbie to do so very, very quickly. So everything that could be commoditized

from a data pipeline perspective. And there's a lot of components that used to take a lot of time for data engineers to develop such as, for example, coding against a certain API of a of an application or data source you want to ingest in. So there's many different tools,

like Riviera, for example, that provides that no code experience, take away all the management of those APIs, and and really provides maybe to in a few clicks, bringing the data from Salesforce, from from Shopify, from Facebook ads, from all the different data applications and sources you work with into your,

business applications very, very quickly.

That's, that's 1 example. We, we see,

other areas of the data pipelines that can be commoditized such as the way we manage increments

and loading data into a, into our warehouse, the way we map,

the, the fields from source to target. All those things are different components that can really be streamlined and provide ABAT

to the engineer to move much faster and really focus on the way you deliver data to the business and and model it versus the way you actually ingest it and and bring it to the target. So that's what really combines the the last principle of really being able to deliver

much faster, and bring value faster to our users.

So in the process of identifying

and

codifying

these different principles,

I'm curious what the main areas of focus were as far as the ways that that data was going to be used.

So data is a very amorphous

concept. It can have multiple different applications.

Most of the time, when people are talking about data engineering, they're talking about it in the context

of business analytics where usually the consumer will be some sort of business intelligence. You also mentioned reverse ETL, which to some degree fits into that same flow of I need to be able to understand and act on and support my core business needs for analytical purposes and being able to feed that back into the operational aspects.

Obviously, there are also cases where machine learning is the target consumer. There are user facing analytics.

There are cases where the data is actually being used for scientific research purposes or just statistical analysis. I'm wondering what were the main points of focus that you were

looking to to identify these principles and how universal they are to those to to this breadth of applications for that data?

Yeah. Great great question. When we started the Ariba Re, what what we saw initially is a lot of customers coming to us with

a use cases or applications around marketing data.

Marketing is a huge world. It's,

I I don't know if you've ever seen those, MarTech,

visuals that shows the different number of applications that exist in the in the MarTech world,

which is growing exponentially, I think, every year.

So a huge amount of of data applications,

essentially SaaS applications typically that marketeers work with. And they need to pull data from those different applications

to fulfill,

their different use cases. So it could be use case where you're trying to analyze ad spend across a their different ad channels. They have data in Google Ads. They also have data in Facebook Ads, LinkedIn Ads, Bing Ads, being able to see all of that together in in a single view and and,

provide or or analyze the right day, ad spend, per channel.

We see obviously,

digital agencies analyzing social media data,

trying to segment audiences based on the,

the right social media channel and and understand

what converts best down the down their funnel.

So these are the initial use cases in this application that really helped us understand

what is required from us when we when we deliver a product that enables the creation of those modern data pipeline.

They came with good number of, different applications and data sources. 1 of the things that we noticed very, very quickly is that it's hard for us to keep up with the number of applications they bring in every day. There was just endless number of SaaS tools that they wanted to connect to. And so many of those, the popular ones, of course, we, we created,

a no code experience out of the box that, that had

the managed APIs for those data sources

natively. But then we also saw many other kind of niche marketing applications that each organization they use for their own specific

requirements.

And so for us, 1 of the things we need to do is is to deliver a mechanism that enabled a quick connectivity to those, say, what we call custom applications that we don't have natively offered in the, in our tool, but still maintaining those principles of 0 infrastructure management,

very quick time to value, full transparency. So that is 1 of the things that we, we had to add to a to Verisk is to make sure we have the option to really build custom data connections to any SaaS application out there, just via low code configurations

of a of a few

endpoints that we, that we populate in the,

in that kind of a specific pipeline that pulls data from a those custom made applications. So that was a kind of an eye opening

experience for us working in this type of organization, not just the classic

that has the common application that everybody else sees, but but also the the niche 1 that are harder to a to find connectors for. The other thing that they,

we saw working with those organizations that had marketing use cases is the need to, to be able to

scale those,

those types of data pipelines very, very quickly. So for example, a digital agency typically serves a lot of different customers

with very similar data extraction data transformation across different data sources. So they always pull data from, let's say from Facebook and TikTok and other channels, but then they had to do it in the context of each 1 of their customers.

So again, the 0 infrastructure management plus a few other capabilities that allowed us to create what we call a multi tenant data pipeline experience.

That was key for us to to be able to deliver on those requirements that they, those digital agencies had.

Lastly, I I would, I would mention is that because the the marketing teams that we worked with,

typically had

both the old marketing data analyst as part of their team, because their work is so, so involved with data. They had a budget to justify and

their team, which was not part of centralized data team with the organization. So for them, kind of having that flexibility of of being able to build a a data pipeline very quickly without having

access to everything that the organization has at the centralized level was key.

And the other piece

that was key there is is they need to connect it back to to their business applications. This is where the reverse ETL that we talked about comes into place because they were so close to the business. They always identify those use cases where

data could go beyond the warehouse where typical centralized data engineering team would, would stop and bring it back into the business application. So I think a lot of the, principles that we've developed over the years came

specifically from those initially

marketing related use cases that we we tackled

in our early days.

It it's easy to get lost in the technical aspects of the work that we do because there is a lot of technology and algorithmic aspects to the work to be done, but the

end goal is to empower business use cases and have an impact on the real world. And I'm wondering what you have seen as the application

of these principles and that their ability to influence the capacity for organizations and their data teams to keep up with the use of and requirement for data in these business contexts.

Yeah.

I agree with you. I think it's, you know, as a as a data geek, for me speaking about technology is always a lot of fun in

getting into how this operates and how that tool can do 1 through another. It's very interesting. But at the end, it's, it's, it's about how

does a

organization can keep up really with the the pace of the the data that the business demand.

And obviously it's not just about delivering the data, but delivering data in any way that is usable,

with the right quality level,

with the right consistency as well. And so the foundation to all of that,

starts with a database to move very, very quickly

in a simple fashion. So, many of the principles that I've talked about

relate to that ability to deliver faster. And if we look at the foundation, it's it's really around the fact that you you don't need to,

manage infrastructure. For example, you have 0 infrastructure management because using a full,

SaaS ELT

architecture

allows you to shift the discussion from,

oh, I need to set up a a VM and that VM needs to have x compute power. I need to maintain it and all that into,

let's just focus on on the data that we actually bring to a to the business and and build that pipeline very, very quickly. The other thing that that we saw is that is

to keep up with the the base of the business,

we need to make sure that

the business realizes that data is delivered to them. And and so

it's it's kind of like an endless loop. The more they see

data being valuable to them, the more they would ask for it and the more

use, you would get out of your architecture and and your your solution in place to really drive your end goal, which is making better decisions with with data available to us. So

that delivery of the data to the right place where the business operates,

it was key to us. And the ability to, to really do it in a flexible fashion, not just in the common set of applications they use, but the specific ones they care about,

we saw it being a major

game changer for our customers. So I can give you a few examples. You know, there's common applications that they that store your data such as a,

Salesforce and and HubSpot and and,

tools that they a lot of organizations use these days. So reverse ETL into those application is is kind of a basic 1.

But even just bringing the data into into more unique applications that the business users are using or into just their their tools, which honestly data tools, but just say communication tools. For example,

we have a lot of customers that actually

take business insights and push those directly into to Slack channels in a very,

in a very sophisticated way. So those Slack notifications are being sent out based on the very specific triggers with the, unique insights

that helps the business users move faster on on their operational

tasks.

This is this is something that I I think is is key to really make sure that the business keeps on demanding data at the at the the fast pace that they do.

And along with the other infrastructure,

benefits that I mentioned before,

enable the data teams to deliver that day that data at that pace.

Now digging into that technical aspect that most of the people listening to this show are actually interested in, and I'm wondering what you have seen as,

at least in broad strokes, the core requirements

for a pipeline infrastructure that can support these principles of the modern workflow

and,

some of the ways that teams should be thinking about the tool selection and evaluation

for

building up their technical, yeah, technical capacity to deliver on those business needs.

Yeah. Absolutely. So I've I mentioned the word cloud multiple times, and, you know, I think it's you know, can be stressed enough. We still see a lot of a lot of ETL tools out there which are not really

SaaS based tools.

And I think that's really, at some point, going to be what they creates a bottleneck and and slows down the implementation of the of your data pipe pipelines at the speed that the business demands it. So obviously, having the right

infrastructure, which is cloud based that can scale up and down very rapidly,

and meet your your demands,

is key, without having to manage it.

It's not just about the the management of it. It's first and foremost about the the infinite scale that you need to have, because

as we all know, data volumes just increase,

every day. And, and we just need to be able to support that without having to,

to spend all of resources and effort on on managing the infrastructure. But beyond the the the cloud foundation here that they that I think is a, is a must these days.

I think what we've seen is a movement of a lot of no code tools,

being stitched together,

almost duct taped together to form what's called a modern data stack.

Each 1 of those tools is, is very good at handling 1 part of your data pipeline. So

if, if it's the starting point, the ingestion where you pulled it from different applications and have that no code experience with the managed APIs, as I mentioned earlier,

that's great. But then, obviously, that's just a starting point. It's it's not a full ELT. It's just the EL part, if you will, of the, the ELT process, then you need to transform data. So then you then have to bring another tool

that will help you a,

take that the ingested data and enter into into business insights. And and if you have

a, many of those,

pipelines running together, then you need to have a an orchestration engine, both

trigger advanced scheduling

and and logics, but also

allow you to

manage that entire operation and and and make sure it's healthy.

And finally, if you have budget, you you may need to bring another tool also for the reverse ETL aspect of a of the pipeline and bringing data out of your warehouse back into a, into the business applications.

So

I think,

the ability to

combine many of those operations together into a single platform

that maintains those principles and don't,

you know, don't overcomplicate things, that actually simplifies things in in fewer integrations,

a simple

kind of seamless,

interface to to see it all in 1 place. That's a big part of what allows you to modernize much faster and, and move with, with less moving parts,

along the way. So if I'll summarize it, it's it's about being on on a cloud based infrastructure, of course, but trying to make sure that, you don't overarchitecture,

your environment and and choose the technologies that really allows you to get to a, to your goal as fast as you can. Because what we see these days, it's mostly a question of speed. How quickly can we answer and satisfy,

the business needs? How quickly can we provide in the data that answer the next question? And the technologies that will win are the technologies that allows us to move faster versus being bogged down with integrations,

infrastructure maintenance,

long learning curves. These are the ones that will slow us down. In that suite of technical capabilities

that are

useful and or necessary,

what are some of the ones that in your experience have been the hardest to implement or

maybe some of the aspects that are that have required the most conversation with data teams to

convince of their value?

Yeah.

Let's say,

you know, in my mind, they often

a question that comes down to kind of the build versus buy. I think many talented data engineers out there like to like to write code That that's part of what they've learned and what they enjoy doing. And so a lot of the conversations that we have is around

the value of,

of using no code for certain operations they have, and, and trusting the system that they will do what, whatever they need to do.

So I think there's a a an understanding, a growing understanding out there that parts of data pipeline can be fully moved to to no code experience. And and,

you know, the better,

the more transparency you give to user about what's happening behind the scenes, obviously, the the better. And and that's, that makes for a better experience for those tools to for those just users to trust those tools. We still see,

the need to kind of convince around the cost of a, of running your operations

with a combination of your

own coded solutions,

as well as, no code,

where and and when you can use it.

And so I think that's that's 1 of the biggest challenges. How do we combine and and and marry all those things together

and make sure the data engineers seize the value of leveraging kind of a those managed solutions

that shorten in time to value and help them to focus really where they need to focus,

which is about bringing data in in the way that they it can be consumed and and and with their the right insights versus focusing on just bringing data itself. So I think that that's still where we see, many of the conversations

happening.

And I think most organizations that that choose to modernize,

they do a they do realize that the,

the hybrid approach where you can combine no code and low code is is kind of the best approach, but that's still a challenge depending on, on the persona you have in front of you, whether they come from a, a no code first approach or, or a code first approach,

kind of helping them realize that there's a balance between them and having that flexibility to deliver on the more complex requirements,

is something that, that's required to to be able to

satisfy,

the requirements as they get more and more sophisticated by the users.

To that point of the

users

that in general is the whole business

and the capabilities and the interfaces to the tools that are used for managing the data flow, including things like onboarding new data sources, bringing it through to a point where it's ready for analysis.

The way that the tools are presented can have a massive impact

on who is able to do those operations.

And with the idea that data engineers are already overtasked

and

generally end up being the bottleneck.

What have you seen as the

organizational

impact of making tool selections that reduce that barrier to entry and reduce the friction involved in bringing more people into the process of working with the data.

Yeah. A 100%. I I think almost every conversation we have with a, a customer around the why they chose starts with, you we used to have a legacy tool and I knew I needed an army of, of data engineers to, to maintain and, and learn how this tool works. And with Viree, I could take a few data analysts and, and in a few weeks

time have them building

end to end data pipelines without being ETL experts ahead of it. So, yeah, I think it's definitely expanding the, the type of, user that can build pipelines. As I mentioned, data analysts are becoming more and more, of heavy users. You know, if if we think maybe in other terms that they are not in the industry, so you have data engineers, you know, typically the the guys that are comfortable coding, then you have a what's called the, the BI developer. And and some of those are comfortable in in in Python and in other languages.

Some of those users are are really just, you know, building dashboards in Tableau and and other tools and and, and maybe they know SQL, but but they're not familiar at all with infrastructure considerations, stuff like that. For these users to be able to come in, plug in their credentials to their data source and and then configure data pipeline is is a huge game changer. And I think we see more and more trust from data engineers

to open up kind of the the keys and environments to to those users to build the data pipelines, obviously, with the right checks and balances.

But that's really,

is a big part of shortening the time to value to your point, opening the pool of users that can then can leverage those tools. So definitely data analysts, BI developers,

system analysts are not able to to build data pipelines as well with with that. And and I think that's a huge game changer. Kind of the whole approach of a, of self serve that the, Tableau, Qlik, and, and others kind of brought to the, data visualization world is, is now happening in the same fashion to a to the ingestion or the the ELT

world. Another aspect of the tooling question and the impact that it has on the

use and adoption of data, particularly

at the organizational

scale, is the question of balancing

price with usage and some of the ways that when you're using managed services or even if you're using self hosted, the impact that the incremental cost of new pipelines, new data has on

whether or not an organization decides that they want to invest in that expense and some of the ways to think about the cost

both in terms of monetary, but also opportunity cost of maintaining these systems,

how that reflects in the ways that businesses are

applying data to their operations.

Yeah. So cost, again, in 2023, cost is a,

is is a very important question that

more and more data teams are

are evaluating.

They're evaluating the ROI. They, they can, they bring to the business.

They evaluate every aspect of it, both, both the value side, as well as the,

the spend side. And so

I think managed services, as you mentioned,

oftentimes

pricing a,

those services with a usage based model is,

is a great vehicle

to really fine tune the cost and the spend, and make sure that you really

find those use cases where data brings more value to the business without having huge investment upfront. Of course, you, you need to make sure that, your managed service provides the tools to be able to

understand what's happening in it. I think for snowflake, for example, for for years and and still now it's an ongoing challenge to understand

where is my Snowflake,

costs

incurring from,

how can I optimize it? There's a lot of tools that both Snowflake themselves and, and many other partners of Snowflake have developed over the years to help you manage that cost better.

So the answer is not always use a managed service that that's gonna be cheaper. Of course not. But if you wanna find those avenues where you can,

reduce your costs, optimize it, and still,

again, I'm going to go back to speed because I think the speed is the most important thing these days. To be able to provide on a, on the speed with rapid experimentation

and exploration of different avenues. The way to go is with

a usage based

pricing model that is provided by those managed services. And typically that comes together with all the benefits of the managed service that allows you to get that speed. The no code experience or, or all the other items that are taken care of that you would need to set up yourself if, if you were to go,

and, and code it, for example. Again, the trick is, is to find or make sure that you have the right visibility into how that the managed service operates,

and what costs is going to be incurred out of it. It's not always simple. A lot of the pricing models are trying to be tuned to metrics, to pricing metrics and value metrics, which are, which are really about the value at the end. But quantifying those from a technical perspective and studying how those would operate. For example, how many rows I'm going to adjust for a certain source is not always very easy. Data engineer doesn't always have a good idea of how many rows exist in that data or how many rows will update every month.

So, yes, that's, that's the value that you would eventually bring to the users, new data. But how do you understand how much that's gonna cost you? It's it's not always a simple task. I think there's a there's different pricing models out there, obviously from different vendors. It's it's interesting to to compare those and and see which 1 works best. But, in my mind, there's no question that a usage based pricing model is is the way to go when you wanna be in a world that, that is modern, that, provides that they that speed that I'm talking about,

to deliver to the business at the right time. In your experience

of

working in the industry and more recently working at Rivery and

codifying these principles. I'm wondering what are some of the most interesting or innovative or unexpected ways that you've seen teams

applying these 7 principles

both at the technical and the organizational level

or any

interesting

moments that you have experienced when talking to customers about these principles?

Yeah.

Yeah. I think I could probably dissect each 1 of those principles and and apply it to where it was interesting,

or innovative in in in their use case, say,

particularly. But I would say that many of those principles are true to almost any technology, non service, just data technology.

What was interesting to me

was that movement of of diversity that I mentioned earlier and

seeing how we we're closing the loop and how we're bringing the data and delivering it to the right place,

much faster than than the old paradigm of of let's just learn it in a warehouse and and some analysts will come and and put the BI tool on top of it and and then analyze it. So that that'd be easy to really push the data into the right application at the right time. To me, that that's a world that opened a lot of interesting use cases that I've seen. I've seen some of our customers doing a whole closing

a system of the month end in applications like NetSuite, for example, where you have business users coming into to NetSuite and they need to decide,

which deals are going to be recognized in in this month's books versus next month's books. And so they look at the list of deals they have.

They check a few and say, okay, what's going to happen to my bottom line if I recognize these deals in this month and what's going to happen to this deals in that month. And then as they wanna simulate and and analyze those,

those scenarios, they just trigger from NetSuite directly a job in that

processes the data, ingest it from a from the different day, booking systems, bring it back into NetSuite,

and presents the, the bottom line as it's look like in this scenario. And then when they're happy with the right mix and the right scenario, they go in and and they they even have the option to lock that and say, okay. That's the final version of the truth I'm gonna have for this month closing. And so this type of use case where I have not even data analysts, but just a financial person, a business user running data pipelines seamlessly from within his own business application without even knowing that behind the scene river he's running, but still getting his job done using the the data in

almost a real time.

To me, that that's a really cool, and and,

interesting way of of, bringing data as close as possible to the business users.

And in your work of

developing these principles, helping customers

adopt and adapt to these principles, what are the most interesting or unexpected or challenging lessons that you learned?

You know, I think around technology, there's lots of,

challenges that are ever growing. I mean, there's always

new integrations, new technology support,

new,

even new systems of, of storage

and lakehouse from from Databricks, for example, is is does not behave like a

a a Snowflake data warehouse, for example. So there's obviously lots of challenges around that from technical implementation perspective. But to me, what was interesting to see is more from a an organizational perspective, how organizations are are turning to be more and more

a hybrid.

I think the approach that we use to to talk about all the time about kind of 1 source of truth, all my data is hosted in 1 data warehouse. I think that is changing more and more, especially with, you know, the the ease that we have these days to spin up in new environment. And and I think that's fine. We we start seeing different,

architectures and, and concepts to be able to manage and make sure that the, our data is still accurate, even though it's stored in, in different layers. So that principle of data being stored in different locations across the business,

to me, that was a kind of the most interesting evolutions or lesson in the last few years. And so I think that was the key principles that we had at Riveria as we started to develop the system is to make sure that it's not monolithic in the sense that it's always gonna,

load it into the same

storage layer. It has to be flexible enough to support organization that shows

Databricks as as the maybe the primary

lakehouse or or warehouse kind of storage location. But then they also have maybe an h an HR unit that, is using Snowflake

because maybe Snowflake is a bit simpler for for those a HRA analysts. And and they need the option to be able to operate their own data pipelines.

Obviously it's HR data, so it needs to be very, very secured and sensitive. So in a way that's almost completely isolated from the rest of the organization and the rest of the engineering team that loads it into Databricks. So to me, that was quite interesting to see how that's evolving. And I think we're going to see more and more of that happening with shifts of data in, in between systems. So 1 1 of the new areas that we're investing in more and more is, is to be able to ingest data, not from source applications that we traditionally have RDBMSs

and the,

external SaaS applications,

but also being able to ingest data from a data warehouse and load it into another data warehouse. Not just for the sake of migration, but even just for the sake of, of syncing data across systems and making sure it's all it's all up to date in in all locations. So that's, I think, an interesting evolution that we we started to see more and more as as we develop kind of the those principles and and thought about, okay, we need to support

multiple

storage layers and destinations.

And for people who are

hearing these principles,

thinking of ways that they might be applied within their environments, what are the cases where some or all of these principles are either undesirable or impractical to implement?

Yeah. I I think,

you know, the a big part is a,

foundation is is based on the, cloud

technologies,

being able to obviously have 0 infrastructure management to be able to, use an ELT approach versus an ETL approach. Those kind of elements

obviously exist only in the cloud.

So if you are forced to use an on premises deployment,

for different reasons that the business is out there, whether it's regulation or or something else that the your businesses

chose to use. And obviously I think you're losing a lot of those principles or you can't really implement them in the same way that I'm talking about. But if you are

moving to the cloud or if you're already in the cloud, obviously I think those principles are really cross industry, cross,

cross verticals, and there's no limitations on on using those,

in in that scenario.

And as you continue to work with customers,

keep an eye on the constantly evolving landscape of data and its applications

and the ways that teams are working with it. What are some of the opportunities

that you see for further advancement and sophistication in the ways that teams work with and gain value from data?

I I think you can't have a conversation these days without without talking about generative AI and how that can make a,

a big impact on on on all of us in in all different areas.

I think that's definitely true for data teams. We see a lot of early days, but still a lot of, say, tools are there already,

evolving around, helping you to write SQL faster, helping you to a,

speed up your processes, maybe gain a better visibility into,

your pipeline's health,

understanding quality issues. I think this is where we'll see more and more advancements and specifications.

And, again, all in in in the goal of kind of shortening time to value. Obviously, we're very we have our own kind of thoughts on that and and plans that they that we have around how we can leverage generative AI to to improve the experience for our users and and really speed up, the development.

And,

I'm I'm sure I'm not surprising anyone here, but I'm I'm I'm pretty confident this is where most of the advancements will come in the in the coming months years to come.

Are there any other aspects of the work that you're doing at Rivery and the application of these 7 principles that we've been discussing that we didn't discuss yet that you'd like to cover before we close out the show?

Yeah. That that's a great question.

You know, we we've touched upon upon different

questions earlier in in this discussion around the,

you know, maybe the the need that we have to,

to stitch too many tools together and and backdate them or or add aspects that we've mentioned around the complexity of a of legacy tools and and the learning curve that they involve. But, you know, I I think the biggest

challenge that we see these days is

is there'd be to to find the right balance between the simplicity

that we wanna bring to the users. And at the same time,

the ability to solve for the more complex use cases. I think there's there's a lot of tools out there that they,

are just

kind of growing,

over many years.

And and each, version that's being released is more and more features added to solve for more

specific,

scenarios and use cases. But again,

learning those tools

take a very long time. Just understanding that the options are there to solve for those complex use cases is is,

is is quite hard to,

to find. And on the other hand, we have tools that kind of went on the other side of the panel and and they oversimplify the approach data management.

Sometimes those tools just focus on on 1 very small part of the data management,

the life cycle, if you will. And they they then force organization to use those tools along with a lot of other tools or site solutions,

that they those organization would would be forced to develop to meet their their unique scenarios. So I think that that's in my mind, the biggest challenge that we see in most tools is that how those tools find the right balance between providing an easy experience for what should be easy and at the same time still provide the avenues and the capabilities to solve for the more unique scenarios and and the complexity that they

arise from

ever changing business requirements.

I think it's gonna be an, kind of a, an ongoing,

kind of challenge.

But,

when I look at the tools in the market, I think most of those tools, they they either swing over 1 side or or the other side. Obviously,

if we're very we try to find kind of that happy path that's able to,

to simplify what should be simple and, and to, to provide ways to solve what, what, what's a bit more complex. But still, that's, that's kind of the biggest challenge I see in, in

gap in in many tools. And,

as as you look at the way those being applied to

provide

a a solid data architecture to manage your data. I think that the, the solution to, to power that is, is definitely

to combine a lot of those modern data pipelines, principles I mentioned earlier,

and and try to, deliver as many of those in in

less tools in in maybe a single platform or a couple of platforms together, but not too many platforms as we've seen too often in in common kind of what's called a modern data stack implementation.

So in my my mind, that's 1 way to increase the simplicity,

but still provide the flexibility.

But, I think there's still a lot of room to keep on the fine tuning and and and find the right balance to,

decrease the gap and and really let the users focus on on the business value versus

trying to learn and understand,

what technology is powering it. Because at the end of the day, I think the technology is just here to provide an abstraction layer to, to the way we use data. It's not it's not about technology. It's about what we do with the data.

And I think that's what the tools

that we use for data should should be focused on.

Alright. Well, thank you very much for taking the time today to join me and share these principles, the ways that they apply to the business, ways that they can help teams get more done faster.

I appreciate the time that you and your team put into condensing this information and presenting it to us, and I hope enjoy the rest of your day. My pleasure. Thank you very much for having me. Have a

great

1.

Thank

you

for

listening.

Don't forget to check out our other shows, podcast.init,

which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast,

which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast

dotcom. Subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts atdataengineeringpodcast.com

with your story.

And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links