Using Product Driven Development To Improve The Productivity And Effectiveness Of Your Data Teams

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode.

With their new managed database service, you can launch a production ready MySQL,

Postgres, or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs.

Go to dataengineeringpodcast.com/linode

today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show.

Legacy CDPs charge you a premium to keep your data in a black box. RudderStack builds your CDP on top of your data warehouse, giving you a more secure and cost effective solution. Plus, it gives you more technical controls so you can fully unlock the power of your customer data. Visit rudderstack.com/legacy

to take control of your customer data today.

Your host is Tobias Macey, and today I'm interviewing Vishal Singh about his experience building data products with and at Starburst. So, Vishal, can you start by introducing yourself?

Absolutely. My name is Vishal Singh. I am the lead product manager and head of data products at Starburst.

I have been at Starburst for about 2 and a half years.

And I joined Starburst when we were less than 60 people. Now we are, the company. I still think it's a start up and and we have innovated a lot, especially in their products. So thank you again for having me on your show, and I'm really excited talking about data products. Do you remember how you first got started working in data? Oh, yeah. It's, it's a funny story.

So I remember I I did my undergrad

from India, from an IIT. And then I remember when I joined the college,

I was really bold in my 1st year in my undergrad,

and I was looking what should I be doing.

And I got connected to this professor who was actually in geophysics,

and it was actually in 2002.

And he was doing some kind of modeling, and and I was like, what are you trying to do? And I went to talk to him, and he was trying to

predict how, you know, rocks behave, how predict

how the rocks life cycle works. And, like, how do you gonna predict that? And he mentioned you actually can use back propagation model using artificial neural networks or you can talk fuzzy logic. And then my question, what are those things?

That was the first time I actually learned about the AI. That was the first time I learned about but you can use data to predict the behavior of rocks.

It got me really excited. Then fast forward, I get it my grad school

from Penn State.

And I also remember, like, what my thesis was with FAA, and it's actually on the concept website. We

were trying to predict how when the Boeing land on the pavement,

how the stress and strain works and what is the life cycle of even, you know, the different electrical components

in the plane and also life cycle of the,

of the the pavement.

And the all the data was getting produced in Excel, and the amount of data they were dumping and all they were creating was enormous.

And I talked to the person at FAA. I'm like, so how are you going to analyze? They're like, we're gonna go to every Excel file and predict the pattern. And, like, how many Excel file do you have? And they're like, it's a 1, 000, 000. I'm like, this is impossible because I will finish this job.

All I did was automated the process, and that was my thesis. Like, how can you automate the process and predict it behaving the bulk of Excel file?

And those 2 things got me, again, like like, completely I was not thinking about data. It's just a pretty real, like, lifetime analysis.

1, I got introduced to artificial neural network, even wrote some

international papers with the professor I was working with. And second, worked with FAA, and I wrote the thesis throughout. Before joining Starburst,

my previous company was CloudHealth, again, a Boston based startup, which was later acquired by VMware,

And the cloud cost optimization

over the time. I joined the company in early in January 2016,

and I've seen customers who were on the cloud, how are they spending

their cost

on the Postgres or the different cloud, you know, Redshift and other stuff, and how the cost cost moves from that to s 3. S 3 used to be, like, put everything in there in 2016.

2018, customers started asking, how could I understand

what should I use in s 3 and what should I archive?

So it just changed, and that's when I got introduced to Star Wars. And, I knew about Presto,

which were later renamed to Trino.

So I knew this problem is going to get bigger, and I have wanted to actually be in a field where we are solving the problem of data. Not just infrastructure, but how data is evolving because I've I've seen the evolution over time. And I was very lucky to be joining Starburst when I joined because I when I joined Starburst in 2020

and today,

even the 2 years with COVID and other stuff, the landscape has completely changed.

Bringing us to where you are now with

the work that you're doing at Starburst, we're discussing the the idea of data products, which is a term that has been thrown around in a lot of different contexts lately.

And I'm wondering if you can just give your definition of what you can see the data product to be.

I'm gonna use example of, again, my undergrad in grad school. I mean, I did not know what data product is. In fact, I did not even know this basic in call artificial intelligence

or NNN,

a back propagation.

But what I wanted to know, like, how can I use the data to produce a model which can be consumed to predict the behavior?

That thing actually resonated with me on my undergrad and grad school. Like, I was given this bulk load of data. And they asked me, can you give us the right pattern? So actually, then the FAA and other folks in the organization

can use

to predict the right behavior or even give the analysis to Boeing and also how the pavement in the airport should be created.

And if we fast forward, if we look at data product, that's exactly what we are trying to do with data product. The amount of data being generated every day is enormous.

How do you actually, as a data engineer or as a data scientist, how do I go

and trust the data? And even if somebody comes like, yes, you should be able to trust data. How do I verify that

I'm able to trust?

Because not just trust, there's trust but verify.

So data product, what it has done is that it's actually taking

the overload of dataset and producing this gold standard dataset with extra metadata around that. And extra metadata is the what business context should I be using. And I wanna double click more into the business context because if I go talk to business context to the marketing and sales in the same department, they will talk 2 different things. Marketing will say, I'm just care about leads, and sales will say, I care about the revenue.

But it's exactly

same set of datasets.

So even the business context which is being added to datasets will mean totally different to 2 different organization

based on the work they're trying to derive from the dataset.

And that's where the metadata becomes really important.

Few days ago, I was asking someone in my organization is that, how can we understand

what customers are using as of today? What BI tools are they using? And I got the got the result. And my next question was, how do I know how latest the data is? And let me actually go figure this out. The point is, like, sometimes you get the results

of what you have asked your organization,

but how do you know you can even trust the result? How do you know the results were valid from? When do you know these results will be valid to? Because you may know that this result is valid till yes yesterday or tomorrow, but it may not be valid in a month. How do you know till when I can use the report?

So it's just data, but there's extra behavioral aspect of the data data, which is metadata

metrics. And also understanding what context the data is being used.

Taking another approach is the governance approach.

How can I easily the the data is being produced from different systems

every day, every second?

How can I securely

share the dataset in my organization

while not exposing any PII information? For example, I just hired data scientist to write a model

on the data product.

But how do I ensure that data scientist is still able to run the model without making sure the data scientist is not looking at somebody's Social Security number? Data is not getting leaked on Internet. So that the governance piece just flows from the start to the end. In the end, it's all about creating business insight. And once the business insight is created, that is also a data product. Once I have created a business insight from a data product which was shared with me, how can I again share that with the full confidence,

not what I had, but also instilling the confidence to the person who's going to use the data product? So it's an end to end using the data, understanding the metrics, and also trust data. But not only trust, but how can I verify that I can trust the data?

An interesting lens as well on the idea of

data as a product

relates to how kind of

software product teams operate. And 1 of the major trends in recent years is the idea of product led growth, where rather than

relying

on spending a whole bunch on marketing and sales teams, you try to just get people to use the product and then from that decide, Okay, what are the things they care about? How can we improve it? And I'm curious what you have seen as some of the ways that data teams either

have or could adopt some of those practices

of

starting with

something that is functional but not necessarily as complete as they might want it to be and using that idea of product led growth just to get people using it and get that feedback cycle working so that it can be

a kind of collaboration or working with early design partners and just thinking through some of those aspects of

product development in the work that they're doing on pipelines and data systems and data platforms?

Great question. In fact, like, I wanna take a step back. Like, why there's a change

into the especially the product like growth. In fact, why we call this data as a product.

Data as a product, the term is all started with a product. When you start treating data as a product,

you start to question, what problems are you trying to solve? Who's going to consume the data and why they consume the data? And what kind of iterations do I need to do? Because in the product cycle, you go with the idea, you capture the idea, you create the solution, You get the feedback, and then you ideate back again.

And that's why that is a really important factor when you're serving the data and solving the use cases.

In the old days, and and and, you know, data has been there forever. If you go to the analyst and ask a question, like, can I get for example, how much revenue from q 4?

The the analyst or whatnot, they will actually give you a dashboard.

You the dashboard was just given based on the question asked.

The cycle of why are you asking for revenue? The consumer of the revenue is gonna be the c level execs. Are you trying to do that for maybe,

go to investors? What is the reason you're looking for it? And if you look for the revenue, what else extra information you're gonna feed with it? Those questions were not being asked. But now

we are at a stage where data is evolving at a rate that without asking the question, you might actually just take an insight, a question

in your organization. It may create a wrong dashboard because you didn't ask the surrounding use cases

on why the data was actually asked to produce.

So taking that software engineering principles,

not only you actually are taking the use cases, you're understanding the use case. You're breaking the use cases down into the worst should be the version 1 and what should be the version end of where this data as a product or what that needs to be fed to the organization.

Then as the data is being shared and being the product is being produced for the organization to consume, you're also looking for confidence cycle or feedback cycle. Who is consuming it and how the people are consuming in an organization

and ideating it back to evolve that data product to next level. While doing that, you're also looking at the history, which is, like, how my data has evolved over time because people look at the code. Like, what was the code yesterday? Was it code yesterday? And how the code has evolved over time? So that also is going towards data set. This is why if you look at schema changes, people ask, why in schema changes, they cannot look at how schema was 10 days ago, how the schema is gonna be in the future. That also gives 2 pieces. 1, understand the organization behavior, but also understand how should I actually even

fund my team so that if there's a more more demand for dataset, then I need to fund my team to produce more dataset in this trend or not produce of this dataset in the trend. They're also going to success metrics of data. So taking this exactly same principle, whether my product is gonna be successful based on who's consuming it, what is the market fit, how long it's gonna be consumption,

and when do I actually retire and add new features, or when do I retire this dataset and add new, actually, aspect around this dataset.

This is why it gets me really excited why when we talk about data, we don't just talk about data itself, but we talk about data in a human

consumption level as a product itself. It's going to give insight and make someone's life easier because they're trying to derive value out of it. Another thing to be

aware of when we're talking about

doing data as a product, making sure that it has all of these other contextual attributes about it is understanding

when

that might be overkill and you actually just need to get the data from point a to point b, and you don't need to do all this extra work. And so just

understanding

how to think about which contexts a data product is useful and necessary,

and in which cases you just need to actually do the mechanical work of getting data from a to b or transforming the data in in such a manner.

Because if you start to think about, okay, every time I interact with data, I need to treat it as a product, that can become very overwhelming. And so you get into the kind of analysis paralysis of I can't do anything because it might not be the right thing. Or if if I wanna do step a, I also have to do steps, you know, x, y, and z, and just thinking about kind of what are the necessary trade offs of data as a product versus data as just raw information.

Okay. It goes to, like, you know, say the softening principle. Sometimes, I may be just debugging my dataset, and I may wrote

some side code to just debug mine, which what makes sense for me.

For example, I'm gonna write a cronja, over the nights because I wanna move some datasets so I can actually write the code in the morning, or I'm gonna write some quality check on the dataset so that I can run the unit testing. You're not gonna make us every unit testing to a product because that is actually being used to test the dataset. Same as on the data, data quality and other component around that too. But the point is like, we need to understand,

like, what is the value of that to the organization?

If there's no value of data to the organization, if I'm just doing select star from the customer to understand exploration, try to that is not a data product because that's a value to myself. I'm still in exploration stage. I'm still trying to figure out, like, how to even create a product.

Every steps cannot be data as a product. Someone has to understand if something is being created as a product, that needs to bring value to the complete organization or a team. The other organization is kind to able to use it and use it in a more repetition way that they are also feeding as the other product. It's a kind of thinking about microservices.

You don't write microservices for everything. You just, like, wanna make sure if this is being written, it's also helping other microservice to understand and feed and look out of it. If I want to just do a select star or whatnot from it, that could be subset of a data as a product. Every SQL statement, every Python script cannot be data as a product. As I said,

it needs to bring a value to the organization

in order to be treated as as a product.

There's also the question of, is this an internal data product or something that I'm building that is only going to be used by people within my organization or within my team? Or is this going to be an end user facing data product where it needs to have maybe some extra polish or features or an explanation about its purpose and context. And I'm wondering

how to think about the different features or capabilities

or

additional systems and support systems that are necessary based on how and by whom the data product is going to be used?

Great question. I think this goes back to for example, I'm gonna use software new principles again here

because

there are an IT department in the organization, and they may actually give you a tool

to go run your own infrastructure. I mean, 2 different teams, and they may be running infrastructure.

Just infrastructure needs to be run so they can actually use the infrastructure for their code they are writing.

Same as data as a product. They might be data product which you're actually exposing internally.

And it could be short lived. It could be actually evolved. You the polishness and the SLA may be very different when you're actually trading internally. You need to understand, for example, if I'm actually creating a data present product for the sales department,

they may actually care about, like, kinda how many leads coming from the marketing,

and how can I use the leads to actually generate the revenue and also pipeline in my organization?

So that is very internal because we are capturing the leads, feeding the department, the sales. But if the SLA doesn't meet a date, they can actually come back the next day. They can ask support tickets and actually ask the marketing department, can I actually get the results out of it? There are data as a product when you're exposing externally.

But the SLA is very different because someone actually may be using it. Let's use example of hedge fund. Hedge fund is actually runs around, like, I need the most latest data so I can actually go work in the stock market. Now the SLA there is very much different because if you cannot wait for a day for the data to give you insight, because the market has actually changed in a day. So based on the use cases and who is actually driving it, the ethylene, the polishness, like, the consumption,

the consumer of who is gonna be driving it creates

the how the data product and how much time needs to be spent on creating that data product. 1 more example I'll give. It's like in a fashion industry. I mean, you are creating data as a product to understand how the fashion is actually evolving over time. And then I've actually talked to a data scientist who works in in 1 of the, fashion industry,

and he basically said that my model never completes. Because I create a model with some input around it, it actually generates some value for a few days, then that again, the change the fashion industry is gonna change. Some other things are trending. Then I have to actually retrain my model on that stuff. So the product he is working on, he's basically saying this product is gonna be valid for 2 years and have to continuously

keep working on it, updating it to make sure the the appetite in the business is getting into the recent trend. There are data product which is static, which is I just wanna look at the historical data. In which you make it and you never work on it because you just wanna look for historical context.

What happened last year? And that could be valid for 1 month. And after that, you know, a year

later, you are creating a different insight to looking to what happened last year. So completely based on the who is the consumer,

And I believe for the anything, is there any product which is create, for example, for external consumers,

will all will have more polish, Will all have more user experience on it? Because you're giving the dataset outside the organization

who may not understand how the data was created, who may not have insight how the data was used, who have to go with the face value of the data to understand how they can use the data. While the internal dataset,

when the data is created,

you have people who you can talk to. You can still have people who can reach out to understand the dataset. And that's why the SLA and the user experience, the polishness would be completely different because the approachability

is very different.

In terms of the ways that teams are thinking about building data products, I'm curious what you see as some of the common challenges that they encounter

either in terms of the technical or organizational

or

maybe tactical elements of how to think about building data products and what components are necessary to bring to bear, what skills they need as a team, both in terms of the individual roles and the

organizational buy in and just some of that aspect of how your customers are at Starburst and how your internal teams are thinking about how do I even start thinking about building a data product? Like, what is the how do we think about maybe treating it as a minimum viable product and, like, going through that iteration cycle? And what are the tools and techniques and skills that we need to make this happen?

So 1 of the questions when I have asked my customer, well, how do you define the success metrics of the product you are creating?

It's really hard to get the answer out of that. Because, oh, someone's going to use it. I'm like, okay. So how do you define

if somebody used the data

is successful or unsuccessful?

And that part is really hard to quantify

at the moment on when you treat data as a product. Somebody may be using data but may not be using the data in a way the data was created.

The other piece is that what tool should I be using to consume that create this dataset? And this goes into the skill set. Like, should I be using Python, SQL, R, and then and then you can look at the modern data stack. There's a gazillion tools available. So So how can I actually fastly iterate on the data set? Especially,

time matters. Time matters in terms of cost, how much cost you're actually incurring to the organization creating the product,

and how fast can I create the product? So, again, the the modeling, creating the model takes time. But how long should I train the model? When should I stop training the model? When I'm ready to actually expose this model that can be consumed.

And those are still in a place where people are still debating that, how this is the time. Should I actually stop it here? Should I continue to do it? Should I just actually create the model to 80%

and actually start giving insight to my organization?

Other challenges which I've seen is the discoverability

of data. Data

is getting created

literally every second.

Every time somebody is actually clicking on the Internet, there's a data being created.

Which data should I be using? How do I actually know that have I used the right amount of data? Should I be using more or less data? Because if I use more data, then I'm gonna take longer to train, and then I will actually spend more money and time in actually creating the model by the time the the trend has changed. But if I use less data, then I will actually now may now be creating the insight right for the organization. So how do I actually discover the data which I need access to? If I actually found the data, is this data product or this dataset

has already been created with someone in the organization?

Am I doing the job which has already been done in the organization? So I have seen, especially in the bigger enterprises,

the same dataset and data product and models are being created in different teams and different business units again

without able to conversing with each other. And they are just doing it because, again, everyone's working remote. People don't know what dataset is coming, and everyone is moving

fast. Nobody's stopping. They're like, I got a request. I need to finish and I need to get this out. So people are producing the same model, same dataset across our website twice and duplicating their work.

Now,

if I actually was able to find the dataset, for example, which I don't have to create it, how do I know who do I go talk to to understand, get access to dataset? And what processes do I need to follow to get access to datasets?

The challenges go on and on. But if you look at the main problem across that is the understanding. It's just like I have it's the same way if I go to the new city. I have never been to the new city. I look at the map. I look at the street. I actually ask people the questions. And if I know the city, I can walk around the city really fast. But I don't know the city. Even walking the block may take me 2 hours, may come back to the same block because I have no idea what I'm doing. The same problem is with data. It's the insight, understanding the context of the data, understanding who owns the data,

understanding that there's no

10 layer of folks, a centralization

to able to give access data. If I have created the data, I should be able to give access data to other folks. So taking

because the data volume has become so big that you cannot expect a team or 1 team to actually spread the tribal knowledge to your organization.

It has to be the community based and the culture based where we are talking in terms of the data. And it should not be like I'm sitting in 10 hour Zoom calls or I'm actually chatting with people on Slack

just to understand what is the first query I should be writing against. So there's a tons of challenges when it comes to data creating a data product, tons of challenges to understand the context of the data. And, again, the challenges get worse because

by the time I understand what data I should be using,

that there's a newer version of the same dataset available in the organization. I'm actually using the same version.

Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code.

Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks.

By the time errors have made their way into production, it's often too late and the damage is done.

DataFold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests.

DataFold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production.

No more shipping and praying. You can now know exactly what will change in your database.

DataFold integrates with all major data warehouses

as well as frameworks such as airflow and DBT and seamlessly plugs into CI workflows.

Visitdataengineeringpodcast.com/datafold

today to book a demo with DataFold.

In terms of what you're building at Starburst, I'm wondering

how you think about enabling data teams to be able to

orient their work in a way that they are producing products and actually

building up a catalog of

features that are useful for the organization and being able to kind of build an engine so that you don't have to redo the same task 15 times in slightly different ways. And so so you're not just working in kind of a support desk mode of just waiting for somebody to ask for something and then filling their request and, you know, trying to actually build out the suite of capabilities

that is flexible and adaptable for people to be able to interact with it in the way they do with other products and be able to actually do some measure of kind of self serve or

being able to fulfill their own requests without necessarily having to involve an engineer in the process.

That has been the motive

of Starburst as a company. For example, Starburst is the behind the scene uses TreeNow, previously known as PrestoSQL,

as an engine to connect to the all the data source. So 1 of the key challenge of a server solve again, the engine was created, the Facebook,

and the 1 of the key reasons why you created the big My Beta, you have different

databases. You have Lake, you have Warehouses, you have MongoDB, you have Logic Search, and whatnot. They serve the purpose for how, why, and when you should use them,

which means that this is why the data stack like, they are the reason

why your data should live in Los Exodus. They're the reason why your data should live live in Lake. There's a reason why they should live in some warehouse.

But as an analyst, if I wanna derive

the complete

insight across all my data set, you wanna connect the data set, which means that

without Starburst, you have to move the data and then you consume the data. With Starburst,

1 of the challenges we have solved

before even the data products is that

you get access to the data

in the environment they are actually being stored.

You can connect and actually create insight between MongoDB Elasticsearch,

Snowflake, Lake, whatnot

while you're actually looking at the data. So you don't have to move. You can on demand write the queries get inside of.

Now

when we connected you all the sources, now you're looking at not 1 river or 1 mountain, you're looking at 10 mountains. How do you know which is the my biggest peak? That's another challenge. Like, now I've actually created dataset.

How do I share this dataset? Because do I actually move it again somewhere?

Do I actually how do people can come in

and still consume the insights which was created from the federated system

without trying to even understand the complexities of federation?

Because as a business analyst, they do not care about federation. They just care about,

can I get access to my

revenue datasets or lease data within a second or within a minute when I actually and just share across to the marketing department where they can actually get what just happened?

This is where the 2 personas works perfectly in Starburst. As of today, we have a data product where you can actually combine the data on demand

without moving the data

and create the inside with business context

and share it with different departments. The different organization doesn't even have to know that this data is coming from 10 sources. They're looking at the curated dataset.

And as the data behind the scene, for example, data in the lake changes, data in the warehouse chain, the data product will also show the reflected in recent version so that it's not blocking the users who are accessing.

With the insight,

how the user are accessing data. So as a data producer, you can also understand who in the organization is accessing data. With the Starburst, you can also govern that how should I even expose the data to different organization.

Now

what we are trying to do is, like, when you are actually just dataset is created, how can I use this dataset with other domain and combine and create a new dataset? So instead of where you wanna go with the data product, is it not you're just for rating

from the

different sources, but you're rating on the use cases.

Because the insights being built between sales and marketing department

can be used to create a whole new insight which could be useful for product department.

So now

people are with the Starbucks, with the Starbucks data product. We are federating

those to understand what level. 1 of the customer actually rightly said that I'm gonna use the quote without using using the name of the customer.

But data products is a catalog of catalog,

which means that you have all the dataset. You can search the dataset. You can learn the dataset.

But as an analyst, if I go try to find phone number on my side, the phone number exists into

does dataset phone number also exist into raw datasets? Phone number also exist in this how do I understand this is the dataset I should be using?

Now level the conversation. You have taken all the dataset,

these goals span that this is the product you should be using, not just the data. But we are taking the productized version of data, calling it data product. So you're also getting this phone number also exist in this data product, which had been verified

by some admin, which had been

stamped and certified by somebody that you should be using for this use cases. While you can also understand who is the owner, also understand

who can give you access,

what context has had been used in the past, what is the NSLA,

and what is the success metrics of this data product which has already been available to you. Looking at that, you are a lot more confident, goes to my point, trust but verify.

You might see the trust, like, I think data product is what I should trust, But you're also able to verify

that this has been verified by the metrics which is available in the data product. So it takes the concept trust but verify.

And once that's available, you are you don't have to go looking at the raw datasets because your persona is just creating the insight. And the raw datasets comes to the people who appear creating pipeline

and also creating data products, and they also understand how data product is being consumed by a customer. Taking an example,

if I create a feature in a software

and feature has never been used by anybody else, I can look at the inside. I look at the different product analytics. Thus, we need to not spend any more time on this feature. Let's shut it down, and and let's work something else. Same as data product. I created data product, understand how it is being used internally or externally. It's never been used. Let's stop wasting our time in it, and let's actually focus on other use cases. So it's also help the prioritization on the data engineer or data producer side of it, where they can understand

which where they should focus their time and how should they prioritize it. Which is the prioritization

place can only come when you understand

the consumption side of it and success metrics of this data product being used in or outside the organization.

In your own work at Starburst

of

building some of these extra features to help your end users, but also maybe dogfooding some of them, I'm curious what are some of the

complexities or challenges that you're running into as you think about, okay, I think this is what I want, and I think this is what other people want, and then figuring out what is the shortest path to decide whether or not I'm wrong.

So it's a good question.

I'll use about the darkfooding piece to it. We do dogfood our own data products in the organization too. Where we're actually trying to understand,

for example, we have tons of connectors in Starburst, and we're trying to understand how which connector is being used in what way. For example, if the late connectors is most used by our customer, then we want to put an extra focus on those connectors so that we can make it more performant. We can move it more scalable because this this is how it can generate

more and more customers coming to these tablets. So we are looking at the those dataset. We're also looking at dataset with with the alignment with the customers, but the customers actually,

most notes actually go into Salesforce.

So how do you actually connect the datasets which is being collected into Lake with the dataset in Salesforce

to generate the insights?

Then there's another set of data set, like, you know, for example, in few in a month and a half, we're in February 9th, we're having a data Noah. And on data Noah, like, you know, we'll be getting a lot of leads to understand who is actually talking about what set. So those leads are also coming in. So now there's a 3 set of datasets we have. 1, the data from our existing customer, what kind of data can generate revenue? And other, what is the future pipeline can we generate?

So if we just focus on 1 of them,

we will be actually just serving to our existing customer, cannot grow our customers. If you just focus on the leads, then we cannot focus on existing customers.

If you focus on those 2 without focusing on Salesforce data,

Then we cannot focus on revenue aspect, and that's how we are dark footing it. But again, sometimes, I believe there's an art to it, which is where we talk about it. Like, okay. We know there's a customer who has actually did the talk. We have collected those data set, and we know that they have actually asked about x y z. Should we actually go in that direction, or did the customer just mention

1 of 10 things, and did we just miss the 9 things? So those 9 things are, like, sometimes data is not there, and we have

to make some opinion based on what we have seen. The opinion is changing every second. So that is the hardest part we have seen.

2nd hard part, which I can talk about 1 thing we have in data products, we have discussion both in data products.

We added the discussion both in our data products for the reason that people will come in

and add some comments around that. But

do realize that most people have actually discussed on the Slack or may discuss in Teams or maybe email them, maybe

discuss over text messages. Nobody wants another discussion board to discuss it. So how do we actually consume the information, which is someone maybe discussing on Slack, to use that to drive ideate how this data product should be rolled. So those challenges we are actually following closely

to understand

that no 1 will gonna ask a question in discussion board in the data product. They probably will ask a question on the Slack, or they probably will actually create a ticket in Jira to understand, like, how can I get access to it? So how do we connect those tools to the data product is something we are still in the exploration stage.

Another challenge

in the context of data products and the overall data ecosystem

is

thinking about kind of what component of the data stack owns the experience,

where it can be very fragmented having to go to 3 different places to be able to get the entire picture of something.

And I'm curious how you're thinking about that at Starburst and how you've seen your customers address some of that challenge of, okay. I've got 3 of the things I need for this to be treated as a product to be able to get the context

in this 1 platform, but then I actually have, you know, maybe the governance aspect in this other platform,

and the person who is trying to actually understand all of this is using a third platform to access it all. So how do I stitch it all together in a way that they're not getting frustrated and I'm not having to, you know, run around with my hair on fire?

That is

that is so true. Somebody I have chatting. They they literally

had a data catalog in their company.

And the data catalog is completely independent from the warehouse or the data source.

And the most people who are creating a dataset or data product are creating a way close to the where they live or very close to the query engine.

And they are the folks who should be

responsible for documentation because they are the 1 who are creating dataset. They are the 1 who has most context when they create dataset.

Now when I as a user, when I have actually 10 different priorities on my hand, you're asking the same person who is actually writing a code to have okay. Now I'm writing a 1 line of a code instead of writing a documentation or comments. Before that, I have to go to a completely different tool

where I have to write documentation on it. And then come back to this tool and create a new set of data product, and then come back to another tool to write documentation. Now take another level. Now I create a data data table or dataset or data product.

How does the other tool know which user created it? Because that user may not be present in that tool because of other reasons, because of monetization aspect of that tool, and then that you may never bring the user in that tool. So suddenly,

the person who created the data product or dataset in 1 tool may not be available in another tool. So if I look into that tool, I may not even know

who what is the right person to go ask a question unless the data admin come in and add the metadata, which is the owner metadata,

into that tool so other person can understand that person is the owner.

It goes back and forth because

the simple piece is that if I look in 1 place, I can understand. But if I look in 1 place and then I go over to another place, then I have to remember what was in that place

to add more context to the 2nd place. And then you keep adding multiple data stack,

the problem gets worse.

And that is 1 of the things we are trying to do at Starburst,

1 of the recent release, feature called data discovery.

What we did in Galaxy, which is our software as a service platform,

when you create a dataset,

actually, the moment any user,

because he or she, comes in at the catalog,

we automatically

grant that person as the owner of the catalog or the cluster the person has created, Which means that person doesn't have to go

add a new set of metadata to actually understand

who is the owner. Because if you can create it, which means you are the creator, you are the owner. You do have an option to transfer the ownership because if you for any reasons, you can always transfer the ownership. But this automatically

get created.

Second thing, the metrics information.

The most metric information in the other tools is actually getting scraped and getting built.

Our metrics information, which is how we're doing in Galaxy, is automatically

getting freefold and automatically getting filled based on the who are the users interact. A user doesn't have to interact within Galaxy. All they have to interact with that catalog

outside the Galaxy using, let's say, Tableau, Power BI,

Metabase, any other tool, and we'll capture those metrics

to generate what are the top table somebody is using, who are the top users. And those top users

doesn't even have to be part of the Starburst Galaxy. If the user is using in Tableau,

we have those metadata coming to us and we automatically generate those metadata. Now if there's a query fail from Tableau, how do you know that's query failed from Tableau? How do you debug the query? Not just getting inside of the query. That's where the data producers come in. They're looking at the complete metrics, not just based on understanding the data product, but debugging the data product. Because when it comes to the product, when she cannot debug the product, how do you know the product is actually up to the quality you expect it to be? So you can actually debug what who what queries have failed, how the queries are performing.

All the metadata is being automatically collected. There's a few things we are actually actively

working on is that taking that ownership concept, like, you know, when you create a table, if some user come and create a table, we are the user has given automatically owner of the table.

If I am the owner of the table, I can decide to give privileges to that table to anyone in the organization,

including myself.

Is a kind of saying that I have a key to the house. I can rent the house and give the key away, or I can actually ask somebody, invite somebody people to the house. And that's how treating the ownership.

You can create a table. You can actually give yourself, like, I do not want to use this table accidentally outside the university. You can even say like, I myself don't have an access, but you can actually give other folks access. It's to actually it goes to the governance piece of the dataset or the data product.

The second, the most metadata is automatically getting populated, which means that if you have actually built some the definition in the your source in the like, we're gonna pull that information automatically because we believe that people have so much on their plate. They should not be spending time building metadata. That should get automatically built in if that has been added in some place.

Since being a query engine, we don't have to go scrape the data. Because I mean, query engine, every time a table is added, table is removed,

the Galaxy will actually look what modification has done. So we are showing the live information of the the change of the dataset as the data is moving. There's no delay

in a different tool understanding

the dataset.

There's no delay in understanding

the dataset. And to the point

when I cannot only create the dataset

if I have to write the documentation,

the moment data is created, it actually gives you documentation

option right there. You can choose to ignore it, but it's right there. This is next step. It's not next tool, it's next step. You also can expose and learn what other documentation

has been created in your organization while in there. So while you're writing it, you can also attach the documentation for different data product. And those all within the context of the same tool. And that is where we are looking for it is that

most people do not wanna jump tools to tools because it's base time, especially in this DNA age

when people literally have 100 of priority, when people do not commit to the time that they're actually giving. Now let's make their life easier. Let's not give them 100 tools because that will actually make their life harder.

Build data pipelines, not DAGs. That's the spirit behind Upsolver SQL Lake, a new self-service data pipeline platform that lets you build batch and streaming pipelines without falling into the black hole of DAG based orchestration. And delivers up to the minute fresh data. SQL Lake supports

a and delivers up to the minute fresh data.

SQL Lake supports a broad set of transformations,

including high cardinality joins, aggregations,

upserts, and window

operations.

Output can be streamed into a data lake for query engines like Presto, Trino, or Spark SQL, a data warehouse like Snowflake or Redshift, or any other destination you choose.

Pricing for SQL Lake is simple. You pay $99

per terabyte ingested into your data lake using SQL Lake and run unlimited transformation pipelines for free. That way, data engineers and data users can process to their heart's content without worrying about their cloud bill. Go to dataengineeringpodcast.com/upsolver

today to get a 30 day trial with unlimited data and see for yourself how to untangle your DAGs.

In your experience of working at Starburst and with your customers, I'm curious what you have seen as some of the most interesting or innovative or unexpected ways that you have seen people approach the idea of building data products

and some of the ways that they've applied Starburst in that context.

I think the first customer of data products surprised me

because the data product initially, when I was ideating it last year, with the most user experience,

my intent was to be used within the organization that was our version 1, like, shared across the organization.

Like, you create the data product you create for 1 domain that can be used by that domain,

the full visibility into the goal standard dataset

with the business context and the ownership around that dataset.

The first customer who used the data product uses for completely different reasons.

They actually embedded the data product. What they did, what they actually use Starburst behind the scene to fill their their website, which means they were actually the outside users were coming in. They were asking for some set of dataset. And instead of them building the dataset, they were actually using that as an API call

and those queries to create data products on demand.

And all the metrics with Starburst was having, they were exposing it on their website. So, basically, they use our data products to expose and make it externally available data products,

which was not the intent when we were ideating on data products. Happily surprised. And then that would also give us another use cases. The other use cases which I've seen are not intentionally needed. Here, like, most customers, what they do, they have legacy datasets, and they also have a cloud dataset.

And the customers are always in moving the data from legacy to cloud, especially they are in the cloud first strategy.

But from after with Starburst and data products, what I have seen, instead of moving all the dataset, they can actually materialize the data. What they can do, they can connect to the legacy on prem dataset, connect to the cloud warehouse or cloud s 3, and then they can actually decide that I just wanna use 2 columns from legacy dataset from 1 table and use 2 columns from s 3 or ORC file and materialize into s 3. So what they are doing, they're using data products

to actually create the small datasets. So instead of moving everything, they're just moving materializing what is small datasets

to cloud

and then using cloud elasticity

for the consumption of their dataset. So now what we have done is, like, creation

can be done with a small compute, but consumption can be used by cloud compute. The customers have been really innovative

on how they're using data products, which is very exciting to see as we are on this journey.

In your experience

of operating in this space of data products and spending a lot of your time thinking about it and building some of the features to to enable those capabilities, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

Few stuff. 1 thing, we made mistakes on how and what to be done initially.

For example, when we actually create a project and materialize views, which I was talking about,

we went with the y'all data product as a view in the first version, which was

not a mistake, but most customers actually

were looking to create the cross organization, the cross region,

across cloud data product.

Initially, they we believed that data product will be locally into 1 region, but which was not the case,

totally evolved the cross cloud and cross organization data product.

The intent of asking for, like, can you materialize data came really fast. So as I said, like, a few things which was unexpected,

we literally acted on those things to actually add those stuff. 1 of the things which I just talked, we spent time on adding discussion mode,

which I was not used often because we felt that people will come in and have discussion.

But it's like most discussion are being happening outside that, so which is what the inspiration stage, how can you bring those information

into the data product. The other challenges we have seen is, what do you call it, especially the data across geography, which means if the data is actually in Europe, if data is in US, we do have a Stargate connector, but how do we actually communicate with those datasets? So 1 fast follow-up of, like, connecting the dataset

across geography. How can we do it securely without making sure that customers are not

exposing accidental

datasets using it to compliant that region respect? And, also, how can we do with cost efficient? So couple of investments we have made on that, which is already available in the Starburst

to actually expose data products using the cross cloud, cross region, and and and respecting the compliance

across. But we are still in the learning stage, and that's why the approach we are taking

in our software as a service is that

there is a reason, at the top of catalogs, soft catalogs, but the people still wants to be able to search the raw dataset,

which goes to my initial point. I was making a trust but verify,

which means that I can look at the data product, but I do wanna see which raw dataset is used to create that data product. I may not be interested in the raw dataset,

but I just wanna make sure that you're you or anyone who has created the data product have not used the information which may be giving wrong you know, created the wrong use case. This is where the verification

of data product

has been 1 of the biggest focus as we develop the the Starburst Galaxy, and that's the approach we are taking data.

As you continue to iterate on the Starburst platform and the Galaxy product and work with customers on enabling this approach of building data products rather than just completing data tasks. What are some of the things you have planned for the near to medium term?

So 1 of the near to medium term is planned is the 1, as I said, like, the data is evolving every day. So once data is in loop or high middle store or even into the lake or

in a warehouse, you can easily use Starburst to actually consume the data product or dataset.

But we have always heard from the customers that, how do I know if it's a new file on the lake? How do I know

that something has changed?

How do I know what evolved over time? And, also, I don't want to do it manually. I just wanna be automated. So few changes you're making. When you're looking and searching for a dataset, you're not searching for a dataset which you have already cataloged, but you can actually go search for the dataset and actually create new schemas,

new tables based on new files which have land name linked, which is good or schema discovery, which is already available in Galaxy, but we're adding a more user experience on the top of that.

1 of the key components around any product is the usability.

Usability

could be at a UX, could be guidelines, or the best practices.

We are really, really hyper focused on the usability of data products and making it easy.

Because if I wanna run a schema discovery, which is example I'm giving, I don't have to go and write it manually and figure out where the bug happened or what if there's an unstructured file. I don't want it to actually go and go into every things like why there was an unstructured file. Maybe I should get an email that you there were 10 files were found, 9 were registered, what were unstructured,

and I now have the information. So exposing the right information

to the customer, not just overload of information. How can we automate

to make our customer's life easier

just so then instead of giving the in too much information, the customer gets overwhelmed. So that is the balance we are taking very carefully.

Just making the usability

and consumption

be so simple that we will ensure that anyone, whether the data engineers to the consumer, should be able to understand the dataset.

Also, give it enough information that if I want to do manual

schema discovery, if I were to change

from the format from ORC to CSV to something else, I should be able to do that. So we are also adding those parameters too. So there's also available for sophisticated user, but also taking the overwhelming of the

error use cases or whatnot, but more on the recommendation

side of it.

Are there any other aspects of the work that you're doing at Starburst or your experience working in the space of data products and the feedback that you've gotten from customers that we didn't discuss yet that you'd like to cover before we close out the show?

So 1 of the things which is 1 of our customers, I was looking at Trino Foundation and stuff like that. People have used Trino for many, many use cases. Like, I was looking on the I believe it was Jib. I don't remember the customers. I will be careful naming the customer. Somebody actually used Trina to run the data quality, which I was very surprised and fascinated with what they have done. But to the point, what I wanna make sure that even though

we are trying to ensure that customers can

understand the data, explore the data, catalog the data, can create the data products, can create the more metrics and documentation

within Galaxy,

our primary motive and primary objective, Starburst, has an optionality,

and we will never walk away from that. What I mean by optionality, if someone is already using some kind of tool, we also wanna support that tool. We never wanna say that you should always be using Starburst. We have been always open, always supporting the community, always ensuring

the tools, like, for example, I wanna use Power BI, what should we use? We have already great expectation,

for example, a collaboration

with Trino. We will support that. So point I'm making that because of the community,

the optionality will always continue to exist.

We want to make sure the user experience is grounded in the galaxy. But at the same time, we also wanna ensure that customers get to use the tools they're most comfortable with. The customer get to actually be productive

in the tools they want to be productive

while also ensuring that we are trying to bring the and bring the customer

a great user experience within Starburst and Starburst Galaxy.

Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

1 of the biggest gap which I see is that

reactiveness

and no agility.

Agility

does is a confusion folks have that I need to be reactive to be agile. That is incorrect.

1 has to be planning, 1 has to be prioritizing, and 1 has to be understanding what needs to be done and iterating on the top of that. That can be will need be done by collaboration across organization.

As I said, there's no thing called 1 person army. It's the organization and driving for the organization.

And driving for the organization does not come from 1 team. It comes from across the team, communication

across the team. So most tools as I've seen as of today

is they get hyper focused into 1 use cases and forget

that as a user, as a data engineer, I may have a 10 different problems.

And even the 1 tool may solve that use cases so perfectly,

but that doesn't help the person who's coming in, doing the job, trying to make sure that I'm doing the right for the organization,

delivering for the organization,

and actually getting lost into okay. I solve the quality problem with 1 tool. How do I solve the cataloging problem? How do I solve the go running problem? How do I solve the creation problem?

So I think the trend, as I see over the years, is gonna be more collaboration across organization, more communication across the organization.

And the prioritization, not just across the team, the prioritization across the organization to create the tools, not which is beneficial for the team, but creating a tool which is beneficial for the team to drive the insight and making sure the organization is on a trend to driving the right thing which needs to be done for that particular organization.

Alright. Well, thank you very much for taking the time today to join me and share your perspectives on how to build data products and the context in which they are useful and

how those varying

applications of the product thinking can be used for the different use cases that the data is being applied. So I appreciate all of the time and energy that you and the folks at Starburst are putting into making that a more tractable problem, and I hope you enjoy the rest of your day. Thank you, Tobias. And, actually, I'm I'm a huge fan of your show, so I'm really, really honored for you having me on the show. And as you can see, I get very animated and excited talking data products in Galaxy.

So really appreciate having me on your

show.

Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast,

which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at dataengineeringpodcast.com

with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links