Data Cloud Cost Optimization With Bluesky Data

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer friendly data catalog for the modern data stack. Open source DataHub is running in production at several companies like Peloton,

Optum, Udemy,

Zynga, and others.

Acryl Data provides DataHub as an easy to consume SaaS product, which has been adopted by several companies.

Sign up for the SaaS today at data engineering podcast.com/accryl.

That's

acryl.

Your host is Tobias Macy. And today, I'm interviewing Mingsheng Hong and Zheng Shao about Blue Sky Data, where they are combining domain expertise and machine learning to optimize your cloud warehouse usage and reduce operational costs. So, Ming Sheng, can you start by introducing yourself?

Thank you, Tobias, for having us. My name is Ming Sheng Hong.

I worked with my old friend and now cofounder, Jen,

to start Blue Sky about 3 months ago. And before that, I spent 8 and a half years at Google.

Initially, working on the database

underneath the Google's ad stack, what people internally jokingly refer to as the real database.

Because as you know, Google has a few database stacks.

And then I moved on to building the machine learning infrastructure,

the TensorFlow runtime

to make Google's machine learning workloads faster and cheaper.

And before that, I spent 5 years

working in 2 early stage startups

in Boston.

And before that, I finished my master and PhD in databases.

So I've been working in the data infra and more recently, ML infra over the last 15 years or so. And, Zeng, how about yourself?

Yeah. So, my name is Zeng. I'm the CTO and cofounder of Blue Sky Data. Very nice to be here. So for myself, I started my career

after I graduated from URUC

in 2005

when my first job, was Yahoo, worked on the web search engine.

1 system I got to know from that day is the Hadoop. Right? At that time, it was,

kind of a small integrating project. I didn't realize how it will become in the later in the industry.

Then in 2008, I joined Facebook and become 1 of the first developers on the Hive project.

Hive project was the internal project I faced by that time.

Again, we didn't realize how big the impact it could make to the big data industry in the later years. I spent 2 and a half years on the Hype project and then moved down to stream processing at Facebook and started that team, and later moved down to

databases like MySQL and RocksDB.

After being at Facebook for 6 years, I went to Dropbox, worked at Dropbox infrastructure for about 1 and a half years. Then at the end of 2015, I joined Uber as the head of data infrastructure.

And then soon, I transitioned back to IC and then worked under Uber's data architecture, where we scaled data architecture for about 1, 000 x in terms of the data storage in the last 6 years. And my personal contribution in the last 3 years was about cost reduction of the big data at Uber, and we were able to reduce that by about 1 third

3 years of journey. So with many, many contributors into the project, and then recently graduated as a distinguished engineer from Uber. So I started Blue Sky together with Mission. And Mission and I actually got to know each other

from our common friend's wedding back 18 years ago. So we are in the same industry

talking with each other over time, although we never worked in the same company until Blue Sky. But we actually know each other really well. That's also why we decided to team up together a couple months ago to start the company. Yes. On that last point, I'm very thankful that Jen held out for me to start a company together.

As you folks might know,

a large portion of Uber's former Data Eng team members

have been out starting their companies,

such as 1 house and others.

And a large portion of modern data stack are now, you know, being operated

by the alumni from Uber. So I'm very thankful to have an opportunity to start Blue Sky with my old friend Jin.

In terms of the actual Blue Sky project, can you give a bit more detail about what it is that you're building there and some of the story behind

how it came to be and why you decided that this particular problem was what you wanted to spend your time and energy on? Sure. I can start.

So as Cian mentioned earlier, we reconnected last winter,

and we're looking for new opportunities.

And 1 of the areas that excites us both is data

cloud. We both see that the market leaders

such as Snowflake,

Databricks

are doing actually well. They're already very successful

and iconic business.

At the same time, based on Gartner and other research,

there is a 40 to $50, 000, 000, 000 market for data warehouses.

And that suggests to us that a majority of the data is still on prem. And over the next 5 to 10 years, we think this is no longer a question of if, but a question of when regarding how people would leverage the power of the cloud for future analytical data management.

And this is where we both think that this is a very early opportunity. And there's a lot of room to improvement to improve

and to help people fundamentally

change how they manage data by leveraging the power of the Cloud.

And specifically,

we started by discussing our own career experiences.

And 1 of the areas that we are both passionate about is in making

data warehouse and analytical

computation

faster and cheaper. And as we reflected on our own experiences

working at some of the world leading tech companies,

we feel that this is a pretty prevalent problem. When it comes to big data efficiency,

Everybody is kind of doing it wrong. There's a lot of room for improvement.

And then when we started talking to our friends who have been running

data cloud instances in their own companies,

that's also really resonates with them. We didn't really have a concrete product at that time. But we've already been receiving invitation

to go to some of the current leaders of different industry spaces such as the crypto,

online grocery,

shopping

companies

to help consult

and help them optimize their data cloud workloads.

So then we started talking to the investors.

And soon soon things clicked really well. And that's how we started this blue sky journey.

Our initial focus is to

help Snowflake users

improve their cost efficiency

in running the data cloud workloads.

Exactly as Minxin said. So

1 of my,

like, most memorable

experience in my previous company is

sometimes 1 big data pipelines can eat up a lot of resources and cost the company a lot of money

if, it was not optimized well. Right? So there were 1 pipeline that was running for more than a year at the company,

and we didn't notice that. Once we noticed that it was really expensive, costing the company something like a quarter

$1, 000, 000 a year. Right? So we dig in and then try to optimize that. And then we were able to reduce the cost of that query by something like 1, 000 x because that query was trying to

reprocess

all the historical data every single day. Instead, we changed that to incremental computation,

and, the cost dramatically dropped down. And that is 1 example that, like, really supports the case that Misha mentioned. There were so much opportunities

to improve the big data efficiency,

and a lot of times,

the owner of the data platform may not even know how much opportunities are there. That's 1 of the main reasons, I would say. So why we started Blue Sky with a mission to help everyone who use big data.

And you mentioned that your current primary focus is on customers of Snowflake, which is definitely 1 of the

larger platforms that people are using for managing their data warehouses. But there are also a number of contenders that have

significant market share, most notable probably being BigQuery, but then there are also Redshift and particularly with some of their newer architectural designs to be able to make their platform more scalable.

I believe that Azure has its own cloud data warehouse, and then there are also other third party contenders such as Firebolt

and some of the managed platforms for running things like Trino.

And I'm wondering

what your selection process was for deciding that Snowflake was where you wanted to spend your initial energies and

what the sort of design approach has been to be able to make your current tooling

adaptable to some of these other technologies to be able to give people some of the similar benefits of what you're currently focused on providing to Snowflake users?

Let me tell you this question about your 2 parts, where the first part is about why do we choose Snowflake.

The second part is about how this technology can potentially extend to other technologies.

In terms of Snowflake,

when we talk with big data users, we realize that Snowflake has very good user experience.

And, also, a lot of users who use Snowflake are, in general, pretty happy, but they are mainly worried about, costs. So costs can go really, really high. Costs going high on 1 side is also reflection of how easy it is to use Snowflake.

Because if it is easy, then, of course, everybody in the company want to use it, then the cost will go high. And that's exactly where Blue Sky can quickly come in and provide values, so to reduce the cost, to make the queries run consistently fast, and so on. Other technologies

that also has SQL interface could definitely benefit

from the same technologies that Blue Sky will be building. But for the beginning, we want to focus our 1 technology first. And, also, on that front, I have 1 note might be useful for everyone,

is when we look at these new generation technologies,

every single technology, it actually has its strengths. Right? So some of the users are asking us, saying, like, hey. Shall we move from a to b or b to c or c to a? Like, our advice is don't move around. Right? So because data migration is a huge pain. It's a lot of costs,

and we would rather have our users, let's say, stay with 1 solution, and then we can help users to make that solution work better for their use cases.

So coming back to your second question, right, so we do tend to extend to other technologies other than Snowflake,

and we will also try to have urge users. Right? Don't move around too much because

every technology has its strengths and its drawbacks, and Blue Sky hopefully at some point will be help you even if you are working on technology that we are not working on yet. We picked Snowflake

also because

as we mentioned earlier, some of our friends happen to be managing Snowflake instances.

And they are pretty they are in pretty strong need of our help.

And also from our own research,

it seems a pretty large

portion of Snowflake users

or higher level users coming from the data analytics, data science background, as opposed to some of the other

data cloud, cloud warehouse users who more have a engineering background and are more used to improving and tuning their own databases.

And as I mentioned, we will look to expand

our optimization

for other data cloud products.

The other thing I was mentioning is when we started Blue Sky, we also explored

the option of building yet another query engine. Just like everybody else as you mentioned Tobias, like Firebolt and whatnot.

But our thesis is that

while it is possible to spend yet another 1, 000 Ant years and build such an engine from scratch. As Jen and I have both done in our past work, such as Jen's work in Hive and my earlier work at Vertica, and then for Google's Ads database.

We do not think that is the key

pinpoint for the end users.

My own belief is that if you build a query engine from scratch, let's say starting 2, 3 years ago,

the end result might be 5, 10% better than, you know, than a state of the art engine.

I don't think any engine can be in average 5 to 10 x better. It's possible to be

better in such a magnitude

on slice of the workloads.

But in average, I do not believe that will be the case.

Instead,

the

bulk of the improvement in terms of both performance and cost

is in tuning.

It's in tuning the database product, the query engine based on users' workloads.

That is often overlooked.

So we believe kind of a dirty secret in this industry,

as Jen also alluded to, is that

when user claims

when they move from database product a to product b, they got 5 to 10x better, faster, cheaper. It's not

the product per se. It is because they did a clean house. We thought about the database schema design, thought about the physical schema design with indices, materialized views and so on. And so why don't you just stay with your current database product? And if you do make the same efforts, you could achieve similar results. And that is why we are starting by helping our users

migrate from Snowflake to Snowflake in some sense, by doing the house clean and help them optimize for their workloads. And we will do the same with the other products,

including possibly products we will build in the future.

That being said, it is possible in the longer run, there's more room for improvement

across the different products. They are so called the right tool for the job as professor Mike Stonebraker often likes to say. And so it's also important to map the right slice of workload

to the right underlying tool and infrastructure.

And that's something we would also look into.

The interesting element of what you're building at Blue Sky is that it parallels what a lot of

application and infrastructure teams are dealing with in their adoption of cloud technologies where it's very easy to get started. There's a good user experience of being able to put things into production, but then it becomes difficult to actually track what your spend is going to be, predict it, understand how different,

you know, application architectures or system architectures are going to impact your overall costs to be able to run these systems.

And I'm wondering what you see as some of the

commonalities and differences between the compute and storage optimization

that cloud build sort of optimization companies are doing and what you're doing at Blue Sky to optimize the utility and costs of Snowflake users?

Indeed, Tobias. As you mentioned, there has been a pretty large industry of cost optimization

and cost visibility companies over the last 5 to 10 years on the public cloud space for AWS, GCP, and Azure.

We have talked to friends from companies like CloudHealth

that was acquired by VMware a couple years ago. And there's a lot of good learnings.

Between these companies and Blue Sky, we see a couple common elements and some differences.

So to start with the common elements, first of all, cost visibility

is the foundational

element.

Without understanding how the cost is attributed to individual

computational jobs or in the case of data clouds, individual query jobs, SQL query jobs, that it is hard

to understand

which internal users and teams in the company is using the most. So we need to provide that kind of visibility and accountability.

Some of our users refer to this as the war of shame. We don't use this phrase, but we supply the technology that users could adopt for their own policy.

Interestingly, in the case of Snowflake

and other data clouds,

even though the

underlying products would provide the kind of cost visibility in terms of how people are using the data warehouses,

They do not provide the attribution

of the cost to individual query jobs.

Some users get surprised by that, but there is a technical reason behind it.

As you know, once users start a data warehouse,

they pay based on the seconds,

based on the time duration they use.

So during that time window,

it doesn't matter if they don't run any query

or let's say they run 3 concurrent queries.

For that reason, there is no direct attribution

from the cost people pay to the individual queries they run.

And the first thing we did when we built our first product called Blue Sky Mission Control

is to

implement a little algorithm that is part of our secret sauce that allows us to attribute the cost to individual query.

And thanks to that fundamental element,

we can now aggregate a query based on the usernames and the teams, the organizational structure, and so users can go and provide visibility and accountability.

We can also help find

some top k most expensive queries

or query patterns

so that we can work with the users to prioritize

which query should be tuned. So I want to say the first common element is cost visibility.

But when it comes to data clouds, there is some technical barrier

in implementing and providing the cost visibility properly that we were able to crack the puzzle.

2nd common element is

to pick low hanging fruits first when it comes to cost optimization.

So in the case of the public clouds, it could be revisiting the VM,

size, the container size. And for us, it could be the data warehouse size. It could also include a resource utilization

and reconfigure, you know, the warehouses

or the VMs to reduce idle time. And 1 story I would like to share here is

when we engaged our trial users,

in the first 2 to 3 weeks, we managed to identify about 20%

of overall

Snowflake

cost for them that could be optimized away. So both parties are very ecstatic about such findings.

And part of the reason for why we're able to lend such large impact initially is because there was low hanging fruit. We can help them find very expensive jobs that based on users review, they don't add much business value. So some of them can simply be thrown away, and the rest we would provide

suggestions for optimizing them.

So picking low hanging fruit is important.

And the last common element is

we plan to charge

our product

based

on the value we provide, so called value based pricing.

So in our area,

since cost reduction,

the savings is 1 of the key values, not the only 1. We also help people analyze more data faster

and help the data engine and the CIOs

gain more cost visibility.

But when it comes to cost savings,

our thinking is to charge a percentage

of the savings we actually lend to them, thereby fully aligned with the user's interest. And that is also a kind of common best practice by the other public Cloud cost optimization tools.

In addition to these common elements, there are a couple of key differences.

The first 1 I want to say I already mentioned. Even computing cost visibility

requires some non trivial data infra internal expertise,

some understanding. So this is not something that, you know, average Snowflake users will be able or will be willing to do themselves. And that is 1 of the key values that we offer.

And the second 1 is

for the public cloud computation, the workloads,

the user jobs tend to be opaque to those vendors, to the products.

In contrast, the SQL jobs and SQL queries are transparent to us.

A large part of our value addition

is obtained

by in-depth analysis of these SQL jobs. So we can go and, for example, based on the query predicates

and the group by columns,

suggest how users could

reconfigure their table clustering key. Or rewrite the queries in deeper ways

that standard database query optimizers

cannot do.

So these are the values that our product could add in complementary

to the existing data cloud product.

And in contrast, this will be hard to do when the VM, you know, the EC 2 jobs are opaque to the cost visibility

tools.

The last part is auto tuning. So our vision is over the time, we want to

make our product really easy to use so that users do not need to take the suggested

tuning, the optimization

ideas from us. Instead, we apply

them automatically for our users.

And this way, our users could allocate their own internal talents, their entry resources,

better on their own product and business,

and leave a large portion of data cloud management to Blue Sky. As far as

the experiences that you've each had at these large companies where in a lot of cases, you're going to be dealing with on premise systems because some of the companies you worked at predated the widespread availability of cloud data warehouses or the overall scale made it such that it was not economical to use these systems. So I'm wondering what are some of the lessons that you learned working at those lower levels or working with these

more self managed systems that have

given you the insights necessary to be able to identify

these potential optimizations of workload and be able to understand

enough of what's happening behind the scenes at Snowflake and some of these other systems to be able to effectively target

opportunities

for reducing

wasted

cycles and wasted workflows?

So the first thing I want to say is the on prem data platforms and the cloud data platforms are very different in terms of cost management. For on prem data platforms,

problems usually arise when the crawl risk becomes slower

because on prem data platforms are usually fixed in size. So whenever there are more workload or inefficient workload,

then users first complain about the speed.

In the cloud world, however,

usually, people won't see a slow speed because cloud data platforms are able to autoscale.

The only surprise a user get is at the end of the month, they suddenly see, woah. Their bill actually is, 10x bigger than last month. So that is 1 very big difference that we all realized. The fundamental

similarity between the 2 is,

the new workload coming into big data

can be very hard to manage

because

1 of the main reasons why all the companies need, let's say, data engineers and data scientists is to write new jobs. So by definition, the workload on the data platforms

are not stable. If it's not stable, the cost will grow. So the question is, how fast to grow is normal versus not normal? There's a lot of interesting

analysis inside those and can be helpful for us to figure out which ones are not good. I would say 1 lesson that we learned in this is whenever the cost is small, don't try to, like, install a lot of new systems.

Right? So in my early experience, the complexity of the data infra sometimes can kill the team.

If we have 5 different

query engines

in the company,

then not only our data infrastructure team will be spread so thin, our users will also not know which Chrome engine they should use for what kind of workload.

We would rather have most of users use a single engine

and help them whenever their cost grows big enough.

At the end of the day, if the cost of 1 user is only 0 1 point percent of the whole data platform,

it really doesn't matter if that user is 5 x less efficient

because that's only 0.5% of our cost. Right? And I would just say, simply put, the biggest lesson we learned is to always have the availability of the cost of distribution first before doing any automation.

And, also, be very careful about introducing new systems

because once we start to introduce new systems, it's going to be really hard to manage the data consistency,

the data quality,

schema data discovery, and a lot of other problems.

I guess Misha can add some more. So let me also share

a past episode of my project that I believe could have relevance to the modern Data Cloud optimization.

As I mentioned earlier, I started

my career at a columnar database startup called Vertica.

Back then, I was fully confident that Vertica could be the system that provides 10x

node speed up and scalability improvement over the encumbrance like the Oracle

and Teradata.

Because I read the paper

from MIT called C Store back in 2005.

Now

in terms of the day to day work,

1 of the key pain points there is that as we go and perform POCs with our trial users,

the Oracles and Teradata's

all each have a large army of DBAs

who are very good at

tuning the user workloads

by hand. Unfortunately, outside of this little company of Redecar with about 20 engineers,

nobody knew how to tune a new column store.

And without tuning, the system could not

realize all its potentials.

Our company decided to prioritize building a new component,

an auto tuner called a vertical database designer.

And I was fortunate to tech lead this project.

And what this component does is it automatically

analyzes

users'

table metadata

like statistics,

as well as SQL queries.

These queries could come from BI dashboards like Pablo

or written by people. Let's say in the modern data stack, it could be the DBT queries as part of the transformation

in the ELT workflow.

And so our database designer would analyze these queries in the table stats,

and then come up with ways to organize the data.

Such as segmenting it across the nodes,

sorting it, and figuring out which column they're encoding to individual columns.

And as a result, in some of the most complex workloads,

we managed to even outperform

the tuning results by our in house experts.

And our tech co founder and CTO, professor Mike Stonebraker,

initially couldn't believe it. And Andrew reviewed with him, and we showed him how complex the workloads could be. Back then, Vertica has some of the largest users such as the syngas.

Some of those workloads contain hundreds of tables and thousands of queries.

And it became clear that no single database expert, not even a hardworking

startup engineer, could manage to tune all of the queries.

So that is the power of having an algorithm driven autotuner.

And I believe in a modern era, the similar technology

could be applied to

data clouds such as Snowflake

in terms of

configuring the table clustering key,

as well

as creating materialized views

and other forms of indices that the data clouds will support.

RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control.

Their SDKs make event streaming from any app or website easy, and their state of the art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up for free or just get the free t shirt for being a listener of the data engineering podcast at dataengineeringpodcast.com/rudder.

And as far as the automatic tuning of the engine and figuring out which parameters to tweak, it puts me in mind of the AutoTune project,

which is being run by

Andy Pavlo from CMU.

And I know that that's focused primarily, at least initially on operational databases, so MySQL, Postgres,

MSSQL.

And I'm wondering what are some of the sort of commonalities

as far as the approach that he's taking and what you're trying to do with Snowflake and eventually other data clouds?

Yes. So earlier on, Chen and I studied the AutoTune project.

We also established,

contact with professor

Andy Pablo.

We had some mutual friends such as Andy Palmer, who was

our earlier business

CEO and cofounder for for Verdeca.

What we saw is that, as you mentioned, Tavares, both Otter2

and Blue Sky are in the mission of automatically

optimizing

the user's data infra.

The key differences are first of all, for now AutoTune seems to be focusing on the OLTP space, whereas Blue Sky is in the OLAP space.

And secondly, the key focus for Blue Sky is to tune the system

based on users' query workloads.

Whereas for now, it seems

auto tune focuses

on tuning the system parameters

such as the buffer pool size.

In our view, the system parameters are also very important to tune. But when it comes to

data clouds, the cloud operators such as Snowflake

have kind of tuned these parameters by hand. And what is left and most important to tune are the users query workloads. So that is how we choose to start on Blue Sky.

And so as far as the areas that drive cost and some of the issues that you identify for being able to target optimizations,

what are some of the most significant contributing factors that actually drive up the costs in the first place? And

the

factors as far as the pricing model of things like Snowflake or the ways that the organization is using it or some of the inefficiencies

as far as how the data workflows are being designed and executed

that you work with to be able to identify areas to reduce that overall spend?

Yes. So at a high level, we see 3 high level factors

that lead to that kind of inefficient

use and the inflated cost.

First of all, it's the mindset and the best practices

from the users.

In the on prem era,

the best practices

are quite different

from now the data clouds.

But as we all know, when there is a new generation of technologies coming up, such as the 1st gen data clouds we're seeing,

people tend to be applying their old practices

while using the new tools, so called old wine in a new bottle.

So as a key example,

when analyzing the query history from 1 child user recently,

we found that there is 1 peculiar

SQL query

that has been costing

about $96

per run

and generating no business value.

That query is what's called an XL

sized warehouse

and it simply times out after 2 hours for each run.

So 2 hours of Excel warehouse would cost, I believe, 32 credits

and the public price for each credit is $3.

So that's about $100

per run,

which times out. And unfortunately, that query has been retried by more than a 100 times by the time we found it on the log. So that is basically the kind of $10, 000

damage that a single data cloud user could do to the system and to the company.

And that company has

hundreds

of data cloud users.

So this is the kind of

impact

in an also positive way that users could bring to the new data cloud product,

partly because they have not been used

to how to use such products effectively.

So we are in the mission to provide a better tooling, but also

help cultivate the best practices.

The second 1 is also organization and priority.

As you know, especially in the last couple of years, a lot of the data cloud users, the companies are in a very fast growth period. Now kind of the fast growth market is now coming to an end. But in the last couple of years, with such a huge growth on the business end,

there isn't as much focus

on internal kind

of cost optimization.

And it is fair to say that even though the Snowflake users could be technically very strong, it's simply not their job, not their key priority to be looking out for such cost optimization.

And last but not least, as we mentioned earlier, as a 1st generation data cloud, we also believe the current technology and the product

has room for improvement.

So this is also where we would be focusing on for the next 5 to 10 years in Blue Sky.

And in terms of the actual technology that you're building, you mentioned that your initial product is called Blue Sky Sky Mission Control. I'm wondering if you could just talk through some of the design and architecture of that platform

and the different signals that you're hooking into for being able to identify

opportunities

for cost optimization

and performance tuning?

So there are a couple of signals from simple to complex.

The first 1, the simplest 1 is the the irritation of the warehouse.

When we look at the Snowflake warehouses,

sometimes there's the idle time before the warehouse is shutting down. Sometimes it can be also the interval idle times before

the last query finish and the the next query starts. There could be simple parameters that we're into, and there will be ways that we can help user to, let's say, move queries around among the warehouses

in order to reduce their credit usage

and sustain at the same time without affecting their query performance.

And then the next step above that is look at the query text itself. So some of the queries user write may not be

in the optimal format. To give 1 example is,

where people want to read the last of 3 days of data. The right way of writing the query would be where

the timestamp column

greater than something. Right? But we noticed there were also cases that people writing a way

in, like, where

the date of the last time stamp greater than

today minus 3. So the main difference between these 2 writes is the first way of writing can be easily pushed down as a predicate into a storage engine and make the query extremely fast. But the latter way of, writing could have issue because the query engine may not be able to push it down using the range of the timestamp column to fill out those data blocks. Let's say, the table

micro partitions

that could not be actually pulled out. So those single query rights are very helpful, and a lot of times, user can take those signals and just apply to their warehouses, apply to their query history, and they get the Winx immediately. And the most complex signal that we're looking for

is looking at the whole query history for the whole day or the whole week.

And we identify

the queries that are similar to each other.

Those queries can be considered as incremental computation,

but are actually computed

from scratch every time. And find out ways that we can make the computation shared

through technologies like Materialise View, or pre

computed results

or cached subqueries,

things like that. That part is more like the more constant technology, but will yield the biggest gain. That's also what we are still developing on right now. That's pretty much all the signals that we have.

So as far as

the workflow for somebody who is getting started with the mission control and they're starting to run it against their data warehouse and understand what are

the inefficiencies that I am identifying. If you can just talk through that workflow of getting set up, finding the areas to be able to optimize, and then actually

what they do with that information

once that's discovered.

So currently,

we get the customer leads

from both our network and through the warm intros from our

advisors, investors, and so on, our friends.

And then we also start to get some kind of cold leads based on our own marketing efforts. And once we establish the initial contact,

so we basically show people our demo.

And we

work with them to create

a read only account

for Blue Sky in their Snowflake

instance.

And they grant us permission

to select metadata

tables

such as query history.

We do not need to see their business data. And so that's 1 reason that we assure

our trial users that there's no concern

with sharing essentially their workload metadata with us. So we can start ingesting the data into our own SaaS back end. We do analysis

and then we show the results via our GUI dashboard.

On the dashboard,

we would show the users, as I mentioned earlier,

how we compute the query cost. We attribute the cost to their queries,

aggregate them across key dimensions

such as by the users

and by the query types, and we show the top k most expensive query patterns.

So upon that initial point of review, user can then decide which of the expensive query patterns could simply be removed if they do not add enough business value. And then we would also

provide tuning suggestions.

The tuning suggestions is a combination

of our product generated insights, as well as our own insights coming from our past, you know, 15 years. Each of us, along with our founding team member, a lot of the kind of experiences manually optimizing

the workloads.

So it's a bit of a consulting initially, with the goal of feeding our consulting

knowledge into our product over time. So that's the 2nd layer,

insights.

And then the last layer is auto tuning that we will build out over time.

What we have seen is

in the initial 2 to 3 weeks, we are able to often find,

fairly non trivial low hanging fruit, As I mentioned for 1 of the trial users, in the 1st 2 to 3 weeks, we identified about 20% of areas that can be optimized away. And the users spend a total of 2 hours with us. So that was a pleasant

a great result. And that also encouraged us to accelerate and expand

our search for the trial users.

Beyond the initial

low hanging fruit, we also see 2 areas where our product could add long lasting value. It becomes more sticky.

The first is the set of monitoring dashboards we provide. So users can go and look at the findings on a daily or weekly basis

and then decide what to do with the workloads.

So this is a form of DIY cost reduction or work flow optimization. They use our tools, then they do the rest.

In the future, such tools will also be enhanced with adding features such as alerting support. So that when new

unoptimized

workloads come up, we can very quickly identify and notify the users.

The second point of kind of the sticky value addition for our product is the auto tuning we mentioned. So this way, user did not be kind of bothered by learning about all of the best practices and then manually applying

them to optimizing the workloads.

Instead, we can operate on behalf of the users.

And we think this will help them bring their big data

management back under cruise control so our users can focus more on their own product and business.

And you mentioned a few points of how you aim to be delivering utility and value beyond just the initial phase of

finding ways to cut the bill. And I'm wondering if you can

maybe talk to some of the

ways that

having that information

as a sort of constant

utility as you're working through developing your workflows,

developing your data pipelines and models,

informs the ways that data teams think about how to actually build their systems and how to understand

sort of where they want to spend their effort and how they might optimize

the value of a given table or a given dataset to make sure that they're not duplicating effort and recreating effectively the same table 3 or 4 times? So as we mentioned, in our current product, the set of dashboards would provide them with the continuously

refreshed

set of most expensive queries. So this will help them gain the visibility

of which queries are not well optimized.

This is what some of our users refer to as the wall of shame.

And we would provide some suggestions for how the query workloads could be tuned.

But our long term vision is to build out auto tuning so that users may not be bothered to manually look at the queries and apply the optimization.

Instead, our tools can do this automatically for them. To add onto that,

so the opinion from Blue Sky is that users of Snowflake and so on should focus on the billing value. Right? So they should not spend too much time thinking about, let's say, cost efficiency

or automation

because those are the things that ideally

platform should provide by itself. Right? That's also where Blue Sky are going to help. Right? So if the data team have to spend a lot of time to think about how to optimize their data usage,

then they probably are not spending enough time with their main business, which may not be good for their business itself.

That's kind of the message. That's kind of the opinion that we have in general. So I think I got your question, but once there are too many

best practices, right, so that distracts user from doing their main jobs,

yeah, if that makes sense. Another thing that I'm curious about, if you've gained any insight on it, is

how

the tool chain that data teams end up using

can contribute to either building more efficient or more wasteful

pipelines where maybe if they're using something like DBT,

they have

a better sort of core utility of reuse of datasets

versus if they're doing a lot of manual

table definitions and workflows or custom code or

if there are different sort of

styles of data modeling that can contribute more to,

wasted spend or duplication of effort. So, you know, whether it's Data Vault or Snowflake or wide tables and just how the sort of tool chain and data modeling and organizational considerations all factor into the ways that that impacts the costs at the end of the day. Yeah. It's interesting, Tobias, you mentioned DBT.

So 1 interesting anecdote I would like to share is the following.

So let me start by giving the caveat. It's not the adoption of DBT itself that made the quality of the code

better or worse. In fact, it probably made it better because DBT has the,

you know, the help with people for managing the SQL code. But what we have seen is, since the wider adoption of tools like DBT encouraged the pattern of ELT

and that brought in more people who used to not necessarily be writing so many SQL based data pipelines.

And since they are now they need to deliver the new pipelines under business pressure, as you kind of alluded to, some of them would go and copy paste 600 lines of CTE, you know, comma table expression, SQL code from somewhere else.

And so that could lead to some non trivial duplicate computation

across the different SQL pipelines being created.

Now this is more due to the factor I mentioned earlier of the organization,

the business pressure, and the priority. And 1 thing that Blue Sky could help is to, over the time, identify

such redundancy

from the query history

and flag them and help people

remove such redundancy or automatically rewrite them through our autotuner.

1 interesting anecdote I want to share is

a former head of data, who is our advisor,

and they were running a big Snowflake instance back then. They mentioned to us recently that due to the increased cost

of running

Snowflake based transformation,

so it used to be ELT so that the transformation was done before the data hits Snowflake.

But through adopting that ELT pattern, the Snowflake cost became too high. And their call to end call solution is to go back to ETL

just to reduce the cost. And to me, this is very unfortunate.

We understand how people sometimes can go out of their way in, let's say, reducing cost

or in the name of getting best entry and efficiency, break down the abstraction

layer of the software architecture and go and handwrite

assembly code, for example. We at Blue Sky think that we should use the right tool for the right job. Let people continue to use

ELT.

Encourage the modern software engineering best practices with testing,

continuous integration, and so on. And let a tool like Blue Sky take care of optimizing performance and the cost. So we hope to partner with tools and companies like DBT

to further

encourage the adoption of ELT

without people fearing the cost consequences.

The most important piece of any data project is the data itself, which is why it is critical that your data source is high quality.

Posthog is your all in 1 product analytics suite, including product analysis,

user funnels, feature flags,

experimentation,

and it's open source so you can host it yourself or let them do it for you.

You have full control over your data, and their plug in system lets you integrate with all of your other data tools, including data warehouses and SaaS platforms.

Give it a try today with their generous free tier at dataengineeringpodcast.com/posthog.

In terms of your experience of building Blue Sky and starting to work with some of your early customers, I'm wondering

what are some of the most interesting or innovative or unexpected ways that you have seen them approaching cost management and optimization in the absence of Blue Sky?

So let me start here. And this is the learning that I think that Jen probably has seen this from the past job at Uber and so on. So I have personally learned a lot from our technology and the business. But 1 of the key pleasant surprise to me, as I mentioned so far, is how within the first 2 to 3 weeks of engagement,

we managed to find 20%

of Snowflake

cost that could be optimized away. And the user spent a total of 2 hours with us. So for that reason, we are very excited to go and find onboard

more trial users where we could help people in such fundamental ways while they don't have to invest as much in the cost reduction.

We understand that in this downturn market,

cost optimization is on top of every CIO, the head of data, and data engineers' mind, and we are here to help. We have a marketing campaign

where we're offering a free

Snowflake

workload house check for eligible users. And by that we mean people who are spending

at least

$50, 000 a year. So if you're interested, please reach out to us to get this health check. And we will also offer some tuning advice and best practices

with those strings attached. The biggest thing I learned is about the power of attribution

or chargeback.

In a lot of cases where some people write SQL queries on their data warehouse, and they just forget about it. Right? So until at some point that somebody reminds them, oh, this query is too expensive, and we should shut it up. Right? So we have seen from our initial users,

some of them had been trying to do their internal chargeback,

and that could be very effective. But at the same time, they don't have enough time to build those chargeback system entry ends themselves. And that's also where we are helping them. Right? But even a very simple charge back report that's sent out to a wide mailing list on a weekly basis could help the customer to reduce the cost themselves already. 1 thing that I was just thinking about that I'm not sure that it's necessarily a good idea, but there's a tool in the sort of infrastructure as code space called InfraCost that hooks into Terraform to be able to say,

if you run this

code to deploy these new resources, this is how it's going to affect your bill. And I'm wondering what your thoughts are on having a similar utility for maybe somebody who's using DBT to say, if you add this model to your workflow, it's going to cost this much extra for your Snowflake bill

and being able to more directly provide that input to people as they're building out these different datasets to understand,

is this extra cost actually worth the value to the business that it's going to provide, or would I be better suited just,

you know, pulling this data in maybe a slightly less optimized form from this table rather than materializing an entirely separate table for it? That's a very interesting idea, and I had the opportunity to talk with the founders of InfraCost,

like, several months ago as well. So

1 big difference between big data and those microservices

is that microservices,

usually,

the footprint

of, the infrastructure cost is relatively stable. So even though they sometimes have a little bit kind of out to scale patches, right, but it's not that much. Big data workload somehow

can

increase in its cost 10 x without changing any collateral.

So assume a DB user who did not change any of their pipelines, right, but the incoming data is 10 x more.

Guess how much money they are going to pay now? The answer may surprise you. Sometimes it's not 10 x. It may be 100 x

because some of the hours in the SQL queries may not be linear to the input data size. Right? So a result of that, right, so a lot of cost control is actually not just about the code itself. It's a combination of the code change and data change. But coming back, right, so I can see if the data volume itself is relatively stable,

then there will be a lot of value to add additional capability to integrate, let's say, with DBP whenever there's a data code changes. So how much additional cost people will need to pay in that cases.

And to add to that,

1 future product feature that, Tobias, you are hinting at sounds very interesting to me. So please allow me to also think out aloud here. This is not a form of product commitment,

but I am fascinated by the notion that when people add a future DBT job or a new slice of SQL workloads,

they could run Blue Sky at that time to estimate the the cost around, you know, their workload for a day and Blue Sky

can provide actual cost.

And so we could then have the future workloads getting certified or getting underwritten

by Blue Sky

to kind of provide certain visibility or even guarantee on the cost side. And maybe we can even look into, again, huge caveat, just, you know, thinking out loud. We can even look into suggesting, hey, for similar workloads, you know, here's how other people have been doing things. And here's the associated business value that's being generated.

1 source of inspiration is looking at my own energy, you know, my own energy bill in my house.

And our, you know, PG and E, I believe, would provide similar reports that, okay, here's how other houses of similar profiles consume the energy. So we would love to provide the type of, you know, data cloud health score for our future users as well. 1 of the fun things about this podcast is being able to trigger ideas like that and then see see people react in real time and help to maybe cross pollinate some of these concepts across different businesses.

And so in terms of your admittedly

short experience so far, but as far as launching and starting to grow the Blue Sky business, I'm wondering what you've seen as some of the most interesting or unexpected or challenging lessons to date.

I guess the biggest learning I have is when working with a market with many, many different companies. Right? Every single company may have a unique need, and the biggest challenge is usually to find out what are the common needs versus what are the needs that are special to 1 company but does not apply to others.

Our past experience is limited to

a small set of companies. Right? So maybe altogether, Misha and I work in about maybe 10 companies in total. But the whole market has, let's say, maybe 10, 000 of, big data users or even more. So I think the biggest lesson that we learned is some useful technologies

learned from 1 customer may not be useful for others, but at the same time, there were some common technologies.

The trick is how to identify which ones

more universal versus which ones are more unique.

So 1 story I wanted to share

of what I have personally experienced

of how the engineering teams focus on cost reduction in the absence of a tool like Blue Sky,

is that in a past company, a kind of world leading tech company, I'm not going to name the name. But if you look up my LinkedIn, you can probably figure out.

There was period of time where it would be a VP level mandate. That's at every level, every team they need to go and cut the internal, you know, big data Cloud spent by how much.

And that's a struggle for many engineers because

cost optimization is often not how people got promoted.

So it's a constant battle of how managers and senior leadership would want to prioritize such 1 off event

and how much resistance the underlying, you know, the individual engineers

feel. And so with a tool like Blue Sky, we think we can really help with the data engineers out there that we think might face similar challenges where the company

would tend to value

rightly

how people contribute to their business, to their product and business,

not cost reduction.

So why not leaving the work to us so they can focus more on how they will get measured in their performance?

Aside from the obvious answer of customers that aren't using Snowflake currently, what are the cases where Blue Sky is the wrong choice and teams might be better served doing their own

cost optimization

or efficiency

tuning versus using a service like what you're providing?

Yes. If you are not using Snowflake today, unfortunately, you have to wait a little. But please do tell us if you are on Redshift or BigQuery or other data cloud so that we can prioritize accordingly.

Another

type of users who may not be a great fit for Blue Sky is if they do not care about the cost.

We don't think anyone has unlimited budget, so that can be ruled out. So the remaining case is people who think they are paying very little. So indeed, if they are only paying 100 of 1, 000 of dollars a year, then no, this is too early. But if they are starting to paying tens of 1, 000 of dollars a year, it could be a good time to engage

because first of all, they may be growing fast. Who knows? The point of using Snowflake is so that it's so easy to use, so they can ramp up fast. And secondly,

if they can install, you know, the best practices and the tooling provided by Blue Sky, they could apply, you know, good house cleaning. They can run a tighter ship, so to speak, early on. To provide an example, we would encourage people to apply query tagging

so that they can easily identify

for what business reasons or which business units

send which queries and not just rely on the warehouse names or the usernames.

And with query tagging, it makes it easy to generate better dashboards

better reports regarding the cost visibility that we mentioned earlier. So these are the things that we could provide.

Not to mention that, you know, our dashboards, we are not going to charge any high cost. We will just, you know, provide with good, you know, best practices.

And down the road, if we can help you with cost reduction, then we would want to take a percentage of that. So our interest is fully aligned with our users.

As you do continue to build out the product and work with some of your early design partners and onboard new customers, what are some of the things you have planned for the near to medium term and problems that you're excited to dig into?

For the medium term, from our discussion with customers, we realized there were a lot of, people who are using non Snowflake

cloud data warehouse. Let's say, BigQuery,

even Redshift, or database SQL, things like that. And we definitely want to extend our offering to other data warehouses in the cloud in the medium future.

And to add to that, 1 thing I am excited about in the medium to long term is to apply even more machine learning to our internals.

Right now, we are leveraging a lot of our own kind of human expertise

accumulated over the past decade or 2.

But over the time, we think we will want to further automate the tool and make it adapt based on users' workloads with more machine learning practices.

I mentioned earlier that I spent a couple years working on machine learning infra and making ML workloads faster and cheaper.

So that's an area that we think we can apply to our own Blue Sky internals.

And the other part is also

provide better integration of ML support on top of a data cloud platform.

Are there any other aspects of the work that you're doing at Blue Sky or the overall space of data cloud cost optimization

that we didn't discuss yet that you'd like to cover before we close out the show?

I just want to kind of remind our audience that first, we are looking for more trial users

for Snowflake users. And secondly, we are also open to expanding our founding team. So if you are someone passionate about the Blue Sky mission of making the future data clouds faster, cheaper, and smarter,

regardless if you are an engineer,

product manager, marketing expert, sales, and so on. We would love to talk to you.

Alright. Well, for anybody who wants to get in touch with either of you and follow along with the work that you're doing, I'll have you each add your preferred contact information to the show notes. And as the final question, I'd like to get each of your perspectives on what you see as being the biggest gap in the tooling or technology that's available for data management today.

When people hear the term gap, the initial reaction is to fill in the gap with adding 1 more thing.

I recently heard of a book recommendation from a friend called Subtract.

And my own mindset is also often less is more. I am increasingly worried about

how complex the surface area has become in the modern data stack for our users.

I think it's partially due to the hot venture capital over the last couple of years, but of course also due to the strong innovations from the founders and engineers.

But we think this is not sustainable. I do not see how a future data engine team can continue to learn about dozens of the tools and figure out how to put them together.

So I am in for continuing to add innovation

and possibly

complexity under the hood, kind of like an iPhone, but a surface has to be very very simple.

And part of what Blue Sky could do to contribute to that mission

is we could provide a simpler surface layer where people could send queries to us and we help them figure out which query workloads

should be mapped to to which underlying query engines and data clouds. That is, by the way, also the genesis of our name. Above the clouds, the sky supposed to be blue, and that is the layer we want to build and contribute to the modern data stack. I'd like to actually mention 1 more thing. It's about the technology gap itself. Is although

big data as a technology has improved a lot of other industries, so we are yet to see how big data as a technology

improve itself through dogfooding. Think about the query history that we have accumulated in our system. Right? Think about all the metadata information we have collected. Right? How much analysis, how much machine learning are we doing, utilizing that information?

I think that's a open area an open question for many of us to explore, and I believe the result of that will be exactly what Misha mentioned, a simpler surface area. The data system doesn't have to be that complex if we are able to utilize the metadata that we have collected and use machine learning technology

to kind of guess what user need instead of always asking users to specify what exactly they need every time.

Alright. Well, thank you both very much for taking the time today to join me and share the work that you're doing at Blue Sky. It's definitely a very interesting problem space and an interesting approach that you're taking, and it's definitely great to see people who are trying to make it more

cost effective and efficient for people to take advantage of some of these technologies and drive the value in their business. So thank you both for all of the time and energy you're putting into that. I'm excited to see how it progresses from here, and I hope you enjoy the rest of your day. Thank you, Tobias. It's been a pleasure.

Thank you.

For listening. Don't forget to check out our other show, podcast.init

atpythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site of data engineering pod cast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links