Summary
The latest generation of data warehouse platforms have brought unprecedented operational simplicity and effectively infinite scale. Along with those benefits, they have also introduced a new consumption model that can lead to incredibly expensive bills at the end of the month. In order to ensure that you can explore and analyze your data without spending money on inefficient queries Mingsheng Hong and Zheng Shao created Bluesky Data. In this episode they explain how their platform optimizes your Snowflake warehouses to reduce cost, as well as identifying improvements that you can make in your queries to reduce their contribution to your bill.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl
- RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
- The most important piece of any data project is the data itself, which is why it is critical that your data source is high quality. PostHog is your all-in-one product analytics suite including product analysis, user funnels, feature flags, experimentation, and it’s open source so you can host it yourself or let them do it for you! You have full control over your data and their plugin system lets you integrate with all of your other data tools, including data warehouses and SaaS platforms. Give it a try today with their generous free tier at dataengineeringpodcast.com/posthog
- Your host is Tobias Macey and today I’m interviewing Mingsheng Hong and Zheng Shao about Bluesky Data where they are combining domain expertise and machine learning to optimize your cloud warehouse usage and reduce operational costs
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Bluesky is and the story behind it?
- What are the platforms/technologies that you are focused on in your current early stage?
- What are some of the other targets that you are considering once you validate your initial hypothesis?
- Cloud cost optimization is an active area for application infrastructures as well. What are the corollaries and differences between compute and storage optimization strategies and what you are doing at Bluesky?
- How have your experiences at hyperscale companies using various combinations of cloud and on-premise data platforms informed your approach to the cost management problem faced by adopters of cloud data systems?
- What are the most significant drivers of cost in cloud data systems?
- What are the factors (e.g. pricing models, organizational usage, inefficiencies) that lead to such inflated costs?
- What are the signals that you collect for identifying targets for optimization and tuning?
- Can you describe how the Bluesky mission control platform is architected?
- What are the current areas of uncertainty or active research that you are focused on?
- What is the workflow for a team or organization that is adding Bluesky to their system?
- How does the usage of Bluesky change as teams move from the initial optimization and dramatic cost reduction into a steady state?
- What are the most interesting, innovative, or unexpected ways that you have seen teams approaching cost management in the absence of Bluesky?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Bluesky?
- When is Bluesky the wrong choice?
- What do you have planned for the future of Bluesky?
Contact Info
- Mingsheng
- @mingshenghong on Twitter
- Zheng
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
- Bluesky Data
- RocksDB
- Snowflake
- Trino
- Firebolt
- Bigquery
- Hive
- Vertica
- Michael Stonebraker
- Teradata
- C-Store Paper
- Ottertune
- dbt
- infracost
- Subtract: The Untapped Science of Less by Leidy Klotz
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer friendly data catalog for the modern data stack. Open source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga, and others. Acryl Data provides DataHub as an easy to consume SaaS product, which has been adopted by several companies. Sign up for the SaaS today at data engineering podcast.com/accryl. That's acryl. Your host is Tobias Macy. And today, I'm interviewing Mingsheng Hong and Zheng Shao about Blue Sky Data, where they are combining domain expertise and machine learning to optimize your cloud warehouse usage and reduce operational costs. So, Ming Sheng, can you start by introducing yourself?
[00:01:41] Unknown:
Thank you, Tobias, for having us. My name is Ming Sheng Hong. I worked with my old friend and now cofounder, Jen, to start Blue Sky about 3 months ago. And before that, I spent 8 and a half years at Google. Initially, working on the database underneath the Google's ad stack, what people internally jokingly refer to as the real database. Because as you know, Google has a few database stacks. And then I moved on to building the machine learning infrastructure, the TensorFlow runtime to make Google's machine learning workloads faster and cheaper. And before that, I spent 5 years working in 2 early stage startups in Boston.
And before that, I finished my master and PhD in databases. So I've been working in the data infra and more recently, ML infra over the last 15 years or so. And, Zeng, how about yourself?
[00:02:38] Unknown:
Yeah. So, my name is Zeng. I'm the CTO and cofounder of Blue Sky Data. Very nice to be here. So for myself, I started my career after I graduated from URUC in 2005 when my first job, was Yahoo, worked on the web search engine. 1 system I got to know from that day is the Hadoop. Right? At that time, it was, kind of a small integrating project. I didn't realize how it will become in the later in the industry. Then in 2008, I joined Facebook and become 1 of the first developers on the Hive project. Hive project was the internal project I faced by that time. Again, we didn't realize how big the impact it could make to the big data industry in the later years. I spent 2 and a half years on the Hype project and then moved down to stream processing at Facebook and started that team, and later moved down to databases like MySQL and RocksDB.
After being at Facebook for 6 years, I went to Dropbox, worked at Dropbox infrastructure for about 1 and a half years. Then at the end of 2015, I joined Uber as the head of data infrastructure. And then soon, I transitioned back to IC and then worked under Uber's data architecture, where we scaled data architecture for about 1, 000 x in terms of the data storage in the last 6 years. And my personal contribution in the last 3 years was about cost reduction of the big data at Uber, and we were able to reduce that by about 1 third 3 years of journey. So with many, many contributors into the project, and then recently graduated as a distinguished engineer from Uber. So I started Blue Sky together with Mission. And Mission and I actually got to know each other from our common friend's wedding back 18 years ago. So we are in the same industry talking with each other over time, although we never worked in the same company until Blue Sky. But we actually know each other really well. That's also why we decided to team up together a couple months ago to start the company. Yes. On that last point, I'm very thankful that Jen held out for me to start a company together.
[00:04:38] Unknown:
As you folks might know, a large portion of Uber's former Data Eng team members have been out starting their companies, such as 1 house and others. And a large portion of modern data stack are now, you know, being operated by the alumni from Uber. So I'm very thankful to have an opportunity to start Blue Sky with my old friend Jin.
[00:05:03] Unknown:
In terms of the actual Blue Sky project, can you give a bit more detail about what it is that you're building there and some of the story behind how it came to be and why you decided that this particular problem was what you wanted to spend your time and energy on? Sure. I can start.
[00:05:18] Unknown:
So as Cian mentioned earlier, we reconnected last winter, and we're looking for new opportunities. And 1 of the areas that excites us both is data cloud. We both see that the market leaders such as Snowflake, Databricks are doing actually well. They're already very successful and iconic business. At the same time, based on Gartner and other research, there is a 40 to $50, 000, 000, 000 market for data warehouses. And that suggests to us that a majority of the data is still on prem. And over the next 5 to 10 years, we think this is no longer a question of if, but a question of when regarding how people would leverage the power of the cloud for future analytical data management. And this is where we both think that this is a very early opportunity. And there's a lot of room to improvement to improve and to help people fundamentally change how they manage data by leveraging the power of the Cloud.
And specifically, we started by discussing our own career experiences. And 1 of the areas that we are both passionate about is in making data warehouse and analytical computation faster and cheaper. And as we reflected on our own experiences working at some of the world leading tech companies, we feel that this is a pretty prevalent problem. When it comes to big data efficiency, Everybody is kind of doing it wrong. There's a lot of room for improvement. And then when we started talking to our friends who have been running data cloud instances in their own companies, that's also really resonates with them. We didn't really have a concrete product at that time. But we've already been receiving invitation to go to some of the current leaders of different industry spaces such as the crypto, online grocery, shopping companies to help consult and help them optimize their data cloud workloads.
So then we started talking to the investors. And soon soon things clicked really well. And that's how we started this blue sky journey. Our initial focus is to help Snowflake users improve their cost efficiency in running the data cloud workloads.
[00:07:36] Unknown:
Exactly as Minxin said. So 1 of my, like, most memorable experience in my previous company is sometimes 1 big data pipelines can eat up a lot of resources and cost the company a lot of money if, it was not optimized well. Right? So there were 1 pipeline that was running for more than a year at the company, and we didn't notice that. Once we noticed that it was really expensive, costing the company something like a quarter $1, 000, 000 a year. Right? So we dig in and then try to optimize that. And then we were able to reduce the cost of that query by something like 1, 000 x because that query was trying to reprocess all the historical data every single day. Instead, we changed that to incremental computation, and, the cost dramatically dropped down. And that is 1 example that, like, really supports the case that Misha mentioned. There were so much opportunities to improve the big data efficiency, and a lot of times, the owner of the data platform may not even know how much opportunities are there. That's 1 of the main reasons, I would say. So why we started Blue Sky with a mission to help everyone who use big data.
[00:08:49] Unknown:
And you mentioned that your current primary focus is on customers of Snowflake, which is definitely 1 of the larger platforms that people are using for managing their data warehouses. But there are also a number of contenders that have significant market share, most notable probably being BigQuery, but then there are also Redshift and particularly with some of their newer architectural designs to be able to make their platform more scalable. I believe that Azure has its own cloud data warehouse, and then there are also other third party contenders such as Firebolt and some of the managed platforms for running things like Trino.
And I'm wondering what your selection process was for deciding that Snowflake was where you wanted to spend your initial energies and what the sort of design approach has been to be able to make your current tooling adaptable to some of these other technologies to be able to give people some of the similar benefits of what you're currently focused on providing to Snowflake users?
[00:09:49] Unknown:
Let me tell you this question about your 2 parts, where the first part is about why do we choose Snowflake. The second part is about how this technology can potentially extend to other technologies. In terms of Snowflake, when we talk with big data users, we realize that Snowflake has very good user experience. And, also, a lot of users who use Snowflake are, in general, pretty happy, but they are mainly worried about, costs. So costs can go really, really high. Costs going high on 1 side is also reflection of how easy it is to use Snowflake. Because if it is easy, then, of course, everybody in the company want to use it, then the cost will go high. And that's exactly where Blue Sky can quickly come in and provide values, so to reduce the cost, to make the queries run consistently fast, and so on. Other technologies that also has SQL interface could definitely benefit from the same technologies that Blue Sky will be building. But for the beginning, we want to focus our 1 technology first. And, also, on that front, I have 1 note might be useful for everyone, is when we look at these new generation technologies, every single technology, it actually has its strengths. Right? So some of the users are asking us, saying, like, hey. Shall we move from a to b or b to c or c to a? Like, our advice is don't move around. Right? So because data migration is a huge pain. It's a lot of costs, and we would rather have our users, let's say, stay with 1 solution, and then we can help users to make that solution work better for their use cases.
So coming back to your second question, right, so we do tend to extend to other technologies other than Snowflake, and we will also try to have urge users. Right? Don't move around too much because every technology has its strengths and its drawbacks, and Blue Sky hopefully at some point will be help you even if you are working on technology that we are not working on yet. We picked Snowflake
[00:11:41] Unknown:
also because as we mentioned earlier, some of our friends happen to be managing Snowflake instances. And they are pretty they are in pretty strong need of our help. And also from our own research, it seems a pretty large portion of Snowflake users or higher level users coming from the data analytics, data science background, as opposed to some of the other data cloud, cloud warehouse users who more have a engineering background and are more used to improving and tuning their own databases. And as I mentioned, we will look to expand our optimization for other data cloud products.
The other thing I was mentioning is when we started Blue Sky, we also explored the option of building yet another query engine. Just like everybody else as you mentioned Tobias, like Firebolt and whatnot. But our thesis is that while it is possible to spend yet another 1, 000 Ant years and build such an engine from scratch. As Jen and I have both done in our past work, such as Jen's work in Hive and my earlier work at Vertica, and then for Google's Ads database. We do not think that is the key pinpoint for the end users. My own belief is that if you build a query engine from scratch, let's say starting 2, 3 years ago, the end result might be 5, 10% better than, you know, than a state of the art engine.
I don't think any engine can be in average 5 to 10 x better. It's possible to be better in such a magnitude on slice of the workloads. But in average, I do not believe that will be the case. Instead, the bulk of the improvement in terms of both performance and cost is in tuning. It's in tuning the database product, the query engine based on users' workloads. That is often overlooked. So we believe kind of a dirty secret in this industry, as Jen also alluded to, is that when user claims when they move from database product a to product b, they got 5 to 10x better, faster, cheaper. It's not the product per se. It is because they did a clean house. We thought about the database schema design, thought about the physical schema design with indices, materialized views and so on. And so why don't you just stay with your current database product? And if you do make the same efforts, you could achieve similar results. And that is why we are starting by helping our users migrate from Snowflake to Snowflake in some sense, by doing the house clean and help them optimize for their workloads. And we will do the same with the other products, including possibly products we will build in the future.
That being said, it is possible in the longer run, there's more room for improvement across the different products. They are so called the right tool for the job as professor Mike Stonebraker often likes to say. And so it's also important to map the right slice of workload to the right underlying tool and infrastructure. And that's something we would also look into.
[00:14:53] Unknown:
The interesting element of what you're building at Blue Sky is that it parallels what a lot of application and infrastructure teams are dealing with in their adoption of cloud technologies where it's very easy to get started. There's a good user experience of being able to put things into production, but then it becomes difficult to actually track what your spend is going to be, predict it, understand how different, you know, application architectures or system architectures are going to impact your overall costs to be able to run these systems. And I'm wondering what you see as some of the commonalities and differences between the compute and storage optimization that cloud build sort of optimization companies are doing and what you're doing at Blue Sky to optimize the utility and costs of Snowflake users?
[00:15:40] Unknown:
Indeed, Tobias. As you mentioned, there has been a pretty large industry of cost optimization and cost visibility companies over the last 5 to 10 years on the public cloud space for AWS, GCP, and Azure. We have talked to friends from companies like CloudHealth that was acquired by VMware a couple years ago. And there's a lot of good learnings. Between these companies and Blue Sky, we see a couple common elements and some differences. So to start with the common elements, first of all, cost visibility is the foundational element.
Without understanding how the cost is attributed to individual computational jobs or in the case of data clouds, individual query jobs, SQL query jobs, that it is hard to understand which internal users and teams in the company is using the most. So we need to provide that kind of visibility and accountability. Some of our users refer to this as the war of shame. We don't use this phrase, but we supply the technology that users could adopt for their own policy. Interestingly, in the case of Snowflake and other data clouds, even though the underlying products would provide the kind of cost visibility in terms of how people are using the data warehouses, They do not provide the attribution of the cost to individual query jobs.
Some users get surprised by that, but there is a technical reason behind it. As you know, once users start a data warehouse, they pay based on the seconds, based on the time duration they use. So during that time window, it doesn't matter if they don't run any query or let's say they run 3 concurrent queries. For that reason, there is no direct attribution from the cost people pay to the individual queries they run. And the first thing we did when we built our first product called Blue Sky Mission Control is to implement a little algorithm that is part of our secret sauce that allows us to attribute the cost to individual query.
And thanks to that fundamental element, we can now aggregate a query based on the usernames and the teams, the organizational structure, and so users can go and provide visibility and accountability. We can also help find some top k most expensive queries or query patterns so that we can work with the users to prioritize which query should be tuned. So I want to say the first common element is cost visibility. But when it comes to data clouds, there is some technical barrier in implementing and providing the cost visibility properly that we were able to crack the puzzle. 2nd common element is to pick low hanging fruits first when it comes to cost optimization.
So in the case of the public clouds, it could be revisiting the VM, size, the container size. And for us, it could be the data warehouse size. It could also include a resource utilization and reconfigure, you know, the warehouses or the VMs to reduce idle time. And 1 story I would like to share here is when we engaged our trial users, in the first 2 to 3 weeks, we managed to identify about 20% of overall Snowflake cost for them that could be optimized away. So both parties are very ecstatic about such findings. And part of the reason for why we're able to lend such large impact initially is because there was low hanging fruit. We can help them find very expensive jobs that based on users review, they don't add much business value. So some of them can simply be thrown away, and the rest we would provide suggestions for optimizing them.
So picking low hanging fruit is important. And the last common element is we plan to charge our product based on the value we provide, so called value based pricing. So in our area, since cost reduction, the savings is 1 of the key values, not the only 1. We also help people analyze more data faster and help the data engine and the CIOs gain more cost visibility. But when it comes to cost savings, our thinking is to charge a percentage of the savings we actually lend to them, thereby fully aligned with the user's interest. And that is also a kind of common best practice by the other public Cloud cost optimization tools. In addition to these common elements, there are a couple of key differences.
The first 1 I want to say I already mentioned. Even computing cost visibility requires some non trivial data infra internal expertise, some understanding. So this is not something that, you know, average Snowflake users will be able or will be willing to do themselves. And that is 1 of the key values that we offer. And the second 1 is for the public cloud computation, the workloads, the user jobs tend to be opaque to those vendors, to the products. In contrast, the SQL jobs and SQL queries are transparent to us. A large part of our value addition is obtained by in-depth analysis of these SQL jobs. So we can go and, for example, based on the query predicates and the group by columns, suggest how users could reconfigure their table clustering key. Or rewrite the queries in deeper ways that standard database query optimizers cannot do.
So these are the values that our product could add in complementary to the existing data cloud product. And in contrast, this will be hard to do when the VM, you know, the EC 2 jobs are opaque to the cost visibility tools. The last part is auto tuning. So our vision is over the time, we want to make our product really easy to use so that users do not need to take the suggested tuning, the optimization ideas from us. Instead, we apply them automatically for our users. And this way, our users could allocate their own internal talents, their entry resources, better on their own product and business, and leave a large portion of data cloud management to Blue Sky. As far as
[00:22:05] Unknown:
the experiences that you've each had at these large companies where in a lot of cases, you're going to be dealing with on premise systems because some of the companies you worked at predated the widespread availability of cloud data warehouses or the overall scale made it such that it was not economical to use these systems. So I'm wondering what are some of the lessons that you learned working at those lower levels or working with these more self managed systems that have given you the insights necessary to be able to identify these potential optimizations of workload and be able to understand enough of what's happening behind the scenes at Snowflake and some of these other systems to be able to effectively target opportunities for reducing wasted cycles and wasted workflows?
[00:22:51] Unknown:
So the first thing I want to say is the on prem data platforms and the cloud data platforms are very different in terms of cost management. For on prem data platforms, problems usually arise when the crawl risk becomes slower because on prem data platforms are usually fixed in size. So whenever there are more workload or inefficient workload, then users first complain about the speed. In the cloud world, however, usually, people won't see a slow speed because cloud data platforms are able to autoscale. The only surprise a user get is at the end of the month, they suddenly see, woah. Their bill actually is, 10x bigger than last month. So that is 1 very big difference that we all realized. The fundamental similarity between the 2 is, the new workload coming into big data can be very hard to manage because 1 of the main reasons why all the companies need, let's say, data engineers and data scientists is to write new jobs. So by definition, the workload on the data platforms are not stable. If it's not stable, the cost will grow. So the question is, how fast to grow is normal versus not normal? There's a lot of interesting analysis inside those and can be helpful for us to figure out which ones are not good. I would say 1 lesson that we learned in this is whenever the cost is small, don't try to, like, install a lot of new systems.
Right? So in my early experience, the complexity of the data infra sometimes can kill the team. If we have 5 different query engines in the company, then not only our data infrastructure team will be spread so thin, our users will also not know which Chrome engine they should use for what kind of workload. We would rather have most of users use a single engine and help them whenever their cost grows big enough. At the end of the day, if the cost of 1 user is only 0 1 point percent of the whole data platform, it really doesn't matter if that user is 5 x less efficient because that's only 0.5% of our cost. Right? And I would just say, simply put, the biggest lesson we learned is to always have the availability of the cost of distribution first before doing any automation.
And, also, be very careful about introducing new systems because once we start to introduce new systems, it's going to be really hard to manage the data consistency, the data quality, schema data discovery, and a lot of other problems. I guess Misha can add some more. So let me also share
[00:25:26] Unknown:
a past episode of my project that I believe could have relevance to the modern Data Cloud optimization. As I mentioned earlier, I started my career at a columnar database startup called Vertica. Back then, I was fully confident that Vertica could be the system that provides 10x node speed up and scalability improvement over the encumbrance like the Oracle and Teradata. Because I read the paper from MIT called C Store back in 2005. Now in terms of the day to day work, 1 of the key pain points there is that as we go and perform POCs with our trial users, the Oracles and Teradata's all each have a large army of DBAs who are very good at tuning the user workloads by hand. Unfortunately, outside of this little company of Redecar with about 20 engineers, nobody knew how to tune a new column store.
And without tuning, the system could not realize all its potentials. Our company decided to prioritize building a new component, an auto tuner called a vertical database designer. And I was fortunate to tech lead this project. And what this component does is it automatically analyzes users' table metadata like statistics, as well as SQL queries. These queries could come from BI dashboards like Pablo or written by people. Let's say in the modern data stack, it could be the DBT queries as part of the transformation in the ELT workflow.
And so our database designer would analyze these queries in the table stats, and then come up with ways to organize the data. Such as segmenting it across the nodes, sorting it, and figuring out which column they're encoding to individual columns. And as a result, in some of the most complex workloads, we managed to even outperform the tuning results by our in house experts. And our tech co founder and CTO, professor Mike Stonebraker, initially couldn't believe it. And Andrew reviewed with him, and we showed him how complex the workloads could be. Back then, Vertica has some of the largest users such as the syngas. Some of those workloads contain hundreds of tables and thousands of queries.
And it became clear that no single database expert, not even a hardworking startup engineer, could manage to tune all of the queries. So that is the power of having an algorithm driven autotuner. And I believe in a modern era, the similar technology could be applied to data clouds such as Snowflake in terms of configuring the table clustering key, as well as creating materialized views and other forms of indices that the data clouds will support.
[00:28:32] Unknown:
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state of the art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up for free or just get the free t shirt for being a listener of the data engineering podcast at dataengineeringpodcast.com/rudder. And as far as the automatic tuning of the engine and figuring out which parameters to tweak, it puts me in mind of the AutoTune project, which is being run by Andy Pavlo from CMU.
And I know that that's focused primarily, at least initially on operational databases, so MySQL, Postgres, MSSQL. And I'm wondering what are some of the sort of commonalities as far as the approach that he's taking and what you're trying to do with Snowflake and eventually other data clouds?
[00:29:39] Unknown:
Yes. So earlier on, Chen and I studied the AutoTune project. We also established, contact with professor Andy Pablo. We had some mutual friends such as Andy Palmer, who was our earlier business CEO and cofounder for for Verdeca. What we saw is that, as you mentioned, Tavares, both Otter2 and Blue Sky are in the mission of automatically optimizing the user's data infra. The key differences are first of all, for now AutoTune seems to be focusing on the OLTP space, whereas Blue Sky is in the OLAP space. And secondly, the key focus for Blue Sky is to tune the system based on users' query workloads.
Whereas for now, it seems auto tune focuses on tuning the system parameters such as the buffer pool size. In our view, the system parameters are also very important to tune. But when it comes to data clouds, the cloud operators such as Snowflake have kind of tuned these parameters by hand. And what is left and most important to tune are the users query workloads. So that is how we choose to start on Blue Sky.
[00:30:54] Unknown:
And so as far as the areas that drive cost and some of the issues that you identify for being able to target optimizations, what are some of the most significant contributing factors that actually drive up the costs in the first place? And the factors as far as the pricing model of things like Snowflake or the ways that the organization is using it or some of the inefficiencies as far as how the data workflows are being designed and executed that you work with to be able to identify areas to reduce that overall spend?
[00:31:29] Unknown:
Yes. So at a high level, we see 3 high level factors that lead to that kind of inefficient use and the inflated cost. First of all, it's the mindset and the best practices from the users. In the on prem era, the best practices are quite different from now the data clouds. But as we all know, when there is a new generation of technologies coming up, such as the 1st gen data clouds we're seeing, people tend to be applying their old practices while using the new tools, so called old wine in a new bottle. So as a key example, when analyzing the query history from 1 child user recently, we found that there is 1 peculiar SQL query that has been costing about $96 per run and generating no business value.
That query is what's called an XL sized warehouse and it simply times out after 2 hours for each run. So 2 hours of Excel warehouse would cost, I believe, 32 credits and the public price for each credit is $3. So that's about $100 per run, which times out. And unfortunately, that query has been retried by more than a 100 times by the time we found it on the log. So that is basically the kind of $10, 000 damage that a single data cloud user could do to the system and to the company. And that company has hundreds of data cloud users.
So this is the kind of impact in an also positive way that users could bring to the new data cloud product, partly because they have not been used to how to use such products effectively. So we are in the mission to provide a better tooling, but also help cultivate the best practices. The second 1 is also organization and priority. As you know, especially in the last couple of years, a lot of the data cloud users, the companies are in a very fast growth period. Now kind of the fast growth market is now coming to an end. But in the last couple of years, with such a huge growth on the business end, there isn't as much focus on internal kind of cost optimization.
And it is fair to say that even though the Snowflake users could be technically very strong, it's simply not their job, not their key priority to be looking out for such cost optimization. And last but not least, as we mentioned earlier, as a 1st generation data cloud, we also believe the current technology and the product has room for improvement. So this is also where we would be focusing on for the next 5 to 10 years in Blue Sky.
[00:34:19] Unknown:
And in terms of the actual technology that you're building, you mentioned that your initial product is called Blue Sky Sky Mission Control. I'm wondering if you could just talk through some of the design and architecture of that platform and the different signals that you're hooking into for being able to identify opportunities for cost optimization and performance tuning?
[00:34:39] Unknown:
So there are a couple of signals from simple to complex. The first 1, the simplest 1 is the the irritation of the warehouse. When we look at the Snowflake warehouses, sometimes there's the idle time before the warehouse is shutting down. Sometimes it can be also the interval idle times before the last query finish and the the next query starts. There could be simple parameters that we're into, and there will be ways that we can help user to, let's say, move queries around among the warehouses in order to reduce their credit usage and sustain at the same time without affecting their query performance.
And then the next step above that is look at the query text itself. So some of the queries user write may not be in the optimal format. To give 1 example is, where people want to read the last of 3 days of data. The right way of writing the query would be where the timestamp column greater than something. Right? But we noticed there were also cases that people writing a way in, like, where the date of the last time stamp greater than today minus 3. So the main difference between these 2 writes is the first way of writing can be easily pushed down as a predicate into a storage engine and make the query extremely fast. But the latter way of, writing could have issue because the query engine may not be able to push it down using the range of the timestamp column to fill out those data blocks. Let's say, the table micro partitions that could not be actually pulled out. So those single query rights are very helpful, and a lot of times, user can take those signals and just apply to their warehouses, apply to their query history, and they get the Winx immediately. And the most complex signal that we're looking for is looking at the whole query history for the whole day or the whole week.
And we identify the queries that are similar to each other. Those queries can be considered as incremental computation, but are actually computed from scratch every time. And find out ways that we can make the computation shared through technologies like Materialise View, or pre computed results or cached subqueries, things like that. That part is more like the more constant technology, but will yield the biggest gain. That's also what we are still developing on right now. That's pretty much all the signals that we have.
[00:37:05] Unknown:
So as far as the workflow for somebody who is getting started with the mission control and they're starting to run it against their data warehouse and understand what are the inefficiencies that I am identifying. If you can just talk through that workflow of getting set up, finding the areas to be able to optimize, and then actually what they do with that information once that's discovered.
[00:37:30] Unknown:
So currently, we get the customer leads from both our network and through the warm intros from our advisors, investors, and so on, our friends. And then we also start to get some kind of cold leads based on our own marketing efforts. And once we establish the initial contact, so we basically show people our demo. And we work with them to create a read only account for Blue Sky in their Snowflake instance. And they grant us permission to select metadata tables such as query history. We do not need to see their business data. And so that's 1 reason that we assure our trial users that there's no concern with sharing essentially their workload metadata with us. So we can start ingesting the data into our own SaaS back end. We do analysis and then we show the results via our GUI dashboard.
On the dashboard, we would show the users, as I mentioned earlier, how we compute the query cost. We attribute the cost to their queries, aggregate them across key dimensions such as by the users and by the query types, and we show the top k most expensive query patterns. So upon that initial point of review, user can then decide which of the expensive query patterns could simply be removed if they do not add enough business value. And then we would also provide tuning suggestions. The tuning suggestions is a combination of our product generated insights, as well as our own insights coming from our past, you know, 15 years. Each of us, along with our founding team member, a lot of the kind of experiences manually optimizing the workloads.
So it's a bit of a consulting initially, with the goal of feeding our consulting knowledge into our product over time. So that's the 2nd layer, insights. And then the last layer is auto tuning that we will build out over time. What we have seen is in the initial 2 to 3 weeks, we are able to often find, fairly non trivial low hanging fruit, As I mentioned for 1 of the trial users, in the 1st 2 to 3 weeks, we identified about 20% of areas that can be optimized away. And the users spend a total of 2 hours with us. So that was a pleasant a great result. And that also encouraged us to accelerate and expand our search for the trial users.
Beyond the initial low hanging fruit, we also see 2 areas where our product could add long lasting value. It becomes more sticky. The first is the set of monitoring dashboards we provide. So users can go and look at the findings on a daily or weekly basis and then decide what to do with the workloads. So this is a form of DIY cost reduction or work flow optimization. They use our tools, then they do the rest. In the future, such tools will also be enhanced with adding features such as alerting support. So that when new unoptimized workloads come up, we can very quickly identify and notify the users.
The second point of kind of the sticky value addition for our product is the auto tuning we mentioned. So this way, user did not be kind of bothered by learning about all of the best practices and then manually applying them to optimizing the workloads. Instead, we can operate on behalf of the users. And we think this will help them bring their big data management back under cruise control so our users can focus more on their own product and business.
[00:41:25] Unknown:
And you mentioned a few points of how you aim to be delivering utility and value beyond just the initial phase of finding ways to cut the bill. And I'm wondering if you can maybe talk to some of the ways that having that information as a sort of constant utility as you're working through developing your workflows, developing your data pipelines and models, informs the ways that data teams think about how to actually build their systems and how to understand sort of where they want to spend their effort and how they might optimize the value of a given table or a given dataset to make sure that they're not duplicating effort and recreating effectively the same table 3 or 4 times? So as we mentioned, in our current product, the set of dashboards would provide them with the continuously
[00:42:19] Unknown:
refreshed set of most expensive queries. So this will help them gain the visibility of which queries are not well optimized. This is what some of our users refer to as the wall of shame. And we would provide some suggestions for how the query workloads could be tuned. But our long term vision is to build out auto tuning so that users may not be bothered to manually look at the queries and apply the optimization. Instead, our tools can do this automatically for them. To add onto that,
[00:42:51] Unknown:
so the opinion from Blue Sky is that users of Snowflake and so on should focus on the billing value. Right? So they should not spend too much time thinking about, let's say, cost efficiency or automation because those are the things that ideally platform should provide by itself. Right? That's also where Blue Sky are going to help. Right? So if the data team have to spend a lot of time to think about how to optimize their data usage, then they probably are not spending enough time with their main business, which may not be good for their business itself. That's kind of the message. That's kind of the opinion that we have in general. So I think I got your question, but once there are too many best practices, right, so that distracts user from doing their main jobs,
[00:43:34] Unknown:
yeah, if that makes sense. Another thing that I'm curious about, if you've gained any insight on it, is how the tool chain that data teams end up using can contribute to either building more efficient or more wasteful pipelines where maybe if they're using something like DBT, they have a better sort of core utility of reuse of datasets versus if they're doing a lot of manual table definitions and workflows or custom code or if there are different sort of styles of data modeling that can contribute more to, wasted spend or duplication of effort. So, you know, whether it's Data Vault or Snowflake or wide tables and just how the sort of tool chain and data modeling and organizational considerations all factor into the ways that that impacts the costs at the end of the day. Yeah. It's interesting, Tobias, you mentioned DBT.
[00:44:31] Unknown:
So 1 interesting anecdote I would like to share is the following. So let me start by giving the caveat. It's not the adoption of DBT itself that made the quality of the code better or worse. In fact, it probably made it better because DBT has the, you know, the help with people for managing the SQL code. But what we have seen is, since the wider adoption of tools like DBT encouraged the pattern of ELT and that brought in more people who used to not necessarily be writing so many SQL based data pipelines. And since they are now they need to deliver the new pipelines under business pressure, as you kind of alluded to, some of them would go and copy paste 600 lines of CTE, you know, comma table expression, SQL code from somewhere else.
And so that could lead to some non trivial duplicate computation across the different SQL pipelines being created. Now this is more due to the factor I mentioned earlier of the organization, the business pressure, and the priority. And 1 thing that Blue Sky could help is to, over the time, identify such redundancy from the query history and flag them and help people remove such redundancy or automatically rewrite them through our autotuner. 1 interesting anecdote I want to share is a former head of data, who is our advisor, and they were running a big Snowflake instance back then. They mentioned to us recently that due to the increased cost of running Snowflake based transformation, so it used to be ELT so that the transformation was done before the data hits Snowflake.
But through adopting that ELT pattern, the Snowflake cost became too high. And their call to end call solution is to go back to ETL just to reduce the cost. And to me, this is very unfortunate. We understand how people sometimes can go out of their way in, let's say, reducing cost or in the name of getting best entry and efficiency, break down the abstraction layer of the software architecture and go and handwrite assembly code, for example. We at Blue Sky think that we should use the right tool for the right job. Let people continue to use ELT.
Encourage the modern software engineering best practices with testing, continuous integration, and so on. And let a tool like Blue Sky take care of optimizing performance and the cost. So we hope to partner with tools and companies like DBT to further encourage the adoption of ELT without people fearing the cost consequences.
[00:47:13] Unknown:
The most important piece of any data project is the data itself, which is why it is critical that your data source is high quality. Posthog is your all in 1 product analytics suite, including product analysis, user funnels, feature flags, experimentation, and it's open source so you can host it yourself or let them do it for you. You have full control over your data, and their plug in system lets you integrate with all of your other data tools, including data warehouses and SaaS platforms. Give it a try today with their generous free tier at dataengineeringpodcast.com/posthog. In terms of your experience of building Blue Sky and starting to work with some of your early customers, I'm wondering what are some of the most interesting or innovative or unexpected ways that you have seen them approaching cost management and optimization in the absence of Blue Sky?
[00:48:04] Unknown:
So let me start here. And this is the learning that I think that Jen probably has seen this from the past job at Uber and so on. So I have personally learned a lot from our technology and the business. But 1 of the key pleasant surprise to me, as I mentioned so far, is how within the first 2 to 3 weeks of engagement, we managed to find 20% of Snowflake cost that could be optimized away. And the user spent a total of 2 hours with us. So for that reason, we are very excited to go and find onboard more trial users where we could help people in such fundamental ways while they don't have to invest as much in the cost reduction. We understand that in this downturn market, cost optimization is on top of every CIO, the head of data, and data engineers' mind, and we are here to help. We have a marketing campaign where we're offering a free Snowflake workload house check for eligible users. And by that we mean people who are spending at least $50, 000 a year. So if you're interested, please reach out to us to get this health check. And we will also offer some tuning advice and best practices
[00:49:12] Unknown:
with those strings attached. The biggest thing I learned is about the power of attribution or chargeback. In a lot of cases where some people write SQL queries on their data warehouse, and they just forget about it. Right? So until at some point that somebody reminds them, oh, this query is too expensive, and we should shut it up. Right? So we have seen from our initial users, some of them had been trying to do their internal chargeback, and that could be very effective. But at the same time, they don't have enough time to build those chargeback system entry ends themselves. And that's also where we are helping them. Right? But even a very simple charge back report that's sent out to a wide mailing list on a weekly basis could help the customer to reduce the cost themselves already. 1 thing that I was just thinking about that I'm not sure that it's necessarily a good idea, but there's a tool in the sort of infrastructure as code space called InfraCost that hooks into Terraform to be able to say,
[00:50:07] Unknown:
if you run this code to deploy these new resources, this is how it's going to affect your bill. And I'm wondering what your thoughts are on having a similar utility for maybe somebody who's using DBT to say, if you add this model to your workflow, it's going to cost this much extra for your Snowflake bill and being able to more directly provide that input to people as they're building out these different datasets to understand, is this extra cost actually worth the value to the business that it's going to provide, or would I be better suited just, you know, pulling this data in maybe a slightly less optimized form from this table rather than materializing an entirely separate table for it? That's a very interesting idea, and I had the opportunity to talk with the founders of InfraCost,
[00:50:52] Unknown:
like, several months ago as well. So 1 big difference between big data and those microservices is that microservices, usually, the footprint of, the infrastructure cost is relatively stable. So even though they sometimes have a little bit kind of out to scale patches, right, but it's not that much. Big data workload somehow can increase in its cost 10 x without changing any collateral. So assume a DB user who did not change any of their pipelines, right, but the incoming data is 10 x more. Guess how much money they are going to pay now? The answer may surprise you. Sometimes it's not 10 x. It may be 100 x because some of the hours in the SQL queries may not be linear to the input data size. Right? So a result of that, right, so a lot of cost control is actually not just about the code itself. It's a combination of the code change and data change. But coming back, right, so I can see if the data volume itself is relatively stable, then there will be a lot of value to add additional capability to integrate, let's say, with DBP whenever there's a data code changes. So how much additional cost people will need to pay in that cases.
[00:52:02] Unknown:
And to add to that, 1 future product feature that, Tobias, you are hinting at sounds very interesting to me. So please allow me to also think out aloud here. This is not a form of product commitment, but I am fascinated by the notion that when people add a future DBT job or a new slice of SQL workloads, they could run Blue Sky at that time to estimate the the cost around, you know, their workload for a day and Blue Sky can provide actual cost. And so we could then have the future workloads getting certified or getting underwritten by Blue Sky to kind of provide certain visibility or even guarantee on the cost side. And maybe we can even look into, again, huge caveat, just, you know, thinking out loud. We can even look into suggesting, hey, for similar workloads, you know, here's how other people have been doing things. And here's the associated business value that's being generated.
1 source of inspiration is looking at my own energy, you know, my own energy bill in my house. And our, you know, PG and E, I believe, would provide similar reports that, okay, here's how other houses of similar profiles consume the energy. So we would love to provide the type of, you know, data cloud health score for our future users as well. 1 of the fun things about this podcast is being able to trigger ideas like that and then see see people react in real time and help to maybe cross pollinate some of these concepts across different businesses.
[00:53:27] Unknown:
And so in terms of your admittedly short experience so far, but as far as launching and starting to grow the Blue Sky business, I'm wondering what you've seen as some of the most interesting or unexpected or challenging lessons to date.
[00:53:40] Unknown:
I guess the biggest learning I have is when working with a market with many, many different companies. Right? Every single company may have a unique need, and the biggest challenge is usually to find out what are the common needs versus what are the needs that are special to 1 company but does not apply to others. Our past experience is limited to a small set of companies. Right? So maybe altogether, Misha and I work in about maybe 10 companies in total. But the whole market has, let's say, maybe 10, 000 of, big data users or even more. So I think the biggest lesson that we learned is some useful technologies learned from 1 customer may not be useful for others, but at the same time, there were some common technologies.
The trick is how to identify which ones more universal versus which ones are more unique.
[00:54:29] Unknown:
So 1 story I wanted to share of what I have personally experienced of how the engineering teams focus on cost reduction in the absence of a tool like Blue Sky, is that in a past company, a kind of world leading tech company, I'm not going to name the name. But if you look up my LinkedIn, you can probably figure out. There was period of time where it would be a VP level mandate. That's at every level, every team they need to go and cut the internal, you know, big data Cloud spent by how much. And that's a struggle for many engineers because cost optimization is often not how people got promoted.
So it's a constant battle of how managers and senior leadership would want to prioritize such 1 off event and how much resistance the underlying, you know, the individual engineers feel. And so with a tool like Blue Sky, we think we can really help with the data engineers out there that we think might face similar challenges where the company would tend to value rightly how people contribute to their business, to their product and business, not cost reduction. So why not leaving the work to us so they can focus more on how they will get measured in their performance?
[00:55:44] Unknown:
Aside from the obvious answer of customers that aren't using Snowflake currently, what are the cases where Blue Sky is the wrong choice and teams might be better served doing their own cost optimization or efficiency tuning versus using a service like what you're providing?
[00:56:01] Unknown:
Yes. If you are not using Snowflake today, unfortunately, you have to wait a little. But please do tell us if you are on Redshift or BigQuery or other data cloud so that we can prioritize accordingly. Another type of users who may not be a great fit for Blue Sky is if they do not care about the cost. We don't think anyone has unlimited budget, so that can be ruled out. So the remaining case is people who think they are paying very little. So indeed, if they are only paying 100 of 1, 000 of dollars a year, then no, this is too early. But if they are starting to paying tens of 1, 000 of dollars a year, it could be a good time to engage because first of all, they may be growing fast. Who knows? The point of using Snowflake is so that it's so easy to use, so they can ramp up fast. And secondly, if they can install, you know, the best practices and the tooling provided by Blue Sky, they could apply, you know, good house cleaning. They can run a tighter ship, so to speak, early on. To provide an example, we would encourage people to apply query tagging so that they can easily identify for what business reasons or which business units send which queries and not just rely on the warehouse names or the usernames.
And with query tagging, it makes it easy to generate better dashboards better reports regarding the cost visibility that we mentioned earlier. So these are the things that we could provide. Not to mention that, you know, our dashboards, we are not going to charge any high cost. We will just, you know, provide with good, you know, best practices. And down the road, if we can help you with cost reduction, then we would want to take a percentage of that. So our interest is fully aligned with our users.
[00:57:40] Unknown:
As you do continue to build out the product and work with some of your early design partners and onboard new customers, what are some of the things you have planned for the near to medium term and problems that you're excited to dig into?
[00:57:52] Unknown:
For the medium term, from our discussion with customers, we realized there were a lot of, people who are using non Snowflake cloud data warehouse. Let's say, BigQuery, even Redshift, or database SQL, things like that. And we definitely want to extend our offering to other data warehouses in the cloud in the medium future.
[00:58:12] Unknown:
And to add to that, 1 thing I am excited about in the medium to long term is to apply even more machine learning to our internals. Right now, we are leveraging a lot of our own kind of human expertise accumulated over the past decade or 2. But over the time, we think we will want to further automate the tool and make it adapt based on users' workloads with more machine learning practices. I mentioned earlier that I spent a couple years working on machine learning infra and making ML workloads faster and cheaper. So that's an area that we think we can apply to our own Blue Sky internals.
And the other part is also provide better integration of ML support on top of a data cloud platform.
[00:58:55] Unknown:
Are there any other aspects of the work that you're doing at Blue Sky or the overall space of data cloud cost optimization that we didn't discuss yet that you'd like to cover before we close out the show?
[00:59:06] Unknown:
I just want to kind of remind our audience that first, we are looking for more trial users for Snowflake users. And secondly, we are also open to expanding our founding team. So if you are someone passionate about the Blue Sky mission of making the future data clouds faster, cheaper, and smarter, regardless if you are an engineer, product manager, marketing expert, sales, and so on. We would love to talk to you.
[00:59:35] Unknown:
Alright. Well, for anybody who wants to get in touch with either of you and follow along with the work that you're doing, I'll have you each add your preferred contact information to the show notes. And as the final question, I'd like to get each of your perspectives on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:59:51] Unknown:
When people hear the term gap, the initial reaction is to fill in the gap with adding 1 more thing. I recently heard of a book recommendation from a friend called Subtract. And my own mindset is also often less is more. I am increasingly worried about how complex the surface area has become in the modern data stack for our users. I think it's partially due to the hot venture capital over the last couple of years, but of course also due to the strong innovations from the founders and engineers. But we think this is not sustainable. I do not see how a future data engine team can continue to learn about dozens of the tools and figure out how to put them together. So I am in for continuing to add innovation and possibly complexity under the hood, kind of like an iPhone, but a surface has to be very very simple.
And part of what Blue Sky could do to contribute to that mission is we could provide a simpler surface layer where people could send queries to us and we help them figure out which query workloads should be mapped to to which underlying query engines and data clouds. That is, by the way, also the genesis of our name. Above the clouds, the sky supposed to be blue, and that is the layer we want to build and contribute to the modern data stack. I'd like to actually mention 1 more thing. It's about the technology gap itself. Is although
[01:01:17] Unknown:
big data as a technology has improved a lot of other industries, so we are yet to see how big data as a technology improve itself through dogfooding. Think about the query history that we have accumulated in our system. Right? Think about all the metadata information we have collected. Right? How much analysis, how much machine learning are we doing, utilizing that information? I think that's a open area an open question for many of us to explore, and I believe the result of that will be exactly what Misha mentioned, a simpler surface area. The data system doesn't have to be that complex if we are able to utilize the metadata that we have collected and use machine learning technology to kind of guess what user need instead of always asking users to specify what exactly they need every time.
[01:02:03] Unknown:
Alright. Well, thank you both very much for taking the time today to join me and share the work that you're doing at Blue Sky. It's definitely a very interesting problem space and an interesting approach that you're taking, and it's definitely great to see people who are trying to make it more cost effective and efficient for people to take advantage of some of these technologies and drive the value in their business. So thank you both for all of the time and energy you're putting into that. I'm excited to see how it progresses from here, and I hope you enjoy the rest of your day. Thank you, Tobias. It's been a pleasure.
[01:02:32] Unknown:
Thank you.
[01:02:38] Unknown:
For listening. Don't forget to check out our other show, podcast.init atpythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used. And visit the site of data engineering pod cast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction to Blue Sky Data
The Genesis of Blue Sky
Why Focus on Snowflake?
Cost Visibility and Optimization
Factors Driving Up Costs
Workflow and Setup of Blue Sky Mission Control
Customer Insights and Cost Management
When Blue Sky is Not the Right Choice
Future Plans and Medium-Term Goals
Closing Thoughts and Contact Information