Summary
Data warehouses have gone through many transformations, from standard relational databases on powerful hardware, to column oriented storage engines, to the current generation of cloud-native analytical engines. SnowflakeDB has been leading the charge to take advantage of cloud services that simplify the separation of compute and storage. In this episode Kent Graziano, chief technical evangelist for SnowflakeDB, explains how it is differentiated from other managed platforms and traditional data warehouse engines, the features that allow you to scale your usage dynamically, and how it allows for a shift in your workflow from ETL to ELT. If you are evaluating your options for building or migrating a data platform, then this is definitely worth a listen.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media and the Python Software Foundation. Upcoming events include the Software Architecture Conference in NYC and PyCOn US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
- Your host is Tobias Macey and today I’m interviewing Kent Graziano about SnowflakeDB, the cloud-native data warehouse
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by explaining what SnowflakeDB is for anyone who isn’t familiar with it?
- How does it compare to the other available platforms for data warehousing?
- How does it differ from traditional data warehouses?
- How does the performance and flexibility affect the data modeling requirements?
- Snowflake is one of the data stores that is enabling the shift from an ETL to an ELT workflow. What are the features that allow for that approach and what are some of the challenges that it introduces?
- Can you describe how the platform is architected and some of the ways that it has evolved as it has grown in popularity?
- What are some of the current limitations that you are struggling with?
- For someone getting started with Snowflake what is involved with loading data into the platform?
- What is their workflow for allocating and scaling compute capacity and running anlyses?
- One of the interesting features enabled by your architecture is data sharing. What are some of the most interesting or unexpected uses of that capability that you have seen?
- What are some other features or use cases for Snowflake that are not as well known or publicized which you think users should know about?
- When is SnowflakeDB the wrong choice?
- What are some of the plans for the future of SnowflakeDB?
Contact Info
- Website
- @KentGraziano on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- SnowflakeDB
- Data Warehouse
- Oracle DB
- MPP == Massively Parallel Processing
- Shared Nothing Architecture
- Multi-Cluster Shared Data Architecture
- Google BigQuery
- AWS Redshift
- AWS Redshift Spectrum
- Presto
- SnowflakeDB Semi-Structured Data Types
- Hive
- ACID == Atomicity, Consistency, Isolation, Durability
- 3rd Normal Form
- Data Vault Modeling
- Dimensional Modeling
- JSON
- AVRO
- Parquet
- SnowflakeDB Virtual Warehouses
- CRM == Customer Relationship Management
- Master Data Management
- FoundationDB
- Apache Spark
- SSIS == SQL Server Integration Services
- Talend
- Informatica
- Fivetran
- Matillion
- Apache Kafka
- Snowpipe
- Snowflake Data Exchange
- OLTP == Online Transaction Processing
- GeoJSON
- Snowflake Documentation
- SnowAlert
- Splunk
- Data Catalog
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the project you hear about on the show, you'll need somewhere to deploy it, so check out our friends over at Linode. With 200 gigabit private networking, scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform. And if you need global distributions, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances.
Go to data engineering podcast dotcom/linode, that's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. You listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media, Corinium Global Intelligence, Alexio, and Data Council.
[00:01:20] Unknown:
Go to data engineering podcast.com/conferences to learn more about these and other events and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macy, and today I'm interviewing Kent Graziano about Snowflake DB, the cloud native data warehouse. So, Kent, can you start by introducing yourself?
[00:01:39] Unknown:
Sure. Thanks for having me, Tobias. I am, the chief technical evangelist for Snowflake. I've been with the company for just a little over 4 years now. Joined it when there was only about a 100 people in the company. Career wise, I have been in the data space for nearly 30 years. I've been, doing data warehousing for over 25. I've, coauthored books with Bill Inman and Dan Linstead, and I have done everything from being a independent consultant to being the director of a data warehousing team for a public school system as well as government contractor and, worked in the health care space as well. So I've seen the, I'll say, ups and downs of the industry over the last couple of decades and,
[00:02:32] Unknown:
am still excited to be And do you remember how you first got involved in the area of data management?
[00:02:38] Unknown:
Yeah. Actually, I do. I was on a government contract in in Denver, working at the Denver Federal Center with the Department of Interior and got into a job initially as a programmer and had the manager come to me and say, hey. I just came back from this class at IBM on relational database design. Does that sound like anything that might be of interest to you? And then he proceeded to explain 3rd normal form to me, and I was basically hooked at that point. And the project involved in a couple of months to me getting my hands on Oracle version 5.1 a and installing it on a PC and building out the very first relational database system for that particular government agency, which was simply combining 3 mailing lists that were being managed in a COBOL system into a single database.
[00:03:41] Unknown:
That that was, as they say, the the beginning of of the whole thing. And so as you mentioned now, you're the chief technical evangelist for Snowflake. So I'm wondering if you can just start by explaining a bit about what the Snowflake DB platform is for anybody who isn't familiar with it. Sure. So
[00:03:57] Unknown:
Snowflake is a cloud based data platform. It started off as, really what I used to call a data warehouse as a service, so it was built in the cloud for the cloud, to allow people to load and analyze data at massive speeds and taking advantage of the native architectures that, are brought to us by by the cloud. And
[00:04:28] Unknown:
I know that there are a lot of architectural changes that come about as a result of the fact that it's designed specifically for cloud native environments. I'm wondering if you can compare it a bit with some of the other available platforms for data warehousing and maybe highlight some of the differences in terms of the operational characteristics and usage characteristics from traditional data warehouses?
[00:04:52] Unknown:
Sure. In the in the data warehousing world, we, over the last couple of decades, have evolved from, you know, basic relational database systems to our MPP, massively parallel processing systems with either a shared nothing architecture or a shared disk architecture. And what the founders of Snowflake did is because of the nature of the cloud and looking at the nature of the cloud versus a on premise sort of system. They invented a completely new architecture called the multi cluster shared data architecture, which separates the the compute power from the storage so that we can independently and elastically scale the storage and the compute.
In the previous architectures, basically, if you wanted more storage, you had to get a bigger box and it had more compute for the most part. If you wanted more compute, many times you were forced into buying again a bigger box that had more storage that you might not need. And you were also found that you were generally bound by the compute power of the box that you bought, Say it's 16 CPUs. And while 16 CPUs may have been great when you started your data warehouse and you had 20 users and 1 ETL process, if you had the the the great opportunity to build a wildly successful platform and you suddenly found yourself with 100 or thousands of users, they're now bumping into each other, and they're having to compete for that shared resource.
And with the architecture that Snowflake has developed, you can actually create independent compute clusters for these different workloads, which allows us to really scale to a near infinite level for the number of users, the number of processes that can be running, and at the same time, make sure that they are not contending with each other for limited resources. Really allows people to be much more agile in the development of their ecosystem, as they they can start very small and grow to very large without having to go out and purchase another system and do a migration.
It's all dynamic and elastic
[00:07:14] Unknown:
on demand in the cloud. And then there are a number of other competitors that are targeting the cloud environment, most notably things like BigQuery or Red Shift, and recently Red Shift Spectrum. I'm curious if you can give a bit of comparison to the characteristics of Snowflake as it relates to some of the other cloud native implementations that people might be familiar with. And then also sort of on the more data lake side of things, systems such as Presto or some of the, SQL on Hadoop or SQL on unstructured data, solutions?
[00:07:48] Unknown:
Sure. Yeah. Most of even some of the the systems that are calling themselves cloud native, they are all based on our the previous architectures that I talked about, whether it was shared nothing or shared disk architecture. And while they may be in the cloud, for the most part, they are slight evolutions of the original architectures. And so they don't offer this dynamic scalability at under the control of the customer on demand. You you have things like query engines that can scale quite large, but in general, the customer doesn't have a lot of roll over that, which means they could throw a really large job at it and it can scrape and just really do a great job. But, they don't necessarily have the control over how much resource is being thrown at that, which means that in in the end, they they don't have a lot of control over how much that may cost them. And some of the other ones are really ports of the the older architectures into the cloud. And so you're looking at things that are effectively VMs, and the the architecture is still the same where the the compute and the storage is tightly coupled. If you need more storage, you gotta buy more compute. If you need more compute, you gotta buy more storage.
Now and to be fair, many of our competitors are obviously evolving, and, they're they're adding things to their their offerings here over time. We're starting to see that, but in the end, it ends up still being a fairly complicated engineering sort of feat for for the folks who are managing the system. And that's 1 of the things that really differentiates us is the, I'll say, lack of management that you need to do. So what a traditional database administrator may have had to do with, performance tuning and backup and recovery and things like that, we've taken care of all that for you.
Wherein many of the other cloud based data warehouse systems, the DBA still has that more traditional role and is having to manage things like distribution keys and, vacuuming, as well as the sizing and forecasting of, the utilization. And so that really does make a a big difference. On the other side of the house, I start talking about data lakes and semi structured data. 1 of the things that we did in Snowflake is we invented a new data type called Variant that allows you to load semi structured data into a column in a table and then access it natively with some minor extensions to SQL.
So this saves people from having to parse it out and basically ETL the data into traditional tables and columns in comparison to things like, Presto and Hive and some of the other, tools that are out there that evolved out of the Hadoop ecosystem, where those really are SQL interfaces, if you will, on top of a file system. And so you do not have the some of the traditional database features such as transaction consistency and asset compliance, those those aren't necessarily there depending on which rev you're using. Again, those are all evolving, but what we've seen over the last, 10 years is what started out as very sort of, I'll say, minimalist SQL because we discovered everybody really understands SQL. They don't understand necessarily Java and MapReduce, that we had to build a SQL interface to make those Hadoop style systems work and today with the cloud some of it's just blob storage whether it's s 3 on Amazon or Azure Blob storage. We're still dealing with more or less a file system with a SQL interface built on top of it. And so this does require a lot more care and feeding and puts limitations on you as to what you can do from a traditional database perspective, which we've got all built into into the Snowflake
[00:11:52] Unknown:
platform already. And 1 of the things that you mentioned in there is this variant data type for being able to load semi structured data directly into Snowflake. And I know that that's 1 of the attributes that has led to the current shift from ETL to ELT, where you're doing the transformations within the data warehouse itself rather than upfront. Because I know that traditionally, there was a lot of planning and rigor that had to go into loading the data into the data warehouse to make it accessible for other people. And so I'm wondering how the performance characteristics and flexibility of Snowflake and the availability of this variant data type and some of the, schematic flexibility plays into the overall data modeling requirements for using Snowflake and some of the shift towards this ELT workflow.
[00:12:42] Unknown:
It's been very interesting for me because I come from a traditional relational background with 3rd normal form modeling as well as dimensional modeling and data vault modeling in the data warehouse world. So looking at Snowflake and the features that we have, even just, looking at the the the semi structured data, so whether it's JSON, Avro, Parquet, things like that, a lot of our customers have found they they can load that into a column. So, effectively, you have a table now where there's might be a couple of metadata columns and a variant column, and then a single row in that table that in that variant, you'll have an entire JSON document that might have arrays and nested arrays and all sorts of really interesting structures within the we'll call it the schema of the JSON document.
And our customers are able to build views on that using our, JSON SQL syntax and make it look like a dimensional model. And I found this incredibly fascinating and also incredibly powerful in that we are now able to do things like load web logs into a database table, but make it accessible to, I'll say, your average business analyst in a dimensional format. So they can actually do analytics on this type of data now using Tableau, using Power BI, using tools that they're very familiar with, and they don't even know that it's JSON data under the covers. And so the, the combination of being able to load that type of data, the SQL extensions to access it, we've actually eliminated the the t now in the ELT, extract load transform, the transform part is now a SQL view, and it's transforming the structure into something that a business analyst understands.
But we're able to then load data much faster and really reduce the latency and the time to value for that data. And the performance optimizations that are built into Snowflake combined with our ability to do these independent compute clusters, which we call virtual warehouses, allows us to get the performance out of that. So there there there is there's literally no indexing. There's no, manual partitioning. There's no distribution keys. And so, the DBAs are not having to spend, you know, I'll say nights and weekends trying to make make this the the stuff perform. We're able to just build the views on it, expose it via Tableau, and the the business analysts now have rapid access to all of this data.
And and and, additionally, we can allow them to join that semi structured data now to structured data that might have been loaded, say, from a CRM system in a more traditional scheme. From a data modeling perspective, there there's 2 factors here with Snowflake. 1 in the scenarios we're talking about right now with the semi structured data is we're we're doing almost virtual modeling. Right? The modeling approaches, whether it's 3 NF, Data Vault, or Dimensional can all be done with views. So it's more about what is the right structure to represent the data in for the business use case and that also means that you can have 1 set of data and have multiple representations of it serving different use cases.
And this really gets us down to that concept of the single source of truth that we've been trying to reach in the data warehousing world for for several decades. The other aspect of modeling with Snowflake is the system was designed to be what we call schema agnostic where many of the traditional on premises database systems really preferred a specific modeling technique, be it their normal form or dimensional for the most part, and required specific tuning to make those things work. Snowflake was designed to be performant against any modeling technique because we wanted to ensure that wherever our customers were coming from in their legacy on prem world, they wouldn't necessarily have to change their approach to modeling in order to get effective use out of Snowflake. And so that was actually designed from the ground up.
1 of the things that I didn't say earlier is our founders wrote Snowflake completely from the ground up. It was a it's a brand new relational database. So not only did they invent a new architecture to take advantage of the elasticity and the dynamic nature of the cloud and to eliminate the concurrency and performance issues that people have had with the traditional systems. They were were able to do this only by starting from ground 0 and not forking some previous code base. So it's it is all new code resulting in this really dynamic data platform that can be used for, as we were talking about, data lake types of applications, as well as traditional analytics applications, and be ANSI SQL compliant as well at the same time. And
[00:18:14] Unknown:
there seem to be sort of 2 driving factors for when somebody might decide between using a data lake versus a data warehouse. 1 of which being the flexibility in terms of the types of data that you can load in and the availability for how to analyze it, which is in large part addressed by the flexibility in the schematic modeling that Snowflake offers, and the other 1 being cost. And so I'm curious what that looks like in terms of a breakdown for the overall spend that somebody would be dealing with on particularly the storage side, but also to some extent the compute of storing everything in an s 3 data lake or possibly even HDFS versus what you'd be storing for the same volume of data in Snowflake, and then the amount of time and compute energy required to be able to derive value from it, or the amount of data copying necessary, particularly in the data lake scenario
[00:19:11] Unknown:
of being able to then transform it into different forms for being able to run various analysis on it? Yeah. You know, that's a great question. 1 of the things with Snowflake is that we are, you know, MPP style. Right? So massively parallel processing using the compute MPP compute nodes in in the cloud. But on the storage side, we actually have implemented an advanced columnar compression algorithm as well. So from a pure storage cost perspective, our customers pay basically the the same price for storage as you would on the underlying cloud vendor. So in the case of AWS, we're talking about s 3 storage because under the covers, that's what we're using. But the difference is is we are giving them 3 to 5 x compression on that data. So from a pure data lake storage perspective, you can store that data in Snowflake and, in all likelihood, be paying way less because of the compression that we have. Our our terabyte per terabyte storage cost is effectively the same as blob storage except that in our case, we're able to compress it. So you're gonna get from 3 to 5 terabytes compressed into 1 terabyte of storage. So that's a huge cost savings for our customers, and that's why it's become feasible for them to consider putting their data lake in Snowflake as well as their data warehouse and doing all the analytics.
Now your your your secondary question there is, okay, if we put the data lake and we put all that data in Snowflake, how much is it costing us? What does that look like to then transform that into analytics style schemas? And as I, mentioned before, in many cases, if we're talking simply about the, you know, the semi structured data and even structured data, a lot of our customers are just using views. And so there's the cost there ends up being in the compute to execute the query, where they in the past, you may have had cost for doing actual ETL processing, where we do certainly have customers that are doing transformations within Snowflake as well where it's necessary to apply some advanced business rules, some data cleansing rules, things like that. So you do have that as well. But that cost in the case of the, of transformations, whether you're talking ETL or ELT, is something that they were paying 1 way or another before already.
The advantage again with Snowflake is our customers are finding that our our system runs way faster than their their prior systems. I know of 1 customer that had a a report that ran for 20 hours on their data warehouse appliance, and by simply moving that data in the scheme over to Snowflake, they ran in 45 minutes. And so they experienced a massive cost reduction, in looking at the cost of compute on Snowflake in comparison to what that, massive data warehouse appliance cost them. And it it is a different cost model for sure. I mean, it is a pay as you go model. In the case of Snowflake, we charge per second per node.
So when you create these virtual clusters, we're talking about, you know, how many nodes are there. Is it is it a single node? Is it 2 nodes? Is it 4 nodes? Is it 8 nodes? But we charge on a per second per node basis and generally get linear scalability. So if somebody wants to run something faster, they use a larger virtual warehouse, and it'll run-in half the time. But because of the the pricing model, it cost them exactly the same amount of money. It's just that it ran faster. And so then we start talking about the the, the value of the time and what else they can do when they have access to that data so much faster.
[00:23:14] Unknown:
And another thing that you mentioned a couple of times is the fact that because the ability to create these views on the underlying data structures, you, in many cases, can just have it living in 1 place, and you don't have to worry about copying it. And so I'm curious how that plays in to the requirement for a master data management approach for ensuring that there are canonical schemas or canonical IDs for users or sales products or things like that, and if that is still a necessary aspect of data modeling and data management in this Snowflake world?
[00:23:51] Unknown:
It is for certain use cases, and it really is about what the business requirements are. If you're talking about, a lot of raw JSON web logs and you need to do data science on it and try to derive some some, insights from that data, then then they're pretty much good to go with with what we've been describing. If you are looking at building, I'll say, an enterprise data warehouse in a highly regulated industry that needs to be compliant with GDPR and CCPA and some of the other regulations that are out there, then we do have our, I'll say, our enterprise customers, you know, the the really big organizations are still building canonical schemas and doing more of the traditional transformations and master data management type approaches.
I I think the the beauty of Snowflake is that it allows you to do either or. You're not locked in. You have the flexibility based on the requirements of your business and and and your approach and, what you're what you're trying to achieve. So we do have people doing doing both for sure that are are porting over things that are what I call, you know, the the highly integrated enterprise data warehouse where it does require an integration layer using something like a data vault modeling methodology, where you will have stage tables and you will have a core granular data warehouse that is aligning the business keys and and making sure that when the reporting comes out that, you know, we we have a common agreed upon representation for a product or a customer. So all of that still does happen. And, again, it just it just varies from from customer to customer as to what their business requirements are. And so you mentioned that
[00:25:44] Unknown:
the high level view of how Snowflake is implemented is that there's the underlying data storage in the cloud object store, and then you run the compute on top of that in a sort of virtualized way and and in a way that gives you some segmentation in terms of who can access what. But I'm wondering if you can dig a bit more into how the Snowflake platform is architected and some of the ways that it has evolved over the years. Sure.
[00:26:12] Unknown:
So, again, the the storage layer, as we were saying, under the covers, whether it's on, AWS or, Google, is we're using the the blob storage native to that cloud and we are writing our own proprietary file structures into that storage. As a Snowflake customer and user, you don't see it as s 3 storage. You don't see any storage, really. I mean, you look at a chart and see how much storage you're using, how many terabytes of data you're storing, but you you you can't access or see these files. It's all something that's that we're built we've built that's, you know, patented file format and structure and metadata around that and that's that's the storage under the covers. And then the compute layer is then using the compute nodes. So in the case of Amazon, we're talking about EC 2 nodes under the covers complete with their SSD, and that becomes our dynamic memory, if you will, along with the number of threads on the compute cluster that someone has allocated.
But holding that altogether is what we call the global services layer, and that's really the brains of Snowflake. And that is where, the metadata engine is that's tracking where all the files are, tracking who the the users and the roles. All the role based access control and the security is all tracked in there as to who has access to what compute resources, who has privileges to create new compute resources, who has privileges to create new tables and load data, all of the standard things that you would need in a in a database from a role based access control perspective, but it also the global services layer manages the transaction consistency, the ACID compliance, and manages all the security from an encryption perspective.
Our underlying data is always encrypted. It's a it's 256 bit encryption at motion and at rest. So as you are uploading data into Snowflake, that data is encrypted on the the the client side before it comes over the wire in into the Snowflake ecosystem, a fairly robust hierarchical encryption scheme to to make sure this stuff is all secured, complete with rotating keys because 1 of the biggest concerns in the cloud was always, you know, is our is our data secure? And I've worked in many traditional, database environments over my career where the data was never encrypted and everyone was was counting on the firewall to to keep the data secure when in fact people were able to get through those firewalls and read the underlying files and actually, glean data out of those underlying files.
Snowflake was built to make sure that could never happen. And because of that, we have, we are HIPAA compliant, we're PCI compliant, we have a SOC 2 type 2 and SOC 2 type 1 certifications along with another a number of other security, certifications to ensure that our customer's data is always secure, that there there's no way anyone can get to it. And so that's all combined there. So it's really the 3 layers, the global services layer, which is the brains, your your compute layer, which is dynamic where you can create as many of these independent compute clusters as you want, And then under at the bottom there is the common storage layer allowing us to have 1 set of data, again, not having to copy it between silos and data marts, but to have 1 set of data accessed by all these different, compute nodes.
[00:30:05] Unknown:
And my understanding from previous conversations that I've had, 1 of which was actually on this podcast, I understand that the actual metadata layer for Snowflake is implemented on top of FoundationDB so that you actually have 1 location for being able to identify where all these files are located in the broader s 3 storage rather than having to actually query the s 3 API and deal with those latencies for being able to seek through information as you're trying to run a query. I'm wondering if you can talk to either that or any of the other performance optimizations that are built into the system to ensure that it is, operating at the sort of maximum performance that you're, that you're looking for and to be able to provide that, reduced time to value for your customers.
[00:30:52] Unknown:
Yeah. And that and that's absolutely correct. What you were told before, it is under the covers, our our metadata engine is based on FDB, and we we retain, I'll say, the addresses of all of the data in there so that when someone is querying the data, it goes in there and looks, for for the data, but depending on the nature of the query, the names of the tables, all of that's there and has pointers directly to our underlying file system to to, find that data as quickly as possible. And, in fact, we we keep the, along along with all the pointers there, part of it is keeping the range of the data to know that, you know, from a to z of this particular column, you can find that in this set of files, under the covers. So we have that sort of direct pointer there. So when, people are writing queries, you know, based on the the predicates and the join conditions, we have a lot of metadata information to allow us to go directly to the files where that data is really stored. I mean, again, to a customer, it's a table. Right? But under the covers, those tables are made up with a lot of very small files, which we call micro partitions, and our 1 of our primary methods of performance is called partition pruning, and it's based on the ability of the system to look at the query, go to the metadata, and then go directly to the set of files that contain the data that this query is asking for. So we're not having to do what, you know, would have been called, I'll say, in the the old database world, a full table scan. You know, obviously, if somebody needs to do that and they wanna do an aggregation on a trillion row table, well, it's gonna do a full table scan. But if you're doing something more of a dimensional analysis and we wanna see all the sales for product x for region y, for sales rep Kent, we're gonna be able to hone in on that data set very quickly and only return that set even if we're talking about, multi terabyte tables and even into the into the petabyte range.
[00:33:11] Unknown:
And then in terms of the overall system architecture and the implementation details, I'm wondering if there are any sort of edge cases or limitations that you're dealing with or any of the specific challenges for being able to design and implement this across multiple different cloud vendors.
[00:33:28] Unknown:
Oh, yeah. That's that that's, interesting. That gets into the fact that we are we are cloud agnostic. And there hasn't in the experience that I've had with Snowflake so far, there doesn't seem to be anything that we can't do on 1 that we or that we can do on 1 that we can't do on another. And that was again the the goal of the founders is they they develop Snowflake with this global services wrapper around it. And we're really writing to the APIs of the underlying cloud vendors so that we can ensure that whether you choose AWS deployment or Azure deployment or Google deployment, your Snowflake experience is exactly the same in all 3. And 1 of the things that is is coming that that we just released earlier this year is the ability to not only do, cross region replication within a cloud provider, but we now have the ability to cross cloud replication.
So you could actually have be running Snowflake on AWS and replicate it to Snowflake on Azure as a, as a mechanism. In in some cases, some of our customers wanna do that for disaster recovery and business continuity. In other cases, it it may be for latency reasons. And in some cases, it may be for business reasons where there's certain datasets that for whatever reason, they only want it in the Azure space and other datasets they may want or need to have in AWS space for, you know, for various various business reasons. So we've we've done quite well because the system was designed from the get go not to be single cloud. It was designed to be cloud agnostic. And so we have not had had these issues where, hey. You know, this feature works great in AWS, but it doesn't work in Azure. And then
[00:35:26] Unknown:
for people who are interested in getting started with using Snowflake and loading their data into it, what are some of the options that they have for being able to populate the data warehouse that they set up with their data, and some of the overall workflow of then allocating and scaling the compute capacity for being able to run analyses, and maybe some of the options for cost optimization to ensure that they don't leave those clusters running when they're not actually being used? Yeah. I'll start with that 1 first because that's the easiest 1.
[00:35:58] Unknown:
We have when you create a virtual warehouse on Snowflake, we have an auto suspend and an auto resume feature because it is a pay as you go. And what we've discovered over over the years is many workloads do not need to be running 247, 365. In fact, in in business intelligence and dashboarding, all that. In many cases, it's, you know, people come in, fire a query, they get the results back, and they're good for the day, really. And so we have this auto suspend feature where the customer controls that with a a parameter to say if the compute is not being used for x number of seconds, minutes, hours, up to them, it automatically shuts down.
And then the auto resume feature, the next query that comes in to hit that cluster, it automatically turns it back on. And so that's that's a a great cost control and, people are finding that, you know, they are saving certainly a lot of money that way in that we're running we're we're we bill at a per second basis. So something runs for 5 seconds and, you know, you pay for 5 seconds. Runs for 5 minutes, you pay for 5 minutes. It runs for an hour, you pay for an hour. 1 of the other scaling features is we do have the ability to dynamically resize.
So say you are running a use case that normally runs for an hour and you're running it on a small, virtual warehouse, which is 2 nodes, and you need to get it done in 15 minutes. Well, then you you alter warehouse resize to a large and now it's running on 16 nodes and it's gonna run-in 15 minutes instead of an hour. But it costs the same because it's per second per node. And so that's a that's a really cool feature that we have that that people do use on a fairly regular basis. On the storage side, you know, how how do we get our data into Snowflake? And that's, you know, obviously, a very common question. How do we get data into the cloud if all of our data is in data centers? And we have a a very open ecosystem of connectors, where we started off with, we have a native Python connector, an ODBC connector, a JDBC connector.
We even have a, an optimized spark connector. People were writing their ETL using spark, which many organizations have been doing. It's allowed all the I'll say all the traditional ETL vendors and ELT vendors, have connectors. And so if you're currently using SSIS or Talend or Informatica, you can repoint that to Snowflake and continue to use the same flow that you're using today. Then, of course, there's there's new new ones like Fivetran and that are tools that are more born in the cloud themselves. You have an option to do those. In fact, we even have a, we have a a partner connect feature. There's a a dashboard when you log in to Snowflake where you can sign up for free trials for some of these tools.
And I've heard from customers that are, you know, doing a POC with us that, you know, they got access to their new Snowflake account. They went in, and within 30 minutes, they were up and running provisioned with 1 of our our partner tools and loading data from their their old database in Snowflake. And so there's any number of options. You can use Kafka, You know, if you happen to be running in a Kafka environment, we, just released our our Kafka connector using using Confluent and so people are able to run data from Kafka straight into Snowflake. We have, you know, the I'll say the traditional methods.
If you could upload your data into an a blob storage bucket and then copy it into a table in Snowflake. We have our own ingestion engine called Snowpipe that allows us to take data that is dumped into a blob storage bucket and automatically pick that data up and load it into tables in Snowflake. And there's even a a button on our web UI where you can click a button to load data in and specify the table you wanna load the data in, specify the file you wanna pull it from, and you could be pulling it off your laptop and and more or let you know, basically manually load it using our web UI. So there's there's a a very large variety of options for customers to choose from in, how they're gonna load their data into Snowflake.
[00:40:56] Unknown:
And another 1 of the interesting features that's enabled by your architecture is the data sharing capabilities for being able to say that this data set is available to somebody else who's running a virtual data warehouse. So that opens up the possibility of things like data brokerages or allowing a certain subset of data to be shared between different business units or organizations. And I'm wondering what you have seen as some of the most interesting or unexpected uses of capability.
[00:41:25] Unknown:
Oh, yeah. That's a that that's pretty much everybody's favorite feature these days is is data sharing that, you know, any organization now can become a data provider and effectively build what I refer to from my architectural framework as a curated data mart and grant access to that data mart to any other Snowflake account holder, and it appears to them as a read only database. And that has evolved now into us having, we now have the Snowflake data exchange, which is available in, it's a public preview now on 1 of our regions where customers have signed up and are actually making data available like weather data.
AccuWeather has a has a tile on there that you can sign up to to get weather data and get access to that to augment your analytics in your data warehouse on Snowflake. And then just this week at, AWS reinvent, we announced the the, private data exchange that we're now actually gonna be allowing customers to internally within their organization rather than a public data exchange to find these datasets into a much, I'll say, user friendly interface and specify other organizations within their company or potentially specific business partners that they wanna allow access to this data in a in a very resilient manner and a very secure manner.
The most 1 of the most interesting usages of this was very early on with, 1 of our customers that had a partnership with another Snowflake customer, and they provided analytics on YouTube videos. And so they were collecting all of all of that sort of data and shared it over to the other Snowflake customer who then did augmented analytics on it with all more consumer related data that they had. And so they were effectively joining that data to the, the YouTube analytics and providing a different level of analytics. And then in turn sharing that data back to the original data provider.
So, effectively, the raw data is coming from 1 organization and sharing it to a partner organization who then augmented it with their own data to get a more sophisticated data set and then sharing it back to the original provider. And then the really interesting thing about it was when the original consumer turned around and bought the provider, and, it actually became 1 company. And the amount of money that they saved by not having to do manual transfers of this data was incredible. And the the their, their return on investment was quite amazing to see because it eliminated, even, you know, before the acquisition, they were dropping files onto a secure FTP, and then the other organization has to have an ETL process to ingest it into their data warehouse.
They were doing their analytics on it and augmenting it and then having to spit that data back out to flat files to a secure FTP site, and then the whole process goes over again. And there was that was quite fascinating to to see that particular use case because, I I call it bidirectional data sharing. And, it's, it's really interesting to see what people can do with these features. And and in many cases, it may be use cases that, our founders might not have even thought of when they first started. But it was all a result of the separation of compute from storage because the storage is centralized separate from the compute. That's what allows us to do data sharing and to and to create this logical share container on the storage layer and have it accessible by, compute
[00:45:33] Unknown:
that's in a completely different Snowflake account. And what are some of the other features that are what are some of the other features or use cases for Snowflake that are not as well known or as well publicized as this data sharing capability, which you think users should know about and would be able to benefit from?
[00:45:50] Unknown:
Well, there's there's a couple that I think, you know, that DBA is really love. There's a feature called undrop that allows you to instantly recover anything from a table to a schema to an entire database. So the the classic Monday morning, oops, somebody comes in, thinks they're logged into development, they're logged into production, and they drop a set of tables because they're about to do the next iteration and then realize, wait a minute, I was in production. They simply issue a command, undrop table, undrop schema, whatever it was, and it's instantaneously back.
And that's 1 of the, I'll say it's a side effect of our feature called time travel which, allows you to, query your data as it was at a previous point in time. And by default, all of our customers get 24 hours of time travel. So if you're running an ETL process in the morning and later in the afternoon, you discover that, hey. You know, there's something wrong with that process. They can very easily roll it back by you know, they can they can take a look at what it looked like before the process ran. And you can go all the way up to 90 days with it. So it's there's literally no configuration other than setting the parameter.
But because of that feature, that allows people to do things like undrop. You know, if you've got a 90 day time travel window, you could actually recover something that you dropped 89 days ago instantaneously, and that's a that's a really cool feature. 1 of the other ones which I don't think people know a lot about is, we actually do result set caching. So for when you run a query and it produces a result set, that that dataset is actually cached for 24 hours. And you can go into the history tab in the Snowflake UI and review the query that you ran, you know, 23 and a half hours ago and see the dataset and actually take a look at that dataset without having to re execute the query, and so there there's no cost involved in in that particular 1. And so from a from a development perspective, I found that very, very useful.
The classic, you know, 5 o'clock, I'm I'm working my way through building out a particular SQL query, go home, come back in the the next morning and trying to remember where, you know, where was I. Well, instead of having to rerun the last query that I ran before I went home last night, I simply go back,
[00:48:23] Unknown:
and look at the results cache and I can see the dataset, I can see the query, and then I can continue on from there without having to have the cost of re executing the query the next morning. And so with all of the different capabilities and flexibility and benefits that Snowflake can provide, it's easy to assume that it's the right choice for every situation. But I'm wondering what you have seen as being some of the cases when Snowflake is the wrong choice.
[00:48:50] Unknown:
Well, for 1, 1 1 of the things that people used to ask me when I first started because we always said, you know, we're a data warehouse. It's like, well, that sounds like it's a database. And, yes, Snowflake is indeed a a database and now we we say say Snowflake is really a, a fairly robust data platform. The 1 thing you absolutely would not wanna do with Snowflake really is an OLTP application where you're doing, singleton inserts and and queries because it is an MPP system. There's 8 threads on our smallest virtual warehouse and you're certainly gonna be grossly underutilizing it if you're doing singleton inserts. So any sort of really single row type of application is probably not not necessarily the best.
You know, microbatches, continuous continuous feeds, things like that are okay, though we are working, as part of 1 of our enhancements working on the improving our ability for low latency type requests, but you really wouldn't wanna build like a order entry system on Snowflake. Even though you could with 1 of the you could use our Node. Js connector and build a a very fancy, web UI order entry system from scratch, but that would just not be the the right thing to do with Snowflake. And what are some of the plans for the future of Snowflake DB either from the technical or business side? Well, we the on the from a business side, the whole, business continuity story and global Snowflake is very big where we are are moving forward with our cross cloud replication.
And 1 of the things on the road map will be an automatic sort of fail over that if 1 if your primary Snowflake account fails for some reason, you know, the the service goes down, the telco, the lines go out for a particular availability zone that it will automatically fail over and automatically fail over the the applications pointing to Snowflake. So that that is on the road map. It's a thing we call organizations. They're gonna allow you to have basically 1 organization with multiple Snowflake accounts under it that are all very well connected. Another is our, geospatial support.
So we're going to be rolling out some, some support for geospatial data types including geo JSON. And you can imagine there's certain industries that that's going to be, that's gonna be a really big hit in. So I know that's that's coming down the the the pike. And then with all the regulations and concerns over privacy, we are going to be adding column level access controls including dynamic data masking. And that's gonna be, you know, that's a a feature that's been requested particularly in in health care and financial industry space to be able to do it not just at a row level but to do it at a column level and to incorporate data masking actually
[00:52:05] Unknown:
data masking functions into into Snowflake directly. And that's 1 thing that we didn't really touch on yet is the specific SQL dialect that Snowflake is using. I'm wondering how closely it's sticking to the ANSI standard, and what are some of the interesting extensions or built in functions that it provides?
[00:52:25] Unknown:
Yeah. Yeah. So it is it is an ANSI ANSI standard SQL. The extensions, I think I mentioned a little bit already with, the extensions we've had for accessing, keys within, JSON and semi structured data as well as XML, and so there's some extensions there. We have a very extensive SQL function library which people can check out in our documentation. You can just go to docs.snowflake.com and look up the SQL library. And so we have a full set of windowing functions, and statistical, standard SQL statistical functions, things of that nature that are already in there. We also have the ability to for customers to write user defined functions, as well as recently we've added support for stored procedures, Kind of back to our ELT conversation a little earlier, we recently introduced a feature of streams and tasks that will take advantage of stored procedures to do some change data capture within the Snowflake engine. So as new data is loaded, you can you can create a stream object on top of the table to look at things that have changed and then launch a task to move the data further downstream. So we're really working on, expanding true what I'll call true ELT capabilities within the Snowflake engine itself.
But all standard SQL, you know, standard SQL functions to do the transformations and calculations that people have typically needed to do, and even to the point of doing some data science type calculations. 1 of our customers, built a data science application completely using Snowflake SQL, that the data scientist said everything he needed was was in the library, to do the statistical analysis that he was trying to achieve.
[00:54:28] Unknown:
Are there any other aspects of the Snowflake platform or the ways that it's being used or the use cases that it enables that we didn't discuss yet that you'd like to cover before we close out the show? Wow.
[00:54:41] Unknown:
It's being used in pretty much every vertical. So we've got we've got folks that are using Snowflake now for security analytics, and we have an open source system that we've put out called Snow Alert, and that's that's a really interesting use case where people are even using Snowflake to replace some of the functionality that they were previously using, things like Splunk for. Marketing analytics, we have quite a few customers that are are doing, marketing analytics using Snowflake and then certainly health care as I've already mentioned. We have well over a 100 health care related companies doing health care analytics.
Some of companies are actually building analytic applications on top of Snowflake. So they're providing an analytic service to their customers that's powered by Snowflake under the covers. We're starting to see, I'll say more and more of of that sort of thing, kind of in many cases in combination with data sharing. So, you know, lots of lots of interesting applications in this sort of hybrid being able to do your data lake and your data warehouse all in 1 platform is a is certainly a growing use case. And with the addition of streams and tasks and our ELT capabilities, I expect to see that growing even more over the next couple of years.
[00:56:04] Unknown:
And so for anybody who wants to get in touch with you or follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. There there's a lot of tooling out there. I mean, the the the biggest
[00:56:24] Unknown:
question I keep getting on is things like data catalogs for large enterprises, and there there is some tooling out there for that. We have a number of partner companies that specialize in data catalogs, and I see more of a, knowledge and process gap honestly than a tooling gap. It's just people understanding the need for data governments and some of the best practices around data management that are are necessary to make effective use of all these, you know, really large datasets that people are now starting to deal with. And, it ends up being more of a process than a technology question right now.
The the only thing I'll say on the Snowflake side that we we don't handle at this point is, things like images and sound, you know, m p threes, you know, closer to what people would think of now truly as unstructured data, though in reality, they all have some amount of structure in it. So that that's that's a little bit of a gap that I think we have right now that I know we're we're looking at, and we have customers asking us about that, about, you know, how how we can integrate some of some of the use cases around that as well.
[00:57:41] Unknown:
Well, thank you very much for taking the time today to join me and share your expertise experience of working with Snowflake. It's definitely an interesting platform and 1 that has been gaining a huge amount of popularity and mind share. So it's great to see, the directions that it's going, and some of the capabilities that it provides. So thank you for all of your time, and I hope you enjoy the rest of your day. Thank you very much.
[00:58:10] Unknown:
Thank you for listening. Don't forget to check out our other show, podcast dotinit@pythonpodcast.com to learn about the Python language, its community, and the innovative ways that is being used. And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at dataengineeringpodcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends
[00:58:40] Unknown:
and coworkers.
Introduction to Kent Graziano and Snowflake DB
Early Career and Involvement in Data Management
Snowflake DB Architecture and Advantages
Comparison with Other Cloud Data Warehousing Solutions
ELT Workflow and Data Modeling in Snowflake
Cost and Performance Considerations
Master Data Management and Canonical Schemas
Detailed Snowflake Platform Architecture
Getting Started with Snowflake and Cost Optimization
Data Sharing Capabilities and Use Cases
Lesser-Known Features of Snowflake
When Snowflake is Not the Right Choice
Future Plans for Snowflake DB
SQL Dialect and Extensions in Snowflake
Use Cases and Industry Applications
Biggest Gaps in Data Management Tooling