Building A Better Data Warehouse For The Cloud At Firebolt

Hello, and welcome to the data engineering podcast, the show about modern data management.

What are the pieces of advice that you wish you had received early in your career of data engineering?

If you hand a book to a new data engineer, what wisdom would you add to it? I'm working with O'Reilly on a project to collect the 97 things that every data engineer should know, and I need your help.

Go to data engineering podcastdot

com slash 97 things to add your voice and share your hard earned expertise. And when you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With their managed Kubernetes platform, it's now even easier to deploy and scale your workflow, so try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode,

that's l I n o d e, today and get a $60 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. You listen to this show to learn and stay up to date with what's happening in databases,

streaming platforms,

big data, and everything else you need to know about modern data management.

For more opportunities to stay up to date, gain new skills, and learn from your peers, there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to data engineering podcast.com/conferences

to check out the upcoming events being offered by our partners and get registered today. Your host is Tobias Macy. And today, I'm interviewing Eldad Farkas about Firebolt, a cloud data warehouse optimized for speed and elasticity on structured and semi structured data. So, Eldad, can you start by introducing yourself? Yes. Hi there. Hi, everyone. Thanks for having me. My name is Eldad Prakash. I'm the CEO and cofounder of Firebolt.

I've been with data

for around 26 years,

started really young, and then kept going with it. Luckily, it's been growing and getting more interesting over the years. So

me and data grow since I was 16. And do you remember how you first got involved with data management?

So I've been actually building startups,

like, 4 startups on data management

after my first job on data. So my first job was

system integration,

was,

long time ago. Data warehousing was pretty new, so it was mostly around

plugging in, cables and connecting things. But again, things changed dramatically.

Over the years, I've built a few startups. I think the most kind of famous 1 is Sisense. It's a business intelligence company. It's a big company now called the Unicorn. Been there for 15 years. Was serving as the CTO,

of the company and and and left 2 years ago to start Fireball and focus on

what I think is more interesting going forward, which is how and where the data is stored and computed.

And so can you describe a bit more about what Firebolt is and your motivation for building it and turning it into a business?

So, yes, I've been with, high performance computing for my entire career. Sensus began as a database company, as an HPC company. We quickly realized that nobody will buy a database from us, so we pivoted the company into a BI tool that wraps technology with the image and allows people to build fast data points. Now this was 15 years ago. Over those 15 years, in memory computing quickly, I mean, eventually became

something that cannot be used to store and hold the data. And,

came to realize that my passion is really with the database, the data warehouse, and less with the BI.

So I've decided 2 years ago that it's time to kind of refocus and rethink how data warehouses work and kind of tackle the 1 big problem that I see,

with data warehousing, which is efficiency and speed. So that's kind of how I decided to start a new journey and go with my poor start up, build a new data warehouse.

The data warehouse space

has been around for a number of years now. It's gone through a number of different evolutions

where it started off as big on premises appliances

or just using some of the existing existing relational database engines on a dedicated box and has now moved to data lakes and cloud data warehouses. I'm wondering if you can give a bit of comparison

of how Firebolt sits in that overall space, particularly

in reference to some of the more recent versions of the data warehouse paradigm.

Yes. So as you mentioned, kind of there's an evolution going on. Data warehousing started

offline on premise. Then with Redshift kind of was the first time we moved

kind of the same concept but to a managed environment on the cloud. And then came

Athena, BigQuery,

which gave us kind of a serverless

experience.

We moved from managing resources to paying per data scan,

and then came Snowflake and gave us elasticity, true elasticity, isolated elasticity,

decoupling storage and compute. And that's kind of where Firebolt starts. So Firebolt

assumes elasticity to be core

kind of feature of a data warehouse. So you cannot build a new data warehouse

without having that core isolated elasticity baked in.

But kind of the the the mission of Firebolt, the the challenge of Firebolt was to get the same elasticity

without losing speed and efficiency. So

for that, we had to change a lot of stuff, and pretty much we architect

almost everything throughout the data value chain,

in the data warehouse. But evolutionally wise, Firebolt starts where Snowflake ends.

Because of the fact that you have such a strong emphasis on performance

and reducing the amount

of movement for the data that's necessary, what are some of the unique

capabilities or use cases that it unlocks?

So there are 2 big use cases. 1 is self-service

in house analytics. So the typical

dashboarding,

BI tools,

business analysts,

of writing SQL whether by hand or

through the BI tool, and they want to run those SQL on bigger data, more granular data. So move from

pulling data into the BI tool to running the BI tool directly on the data warehouse.

And even though this has been kind of discussed for many years, if you look

actually today at how data warehousing is being applied, you rarely see that. So in many cases, you're still having people moving data into the warehouse just to move it out back into some in memory system

to get speed up. So the first use case is self-service. The second use case is customer facing. So data serves your actual product, the your actual business. When someone clicks on your web app, it runs an actual SQL query on an actual table and returns results in real time. So those are 2 extremes,

but those 2 extremes are at their core,

dependent on speed and efficiency. So with the first, it's more about what happens

after the data has been kind of downloaded from the lake or scanned from the lake, so everything that happens between your memory and your CPU.

And the second thing is mostly around how do I scan petabytes,

hundreds of terabytes of data on my lake and support high frequency,

highly concurrent,

low latency environment. So I would say

on 1 extreme, you have a Redshift and a Snowflake doing the more traditional

BI. And on the other extreme, you have the Athena and the BigQuery and the Druids of the world, which try to tackle real time ingest aggregate kind of for actual production level scenarios. So Firebolt

looks at both extremes as just

2 ends of the same need. Data goes through,

at different phases used by different people, but being able to crunch the same data internally

and then open up the data externally

should be kind of should happen in 1

product and in 1 experience. So Firebolt is about

consolidating

EMT

analytics,

and customer facing apps into 1 solution. And 1 of the other big dichotomies between data warehouses and data lakes is the fact that data lakes are intended for being able to use a variety of different tools and computational methods on the underlying data. But at the same time, it

increases the amount of complexity and the number of moving pieces necessary to be able to get a complete solution out of it. And so there has been this blurring between the data warehouse and the data lake with the advent of cloud data warehouses. So I'm wondering what the perspective is for Firebolt on where it lies on that continuum and the ability of people to be able to bring a variety of different tools to interact with the data that Firebolt is managing. So let's talk a bit about data lakes, data warehouses,

a bit of difference between them and why they both coexist today. So first, of course, when it comes to big data, there's always things happening. There's always multiple

kind of, perceptions or philosophies,

that are being taken to tackle data challenges. Mostly those approaches stem from who you are. So if you're in data heavy data engineering team with a strong engineering background and programming would be your first choice for anything, then you would kind of favor interoperability.

You would favor more working with multiple databases, multiple tools,

multiple environments. So you would want your data lake to be as open as possible, supporting multiple formats, etcetera, etcetera, etcetera.

And then there's another thing. There's another market. There's another type of people who come more from the business aspect. They're more business analysts. And the I mean, today, it's it's very blurred, those kind of things, but

this type of of of of of mindset says that we shouldn't work

so hard to get the data,

work for us. We should use

tools,

to that simplify, that abstract it, that allow me to ignore how the lake works and focus

on writing SQL.

So and, of course, this is an evolution and we've seen a lot of products over the years. So lots of, different kind of,

approaches. I would say the biggest

giants today,

each with its own philosophy, would be Databricks and Snowflake. So Databricks is more about engineering, a lot of tools, a lot of moving parts. You need to understand what merging is and you need to understand how it works part and you manage resources and etcetera. Whereas Snowflake does a lot of that but

with SQL. So it's more of a kind of you rent the warehouse

and you have everything much simpler,

served for you. And I think that the beauty of Snowflake is that they prove everyone

that SQL is the way to operate data and that with the right

platform,

people who never imagined they can work big data cannot work big data. Of course, Snowflake is out of disadvantage, but this huge credit goes to Snowflake. They, I think, changed the market and got the data warehousing or in SQL driven data warehousing

back to the kind of to the discussion of how to build data driven solutions.

So ELT

is kind of because of Redshift and Snowflakes and Deepgrams, I mean, people run their ELT on the warehouse or they run it on Spark. So I think the market will kind of converge into those 2 extremes. There will be people who will still want to have this,

extremely open environment, and there will be people who want to have a

kind of a more simplified

environment in which it just, you know, just get the work done, each with its own advantages. But I think mostly it's about who's using it,

what's kind of the technical background of the company.

Now the biggest change we've seen over the last 2 years is that the first group of engineering, kind of the engineering mindset, is starting to realize that they actually

like, using SQL to run complex

data tasks. So it will be interesting to see in the upcoming years how that plays, and I hope that Firebolt will help to push that forward and convince

heavy engineering teams,

that,

yes, it can be done without

kind of understanding how files are stored in Parquet, etcetera. Now in terms of the data lake, data lake is it's I think it's it's nothing it's nothing new. I mean, data lakes existing for existed forever. We just call them in different names. Beginning, we called them storage devices and and and shared storage devices, and then they evolved. And then we had a dupe and we had HDFS.

But I think the biggest change and and the really thing that that made a data link what it is today is is s 3. So to me, s 3, of course, the equivalent of s 3 on other cloud vendors, but s 3 was the first platform technology storage device that storage infrastructure that,

gave us

infinite scale. It's super simple, super cheap,

cost, which changed everything. And Snowflake proved the world that s 3 can be used as kind of this default standard

storage mechanism. So the data lake today is is mostly about s 3 enabling the data lake to be what it is. And and I am absolutely certain that

s 3 and and the competition of s 3 will will evolve and we'll see new amazing things, giving us more speed with more scale. But thing for me, without data lakes, is that they solve problems that we data warehouses

were trying to solve before. So data warehouse

is was mostly about warehousing.

Now with data lake, warehousing

is something we can decouple

and let the cloud vendor do for us. So that's kind of the first big change between data warehousing and data lakes.

But if you look today at the data warehouse, modern cloud data data warehouse, then

this will exploit delay. So they operate on delay Fireball 3, we call ourselves a data compute platform,

that speaks equal on your lake, but now it's just yet another

definition

for the ability to decouple, storage, and compute. And they're very and in multiple variations to that, we use kind of the modern

elasticity,

resource isolation approach. So a data lake is here to stay. Data lake will allow us to scan a lot of data, but it's just storage. So it's just the beginning of the journey. Now the second thing that is very common with and differentiates between the lake and the data warehouse is the format, so the data format. With the data lake, it's people mostly relate the lake to the ability to use open formats even though, you know, format is really not is not dependent on the lake, but still people perceive lake data lakes as, oh, this is the place where I store my parquettes and Avro and and JSON files depending kind of on the need. But in reality, those file formats are extremely inefficient when it comes to read intensive

analytics. So when when those formats were invented, interoperability

between big data platforms was kind of the bottleneck. So when Prokhet came, it solved the big issue. But if you look today, I mean, no matter what data pipeline you build, no matter what technology you use, if you end always end up with some sort of a data warehouse running your queries. So

it can be an Athena running directly on a parquet or it can be Shift, which uses its own proprietary format or a BigQuery or a Snowflake.

But kind of the things that differentiates clearly between the warehouse and the lake is that the warehouse vendors use their own formats for a very, very good reason because the warehouse is where you end your you you end up your data width, and this is where you want to query your data. And query means read intensive,

lots of users, and this is where we need to change the format and look at how data

is stored on the lake differently. This is kind of 1 of the big changes that Firebird brings is that we first introduce a new format,

and this format is is is sitting, you know, and stored just like on your leg, and we'll touch that a bit later.

But the format is the and everything

on top of that is designed

for

read intensive query,

versus interoperability

and having that format support as many platforms as possible. Instead of using the format, we use SQL as our kind of interoperability

layer. And with SQL, we can support everything from BI to EMT to kind of data science. So instead of, focusing on on on data formats, we focus on data formats that give us the edge we need for the problems that we wanna solve. That's it in a nutshell on data lakes versus data warehouses. I think both will blend. And I think if you look at kind of data lake players, they already

come out with their own compute engines, and they already realized that they would need probably less moving parts in their systems because it's just evolving and and we solve previous problems. We just don't need that anymore. So I think the data lake is very much like an HDFS,

period where we love that, we use that, but, eventually, it will become get just another storage there. Yeah. And there's a term that's come out recently of the data lake house, which makes it all sound very tranquil and pleasant. But

And that's exactly kind of and that's I mean, it's I love marketing, and I love product marketing. And I think it's within our space,

specifically in our space, it's actually needed because there's so many different technologies

and mindsets and ways to solve things. And we're always working on something that was built before and and you never had the privilege to kind of restart everything. So being respectful

to what's

running now and and and and connecting to it and and supporting it is kind of a necessity. You can't just start from scratch, and this is why the lake will continue to be there and this is why companies will use the lake,

you know, within their products. But by the end of the day, it's, all about

obstructing your lake away. So,

yes, a way lakehouse is is basically compute

or emerging or a fire merge process that happens on your lake. Redshift and Snowflake will just not call it that. It'll just, you know, hide it somewhere internal. 1 of the other aspects of the difference between data warehouses and data lakes is in how you think about

structuring and modeling your data.

Because

for some of the batch compute engines or for things like Presto,

you want to work with your data in a way so that partitioned so that you limit the amount of scanning that's necessary.

And so that it's easier to be able to do compaction operations

and, you know, insert tombstones for any deletions so that you can process those in your batch updates. But then in data warehouses, you want to structure it in a way to optimize for the query patterns of SQL and being able to join across dimensional tables and fact tables, and you've got data vaults or, you know, star schemas. So I'm wondering because of the fact that Firebolt is so focused on extreme performance and very fast scans across large volumes of data, how that impacts the way that users should be thinking about modeling their data as it's being loaded into Firebolt and as they're processing it and

structure, and,

they can coexist. So actually with Firebolt, we fully support semi structured,

data both on kind of the storage

and on the querying side. Now let's kind of talk a bit about semi structured.

Semi structured today exists in 2 variations. The first 1 is

you you have semi structured support usually with array types,

and the data warehouse vendor or the file format allows you to store primitive types as arrays

within your table. The second thing is how you query that those arrays. And this is where the big, kind of differences

are are are, becoming more relevant. And most data warehouses today will will force you to flatten the data. So and this is why in many cases, even though you have amazing, array manipulation support specifically for Presto, and we'll talk about that, people still

flatten the data. They still explode the data because it's just too slow. So even even though you

have a race for it within your table, within your database as a primitive type,

you're still not using it to query,

you know, real in real time, you'll flatten it. And the flattening

process

removes the whole value of a race. So Firebolt takes a different approach here. We want, you know, to enjoy both worlds. We want to support native array types and multi nested array types so you can do many arrays in each other to nest them together.

But we don't want users to flatten the data. We want users to apply array manipulation language, so it's an extension to SQL, which was taken from the Presto standard,

but we want that to run extremely fast. If you do that, you get 2 advantages. 1 is you keep your data your table smaller because you don't need to flatten it in advance. You have your table and you have some of your fields as arrays, but then when you query the data and you apply array manipulation within SQL, you want to kind of to execute that in a completely different way than you would do with relational operators. So for that, this is kind of where you enter the domain of LLVM just in time compilation and similar stuff. But, and this is why Fireflies, by the way, combines vectorization

and LLVM just in time compilation

to support different variations of SQL use cases

even though those would look to the user as just 2 things that are part of the same query. Now I haven't mentioned it before, but when it comes to Fireball, there are 2 big things. There is storage and there's compute. And, and we've discussed elasticity. So so with Fireball, you have this concept of engines. Engines are

isolated clusters, isolated compute resources. You can have 1 or many engines

running concurrently on the same database. So database is just a logical representation

of your schema. And within those engines, you have the local cache which stores intermediate data so you don't have to scan data from X3, etcetera. But when it comes to storage, it's all about tuning. So indexing your data means that Fireball, for the first time, orders your data when you ingest it. If we can order the data, you know, if you can do that fast enough, then we can completely change the way we build query engines and

basically

basically change the way we build indexes

and and change most of the way data warehouses work. And clustering

and sorting and ordering has been a huge pain for for for data warehouses over the years. And and even today, it's not actually working. So

most companies will not apply clustering because it's just reordering the data in the background and it costs huge amounts of money. It doesn't work. In order to make order,

sorting, ordering of data work, you need to do that to ingest. And if you get that right, if you get your data coming in in a natural order and coming out in an order sorted order, then amazing things start to happen.

So Fireball introduces a primary index on your table, meaning you can have 1 or a 100 columns defined as your primary index. This means 2 things. 1 is we sort the data by this compound

index. So

each chunk we generate, a file or a set of files, we can call it, will be ordered by this index. Now as data gets treated

and gets ordered and stored as ordered files

and uploaded to s 3 and committed to s 3, we also index the data. Indexing the data means we can create sparse indexes because the data is sorted. Those sparse indexes are actually

living within

the instance itself, so they're in memory. So we have data order on s 3. You can stream in a petabyte of data, and it will be sorted by this primary index. On the other hand,

having the data ordered allows us to, for the first time, apply sparse indexing

in memory

close to the compute

so we can prune data. And pruning data on s 3 is everything. If you look at Athena or BigQuery or or any other data warehouse,

none of them will actually tackle the pruning problem as they should for many reasons.

But

pruning means I can download

data that is, that is actually what I need for the query. And when you run, you know, and you mentioned before that, you know, a lot of times when you work the lake, you need to think how you need to store data in the lake and how you need to place

files and folders,

in partitions,

and and do all of that. So you don't do that with Firebolt. With Firebolt, you justify your primary index and you let Firebolt handle the rest for you. Now when you run your query on on on your terabyte or multi or or petabyte scale, the query engine will use the sparse indexing

to prune as much data as it can and eventually download

during the query

as less data as it can, which gives you this amazing

crazy

performance boost, like, on the, like, on the first second you start using the product. So if you take you take do a copy paste of your sequel from another computer engine into firewall, the first thing you'll notice is the amount of gigabytes

that were actually scanned

for the query, which are between 1 and 2 orders of magnitude smaller than anything out there. So this is kind of the first problem, but it's not the only problem. This problem tackles big data. How do I enjoy s 3 but don't have to download so much data from s 3 just because they have 5 columns as predicates?

So I want to do that efficiently. And this is kind of more in the big data domain. The second thing, which is pure

computation, and and I love this space because kind of this is where it started. This is where your query kernel, your query engine lives. And this is and this is where you need to design it differently. So vectorization, LLVM, just in time compilation, encodings of data, applying suprascallatic compression. All of that is needed because

you want to run queries that are more BI centric. So having a star schema is perfectly fine. You wanna have a star schema because that's how you look at your business. Being limited

by kind of big data platforms to force you into having this denormalized,

highly

flattened table

doesn't make sense anymore as big data becomes then just the new normal. So you wanna have a system that can do big data stuff, yet you want it to feel like it's just a simple

kind of,

you know, BI supportive,

star schema supportive database.

So there are 2 universes here. And and so the first 1 we talked about is storage and pruning and indexing and sparse indexing and sorting and merging in the background.

But the second thing is has nothing to do with it. It's all about in memory computing and supporting joints and taking pretty much almost everything that was worked on in the academia over the last 4 years and trying to implement as much as possible into Firewall.

So

this is kind of how we get our edge on speed. We tackle both sides of the problem,

and therefore, we can support

both extreme use cases of Step Thru versus kind of customer facing high velocity data platforms.

Today's episode of the data engineering podcast is sponsored by Datadog, a SaaS based monitoring and analytics platform for cloud scale infrastructure,

applications,

logs, and more.

Datadog uses machine learning based algorithms to detect errors and anomalies across your entire stack, which reduces the time it takes to detect and address outages and helps promote collaboration between data engineering,

operations, and the rest of the company.

Go to dataengineeringpodcast.com/datadog

today to start your free 14 day trial. And if you start a trial and install Datadog's agent, they'll send you a free t shirt.

And can you dig a bit more into the underlying architecture of Firebolt? And I know that you have, as you mentioned, a proprietary file format in the form of f 3 to be able to handle some of these sparse indexes and accelerating

the range scans on and reducing the amount of data that gets processed. Just wondering if you can dig through some of that and some of the ways that the overall design has evolved since you first began working on it and as you dug deeper into the problem. Mhmm. So

so yeah. So I try kind of to take us through the data journey.

Maybe it will help to understand what's happening behind the scenes. So we always start with either a lake with files in it or a kind of some kind of streaming system. So it can be Kafka

or it can be a set of kind of log generating tools, but those are our only 2 inputs. We support

file formats, big data file formats sitting in the lake, s 3,

and Kafka slash,

streaming platforms.

This will be kind of how we start

streaming data into firewall.

When data gets streamed,

there is kind of a a

a micro process that's going on behind the scenes. So you need to launch a compute resource. Some call it warehouse, some call it engine, but basically it's a cluster of compute,

that has a local cache in it and it executes your SQL. When you do an insert

from your s 3 bucket into

a Fireball table,

you can create various types of table, fact tables, dimension table, all sorts of tables. But when you do that, we we we do a few things. We, of course, convert the raw format

into our own format. We'll call those data chunks. The data chunk is a set of files. Unlike Parquet, we divide,

each column into its own separate file. This allows us more flexibility

on modifying the data after you ingest it so you don't need to read,

process files and you don't need to kind of go through

extensive

batchy,

long running processes just because someone added a new column or just because someone wants to update the column. We want to have this kind of flexibility

even though it's sitting on the lake. So the first thing is decoupling

columns and storing them as different files, but this data chunk also has everything needs to be self descriptive. So the indexes that's that are that it has, the columns, the skipping indexes, the the the kind of encoding and compression

that is applied per column. Everything is

within the file, and everything is within kind of the file meta format sitting on the link. The files are being generated all the time. The the size of the file

depends actually

on

the payload that was inserted into the database. So if I'm

inserting 1 row

every 10 minutes, then a file with 1 row will be generated. If I insert a 10, 000, 000 rows every second, then we will generate a file with 10, 000, 000 rows. So we will always try to commit

the data as fast as possible because we want to provide real time experience. This is kind of the beginning, but we all know that we cannot have those files stay as they are because they might be efficient for store,

but not efficient for real. So files go through an evolution in which they start merge with other files

as we ingest new data. And this merging process is 1 of the core and fundamental

concepts of the architecture. It does so many things

from

optimizing

ordered files and converting them into into fewer bigger files

because our files are sorted. So it makes perfect sense to merge sorted files and enjoy

ordering and compression

so we get much better compression than you would get from Parquet, just as a side effect from

sorting. The second thing is that merging does other things as well. It does the dupe. It does upstarts. It does

slowly changing dimensions. It does

anything that relates to data updates, and data updates with big data are a bit different than kind of traditional

batchy updates.

With big data, updates are happening all the time. Every day, we're getting

old data that we need to modify

because that's how it is, slowly arriving data. So we've decided that

aside from supporting GDPR and

old school batchy updates and deletes, we want

to innovate

on upsurge and updates as well and use the fact that we're so capable of merging

efficiently that we will allow you to ingest data and do upserts

throughout the stream without actually needing to apply kind of a batchy

side operation to do that. So if you want to dedupe data, you just define a dedupe index on the table. If you want to update data, you just define how you want to update the data when new rows come in. So everything is kind of decorated within the table you define,

and it's fit for

real world big data versus trying to take

old school snapshot, lock key updates and apply them on big data. Now there's another big thing, which is because data is is in most data warehouses is stored in natural manner, natural order, then when you update stuff, you usually

change

much more than you actually need because on the lake, data is immutable. So when you when you upload a file, commit a new file, but you want to just change 1 row in it, you actually need to write the whole file.

So people get confused. People get,

frustrated. People,

you know, they spend months

designing their perfect parquet kind of, cluster

structure, and then they

just to kind of find out that every day they need to update data and they end up rewriting

80% of the data. Now when you order the data, data gets clustered in a much better way. So updating,

you know, the 10% that you usually need to update every day gets updated in a much dense more dense and clustered way. So you

it just, you know, as a side effect, update much less data. Now this is 1 thing, but there's another thing which is more complex, which is if you block the system when you update the data, then you

ruin half of the potential value of the system. Because if updates are blocking and if updates are long running and and you need to if updates are expensive,

then

it breaks the real time or near real time experience on big data. So with Firebolt, we we wanted to tackle that. And the way we did it is actually solving it on the query engine. And the way it works is like this. When you update something, we would version the files. We would like we have an internal version mechanism that, you know, knows which file was updated and what. And the query engine,

it knows how to compute

on data even though it has duplicates in it. So dedupe, for example. I mean, dedupe is constantly happening, but you don't want dedupe to block your system. You don't want to dedupe in a batchy manner. You want dedupe to be just hidden behind the scenes

and eventually merge it. So we move from

low key

updates to eventual consistency

via merging of

data tracks. Now eventual consistency is a problem because you can't work in, you know, if if you have eventual consistency of data, it removes a lot of critical use cases. Now to get consistency,

the query engine

knows

how to work with non dedupe data. So even though you assigned dedupe on your table,

you might have at a given point of time duplicates.

Why? Because the merging process hasn't yet come to the point where it actually merges files, the specific files, and doing the the dupe. Because as I said, everything happens,

through the merge process. Now

so you have you find yourself in on 1 hand having

duplicate rows.

On the other hand, you define your table to be the dupe. So whenever you run your SQL, you want to automatically

have your data warehouse solve the problem for you, and that's exactly what we're doing. We are decoupling

the kind of eventual

merging of files to apply the dupe from the ability of the query engine to still be able to work on duplicate data yet give you the correct results. And this is gets more complicated when you move to kind of when you

provide extensive SQL. So doing distinct count, doing mean, doing averages, doing aggregations that are not naturally

kind of not you can't you can't just append to. You need to have a stateful

kind of aggregation,

mechanism.

So I hope I kind of got the kind of the data journey

a bit clearer.

Let me just kind of try to wrap it quickly.

Data streams in from standard system

gets converted, gets sorted, gets indexed. Those indexes are sparse. They will

sit within the compute engine. And when you start working on that,

pruning will make sure we download as less data as possible, and then the query engine will be responsible for the rest. They do up search and everything else is done by the merge process. It's eventually consistent, yet the query engine will always make sure you get consistent results even though your lake might be behind and still have duplicate growth. Does that make sense? Did I kind of explain it?

Yeah. That was definitely a very good explanation, and you answered my question before I even got the chance to ask it as far as being able to handle eventual consistency

and the accuracy of results, particularly in the case of things like duplicates where you might run the report now as

the data has already been merged and processed, but then you run it again in an hour after more information has been added and some potential duplicate rows have been created. And then making sure that those analytics and the end user reports are consistent across those different spaces, just taking account for the new data and delighting the duplicates. So I think it's definitely interesting that you add that intelligence into the query engine to ensure that the eventual consistency

of the underlying storage

doesn't compromise the end user experience.

Exactly. For us, eventual consistency is more about what the user sees or kind of the result of the SQL and not what's sitting within your lake. I mean, that's just storage. It's just like a temp temporary folder of files but shouldn't matter to the user. The user should get consistency and to do that, we need to move eventual consistency from the storage to the query part. Another aspect of the problem that you're working through beyond just the capability of storing the data efficiently and processing it efficiently

is being able to orchestrate

the cloud resources that are being used to actually perform all these operations.

And I know that another aspect of your value proposition is that you transparently pass through the underlying cloud costs to the end user rather than adding any sort of markup. And so I'm curious what your experience has been in terms of being able to handle the automation of those resources and trying to manage the efficiency of the overall compute

and the infrastructure required there as well as

reducing

the cost to the end user and some of the messaging and just the systems that are necessary to help support that aspect of the problem?

Okay. So so 2 things. 1 is critical because when we thought about how we're gonna sell the product and and and how we want to sell the product, we quickly realized that we have a core problem. As a company whose, you know, whose vision and mission in life is to provide

efficiency and speed on data,

charging per compute, per scan, per resource is just counterintuitive

and it's actually

hurting our mission. And I'll explain why. This is coming from someone who's been building and have been sitting on the other side of the of the river and and building the technology that support your queries. And when you're a data warehouse vendor and and you're and you charge for compute, it's really you're selling, resources. That's kind of how you that's your business. Your business is

sell more resources because

with today with today's data warehousing,

resources are the only way to get speed. So the pitch is always the same. It's we give you fine grained elasticity so you don't have to pay more, which is

completely wrong, and I'll shortly explain why. But then again,

you're kind of paying kind of a virtual

credit,

a virtual price that the data warehouse invented

that hides the actual cost of what you're using. And if you are a company that, you know, relentlessly

works on speed and efficiency,

you end up 1 day asking yourself, am I hurting my business? If I've been going to release a new

skipping index

that does string search a 100 times faster.

Why? Because, you know, 6 months ago, someone from China released a new research article about it. And you look at it and you decide, okay. I love it. I want to try it out. I want to implement it. So someone will ask, okay. So what will happen to our revenue? I mean, if if I'm making this SQL which runs, you know, an hour a night, now it runs in in 5 minutes instead of an hour, then I'm hurting my revenue. So there is kind of a a a conflict, a a herd conflict when you want to provide speed and efficiency and you still charge per compute.

So we've decided early on, even though we didn't know how to charge, we've decided we're not going to charge

for resources.

With Firebolt, we don't care whether you launch

a 1, 000 engines, you know, a 1, 000 nodes,

or 1. We will not make any revenue out of that. Now to enable that, we used, AWS

marketplace

intended account concept. So as a customer, as an account owner in Fireball, you will see

each and every

resource

consumption

on your AWS bill.

And it will be

with the price that you're paying for AWS,

and it will support it already actually supports spot instances

or any other performance enhancing

feature that AWS provides. And you will pay the cost that AWS charges you versus the virtual cost that we think you should pay for. Now you may ask, okay, so how do we make money? I mean, if we allow people to do whatever they want with our product, which is true, how do how do we still make money? So the way we do it is we look at the feature set. So instead of selling

compute or data scan or anything like that, we say, okay. So there's a big difference between someone who wants to have a customer facing

production ready,

running thousands of concurrent users, 247,

than someone who wants to, you know, to just replace Athena and then get fast responses on its Tableau. So there's a big difference between someone who's using self-service BI and wants

to work on granular data than someone who wants to embed analytics within their product. So we, will charge based on features we use and we divide that into subscription plans. So it's it's super simple. You have kind of a free version.

The free version means that we don't make money and you only pay

for electricity. You only pay for the resources that you actually consume up to the second. So it's as fine grained as AWS can be. AWS 2 more can go from a second to a nanosecond,

so we so will we. The second thing is subscription. So standard is free.

Then you have kind of the basic, which is, I think, up to a terabyte,

and it goes kind of through a few levels.

It's it's like, you know, buying a SaaS subscription.

So like, buying a Atlassian.

It's as simple as that. Our goal is to

get as many users as possible to be a a productivity tool on data, And we're less concerned on

maximizing

kind of resources or making money out of resources or or more, or even kind of

growing our revenue just because you as a customer

gathered more data. I mean, it's ridiculous. I think it's it's it's a really big problem with data warehouses, and it's contradicting to speed and efficiency.

Therefore,

we've decided that we will not do that. We will decouple costs from compute,

just as we decoupled storage from compute, and we'll have users pay a simple

subscription price

that is much, much, much more cost effective than anything they use today. And hopefully,

people will actually

have their data warehouses

running versus

renting their data warehouse for an hour here or there and actually kind of start using the data warehouse on on more use cases, on more scale, and most importantly, without

constantly

thinking about how much a query will cost me. People ask before they ask today. We we see we see it all the time. We see the the way a typical POC goes like is you have an Excel file, you have a 100 queries in it, and then you have kind of different vendors, different engines, different warehouses, different kind of, you know, t shirt sizes, whatever. You you name it, And and people kind of log the cost and people try to manage who can run what because it costs so much.

So this is kind of, over. This is not got not going to happen with Firebolt and users love it. Yeah. It's definitely helpful to have that transparent cost because with Amazon in particular, it's

virtually impossible to understand what you're going to pay when you first launch an instance. So just passing that through directly is, I think, useful because then the user doesn't have to worry about what additional costs they're going to incur or things like BigQuery where it's charged based on the amount of data that gets scanned and a sort of per query rate.

It definitely, as you said, puts people in the situation of, only gonna use this for an absolutely have to versus allowing people to explore and experiment and a mock potentially, you know, valuable insights that they might not otherwise

achieve if they're just trying to reduce the amount of spend and just getting the bare minimum out of the query engine. And exactly. And, you know, if you look at kind of how we build features on top of data, it's you the typical person will tell, listen, if I need a new feature, it takes me 2 weeks, it's done. But if I need a data driven feature, it will take me 8 months. Why?

Because they prototype it on a on 1 system, which is more cost effective and and and and more suited for ad hoc exploration. But then when I want to take it to production, I discovered that I need to pretty much

do it from scratch

on different toolsets,

different mindset, different,

new limitations.

So it's super frustrating, and that's how we look at it. We think that someone who's building data driven features,

whether it's an engineer, a data scientist, or a business analyst,

they should be able to move from exploration and experimentation

to production

extremely fast with the same platform. And and we think it's extremely valuable for trimming down those 8 months

and and this frustration on on on, you know, building data driven features. And on the point of development, it also brings in the question of ecosystem support

and what are the types of tools that somebody might use Firebolt to replace entirely, and what are the integration points for being able to embed Firebolt into their overall data platform?

So in terms of interoperability,

we're doing 2 things. 1 is we have our JDBC

driver, which should power Tableau, Sisense, Looker,

and all the leading guy bangers. On your question to which what do we replace? I would say from our experience so far, when it comes to internal analytics, it's mostly Redshift and Snowflake. When it comes to external analytics, it's mostly Athena,

Elastic, and BigQuery. So those would be kind of the the typical

platforms we would, tap into either side by side or replace. I mean, when you're a SaaS in a warehouse and when you live in the cloud, it's much easier to to move between systems and and and we usually start

sitting alongside

those vendors,

but quickly and hopefully

people will start using more and more use cases on file because it's just just for efficient. And then, you no. It's a it's a bummer to wait 10 minutes for for a query to return whether when it should have returned in a few seconds. So I would say

interoperability is mostly about providing the SQL support. By the way, we are derived from Postgres, so this is kind of our SQL standard. I'm not and I'm talking about

TPC, DS, TPC

H level. We're going to release later this year the benchmark. So, people who are interested in TPC H and TPC DS,

stay tuned. New performance benchmarks are coming

out and new kind of and when you'll see new,

peaks on performance,

usually, we've seen that with GPUs.

Now we'll see that, running on commodity hardware. Again, when you work on sorted data, everything changes,

even so even speed. So,

going further ahead, we're seeing stuff like DBT

a lot. This is kind of super interesting, and

we're thinking more and more about deeper integration with those

modeling,

tools.

Think LookerML,

DPT,

and there there are a few others. We want people to be able to run those scripts completely within Firebolt,

and we're working hard on doing that. So for data warehouse, it's super important because you always tap into an existing ecosystem and you always have people

running their Tableau dashboards or in in Looker,

live dashboards, and and they don't want to rebuild everything. So part of the huge challenge with building a data warehouse

is actually making sure that it won't take us 10 years to get to the interoperability that we want. So a lot of resources,

are put in advance

to have as much interoperability

as we can to plug into as many existing environment as we can. And usually people don't build on top of Fireball from scratch. They usually copy paste your SQL or they change connection in their Tableau,

and this is how we look at it. I mean, Fireball should be kind of a continuation of what we're doing with a much better experience versus if you want to get this speed, you need to rethink everything you're doing. So big no on rethinking

and big yes on on anti frustration software, which it means being able to connect to existing stuff. And you mentioned that you built off of Postgres in terms of the SQL support, which seems like a natural choice because of your ability to work with nested and semi structured data. I'm curious

what you have found to be necessary in terms

of extensions or modifications

to the SQL specification and the

Postgres implementation

as far as any additional built in functions or extensions to the

format for making it natural for people to work with large volumes of data? Yeah. So yes. We started with Postgres as kind of the base,

to support

BI,

star schema and more generic,

use cases. But then we implemented

the Presto

array manipulation dialect

into a data warehouse

because we believe deeply that people who write SQL will start writing

array kind of or a semi structured

SQL as well because

once they realize that

how much they save and how more efficient they can become by doing

that, then they just start doing that. So, you know, when you look today, you know, people flatten, people use the window functions, people use all sorts of stuff. But if they had a RAM manipulation

baked into SQL,

they would tackle problems differently

to get kind of the speed and efficiency they need versus building an ELT process just because they can't do complex stuff on arrays. And when I say complex, I mean like doing multistage Lambdas, you know, anything we know from Presto arrays. This is kind of the first thing. So extending SQL to support semi structured data

yet having it behave in a relational manner was the first kind of critical thing. The second thing is more about

kind of managing

and running your EOT. So SQL is our only language. This is kind of you you you can do everything with SQL and Firewall. And and when I mean everything, it's not just DML, DDL,

and select statement, but it's also

resource management.

You manage engines with SQL. So imagine that you have this ELD script can be, you know, it's a multiline, you know, has lots of stuff going on in there, but throughout the Elt script, which is purely in SQL, you want to modify resources. You you're kind of you're doing it in a big aggregation and you want to have a strong compute for that, but then you're moving to a simpler stuff and you want to change the compute. And you can

do that in the script,

interact like when the script runs, it will modify resources dynamically

to support that. So SQL is kind of growing beyond

declarative selects

and beyond,

basic operations on metadata and data,

and you're now

able to actually

decide

which resources you want to do your insert with. So you can have insert into a table and you say, okay.

This ingestion

is actually I want to apply a specific engine on. By the way, I forgot to mention. With Firebolt, we have different types of engines.

Why? Because it doesn't make sense to have 1 type of engines for ingest and and analytics.

For ingest, I would use a wider,

engine with many more nodes to maximize network bandwidth on s 3. I want to download petabytes of data, so I want to have as much network bandwidth as I want. I also want to have much more local storage. But when it comes to analytics,

shared memory is king. I want to add tier nodes with more memory,

more cores because

with high performance computing, we can do magic there. So you when you kind of launch your resource, you decide

what type of resources it is, not just the size of the resource. It can be an ingestion engine. It can be an analytics

engine. It can be general generic engine. We are having an upcoming kind of real time

streaming engine that is different because this is more useful for

real time kind of high frequency ingest use cases, which is different from

sporadic up and down of, you know, I'm launching an engine to do some queries.

So we have different hardware,

stitched together within the console of engines, and this hardware is configured differently

to support different use cases.

So users

can use SQL

to change the actual resource they're using

throughout the query,

depending on kind of the phase of the query. And that's very powerful because you have a 1 SQL that not only contains what you want to do, but also

how you want to do it, what research you want to apply. So SQL is being, kind of is growing beyond what we knew on traditional DBMSs and in traditional data warehousing.

Because now when you write SQL, you act you can actually control the resources you're using. Now of course, when you allow people to control resources, you you also need to give them the ability to control

who and who can do what. So think about a quota system versus a

just kind of old school user

permission and security. With quotas, you can define

which queries should consume what resource. You can define which users can do what. You can say, I want to have a maximum amount of money spent for this dashboard, for this user, for this work use case. So

we're really looking we're really really starting to see how

people who actually work the data

can be responsible for for for the resources that they operate on. So don't have this centralized,

cluttered place where people

flow start to manage

cost in Excel anymore, and you don't need to wait for commissions from someone to allow you to ask the question.

The system is taking care of it and the system is providing

the new unique feature sets

embedded within SQL to support.

So

semi structured and the ability to control resources would be the 2 big changes that we introduced into SQL. And in your experience of building the Firebolt technology

and the associated business, what have you found to be some of the most interesting or unexpected or challenging lessons learned in the

process? I've never worked at a place which is not a startup or my startup. I mean,

so it's to me, it's kind of my whole career is is feels like kind of the the bullets you described. With Fireball though, it's a bit different because with Fireball, it's the first time that we're building

after we've built a Unicorn before.

So we've learned a lot with Sisense and a lot of smart

friends joined me for to Firebolt with a lot of experience that that we didn't have before. And when we look at Firebolt now, the speed in which we operate is completely different. So,

you know, with with Sisense, we've raised a few $1, 000, 000 when we started and with Firebolt, we've raised an order of magnitude more money and we we have built an engineering team that we're so proud of. The engineering team is is is located,

you know, in in 6 different countries.

Building a data warehouse today means you need to have experts in storage. You need to have experts in cloud. You need to have experts in SaaS. You need to have experts in query engines. So there's so many different things you need to combine because it's a very complex system. And and when you take complex systems and you want

to simplify them for users, then it's just hard work. Constant, constant hard work and,

ups and downs. And,

and that that's what I'm used to and I love it and I'll never stop doing that. So yes. So Fireball is unique because we are more mature and we are more focused and we have a lot of experience and we know what we want to solve. And we, when we started, we we and we're, you know, we started,

we did we're 2 year old. So

the first year was pure research and the second year was, actually, you know, starting the company in 2019

and and hiring the 30 engineers and which will double soon. So the company is is is 99%

engineering today because we believe in, you know, even specifically

in our period, the COVID, you know, and and everything that's happening, everything is self serve. Everything is you don't kind of sell the data warehouse anymore. You sell the quick experience and you sell the solution to a problem. Everything is online.

Everything needs to be SaaS. Everything needs to be instant. So it's very challenging and,

but but we love it and we're

looking forward to have Firebolt grow and and and tackle the problems that we think are relevant

within our space. So Firebolt is definitely a very impressive system and a complex feat of engineering.

But what are the cases when it's the wrong choice and somebody should be looking to a different technical solution for their problem? So there are many things we don't do well. I mean, we don't do most things well. We do data warehousing extremely well. So if you're a you need an OLTP system to run your app, that's not us. If you need a key value store, even though we use key value store, we use FoundationDB

extensively,

we're not a key value store vendor. And, and there's so many amazing key value stores out there that productize it, their tech.

So we're not that. We're a data warehouse and we're and and we're a cloud native data warehouse. We cannot run

on your on premise environment. So if you're kind of a hybrid

if you need a hybrid solution, that's not us. Only do native cloud. So it means we you need to have your data on s 3

today. We're still on edge on AWS. We actually love AWS and we think that until we get

AWS,

you know, to perform perfectly,

we're not going to move to other cloud,

vendors. We think that today and for the foreseeable future,

AWS is where true big data exists versus kind of big data concepts. S 3 is still the best

big data storage environment,

so we want to exploit

anything that's unique about AWS and have that natively embedded within our product. So for us, it's not about

supporting multiple cloud vendors.

It's about

supporting each in AWS first perfectly. So that's kind of what you need to run Firebolt, and this is

why you wouldn't use Firebolt if you wouldn't have your data

in AWS.

Are there any other aspects of the work that you're doing on Firebolt or the overall space of data warehouses

and data lakes and storage engines or the future direction that you're planning on taking, the technology and business that we didn't discuss that you'd like to cover before we close out the show? I think we covered

everything. Almost everything, I think. Yeah. We did. I hope we did well.

Yeah. It was definitely interesting to learn more about the product. So I thank you for that. And for anybody who wants to get in touch with you and follow along with the work that you're doing or get in touch, I'll have you add your preferred contact information to the show notes. And as a final question, I would just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. I think that so first, the the data management,

ecosystem is is

is doing well. So we're solving real problems, and we're enabling

amazing new products

and based on new experiences. So I think I I have a lot of respect for a lot of vendors. And I think that

given this slow changing environment where, you know, it's just not easy to just, as I said, move from 1 thing to another, I think that things are not that bad yet. If you talk to people today, they feel like it shouldn't be that complex to operate. So people want to move forward. People want to focus on new challenges and want to provide new experiences with their offerings. If you look at how data is being built today,

they're very it's very limited. So most advanced companies

have reached their limit in terms of what they can provide in their product given their existing

tech. So with Firebolt, we're showing we're showing customers that they can move forward. They can do

more and more interesting things on their product just because they have this new type of speed and efficiency

that they

never imagined they would have on the data warehouse. I think that that, kind of historically,

the

even though I complimented the market,

I think that the market is yet over complex and it it was

doing well because

Uber smart people operated it. But if you want to open big data, if you want to have more people involved, you need to

simplify because most people don't are not engineers and most people just need want answers.

And and to enable them, we need to have fewer tools. We need to have more consolidated

environments. And and I think that's where the market goes. I think that over the next 5 to 10 years, we'll see less

interoperability

on the stack. I mean, ecosystem will always exist. So EFT tools and PI, they will always be there. But on the, on the kind of the data warehouse stack, on the data lake stack, people will start using few vendors and will start using few products

and we'll, you know, favor

products that do more for less versus

products that talk about scale and and and and and and just in lakes. So I think the market is moving from the tech and how to what we want to do.

And, we'll see more,

companies and products coming out that simplify and hide the complexity

so people can move faster. Yeah. I definitely agree that the simplification

you there, and I thank you for the time that you took today to join me and discuss the work that you're doing with Firebolts. So I appreciate all the time and effort you've put into helping to move the industry forward

and simplify the experience

of users trying to work with large volumes of data. So thank you again, and I hope you enjoy the rest of your day. Thank you for having me, and,

enjoy the rest of the day as well. And,

good luck to us all with data.

Listening. Don't forget to check out our other show, podcast.init@pythonpodcast

dotcom to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site of data engineering podcast dotcom to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links