Unpacking Fauna: A Global Scale Cloud Native Database

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline or want to test out the project you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode. With 200 gigabit private networking, scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform.

And if you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai.

And for your machine learning workloads, they just announced dedicated CPU instances.

Go to data engineering podcast.com/linode,

that's l I n o d e, today to get a $20 credit and launch a new server in under a minute.

Eluxio is an open source distributed data

layer that makes it easier to scale your compute and your storage independently.

By transparently pulling data from underlying silos, Eluxio unlocks the value of your data and allows for modern computation intensive workloads to become truly elastic with a cloud.

With Aluxio, companies like Barclays,

jd.com,

Tencent, and 2 Sigma can manage data efficiently,

accelerate business analytics, and ease the adoption of any cloud.

Go to data engineering podcast.com/aluxio,

that's a l l u x I o, today to learn more and to thank them for their support.

And understanding how your customers are using your product is critical for businesses of any size. To make it easier for start ups to focus on delivering useful features, Segment offers a flexible and reliable data infrastructure for your customer analytics and custom events.

You only need to maintain 1 integration to instrument your code and get a future proof way to send data to over 250 services with the flip of a switch.

Not only does it free up your engineers' time, it lets your business users decide what data they want where.

Go to data engineering podcast.com/

segment I o today to sign up for their start up plan and get $25, 000

in segment credits and $1, 000, 000 in free software for marketing and analytics companies like AWS, Google, and Intercom.

On top of that, you'll get access to the Analytics Academy for the educational resources you need to become an expert in data analytics for measuring product market fit. And you listen to this show to learn and stay up to date with what's happening in databases,

streaming platforms, big data, and everything else you need to know about modern data management.

For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season.

We have partnered with organizations such as O'Reilly Media, Dataversity, and the Open Data Science Conference.

Go to dataengineeringpodcast.com/conferences

to learn more and to take advantage of our partner discounts when you register.

And go to data engineering podcast.com

to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. And please help other people find the show by leaving your view on Itunes and telling your friends and coworkers.

Your host is Tobias Macy. And today, I'm interviewing Evan Weaver about

FaunaDB, a modern operational data platform built for your cloud. So, Evan, could you start by introducing yourself?

Great to talk to you, Tobias. I'm Evan Weaver. I'm CEO and 1 of the cofounders of Fauna.

And do you remember how you first got involved in the area of data management?

I do remember.

So in in grad school, I did

a bunch of work,

in bioinformatics.

Specifically, I worked on

gene orthologs in chickens as well as a plankton simulator.

And for the for the gene project, I ended up using Rails

because we needed a web interface to throw some data up. And that was sort of my first experience

with with web programming. It was my first time using a a real database. I I got super excited about Rails because of, like, the blog demo screen cast thing, and then I spent a week trying to install Postgres.

And then from that point forward, I was basically doomed to spend the rest of my life paginating things and working on the data side of platforms.

After after that project, I went,

and worked at CNET Networks and did rail sites there. Specifically, I did chow.com

and urbanbaby.com.urbanbaby

was a threaded real time

web chat for moms.

So if you take away the for moms, it kinda sounds like Twitter. Around the same time my team at CNET

left, to found GitHub, I left to go to Twitter as employee number 15.

And so in the time that you spent at Twitter, you ended up dealing with a lot of different issues related to databases and storage and consistency.

And after that, you went ahead and co founded Fauna and

released the FaunaDB product. So can you start by giving a bit of an explanation about

what FaunaDB is and your motivation for starting it?

Yeah. So at at Twitter, I ended up running the the what we call the back end infrastructure team. We built all the distributed storage for the core business objects. So that was tweets, timelines, users, social graph, image storage, the cache,

probably some other storage that I forget.

We also worked on performance. And,

Twitter was 1 of the last

great consumer Internet companies that was built pre cloud.

Like, we were using colocated hardware.

There wasn't any cloud native software. We had to do almost everything ourselves. And it's probably why there are a lot of great infrastructure startups that spun out of Twitter. And, essentially, when we were starting to trying to scale Twitter up, we got involved in the NoSQL movement, in particular,

Cassandra.

And we hosted the 1st meetup. I wrote the 1st tutorial. We fixed the build because people don't remember now, but it actually didn't compile

when Facebook open sourced it.

And we were hoping basically that Cassandra would develop into

a global data platform that would be multipurpose, you know, reusable, flexible, productive, and that just didn't happen.

And to scale Twitter, we ended up building point solutions

where we would take, like, InnoDB or Redis or Memcache or some other local storage engine and put a

sharding facade

in front of it that manage replication and querying and transactionality

and that kind of thing.

But because Twitter was under such extreme time constraints, we just never had the chance to build

that truly reusable platform that we wanted to build.

So that's basically Fauna.

I spent 4 years at Twitter.

When I left,

couple people ended up coming with me, And we spent about 3 years in consultancy mode exploring the data space, working on a bunch of other projects, trying to understand, you know,

Twitter needed a social graph, but there's probably not a market

for a social graph DB.

Like, what do people really need as a general comprehensive

data platform? And then in 2016,

we felt like we had a prototype down. We had an initial customer. We went and raised our seed round from CRB.

Then we raised our series a

in

in 2017

from 0.72

in GB.

And so

you've

been building this platform for a while. And looking at the technical documentation, it seems to be quite the feat of engineering.

So I'm wondering at the outset, what are some of the main use cases that FaunaDB

is targeting that you found people were asking for in your exploration of what the data space was looking for and what it was lacking? What what we found was really missing

was

that that general purpose,

you

know, safe and reliable

cloud native

database, so to speak. Like, we found a lot of people who said,

you know, I looked at NoSQL, and I like the scale.

I looked at SQL, and I like the flexibility of modeling.

I can't get something that does both. That's the same experience we we had at Twitter and partly why we ended up building all all these systems more or less from scratch.

And we decided, you know, if we don't get this done, it's not gonna happen. Like, information science doesn't say that you can't build

a

transactional

global high performance

operational data store. But the you know, in practice, there's so much path dependence and software development that

at the time, everyone who had tried to do that had basically gotten diverted into some niche,

like time series or

click tracking behavioral data, that kind of stuff. Like, the NoSQL vendors had given up on transactionality,

data correctness, safety, and we're promoting a worse is better story.

And the the SQL community had fallen back to, well,

vertical scale is all you need. Global is impossible.

Never use anything new.

So we found that there's a segment of the market though that was just refusing to believe that

this was as good as it was gonna get, and and that's our market.

And so in recent years, there have been a couple of other projects that came out to enable scaling

of transactional workloads,

across data centers and potentially globally, most notable of which being CockroachDB,

which I know is based on the Spanner paper out of Google.

So I'm wondering if you can just do a bit of comparison

as to

how Fauna compares to Cockroach

or any of the other products that are available in the market that are offering these global scale transactional databases?

Yeah. It's,

there there aren't many entrants in the market because of the tremendous, you know, r and d burden.

I mean, 5 years ago, people were saying that these systems were literally impossible. So it's kind of in the cold fusion territory.

And the the main thing that changed that was really Google Spanner.

And the Spanner paper came out similar to the Dynamo paper. The Dynamo paper said, you can have

total availability

if you treat your data this or that way. The spanner paper said you can have total transactionality

if you relax your availability

requirements to this minimal degree, which in practice is effectively totally available.

But

Spanner had followed on percolator, and,

essentially, there were 2 models at the time

for doing global transactional multi partition consensus, like, really acid.

And Percolator was the first. It also came out of Google.

What Percolator does is essentially scale up

the

primary replica model to data center scale,

where instead of having a single machine that uses locks to coordinate all transactions,

They essentially have a a time stamp oracle, it's called, which is more or less a lock server

that can be individually scaled up. And every node in the data center has to talk to that guy to do any any useful work.

And that gives you data center scale reads and writes up until, like, the limits of that machine.

And it gives you

it gives you

global scale out

for stale reads, but it doesn't give you global rights.

And Spanner came out and said, you know, we can use atomic clocks to synchronize the right path to replace the time stamp oracle.

And then everyone realized that there actually were mechanisms

to deliver global transactionality.

But the problem with the Spanner model, obviously, is the clocks.

And

systems like Cockroach, for example,

attempted to import Google's clock synchronization strategy into a public cloud environment where you don't actually have

atomically synchronized clocks. And part of the reason that Spanner can pull this off is because the entire software stack is controlled end to end.

The network latency is known. The

service latency is known. The and, like, the implementation of every

part of the transactional right path is very tightly latency controlled.

No garbage collection stalls. No VM pauses. No nothing.

Because if you drift out of that clock tolerance, you'll violate correctness, and you won't have any way to recover. You might have corrupt transactions, and you'll have nothing to do. You'll you'll have no you know, there's

no there's no way to roll back. There's no way to even identify

what got corrupted during the window because you don't know if the clocks have drifted out of synchronization until after it's happened.

And

for that reason, you know, systems like Cockroach took bias towards

only doing

reads and writes on

the tablet leader. So it's kind of some of the global story isn't quite there. We were looking at this at the time, and we're like, well, we're we're building we're building for the WAN.

Like, our customers want this to be global. We want it to be global. We're not satisfied with the limitations of this clock based architecture.

And we took a look at the academic literature, and we found,

in particular, there was 1 alternative at the time, and that was a that was a protocol called Calvin

that came out of Daniel Abadi's lab at Yale.

And so you've been building the Fauna database on top of this Calvin protocol,

and I know that you've also taken in some of the aspects of the Raft consensus algorithm.

And so I'm wondering if you can talk a bit about how Fauna itself

is architected to be able to achieve this global scale and transactional

consistency,

and just some of the overall consensus

protocol and consensus management that you use to ensure this global availability of the data as well?

Yeah. So what what Calvin does is

invert the synchronization model.

Instead of using clocks to figure out when transactions occurred

on the data replicas,

it sends the transactions themselves

to a shared log, which then essentially defines the order of time.

These transactions in the shared log are then asynchronously

replicated out to the individual replica nodes very similar to a traditional NoSQL system, And that gives you a ton of advantages.

So at the front end, sort of in the right path, you have a RAF cluster which is sharded, partitioned, highly available, spans

nodes that's accepting these deterministically

submitted transaction

effects or intermediate representations, what have you. That thing has no single point of failure. It's global. It's multi data center.

Any node can commit to it within

the same median latency regardless of how complex the transaction is.

Then on the read side, you can have as many data centers as you please,

tailing off this log in lockstep applying the transaction effects locally

to their their local copy of the data. And that means that on the read side, you get a scale out experience, which doesn't require any coordination.

So we can do snapshot reads from any data center with single single millisecond latency.

Whereas on the right side, you know, the the latency for a commit

takes about 1 majority round trip

throughout the log nodes wherever they're configured to be in the data center. So, you know, a 100 to 200 milliseconds in a typical multicontinent

cluster.

That's basically the best you can do in terms of,

balancing of, like, maximizing availability

without ever giving up the the benefits of transactionality.

And

as far as

consensus and consistency,

I'm wondering what are some of the edge cases that could lead to data conflicts

and how Fauna manages resolving or alerting on those conflicts?

In Fauna

so Fauna offers a

functional expression oriented

relational language. It's very similar to link in the way you compose your transactions. You're writing relational patterns

in in, you know, maps and folds and flat maps and that kind of thing.

The these transactions then get completely processed

atomically, you know, with ACID

in the database itself.

So it's just like working with any other

SQL system except it's not SQL in that. If you think you might conflict,

and you wanna take a read intent on some other value, you just write it into the transaction. It's not like something like Cassandra, for example, which can't express reads and writes within the same transaction.

So in in that sense,

you don't have to do anything except describe

what, you know,

what the,

like, the the business model, so to speak, of the transaction or the logic is supposed to be. We allow you to push down

stored procedures, which we call functions. We allow you to build unique indexes,

consume multiple indexes, create views, transform data, all of which is transactionally available.

And in particular, because Calvin

has a logical

global log

instead of

dropping down to the individual leader, like raft leaders for partitions.

Phonadb

offers strict serializability

or external consistency just like Google Spanner, which is the highest possible consistency level. So there are no anomalies

in in Fauna's transactional

consensus resolution.

There's no index phantoms.

There's no, you know,

reversal of real time.

There's no re skew or right skew. You just don't have to worry about it. And as far as

the underlying storage layer

and the data modeling that Fauna supports, I'm wondering if you can talk through how that's implemented

and specifically for the multimodal capacity, how the query layer is designed to be able to allow for those different query patterns on the same underlying data. Yeah. And the the Molly model is something we're investing a lot in now. We should talk about that in a minute for sure. The underlying storage engine is

an LSM tree. It's derived from Cassandra's original

level implementation in Java. Phone is written in Scala and Java primarily. It's not really very special. What's special is the temporality that Phoned layers on top.

Because as part of the

consistency model, as part of the the FQL

functional query language,

we offer total access to the history of your data within the configured retention period. So you can run any query at a point in time.

You can create a change feed

for any query between 2 points in time. You can get, you know, change data capture from indexes and tables and that kind of thing.

And for data that has to be retained forever, you can configure it to do so. For data which is derived and you only wanna retain the latest version, you can also configure to do so. That gives us a ton of power both in

the language that's exposed to the end user and

to and and for Calvin,

which relies on that history

to to make,

read and write intent, Jackie, more efficient.

And so I'm wondering

given the ability to

interact with these different views of the same underlying data,

how

an application developer would approach data modeling, particularly

in relation to a SQL oriented or a document oriented data store?

So Fauna Fauna is a document oriented data store. We we call it a relational NoSQL platform, which means

you insert documents and you build relations in the form of indexes

on top of them. But 1 1 thing we've discovered

as we've gone to market, been working with our our our customer base, is that people want the operational power of the platform, but they also want easy integration with the languages they're currently using.

So we've just announced

our platform plan,

as well as launch GraphQL

as 1 of the first available languages on top of native Fauna. And our goal is to give people

a completely transparent and native experience

with these familiar languages, which will give them access to the underlying power of the platform.

So if if you wanna go crazy and basically

stay in power user mode, you can use FQL,

which gives you

transparent and direct access to all the semantics and and functional and operational capabilities of the underlying platform, including

QS and security and temporality and all that kind of thing.

But if you're just trying to build a app,

what you get now is a series of basically best of breed standard languages

for that modeling paradigm.

So for CRUD, we now have GraphQL.

And for key value, we have CQL, which is Cassandra's native language.

And we're also working on SQL for relational

for relational modeling, which will launch

later this year. Then we'd like to also do a couple more data domains,

in particular graph, which you can currently model directly in FQL,

but we don't have a standard interface for.

What we found is that people are super excited about this strategy because

they they want that shared platform, especially because Fauna lets you access the same data

from different APIs.

But they just don't wanna deal with the learning curve

upfront, which is understandable because FQL is pretty unique even though it's similar to link. You know, you have to get a specific driver. You have to understand

Fauna's native semantics, which are very powerful, but also,

you know, not necessarily

intuitive or

familiar

out of the box.

So I I would say, you know, depending on what kind of app and data you're trying to model,

grab 1 of those APIs now and go nuts. And as soon as you need more power, you can always get it by dropping down

to what's effectively at that point an intermediate representation. Just like a compiler compiles, you know, a higher level language to

a bytecode or something, you know,

internal or partial representation that is more explicit, gives you more control. We're now doing the same thing with Fauna, and that kinda moves us from

the the database that does 1 thing

to the the data platform. You you kinda get, you know,

effectively

all of, you know, AWS or Google Cloud's operational data systems in a box. And in terms of people who are first getting started on working with Fauna and interacting with the FQL

syntax

or starting to work with some of these higher level interfaces.

I'm wondering what are some of the common points of confusion

or surprise or edge cases that they run up against? 1 of the things we did early on was borrow a lot of terminology from the the object relational movement in the nineties. You know, Twitter engineering and us, you know, have a reputation for

kind of doing our own thing, hell or high water.

And even though

object relational databases basically died, we still felt that those paradigms we we felt like those those patterns were more or less optimal for modern

development practices.

But,

the jargon that we imported from them is a little weird. So, like, instead of tables or collections, you have classes.

And instead of documents or rows, you have instances.

And another thing that's a little strange, I think, which we need to fix and use more conventional language, You know, indexes and fauna are equivalent to views.

You can transform data. You can cover multiple terms. You can rank values. You can even write 1 index that indexes multiple

collections.

And

these these are kind of, you know,

similar to,

a functional programming language or something like f sharp, you know, OCaml,

Haskell, Scala,

like FONA is written in. These are super powerful, but also super abstract concepts. And I think it's it's it's been a little difficult for a lot of our users

to wrap their heads around a paradigm which is so composable,

but also not

necessarily familiar

with the practices that they've they've they've encountered before. Then at the same time, there are features which are totally new

to the database landscape like QoS management.

Like, we run Fauna Serverless Cloud as a single global Fauna cluster.

We use the bill and tenancy in QoS Management to provision new accounts within it because the database hierarchy

is recursive like a file system. So you can have a database that has other databases that have databases within them. Each of those can have a priority.

That priority is instantaneously

scheduled at a subquery level by the query scheduler on each node.

You can do things like

consolidate a lot of different applications and different access patterns into the same physical cluster.

Those features are are very far in

to DBAs, to people building back end applications because they've never encountered them before.

And bitemporality

is similar. There are very few production systems with capable, you know, bitemporality

implementations.

Most people's experience of change data capture is very low level. Like in Postgres, you have to grab some

you have to grab some 3rd party plug in thing that tries to sniff the bin log and look at the binary format. And if you fall behind, you can't catch up because the bin log is gone.

Having that kind of stuff highly available, transparent

in the high level, you know,

programming language of of the data system is just a surprise.

And it's been you know, we're we're doing a lot of work on docs, on tutorials, as well as the new APIs. But I think

a lot of people encounter that initial learning curve with the foreign concepts,

from the object relational paradigms and elsewhere.

And I have trouble seeing through it to the underlying

operational power. And

as far as

the types of use cases

that Fauna is built for and the types of application design patterns that enables,

I'm wondering

if there are any sort of unique architectures

that it lends itself well to that would be

impractical

with a,

single loop purpose database, whether it's a relational database or a NoSQL document store or something like that? Yeah. There there are a lot.

And you need to kinda

adjust the way you view the database. Like, the the traditional view of a database, even if it's a distributed system, is a,

you know, a single workload kind of brittle

operationally

heavy system that you just don't wanna touch.

And Fauna just isn't like that. It's it's entirely self managed, whether you're operating it yourself or using our serverless cloud. You can scale it in and out, up and down.

Everything happens online. Everything happens without

data corruption, without service interruption, with QoS management built in automatically.

And you can adopt kind of a a platform approach similar to the way, you know, in a large enterprise like some of our customers.

They'll have an internal compute platform that uses Kubernetes or what have you

DCOS

or some of the older, you know,

orchestration and and cluster management paradigms.

But databases are still special snowflakes and you have to you have to kinda bring your bring your thinking forward and think, you know, what if the database didn't have to be treated differently

than my stateless

capabilities? What if, you

know, I could provision a database for every

developer, for every

staging environment, for every build? What what if I could, you know, run analytics workloads against

the production database by giving them a low priority read only key, that kind of thing. So really adopting that cloud native mentality

for the data tier, especially the operational data tier, It's just not something people are accustomed to doing. So we have to do a lot of education there, a lot of demoing, and a lot of communication

to show that, no, it really is safe. It really does work.

And at the same time, having global transparent access to all your data with low latency

also leaves you into a different series of design choices for your applications. Because if you have, you know,

an app which only lives in

in US east, you know, in AWS or what have you, like, say it's been in the original data center for a decade and, you know, there's weeds growing up and rain is dripping in the roof and all that kind of thing.

Like, you don't really see the benefit of global scale out unless you start refactoring your app to also manage, you know, data center lay level failover. So if you're building a greenfield app and you build it totally stateless,

for example, you're using a serverless framework,

then you've an experience which is much more like running a CDN,

but it magically has access to transactional correct data under the hood instead of just caching.

But it it's a little difficult to kinda enter that world

from a legacy mindset or from a legacy app. Yeah. And I was curious about what you were saying as far as being able to run analytical workloads on Fauna because I know that it's primarily built for these transactional use cases.

And then also to your point about being able to spin up different instances

for preproduction environments or for developers to be able to experiment with. I'm curious if there is the ability to

leverage

either indexes

or if there's any sort of fast copy mechanism for being able to populate those preproduction databases

with either the entirety or some subset of the data that's stored in the production transactional data store?

Yeah. Currently currently, there is not. That's something we've been asked about a lot and and wanna get on the road map.

Most most, you know, testing datasets are relatively small,

so copying it with the high level layer isn't a big deal. But

that forking branching model has been a request that we've gotten that we would like to enable long term. 1 thing you can absolutely do now

is,

you know, for read queries, you can you can give a different version of the app, a read only key,

and test it against the production data in a completely safe way. That's something which is not really practical to do in

a traditional RDBMS or another no SQL system where all you have is administrative access.

And there's similar things you can do at the

at the user level with with our RBAC system. You can create a security model which lets untrusted clients access

public

data that lets users bootstrap themselves and, you know, own their own little sphere

of the data world directly from,

you know, a single page front end app or a mobile app or some other embedded device like a IoT device without any intermediate,

you know, proxy or security layer on top of the database itself.

Yeah. And that was another thing that I was impressed by is the level of granularity that you're able to offer in terms of the access controls. So I'm wondering if you can talk a bit more about the security model of Fauna

and how user management

and just overall cluster security

factors in and what the administrative interface is for being able to manage all of that? Yeah. That's a good question.

The

operational management, like true ops, like adding a node, removing a node, adding a data center,

happens through a admin tool, which you run locally on any machine in the cluster. But everything above that level is self hosted and part of the logical API.

So for example, schema records aren't themselves

any

different than data records. Like,

an index is an instance within

a database or a document.

A database is itself a document within another database all running up to the root of the tree. And at each level of the tree, you also can provision

access keys. You can define exactly what they access with a, you know, with with Fauna programming, essentially. You can write lambdas which are embedded in the key and control, you know,

exactly what they're allowed to do and at what priority. And that that model, because it's self hosted, is pushed all the way down to the individual documents. So you can you can have a document which serves as an identity.

You can have a scheme either by username password, which allows you to let, you know, untrusted devices create

a new record without any intermediation

by by setting whatever their their password or secret is supposed to be.

Or you can build a stateless service that

delegates that identity of something else, like

LDAP

or Auth0

or whatever existing identity provider, Facebook, for example, if it's a mobile app you already have,

then issues back access keys that have the appropriate scope, the appropriate

RBAC Lambda's installed that that let you really

push the security that you normally model in the app all the way down into the database. And that's super beneficial for 2 reasons. First, it's faster

because the database can process all this locally. You're not streaming back data that the user is not allowed to see. And second, you have a much stronger guarantee of fundamental security in your system because you're not especially in a microservices environment. And this applies kinda to transactionality

too. If the database doesn't

handle these concerns in their totality, the more you move to a serverless or a microservices environment, the more individual code bases you have

trying to agree on these access patterns, which are very, you know, very nuanced. Your typical security hole comes from,

you know, a bunch of well meaning implementations

which have somehow interacted in an unexpected way. So if you can push that down into the shared data tier, you know, integrate through your data just like it's 1982 and you have Oracle or something, you say the database is the bus.

Everything talks to the database. Everything uses, you know, if you want stored procedures,

build in our back, that kind of thing

to make sure that what we're doing is correct, safe,

properly

QoS managed,

then you get a tremendous amount of flexibility at the application tier because you just don't have to worry about that level of concern anymore.

And to your point about the

user

controls being just another record in the database and being able to manage stored procedures. I'm wondering

what the data types are natively in FaunaDB

and what capacity there is for being able to create custom or higher order data types and what level of support there is for being able to push some measure of application logic into the database in the form of stored procedures or custom function definitions?

Yeah. That's a good question. And this,

there's a really interesting implementation

detail and and fun around this that we don't really talk about. You know, in your typical like, take my SQL. When you created my SQL database or cluster, you have to set the collation. Well, what the hell is a collation?

Like, it's the it's the order of

ranking of individual data elements

based on what language you speak, what other types of data you expect to query, what kind of indexes you wanna you wanna expose.

And it's it's very difficult, especially in, like, a multi tenant environment to say this is what collation should be, and it's also very confusing. And what we did in fauna

is every data type

has a an ordinal position

in what's essentially

an infinite circle of all possible

data elements.

So you can sort, for example,

floats and strings,

together, and you'll get a order which makes sense. You can sort

and we end to be careful about, you know, some of the locale based stuff because of the way it branches. But you can, for all intents and purposes, you know, sort most languages in a way that may makes sense to the end user.

And this this means, you know, font, it can be statically typed internally and process efficiently all the native data types, floats, strings, bytes, longs,

arrays, maps, what have you. Every data element has a rank,

a predictable rank, a rank that lets you order it, sort it, partition it, what have you. But at the same time,

you can do whatever you please on top of that by taking advantage of that underlying order. So you can create essentially a struct,

which, you know, has multiple data elements in it. If you want your data to be ranked a different way, you can create an index which transforms it before it gets ranked and also includes the original value. And this is a break actually from the object relational paradigm because the object relational paradigm is basically like

you compile, like, a native data type. You install it into the database. You have to define

all this stuff about how the database will interact with it and sort it and rank it. And and you usually can't, you know, create a column, for example, which includes that data type but also elements of a different data type, you end up falling back to kinda like the

the VARCHAR, my SQL model where you're like, who knows what's in here? It's just a bunch of bytes. We learned, you know, through our own experience and working with customers and that kind of thing that people don't really wanna extend their database. They wanna model their application on top of a shared set of primitives.

So that's what we offer. We offer these native data types, but the the language itself is,

you know, typed but dynamically typed even though everything's stored statically

internally.

And

we offer stored procedures which we call functions, which let you push down lambdas into the database written in FQL,

which can compose, transform data, augment the language, but at a high level. Like, you're not compiling a Java jar to install. You're just writing a query. Once you like the query, you can give it a name and put it into the database.

Obviously, that function is itself a a Fauna document, so it's versioned. You can see how it's changed by going through the temporal history.

You can, you know, transactionally depend on it when you're doing other operations if you need to. And you really get a much more composable

kind of a tuplespace

experience where

the database is a compute engine over data, which it makes

transparently available to all the the query processes as opposed to, you know, having to think about it in a more more legacy mindset. I I really like the temporal aspect of Fauna of being able to automatically maintain versions of records because as you mentioned, it's not something that's generally built in as a first class concern into database engines. You either have to, like you said, capture the right of head logs in a post grads database for change data capture, or if you want to

be able to keep that information

readily available, you have to implement some sort of history table, which has its own edge cases that you have to work around depending on how the data model changes over time. So the fact that it's built in as a first class concern and that it's something that is accessible

without having to go through all of these additional

back flips and, you know, additional tooling. It's definitely a very valuable addition.

And I'm curious, what are some of the other

interesting

features that are often overlooked or misunderstood,

and any particularly

interesting or unexpected ways that you've seen fauna used?

Temporality is definitely 1. Like, we originally you know, we came from Twitter. Right? So we thought every app would have feeds in it, which by and large was correct. But it turns out, you know, people are also struggling with much more fundamental data issues, like, is their data correct and available? But 1 thing we've seen is kind of an augmentation pattern

traditional database

and you wanna keep history, but you don't wanna deal with logs, basically. So you can add a trigger

into the application layer or into the database itself, which replicates the individual changes into Fauna.

Then that can let you expose those changes not just, you know, locally, but globally as as feeds, as change data capture

for data services and other data centers that can let you, for example, span clouds.

If you have your app already built in a single data center in a single cloud, but you wanna start getting data, like, let's let let me be more concrete. Say you have a legacy application,

which

built in US east 1, it uses Postgres or something like that, but you're not gonna migrate it to Google Cloud, at least not out of the gate. You wanna take advantage of some of the unique services in Google Cloud, like the machine learning capabilities, for example.

What you can do is add a trigger either in the application layer or in

database itself to write changes to Fauna,

use Fauna Serverless or operate it yourself,

span that data into Google, and then have it locally accessible,

either reading from that you know, reading from Fauna local to update

an analytics system that has storage or directly querying the local phone and cluster from, you know, as a sparkly kind of service with the phone and driver that kinda

inverts the database model and takes advantage of temporality to create a bus. But you don't just have change events. You actually get to query all of your data. So you can start moving that, you know, some of the patterns which are well served by that legacy database into Fauna for things like change feeds,

and change data capture and that kind of thing. You can also use the security model to to lock down the canonical data database more tightly

and then rely on Phonus security model to expose that data to mobile apps, for example. So you can take a database which was built for a trusted deployment environment on the web with servers you you own, manage, and essentially use Fauna as kind of a data CDN and get all that relational power, that modeling power, that querying power to make that data

globally publicly available to new views

into the same underlying product. And what are some of the cases when Fauna is the wrong choice and you would advocate for somebody to choose a different tool or a different platform? 1 thing we encounter,

and to be clear, Fauna is becoming the wrong choice less over time. But 1 thing we encounter

with some frequency that

will remain the case indefinitely is

time series use cases.

A lot of people confuse time series and temporality

because they both involve time.

Faunus temporality is really about

storing history

and and change history

of

mission critical,

like, user generated content, stuff that's super important. And time series is really about

it's about data that individually

doesn't matter

and only is interesting in aggregate.

So the the operational characteristics people are looking for are wildly different. You know, they want

a lot

of analytics, roll up aggregation features that Fauna doesn't currently have. They want it to be

they want it to be cheap to the point that they give up, you know, correctness, transactionality,

even availability

in order to store as much data as as fast as humanly possible because the individual rows, you know, just don't really matter.

It's all about the aggregates.

So that's something we've encountered with some frequency. And, you know, if you really wanna do time series, you should grab something like Influx or potentially Cassandra,

like a a truly eventually consistent

NoSQL time series database, which is optimized for those patterns.

And right now for for OLAP, for example, Fauna doesn't have

native analytics capabilities, so you can query it from something like Spark with a Fauna driver.

But if you really wanna do, you know, kind of a HTAP kind of scenario,

the best thing to do is use Fauna's temporality

to to capture data

into a relational database or a cloud analytics service, for example. And the temporality is nice for that because you can do it in a restartable

transactional correct way. And usually, these systems don't have to be, you know, globally distributed. So if you have a global font of cluster,

you wanna do HTAP, just spin up Postgres in 1 data center, you know, write a connector, which will which will sync the data you care about into Postgres and soft real time,

and and you'll get that capability.

As we bring our own SQL capability online,

these needs will diminish. But right now, those are kind of the 2 main areas where it doesn't really make sense. To be clear, we are building a general purpose data platform, so we're not we're not opposed to eventually

implementing pretty much everything you would you would want,

these different data domains. But,

right now, the focus is CRUD

relational

document graph

key value.

And in terms of your experience

of building and growing the technical and business aspects of Fauna, I'm wondering what you have found to be some of the most interesting or unexpected or challenging lessons that you've encountered in the process.

I think it would be safe to say when we we set out to build this project, we didn't realize we'd end up solving 1 of the hardest problems in applied computer science, in particular,

global highly available

asset transactions.

There was there was no

industry version of Calvin before Fauna. There was only the prototype that was written

for purposes of the paper.

So that's the the challenge of doing that certainly exceeded our expectations. I mean, most databases are working with a single

consensus layer if they have any consensus whatsoever. And some of the patterns for Raft are relatively laid down at this point, although many people batch their their Raft implementations.

But to add an additional

novel consensus protocol on top of that, because we've extended Calvin in quite a few ways in particular for performance and flexibility reasons, was a tremendous challenge for us. And

we were gratified to to finish our Jepsen analysis with Kyle

Kyle Kingsbury recently, which kind of validated the entire architecture

of what we set out to build in the context of the academic literature and the history that came before and particularly Google Spanner and Percolator and that kind of thing. So in the technical side, that has by far kind of exceeded, I think the level of difficulty we initially assumed that never stopped us before. It never stopped us at Twitter, so we still got it done. But,

I think that was a surprise. And then I think, you know, there's the usual stuff, which I think is common to technical cofounders where, you know, I was a director at Twitter and managed a team of about 25 people. But managing a larger company, growing it growing it from scratch,

having an executive team and line managers and that kind of thing has been a learning experience for me because the the the people are are just as hard as the the the technical aspects of of the business.

And looking forward, what do you have in store for the future of Fauna, both from the business and the technical side? The biggest thing for us right now is these new APIs. Like, we've seen, you know, a lot of our a lot of our cloud users

already begin to implement GraphQL adapters for Fauna. So we're super excited to release the, you know, the first party

native GraphQL interface and and and serve that market need more directly. We also have a lot of work to do on

the SQL implementation,

for example. And then, you know, some of the future interfaces will release beyond that. At the same time, you know, we're always improving performance. We're always improving

the default consistency levels you get pushing down latency even further.

There's been a bunch of operational improvements

lately, which we're excited about, which will dramatically improve certain workloads. We're making it easier to use

in different operational environments like Kubernetes and different clouds and that kind of thing.

And we also have locality control

on the road map for this year, and we've started work on the ability to define on a record by record basis

where your data

lives in a single Fauna

cluster. That makes kind of the the shared services paradigm

even more powerful because you can lay out Fauna around the world. You can have it in 25 data centers,

and you can have every individual application

or logical database that's accessing that cluster decide on a row by row basis where it wants that data to live.

And that's good for compliance. It's good for management of, you know, replication costs. It's good for offering a shared service

internally or in our own cloud and really, you know, pushing you to that

edge CDN kind of data experience.

And are there any other aspects

of FaunaDB

or the Fauna company that we didn't discuss yet that you'd like to cover before we close out the show? I mean, we're hiring.

If you, you know, if you like to work on consensus algorithms, distributed systems, if you, you know, you like worrying about what can go wrong rather than what can go right, a a database company is a good place to be. In high school,

I did some hobby stuff with electrical engineering, and I could just never, you know, quite get it together because it was so unpredictable. All this analog stuff happening, and I ended up going into software

for that reason. I found it to be much more predictable environment. But then, of course, I do myself to having the same category of problems

by working on distributed systems exclusively, which are again, especially in the cloud,

incredibly unpredictable,

partially analog environments. You know, latency varies, nodes shut down, data disk get corrupt.

Like,

if you're excited about solving those things, please talk to us.

And for anybody who does wanna get in touch with you about that or follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management

serverless

edge experience. Like, we're pushing

the granularity

of

application building

down to literally nothing. Like, you had kind of a a a series of incremental

paradigm shifts from

physical servers to colocated

or or leased servers to virtualized servers to containers,

they're still all little servers.

So, like, if you like, every thought you had, you had to mentally conceptualize it as being in a book. And it just doesn't make sense from a productivity perspective to think about

software,

especially distributed software this way. Like,

who cares, like, how many functions can run within 1 container? I don't. I just wanna know if I have the aggregate capacity to execute the workloads my users are generating. And it it requires a complete inversion of that abstraction,

which we finally have now for the most part

with serverless frameworks on the compute side. We've had it for a long time with CDNs on the caching side. But data, especially operational data, is always the last thing to move because it's the riskiest. So you can now get, you know, some serverless analytics capability with things like Snowflake. But your canonical

operational user generated mission critical, you know, data, which is the existential underpinning of the business still lives in essentially, you know, a mainframe. And what we're trying to push with fun and what the entire industry needs to push is, you know, bringing this paradigm to its logical conclusion, which is you should not have to care or even know as a application developer

how your data tier is operated. It should be completely orthogonal.

And at the same time as a operator, you shouldn't have to care what your applications are doing. Like, the the model of the DBA who has to, like, go in and, like, tune queries and make sure everything is safe to execute

and fail over nodes to hot spares and stuff is,

an eighties model. And we need to move past that to, an arm's length utility computing

serverless model where something's behaving badly, you know, in Fauna, for example, if you have an application

is consuming too many resources, lower its priority. You don't have to know what it's doing as an operator. And if you want global resources as a developer,

just provision a new database. You don't have even have to think about where those data centers are located. And that's the experience we're closer to with serverless, and we're already there with CDNs. But data is just harder

because the

the the quality bar is so astronomically high because, you know, I mean, the NoSQL movement was notorious for for essentially killing businesses,

like, dig comes to mind with their experience with Cassandra.

And people are smarter now, and they demand that their database vendors really do the work. But until until the vendors do, like we're doing at Fauna, we're still gonna be stuck in that mainframe mindset.

Well, thank you very much for taking the time today to join me and discuss the work that you're doing at Fauna. It's definitely a very interesting project and seems to be quite the feet of engineering,

and there's a lot of great technical resources that you've put up for people to be able to understand what it is that you're doing and working on. So I appreciate all of the work that you're doing on that front, and I hope you enjoy the rest of your day. Thanks. It was great to talk to you. Thanks for having me on the show.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links