Summary
One of the biggest challenges for any business trying to grow and reach customers globally is how to scale their data storage. FaunaDB is a cloud native database built by the engineers behind Twitter’s infrastructure and designed to serve the needs of modern systems. Evan Weaver is the co-founder and CEO of Fauna and in this episode he explains the unique capabilities of Fauna, compares the consensus and transaction algorithm to that used in other NewSQL systems, and describes the ways that it allows for new application design patterns. One of the unique aspects of Fauna that is worth drawing attention to is the first class support for temporality that simplifies querying of historical states of the data. It is definitely worth a good look for anyone building a platform that needs a simple to manage data layer that will scale with your business.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- Alluxio is an open source, distributed data orchestration layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos, Alluxio unlocks the value of your data and allows for modern computation-intensive workloads to become truly elastic and flexible for the cloud. With Alluxio, companies like Barclays, JD.com, Tencent, and Two Sigma can manage data efficiently, accelerate business analytics, and ease the adoption of any cloud. Go to dataengineeringpodcast.com/alluxio today to learn more and thank them for their support.
- Understanding how your customers are using your product is critical for businesses of any size. To make it easier for startups to focus on delivering useful features Segment offers a flexible and reliable data infrastructure for your customer analytics and custom events. You only need to maintain one integration to instrument your code and get a future-proof way to send data to over 250 services with the flip of a switch. Not only does it free up your engineers’ time, it lets your business users decide what data they want where. Go to dataengineeringpodcast.com/segmentio today to sign up for their startup plan and get $25,000 in Segment credits and $1 million in free software from marketing and analytics companies like AWS, Google, and Intercom. On top of that you’ll get access to Analytics Academy for the educational resources you need to become an expert in data analytics for measuring product-market fit.
- You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
- Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
- Your host is Tobias Macey and today I’m interviewing Evan Weaver about FaunaDB, a modern operational data platform built for your cloud
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by explaining what FaunaDB is and how it got started?
- What are some of the main use cases that FaunaDB is targeting?
- How does it compare to some of the other global scale databases that have been built in recent years such as CockroachDB?
- Can you describe the architecture of FaunaDB and how it has evolved?
- The consensus and replication protocol in Fauna is intriguing. Can you talk through how it works?
- What are some of the edge cases that users should be aware of?
- How are conflicts managed in Fauna?
- What is the underlying storage layer?
- How is the query layer designed to allow for different query patterns and model representations?
- How does data modeling in Fauna compare to that of relational or document databases?
- Can you describe the query format?
- What are some of the common difficulties or points of confusion around interacting with data in Fauna?
- What are some application design patterns that are enabled by using Fauna as the storage layer?
- Given the ability to replicate globally, how do you mitigate latency when interacting with the database?
- What are some of the most interesting or unexpected ways that you have seen Fauna used?
- When is it the wrong choice?
- What have been some of the most interesting/unexpected/challenging aspects of building the Fauna database and company?
- What do you have in store for the future of Fauna?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- Fauna
- Ruby on Rails
- CNET
- GitHub
- NoSQL
- Cassandra
- InnoDB
- Redis
- Memcached
- Timeseries
- Spanner Paper
- DynamoDB Paper
- Percolator
- ACID
- Calvin Protocol
- Daniel Abadi
- LINQ
- LSM Tree (Log-structured Merge-tree)
- Scala
- Change Data Capture
- GraphQL
- Fauna Query Language (FQL)
- CQL == Cassandra Query Language
- Object-Relational Databases
- LDAP == Lightweight Directory Access Protocol
- Auth0
- OLAP == Online Analytical Processing
- Jepsen distributed systems safety research
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the project you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode. With 200 gigabit private networking, scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform. And if you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances.
Go to data engineering podcast.com/linode, that's l I n o d e, today to get a $20 credit and launch a new server in under a minute. Eluxio is an open source distributed data layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos, Eluxio unlocks the value of your data and allows for modern computation intensive workloads to become truly elastic with a cloud. With Aluxio, companies like Barclays, jd.com, Tencent, and 2 Sigma can manage data efficiently, accelerate business analytics, and ease the adoption of any cloud.
Go to data engineering podcast.com/aluxio, that's a l l u x I o, today to learn more and to thank them for their support. And understanding how your customers are using your product is critical for businesses of any size. To make it easier for start ups to focus on delivering useful features, Segment offers a flexible and reliable data infrastructure for your customer analytics and custom events. You only need to maintain 1 integration to instrument your code and get a future proof way to send data to over 250 services with the flip of a switch. Not only does it free up your engineers' time, it lets your business users decide what data they want where.
Go to data engineering podcast.com/ segment I o today to sign up for their start up plan and get $25, 000 in segment credits and $1, 000, 000 in free software for marketing and analytics companies like AWS, Google, and Intercom. On top of that, you'll get access to the Analytics Academy for the educational resources you need to become an expert in data analytics for measuring product market fit. And you listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season.
We have partnered with organizations such as O'Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and to take advantage of our partner discounts when you register. And go to data engineering podcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. And please help other people find the show by leaving your view on Itunes and telling your friends and coworkers. Your host is Tobias Macy. And today, I'm interviewing Evan Weaver about FaunaDB, a modern operational data platform built for your cloud. So, Evan, could you start by introducing yourself?
[00:03:14] Unknown:
Great to talk to you, Tobias. I'm Evan Weaver. I'm CEO and 1 of the cofounders of Fauna.
[00:03:19] Unknown:
And do you remember how you first got involved in the area of data management?
[00:03:23] Unknown:
I do remember. So in in grad school, I did a bunch of work, in bioinformatics. Specifically, I worked on gene orthologs in chickens as well as a plankton simulator. And for the for the gene project, I ended up using Rails because we needed a web interface to throw some data up. And that was sort of my first experience with with web programming. It was my first time using a a real database. I I got super excited about Rails because of, like, the blog demo screen cast thing, and then I spent a week trying to install Postgres. And then from that point forward, I was basically doomed to spend the rest of my life paginating things and working on the data side of platforms.
After after that project, I went, and worked at CNET Networks and did rail sites there. Specifically, I did chow.com and urbanbaby.com.urbanbaby was a threaded real time web chat for moms. So if you take away the for moms, it kinda sounds like Twitter. Around the same time my team at CNET left, to found GitHub, I left to go to Twitter as employee number 15.
[00:04:32] Unknown:
And so in the time that you spent at Twitter, you ended up dealing with a lot of different issues related to databases and storage and consistency. And after that, you went ahead and co founded Fauna and released the FaunaDB product. So can you start by giving a bit of an explanation about what FaunaDB is and your motivation for starting it?
[00:04:56] Unknown:
Yeah. So at at Twitter, I ended up running the the what we call the back end infrastructure team. We built all the distributed storage for the core business objects. So that was tweets, timelines, users, social graph, image storage, the cache, probably some other storage that I forget. We also worked on performance. And, Twitter was 1 of the last great consumer Internet companies that was built pre cloud. Like, we were using colocated hardware. There wasn't any cloud native software. We had to do almost everything ourselves. And it's probably why there are a lot of great infrastructure startups that spun out of Twitter. And, essentially, when we were starting to trying to scale Twitter up, we got involved in the NoSQL movement, in particular, Cassandra.
And we hosted the 1st meetup. I wrote the 1st tutorial. We fixed the build because people don't remember now, but it actually didn't compile when Facebook open sourced it. And we were hoping basically that Cassandra would develop into a global data platform that would be multipurpose, you know, reusable, flexible, productive, and that just didn't happen. And to scale Twitter, we ended up building point solutions where we would take, like, InnoDB or Redis or Memcache or some other local storage engine and put a sharding facade in front of it that manage replication and querying and transactionality and that kind of thing.
But because Twitter was under such extreme time constraints, we just never had the chance to build that truly reusable platform that we wanted to build. So that's basically Fauna. I spent 4 years at Twitter. When I left, couple people ended up coming with me, And we spent about 3 years in consultancy mode exploring the data space, working on a bunch of other projects, trying to understand, you know, Twitter needed a social graph, but there's probably not a market for a social graph DB. Like, what do people really need as a general comprehensive data platform? And then in 2016, we felt like we had a prototype down. We had an initial customer. We went and raised our seed round from CRB.
Then we raised our series a in in 2017 from 0.72 in GB.
[00:07:10] Unknown:
And so you've been building this platform for a while. And looking at the technical documentation, it seems to be quite the feat of engineering. So I'm wondering at the outset, what are some of the main use cases that FaunaDB is targeting that you found people were asking for in your exploration of what the data space was looking for and what it was lacking? What what we found was really missing
[00:07:36] Unknown:
was that that general purpose, you know, safe and reliable cloud native database, so to speak. Like, we found a lot of people who said, you know, I looked at NoSQL, and I like the scale. I looked at SQL, and I like the flexibility of modeling. I can't get something that does both. That's the same experience we we had at Twitter and partly why we ended up building all all these systems more or less from scratch. And we decided, you know, if we don't get this done, it's not gonna happen. Like, information science doesn't say that you can't build a transactional global high performance operational data store. But the you know, in practice, there's so much path dependence and software development that at the time, everyone who had tried to do that had basically gotten diverted into some niche, like time series or click tracking behavioral data, that kind of stuff. Like, the NoSQL vendors had given up on transactionality, data correctness, safety, and we're promoting a worse is better story.
And the the SQL community had fallen back to, well, vertical scale is all you need. Global is impossible. Never use anything new. So we found that there's a segment of the market though that was just refusing to believe that this was as good as it was gonna get, and and that's our market.
[00:09:07] Unknown:
And so in recent years, there have been a couple of other projects that came out to enable scaling of transactional workloads, across data centers and potentially globally, most notable of which being CockroachDB, which I know is based on the Spanner paper out of Google. So I'm wondering if you can just do a bit of comparison as to how Fauna compares to Cockroach or any of the other products that are available in the market that are offering these global scale transactional databases?
[00:09:38] Unknown:
Yeah. It's, there there aren't many entrants in the market because of the tremendous, you know, r and d burden. I mean, 5 years ago, people were saying that these systems were literally impossible. So it's kind of in the cold fusion territory. And the the main thing that changed that was really Google Spanner. And the Spanner paper came out similar to the Dynamo paper. The Dynamo paper said, you can have total availability if you treat your data this or that way. The spanner paper said you can have total transactionality if you relax your availability requirements to this minimal degree, which in practice is effectively totally available.
But Spanner had followed on percolator, and, essentially, there were 2 models at the time for doing global transactional multi partition consensus, like, really acid. And Percolator was the first. It also came out of Google. What Percolator does is essentially scale up the primary replica model to data center scale, where instead of having a single machine that uses locks to coordinate all transactions, They essentially have a a time stamp oracle, it's called, which is more or less a lock server that can be individually scaled up. And every node in the data center has to talk to that guy to do any any useful work. And that gives you data center scale reads and writes up until, like, the limits of that machine.
And it gives you it gives you global scale out for stale reads, but it doesn't give you global rights. And Spanner came out and said, you know, we can use atomic clocks to synchronize the right path to replace the time stamp oracle. And then everyone realized that there actually were mechanisms to deliver global transactionality. But the problem with the Spanner model, obviously, is the clocks. And systems like Cockroach, for example, attempted to import Google's clock synchronization strategy into a public cloud environment where you don't actually have atomically synchronized clocks. And part of the reason that Spanner can pull this off is because the entire software stack is controlled end to end.
The network latency is known. The service latency is known. The and, like, the implementation of every part of the transactional right path is very tightly latency controlled. No garbage collection stalls. No VM pauses. No nothing. Because if you drift out of that clock tolerance, you'll violate correctness, and you won't have any way to recover. You might have corrupt transactions, and you'll have nothing to do. You'll you'll have no you know, there's no there's no way to roll back. There's no way to even identify what got corrupted during the window because you don't know if the clocks have drifted out of synchronization until after it's happened.
And for that reason, you know, systems like Cockroach took bias towards only doing only doing reads and writes on the tablet leader. So it's kind of some of the global story isn't quite there. We were looking at this at the time, and we're like, well, we're we're building we're building for the WAN. Like, our customers want this to be global. We want it to be global. We're not satisfied with the limitations of this clock based architecture. And we took a look at the academic literature, and we found, in particular, there was 1 alternative at the time, and that was a that was a protocol called Calvin that came out of Daniel Abadi's lab at Yale.
[00:13:09] Unknown:
And so you've been building the Fauna database on top of this Calvin protocol, and I know that you've also taken in some of the aspects of the Raft consensus algorithm. And so I'm wondering if you can talk a bit about how Fauna itself is architected to be able to achieve this global scale and transactional consistency, and just some of the overall consensus protocol and consensus management that you use to ensure this global availability of the data as well?
[00:13:43] Unknown:
Yeah. So what what Calvin does is invert the synchronization model. Instead of using clocks to figure out when transactions occurred on the data replicas, it sends the transactions themselves to a shared log, which then essentially defines the order of time. These transactions in the shared log are then asynchronously replicated out to the individual replica nodes very similar to a traditional NoSQL system, And that gives you a ton of advantages. So at the front end, sort of in the right path, you have a RAF cluster which is sharded, partitioned, highly available, spans nodes that's accepting these deterministically submitted transaction effects or intermediate representations, what have you. That thing has no single point of failure. It's global. It's multi data center.
Any node can commit to it within the same median latency regardless of how complex the transaction is. Then on the read side, you can have as many data centers as you please, tailing off this log in lockstep applying the transaction effects locally to their their local copy of the data. And that means that on the read side, you get a scale out experience, which doesn't require any coordination. So we can do snapshot reads from any data center with single single millisecond latency. Whereas on the right side, you know, the the latency for a commit takes about 1 majority round trip throughout the log nodes wherever they're configured to be in the data center. So, you know, a 100 to 200 milliseconds in a typical multicontinent cluster.
That's basically the best you can do in terms of, balancing of, like, maximizing availability without ever giving up the the benefits of transactionality.
[00:15:36] Unknown:
And as far as consensus and consistency, I'm wondering what are some of the edge cases that could lead to data conflicts and how Fauna manages resolving or alerting on those conflicts?
[00:15:50] Unknown:
In Fauna so Fauna offers a functional expression oriented relational language. It's very similar to link in the way you compose your transactions. You're writing relational patterns in in, you know, maps and folds and flat maps and that kind of thing. The these transactions then get completely processed atomically, you know, with ACID in the database itself. So it's just like working with any other SQL system except it's not SQL in that. If you think you might conflict, and you wanna take a read intent on some other value, you just write it into the transaction. It's not like something like Cassandra, for example, which can't express reads and writes within the same transaction. So in in that sense, you don't have to do anything except describe what, you know, what the, like, the the business model, so to speak, of the transaction or the logic is supposed to be. We allow you to push down stored procedures, which we call functions. We allow you to build unique indexes, consume multiple indexes, create views, transform data, all of which is transactionally available.
And in particular, because Calvin has a logical global log instead of dropping down to the individual leader, like raft leaders for partitions. Phonadb offers strict serializability or external consistency just like Google Spanner, which is the highest possible consistency level. So there are no anomalies in in Fauna's transactional consensus resolution. There's no index phantoms. There's no, you know, reversal of real time. There's no re skew or right skew. You just don't have to worry about it. And as far as
[00:17:41] Unknown:
the underlying storage layer and the data modeling that Fauna supports, I'm wondering if you can talk through how that's implemented
[00:17:50] Unknown:
and specifically for the multimodal capacity, how the query layer is designed to be able to allow for those different query patterns on the same underlying data. Yeah. And the the Molly model is something we're investing a lot in now. We should talk about that in a minute for sure. The underlying storage engine is an LSM tree. It's derived from Cassandra's original level implementation in Java. Phone is written in Scala and Java primarily. It's not really very special. What's special is the temporality that Phoned layers on top. Because as part of the consistency model, as part of the the FQL functional query language, we offer total access to the history of your data within the configured retention period. So you can run any query at a point in time.
You can create a change feed for any query between 2 points in time. You can get, you know, change data capture from indexes and tables and that kind of thing. And for data that has to be retained forever, you can configure it to do so. For data which is derived and you only wanna retain the latest version, you can also configure to do so. That gives us a ton of power both in the language that's exposed to the end user and to and and for Calvin, which relies on that history to to make, read and write intent, Jackie, more efficient.
[00:19:15] Unknown:
And so I'm wondering given the ability to interact with these different views of the same underlying data, how an application developer would approach data modeling, particularly in relation to a SQL oriented or a document oriented data store?
[00:19:34] Unknown:
So Fauna Fauna is a document oriented data store. We we call it a relational NoSQL platform, which means you insert documents and you build relations in the form of indexes on top of them. But 1 1 thing we've discovered as we've gone to market, been working with our our our customer base, is that people want the operational power of the platform, but they also want easy integration with the languages they're currently using. So we've just announced our platform plan, as well as launch GraphQL as 1 of the first available languages on top of native Fauna. And our goal is to give people a completely transparent and native experience with these familiar languages, which will give them access to the underlying power of the platform.
So if if you wanna go crazy and basically stay in power user mode, you can use FQL, which gives you transparent and direct access to all the semantics and and functional and operational capabilities of the underlying platform, including QS and security and temporality and all that kind of thing. But if you're just trying to build a app, what you get now is a series of basically best of breed standard languages for that modeling paradigm. So for CRUD, we now have GraphQL. And for key value, we have CQL, which is Cassandra's native language. And we're also working on SQL for relational for relational modeling, which will launch later this year. Then we'd like to also do a couple more data domains, in particular graph, which you can currently model directly in FQL, but we don't have a standard interface for.
What we found is that people are super excited about this strategy because they they want that shared platform, especially because Fauna lets you access the same data from different APIs. But they just don't wanna deal with the learning curve upfront, which is understandable because FQL is pretty unique even though it's similar to link. You know, you have to get a specific driver. You have to understand Fauna's native semantics, which are very powerful, but also, you know, not necessarily intuitive or familiar out of the box.
So I I would say, you know, depending on what kind of app and data you're trying to model, grab 1 of those APIs now and go nuts. And as soon as you need more power, you can always get it by dropping down to what's effectively at that point an intermediate representation. Just like a compiler compiles, you know, a higher level language to a bytecode or something, you know, internal or partial representation that is more explicit, gives you more control. We're now doing the same thing with Fauna, and that kinda moves us from the the database that does 1 thing to the the data platform. You you kinda get, you know, effectively
[00:22:36] Unknown:
all of, you know, AWS or Google Cloud's operational data systems in a box. And in terms of people who are first getting started on working with Fauna and interacting with the FQL syntax or starting to work with some of these higher level interfaces. I'm wondering what are some of the common points of confusion
[00:22:58] Unknown:
or surprise or edge cases that they run up against? 1 of the things we did early on was borrow a lot of terminology from the the object relational movement in the nineties. You know, Twitter engineering and us, you know, have a reputation for kind of doing our own thing, hell or high water. And even though object relational databases basically died, we still felt that those paradigms we we felt like those those patterns were more or less optimal for modern development practices. But, the jargon that we imported from them is a little weird. So, like, instead of tables or collections, you have classes. And instead of documents or rows, you have instances.
And another thing that's a little strange, I think, which we need to fix and use more conventional language, You know, indexes and fauna are equivalent to views. You can transform data. You can cover multiple terms. You can rank values. You can even write 1 index that indexes multiple collections. And these these are kind of, you know, similar to, a functional programming language or something like f sharp, you know, OCaml, Haskell, Scala, like FONA is written in. These are super powerful, but also super abstract concepts. And I think it's it's it's been a little difficult for a lot of our users to wrap their heads around a paradigm which is so composable, but also not necessarily familiar with the practices that they've they've they've encountered before. Then at the same time, there are features which are totally new to the database landscape like QoS management.
Like, we run Fauna Serverless Cloud as a single global Fauna cluster. We use the bill and tenancy in QoS Management to provision new accounts within it because the database hierarchy is recursive like a file system. So you can have a database that has other databases that have databases within them. Each of those can have a priority. That priority is instantaneously scheduled at a subquery level by the query scheduler on each node. You can do things like consolidate a lot of different applications and different access patterns into the same physical cluster. Those features are are very far in to DBAs, to people building back end applications because they've never encountered them before.
And bitemporality is similar. There are very few production systems with capable, you know, bitemporality implementations. Most people's experience of change data capture is very low level. Like in Postgres, you have to grab some you have to grab some 3rd party plug in thing that tries to sniff the bin log and look at the binary format. And if you fall behind, you can't catch up because the bin log is gone. Having that kind of stuff highly available, transparent in the high level, you know, programming language of of the data system is just a surprise. And it's been you know, we're we're doing a lot of work on docs, on tutorials, as well as the new APIs. But I think a lot of people encounter that initial learning curve with the foreign concepts, from the object relational paradigms and elsewhere.
And I have trouble seeing through it to the underlying operational power. And
[00:26:13] Unknown:
as far as the types of use cases that Fauna is built for and the types of application design patterns that enables, I'm wondering if there are any sort of unique architectures that it lends itself well to that would be impractical with a, single loop purpose database, whether it's a relational database or a NoSQL document store or something like that? Yeah. There there are a lot.
[00:26:41] Unknown:
And you need to kinda adjust the way you view the database. Like, the the traditional view of a database, even if it's a distributed system, is a, you know, a single workload kind of brittle operationally heavy system that you just don't wanna touch. And Fauna just isn't like that. It's it's entirely self managed, whether you're operating it yourself or using our serverless cloud. You can scale it in and out, up and down. Everything happens online. Everything happens without data corruption, without service interruption, with QoS management built in automatically. And you can adopt kind of a a platform approach similar to the way, you know, in a large enterprise like some of our customers.
They'll have an internal compute platform that uses Kubernetes or what have you DCOS or some of the older, you know, orchestration and and cluster management paradigms. But databases are still special snowflakes and you have to you have to kinda bring your bring your thinking forward and think, you know, what if the database didn't have to be treated differently than my stateless capabilities? What if, you know, I could provision a database for every developer, for every staging environment, for every build? What what if I could, you know, run analytics workloads against the production database by giving them a low priority read only key, that kind of thing. So really adopting that cloud native mentality for the data tier, especially the operational data tier, It's just not something people are accustomed to doing. So we have to do a lot of education there, a lot of demoing, and a lot of communication to show that, no, it really is safe. It really does work.
And at the same time, having global transparent access to all your data with low latency also leaves you into a different series of design choices for your applications. Because if you have, you know, an app which only lives in in US east, you know, in AWS or what have you, like, say it's been in the original data center for a decade and, you know, there's weeds growing up and rain is dripping in the roof and all that kind of thing. Like, you don't really see the benefit of global scale out unless you start refactoring your app to also manage, you know, data center lay level failover. So if you're building a greenfield app and you build it totally stateless, for example, you're using a serverless framework, then you've an experience which is much more like running a CDN, but it magically has access to transactional correct data under the hood instead of just caching.
But it it's a little difficult to kinda enter that world
[00:29:16] Unknown:
from a legacy mindset or from a legacy app. Yeah. And I was curious about what you were saying as far as being able to run analytical workloads on Fauna because I know that it's primarily built for these transactional use cases. And then also to your point about being able to spin up different instances for preproduction environments or for developers to be able to experiment with. I'm curious if there is the ability to leverage either indexes or if there's any sort of fast copy mechanism for being able to populate those preproduction databases with either the entirety or some subset of the data that's stored in the production transactional data store?
[00:29:59] Unknown:
Yeah. Currently currently, there is not. That's something we've been asked about a lot and and wanna get on the road map. Most most, you know, testing datasets are relatively small, so copying it with the high level layer isn't a big deal. But that forking branching model has been a request that we've gotten that we would like to enable long term. 1 thing you can absolutely do now is, you know, for read queries, you can you can give a different version of the app, a read only key, and test it against the production data in a completely safe way. That's something which is not really practical to do in a traditional RDBMS or another no SQL system where all you have is administrative access.
And there's similar things you can do at the at the user level with with our RBAC system. You can create a security model which lets untrusted clients access public data that lets users bootstrap themselves and, you know, own their own little sphere of the data world directly from, you know, a single page front end app or a mobile app or some other embedded device like a IoT device without any intermediate, you know, proxy or security layer on top of the database itself.
[00:31:13] Unknown:
Yeah. And that was another thing that I was impressed by is the level of granularity that you're able to offer in terms of the access controls. So I'm wondering if you can talk a bit more about the security model of Fauna and how user management and just overall cluster security factors in and what the administrative interface is for being able to manage all of that? Yeah. That's a good question.
[00:31:39] Unknown:
The operational management, like true ops, like adding a node, removing a node, adding a data center, happens through a admin tool, which you run locally on any machine in the cluster. But everything above that level is self hosted and part of the logical API. So for example, schema records aren't themselves any different than data records. Like, an index is an instance within a database or a document. A database is itself a document within another database all running up to the root of the tree. And at each level of the tree, you also can provision access keys. You can define exactly what they access with a, you know, with with Fauna programming, essentially. You can write lambdas which are embedded in the key and control, you know, exactly what they're allowed to do and at what priority. And that that model, because it's self hosted, is pushed all the way down to the individual documents. So you can you can have a document which serves as an identity.
You can have a scheme either by username password, which allows you to let, you know, untrusted devices create a new record without any intermediation by by setting whatever their their password or secret is supposed to be. Or you can build a stateless service that delegates that identity of something else, like LDAP or Auth0 or whatever existing identity provider, Facebook, for example, if it's a mobile app you already have, then issues back access keys that have the appropriate scope, the appropriate RBAC Lambda's installed that that let you really push the security that you normally model in the app all the way down into the database. And that's super beneficial for 2 reasons. First, it's faster because the database can process all this locally. You're not streaming back data that the user is not allowed to see. And second, you have a much stronger guarantee of fundamental security in your system because you're not especially in a microservices environment. And this applies kinda to transactionality too. If the database doesn't handle these concerns in their totality, the more you move to a serverless or a microservices environment, the more individual code bases you have trying to agree on these access patterns, which are very, you know, very nuanced. Your typical security hole comes from, you know, a bunch of well meaning implementations which have somehow interacted in an unexpected way. So if you can push that down into the shared data tier, you know, integrate through your data just like it's 1982 and you have Oracle or something, you say the database is the bus.
Everything talks to the database. Everything uses, you know, if you want stored procedures, build in our back, that kind of thing to make sure that what we're doing is correct, safe, properly QoS managed, then you get a tremendous amount of flexibility at the application tier because you just don't have to worry about that level of concern anymore.
[00:34:42] Unknown:
And to your point about the user controls being just another record in the database and being able to manage stored procedures. I'm wondering what the data types are natively in FaunaDB and what capacity there is for being able to create custom or higher order data types and what level of support there is for being able to push some measure of application logic into the database in the form of stored procedures or custom function definitions?
[00:35:10] Unknown:
Yeah. That's a good question. And this, there's a really interesting implementation detail and and fun around this that we don't really talk about. You know, in your typical like, take my SQL. When you created my SQL database or cluster, you have to set the collation. Well, what the hell is a collation? Like, it's the it's the order of ranking of individual data elements based on what language you speak, what other types of data you expect to query, what kind of indexes you wanna you wanna expose. And it's it's very difficult, especially in, like, a multi tenant environment to say this is what collation should be, and it's also very confusing. And what we did in fauna is every data type has a an ordinal position in what's essentially an infinite circle of all possible data elements.
So you can sort, for example, floats and strings, together, and you'll get a order which makes sense. You can sort and we end to be careful about, you know, some of the locale based stuff because of the way it branches. But you can, for all intents and purposes, you know, sort most languages in a way that may makes sense to the end user. And this this means, you know, font, it can be statically typed internally and process efficiently all the native data types, floats, strings, bytes, longs, arrays, maps, what have you. Every data element has a rank, a predictable rank, a rank that lets you order it, sort it, partition it, what have you. But at the same time, you can do whatever you please on top of that by taking advantage of that underlying order. So you can create essentially a struct, which, you know, has multiple data elements in it. If you want your data to be ranked a different way, you can create an index which transforms it before it gets ranked and also includes the original value. And this is a break actually from the object relational paradigm because the object relational paradigm is basically like you compile, like, a native data type. You install it into the database. You have to define all this stuff about how the database will interact with it and sort it and rank it. And and you usually can't, you know, create a column, for example, which includes that data type but also elements of a different data type, you end up falling back to kinda like the the VARCHAR, my SQL model where you're like, who knows what's in here? It's just a bunch of bytes. We learned, you know, through our own experience and working with customers and that kind of thing that people don't really wanna extend their database. They wanna model their application on top of a shared set of primitives.
So that's what we offer. We offer these native data types, but the the language itself is, you know, typed but dynamically typed even though everything's stored statically internally. And we offer stored procedures which we call functions, which let you push down lambdas into the database written in FQL, which can compose, transform data, augment the language, but at a high level. Like, you're not compiling a Java jar to install. You're just writing a query. Once you like the query, you can give it a name and put it into the database. Obviously, that function is itself a a Fauna document, so it's versioned. You can see how it's changed by going through the temporal history.
You can, you know, transactionally depend on it when you're doing other operations if you need to. And you really get a much more composable kind of a tuplespace experience where the database is a compute engine over data, which it makes
[00:38:37] Unknown:
transparently available to all the the query processes as opposed to, you know, having to think about it in a more more legacy mindset. I I really like the temporal aspect of Fauna of being able to automatically maintain versions of records because as you mentioned, it's not something that's generally built in as a first class concern into database engines. You either have to, like you said, capture the right of head logs in a post grads database for change data capture, or if you want to be able to keep that information readily available, you have to implement some sort of history table, which has its own edge cases that you have to work around depending on how the data model changes over time. So the fact that it's built in as a first class concern and that it's something that is accessible without having to go through all of these additional back flips and, you know, additional tooling. It's definitely a very valuable addition.
And I'm curious, what are some of the other interesting features that are often overlooked or misunderstood, and any particularly interesting or unexpected ways that you've seen fauna used?
[00:39:47] Unknown:
Temporality is definitely 1. Like, we originally you know, we came from Twitter. Right? So we thought every app would have feeds in it, which by and large was correct. But it turns out, you know, people are also struggling with much more fundamental data issues, like, is their data correct and available? But 1 thing we've seen is kind of an augmentation pattern traditional database and you wanna keep history, but you don't wanna deal with logs, basically. So you can add a trigger into the application layer or into the database itself, which replicates the individual changes into Fauna. Then that can let you expose those changes not just, you know, locally, but globally as as feeds, as change data capture for data services and other data centers that can let you, for example, span clouds.
If you have your app already built in a single data center in a single cloud, but you wanna start getting data, like, let's let let me be more concrete. Say you have a legacy application, which built in US east 1, it uses Postgres or something like that, but you're not gonna migrate it to Google Cloud, at least not out of the gate. You wanna take advantage of some of the unique services in Google Cloud, like the machine learning capabilities, for example. What you can do is add a trigger either in the application layer or in database itself to write changes to Fauna, use Fauna Serverless or operate it yourself, span that data into Google, and then have it locally accessible, either reading from that you know, reading from Fauna local to update an analytics system that has storage or directly querying the local phone and cluster from, you know, as a sparkly kind of service with the phone and driver that kinda inverts the database model and takes advantage of temporality to create a bus. But you don't just have change events. You actually get to query all of your data. So you can start moving that, you know, some of the patterns which are well served by that legacy database into Fauna for things like change feeds, and change data capture and that kind of thing. You can also use the security model to to lock down the canonical data database more tightly and then rely on Phonus security model to expose that data to mobile apps, for example. So you can take a database which was built for a trusted deployment environment on the web with servers you you own, manage, and essentially use Fauna as kind of a data CDN and get all that relational power, that modeling power, that querying power to make that data globally publicly available to new views
[00:42:20] Unknown:
into the same underlying product. And what are some of the cases when Fauna is the wrong choice and you would advocate for somebody to choose a different tool or a different platform? 1 thing we encounter,
[00:42:32] Unknown:
and to be clear, Fauna is becoming the wrong choice less over time. But 1 thing we encounter with some frequency that will remain the case indefinitely is time series use cases. A lot of people confuse time series and temporality because they both involve time. Faunus temporality is really about storing history and and change history of mission critical, like, user generated content, stuff that's super important. And time series is really about it's about data that individually doesn't matter and only is interesting in aggregate. So the the operational characteristics people are looking for are wildly different. You know, they want a lot of analytics, roll up aggregation features that Fauna doesn't currently have. They want it to be they want it to be cheap to the point that they give up, you know, correctness, transactionality, even availability in order to store as much data as as fast as humanly possible because the individual rows, you know, just don't really matter.
It's all about the aggregates. So that's something we've encountered with some frequency. And, you know, if you really wanna do time series, you should grab something like Influx or potentially Cassandra, like a a truly eventually consistent NoSQL time series database, which is optimized for those patterns. And right now for for OLAP, for example, Fauna doesn't have native analytics capabilities, so you can query it from something like Spark with a Fauna driver. But if you really wanna do, you know, kind of a HTAP kind of scenario, the best thing to do is use Fauna's temporality to to capture data into a relational database or a cloud analytics service, for example. And the temporality is nice for that because you can do it in a restartable transactional correct way. And usually, these systems don't have to be, you know, globally distributed. So if you have a global font of cluster, you wanna do HTAP, just spin up Postgres in 1 data center, you know, write a connector, which will which will sync the data you care about into Postgres and soft real time, and and you'll get that capability.
As we bring our own SQL capability online, these needs will diminish. But right now, those are kind of the 2 main areas where it doesn't really make sense. To be clear, we are building a general purpose data platform, so we're not we're not opposed to eventually implementing pretty much everything you would you would want, these different data domains. But, right now, the focus is CRUD relational document graph key value.
[00:45:14] Unknown:
And in terms of your experience of building and growing the technical and business aspects of Fauna, I'm wondering what you have found to be some of the most interesting or unexpected or challenging lessons that you've encountered in the process.
[00:45:29] Unknown:
I think it would be safe to say when we we set out to build this project, we didn't realize we'd end up solving 1 of the hardest problems in applied computer science, in particular, global highly available asset transactions. There was there was no industry version of Calvin before Fauna. There was only the prototype that was written for purposes of the paper. So that's the the challenge of doing that certainly exceeded our expectations. I mean, most databases are working with a single consensus layer if they have any consensus whatsoever. And some of the patterns for Raft are relatively laid down at this point, although many people batch their their Raft implementations. But to add an additional novel consensus protocol on top of that, because we've extended Calvin in quite a few ways in particular for performance and flexibility reasons, was a tremendous challenge for us. And we were gratified to to finish our Jepsen analysis with Kyle Kyle Kingsbury recently, which kind of validated the entire architecture of what we set out to build in the context of the academic literature and the history that came before and particularly Google Spanner and Percolator and that kind of thing. So in the technical side, that has by far kind of exceeded, I think the level of difficulty we initially assumed that never stopped us before. It never stopped us at Twitter, so we still got it done. But, I think that was a surprise. And then I think, you know, there's the usual stuff, which I think is common to technical cofounders where, you know, I was a director at Twitter and managed a team of about 25 people. But managing a larger company, growing it growing it from scratch, having an executive team and line managers and that kind of thing has been a learning experience for me because the the the people are are just as hard as the the the technical aspects of of the business.
And looking forward, what do you have in store for the future of Fauna, both from the business and the technical side? The biggest thing for us right now is these new APIs. Like, we've seen, you know, a lot of our a lot of our cloud users already begin to implement GraphQL adapters for Fauna. So we're super excited to release the, you know, the first party native GraphQL interface and and and serve that market need more directly. We also have a lot of work to do on the SQL implementation, for example. And then, you know, some of the future interfaces will release beyond that. At the same time, you know, we're always improving performance. We're always improving the default consistency levels you get pushing down latency even further.
There's been a bunch of operational improvements lately, which we're excited about, which will dramatically improve certain workloads. We're making it easier to use in different operational environments like Kubernetes and different clouds and that kind of thing. And we also have locality control on the road map for this year, and we've started work on the ability to define on a record by record basis where your data lives in a single Fauna cluster. That makes kind of the the shared services paradigm even more powerful because you can lay out Fauna around the world. You can have it in 25 data centers, and you can have every individual application or logical database that's accessing that cluster decide on a row by row basis where it wants that data to live.
And that's good for compliance. It's good for management of, you know, replication costs. It's good for offering a shared service internally or in our own cloud and really, you know, pushing you to that edge CDN kind of data experience.
[00:49:11] Unknown:
And are there any other aspects of FaunaDB or the Fauna company that we didn't discuss yet that you'd like to cover before we close out the show? I mean, we're hiring.
[00:49:22] Unknown:
If you, you know, if you like to work on consensus algorithms, distributed systems, if you, you know, you like worrying about what can go wrong rather than what can go right, a a database company is a good place to be. In high school, I did some hobby stuff with electrical engineering, and I could just never, you know, quite get it together because it was so unpredictable. All this analog stuff happening, and I ended up going into software for that reason. I found it to be much more predictable environment. But then, of course, I do myself to having the same category of problems by working on distributed systems exclusively, which are again, especially in the cloud, incredibly unpredictable, partially analog environments. You know, latency varies, nodes shut down, data disk get corrupt.
Like, if you're excited about solving those things, please talk to us.
[00:50:11] Unknown:
And for anybody who does wanna get in touch with you about that or follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management
[00:50:31] Unknown:
serverless edge experience. Like, we're pushing the granularity of application building down to literally nothing. Like, you had kind of a a a series of incremental paradigm shifts from physical servers to colocated or or leased servers to virtualized servers to containers, they're still all little servers. So, like, if you like, every thought you had, you had to mentally conceptualize it as being in a book. And it just doesn't make sense from a productivity perspective to think about software, especially distributed software this way. Like, who cares, like, how many functions can run within 1 container? I don't. I just wanna know if I have the aggregate capacity to execute the workloads my users are generating. And it it requires a complete inversion of that abstraction, which we finally have now for the most part with serverless frameworks on the compute side. We've had it for a long time with CDNs on the caching side. But data, especially operational data, is always the last thing to move because it's the riskiest. So you can now get, you know, some serverless analytics capability with things like Snowflake. But your canonical operational user generated mission critical, you know, data, which is the existential underpinning of the business still lives in essentially, you know, a mainframe. And what we're trying to push with fun and what the entire industry needs to push is, you know, bringing this paradigm to its logical conclusion, which is you should not have to care or even know as a application developer how your data tier is operated. It should be completely orthogonal.
And at the same time as a operator, you shouldn't have to care what your applications are doing. Like, the the model of the DBA who has to, like, go in and, like, tune queries and make sure everything is safe to execute and fail over nodes to hot spares and stuff is, an eighties model. And we need to move past that to, an arm's length utility computing serverless model where something's behaving badly, you know, in Fauna, for example, if you have an application is consuming too many resources, lower its priority. You don't have to know what it's doing as an operator. And if you want global resources as a developer, just provision a new database. You don't have even have to think about where those data centers are located. And that's the experience we're closer to with serverless, and we're already there with CDNs. But data is just harder because the the the quality bar is so astronomically high because, you know, I mean, the NoSQL movement was notorious for for essentially killing businesses, like, dig comes to mind with their experience with Cassandra.
And people are smarter now, and they demand that their database vendors really do the work. But until until the vendors do, like we're doing at Fauna, we're still gonna be stuck in that mainframe mindset.
[00:53:18] Unknown:
Well, thank you very much for taking the time today to join me and discuss the work that you're doing at Fauna. It's definitely a very interesting project and seems to be quite the feet of engineering, and there's a lot of great technical resources that you've put up for people to be able to understand what it is that you're doing and working on. So I appreciate all of the work that you're doing on that front, and I hope you enjoy the rest of your day. Thanks. It was great to talk to you. Thanks for having me on the show.
Introduction to Evan Weaver and FaunaDB
Evan's Journey at Twitter and the Birth of FaunaDB
Main Use Cases and Market Needs for FaunaDB
Comparison with CockroachDB and Google's Spanner
FaunaDB's Architecture and Consensus Protocols
Data Modeling and Multimodal Capacity in FaunaDB
Common Points of Confusion and Learning Curve
Unique Architectures and Use Cases for FaunaDB
Analytical Workloads and Preproduction Environments
Security Model and Access Controls in FaunaDB
Data Types and Custom Functions in FaunaDB
Interesting Features and Unexpected Uses of FaunaDB
When FaunaDB is the Wrong Choice
Challenges and Lessons in Building FaunaDB
Future Plans for FaunaDB
Biggest Gap in Data Management Technology