Serverless Data Pipelines On DataCoral

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline or want to test out the project you hear about on the show, you'll need somewhere to deploy it, so check out our friends at Linode.

With 200 gigabit private networking, scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform.

And if you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai.

And for your machine learning workloads, they just announced dedicated CPU instances to ensure that you get the performance that you need. Go to data engineering podcast dotcom/linode,

that's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of the show.

And managing and auditing access to all of those servers and databases that you're running is a problem that grows in difficulty alongside the growth of your teams.

If you're tired of wasting your time cobbling together scripts and workarounds to give your developers, data scientists, and managers the permissions that they need, then it's time to talk to our friends at StrongDM.

They have built an easy to use platform that lets you leverage your company's single sign on for your data.

Go to dataengineeringpodcast.com/

strongdm

today to find out how you can simplify your systems.

Analuxio

is an open source distributed data orchestration layer layer that makes it easier to scale your compute and your storage independently.

By transparently pulling data from underlying silos, Eluxio unlocks the value of your data and allows for modern computation intensive workloads to become truly elastic and flexible for the cloud.

With Aluxio, companies like Barclays,

jd.com,

Tencent, and 2 Sigma can manage data efficiently,

accelerate business analytics, and ease the adoption of any cloud.

Go to dataengineeringpodcast.com/aluxio

today to learn more and thank them for their support.

You listen to this show to learn and stay up to date with what's happening in databases,

streaming platforms, big data, and everything else you need to know about modern data management.

For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season.

We have partnered with organizations such as O'Reilly Media, Dataversity, and the Open Data Science Conference.

Go to dataengineeringpodcast.com/conferences

to learn more and take advantage of our partner discounts when you register.

And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Your host is Tobias Macy. And today I'm interviewing Raghu Murthy about Data Coral, a platform that offers a fully managed and secure stack in your own cloud that delivers data to where you need it. So, Raghu, could you start by introducing yourself?

Absolutely.

Thanks, Tobias.

As I mentioned, like, my name is Raghu Murthy. I've been an engineer most of my career, working on kind of big data

processing systems before it was called big data. I worked at companies like Yahoo.

This is back in the day in the year 2000 when they were having to process tens of terabytes of data. And there's not that many systems that could handle that kind of stuff. So we had to build a bunch of stuff in house. And a similar team has kind of followed me through the years, where in 2008, I joined Facebook, where, again,

their data volumes were going pretty quickly.

And we ended up in a situation where we had to build quite a lot of systems ourselves. And as part of that, I ended up working on Hive,

Apache Hive, which you're probably familiar with. We built it and open sourced it. And for a 5 year period, ended

up working on the data infrastructure stack at Facebook, where we grew from like a 50 terabyte

single cluster, Hadoop,

single Hadoop cluster to about 200 terabytes of data across multiple data centers.

And through those years, I've worked on pretty much every layer of the data infrastructure stack, starting from, like, an auto instrumentation,

library called Nectar, which would get data into Hive and an orchestration layer on top so that when people are trying to build pipelines,

and these would turn into 1, 000 or even,

millions of jobs a day, Like, they need an orchestration system. So I built that. And then finally ended up working on a project to actually make it, make the Facebook data infrastructure stack to become multi tenant and multi data center.

And that,

was a significant amount of learnings.

And

over the next couple of years, I did a bunch of other projects. And, finally, over the past 3 years, I've been working on data coral mainly as a way to apply a lot of the learnings that I've had over the years and make it so that we can provide companies

a way to just get started on their data without having to build any of the infrastructure that typically takes a significant amount of time. And do you remember how you first got involved in the area of data management?

And I'm wondering what it is that keeps you interested in working in that problem space. Yeah. So

during my undergrad years, I was,

actually more focused on compilers.

I was always interested in the

marrying of

theory and practice. So where there's, like, some strong theoretical underpinnings

and then you're building systems that are actually

practically,

kind of useful

that are based on those theoretical underpinnings.

And after undergrad, when I I joined Yahoo at Yahoo, at at that point in time, they were just trying to, build data processing systems, as I mentioned. And it just seemed really interesting to just solve this kind of large scale problems.

But then I also realized, started,

reading a bunch of papers and stuff that there's plenty of theoretical underpinning starting from, like,

the relational,

algebra all the way to all all of the kinds of database optimizations and things like that that has had, like, many decades of research.

And that is what kind of keeps me going, where as kind of the compute landscape changes, like going from kind of single machines to multiple machines to now,

serverless. What exactly are the trade offs that you had to make to make those same kind of, with the same theoretical or opinions, but then the system is completely new. So even in my grad school, I worked on a system called Trio. This is at Stanford where,

we removed

1 assumption about databases, which is that the data in the database may not be a 100% right. So if you have confidence levels introduced to every single data item, how how exactly do your query query engine change? So this has been kind of ongoing for many years now. And right now, the way I'm looking at it is we have changed the compute layer,

or we have leveraged the compute layer, which is completely serverless.

Now what can you build around this, kind of compute layer so that you can find a lot more visibility into how the data is flowing through these systems

rather than just giving you the answer. I mean, there's plenty of systems that just give you the answer if you run a query on them. But then if you have a bunch of standing queries, there's not that many

systems that give you metadata about what the data quality is or what the data freshness is because most of the systems are just still focused on providing the actual answers about the answers of the data itself. So it's been, like, closer to decades now. And,

it's still super fascinating because every time we are

finding that, there are new problems to kind of go after, there's kind of new things around data processing, which is now around kind of model building. So in my kind of system's mind, it is like, well, you're doing data transformation. So now how exactly do you orchestrate them? How do you make sure that the processing itself is scalable? So that's kind of the way that I've been operating. In some manners, it might seem like, okay, your your hammer and everything starts looking like a nail. But so far, it's kinda kept me,

going in terms of solving practical problems,

through this kind of thinking as well. And

so can you give a bit of an explanation

of what it is that you've built at Data Coral and your initial motivation

for starting the company? Absolutely. So about 3 and a half years ago, I was an EIR at this VC firm called Social Capital trying to figure out what I was going to do next. And as part of that process over a 3 or 4 month period, I ended up talking to a lot of companies,

their portfolio companies, companies that would come to pitch them and so on about their data infrastructure.

So a lot of these companies have a ton of data and it's sitting in many different places.

And they're all typically looking to either hire people to build the systems out or at least looking for systems that might, solve that problem for them. And every single company that I talked to, it seemed like the problem was fairly undifferentiated

company to company. You're basically wanting to collect data from different places,

organize that data in, like, query engine or,

like a data warehouse depending on whatever your workload was, and then actually leverage

the hand, leverage the insights at such that I'm that might have been generated back out into applications so that either you build out applications or you change the business processes of the company. So when I started each of these companies, what I realized was that there are a lot of vendors. I mean, there's a lot of companies that offer different slivers of functionality.

But, even putting them together, even if you decided, okay, I'm not gonna build it myself. Even then, in order to put these things together, you still have to,

have some expertise. You need to know what it means to bring together a couple of services that ingest

and put them together with, like, let's say, an orchestration layer that then allowed you to transform the data. And

there didn't seem to be,

like a common thread that could,

provide very clear visibility into data quality and data freshness. And I kind of talk about it as kind of this impedance mismatch between systems

where when you're trying to build end to end data flows, at some point, there is a cron job or something that runs at a schedule assuming that all the data that is needed for that particular processing

step already exists.

And invariably, something or the other goes wrong upstream and there's no clear way to communicate it.

And this is where a lot of the challenges end up happening when you just set something up. And, also, like an ongoing basis, you try to change the business logic of some part of your data flow. You don't really have a good sense of what else downstream might fail. So I started kind of looking at all the the services that were out there and all the systems that are out there, open source and so on, and realized that the amount of technical expertise that was needed to put them together was actually fairly high. And

to top that, especially if you use, like, SaaS services,

companies had to, in some sense, give up control over their data. Right? So they would be able to,

get the analysis or whatever it is that these SaaS tools provided. But then the company's kind of business data or data that they could deem kind of really private and

sensitive, they would have to send it through these SaaS tools before they could get value out of the data. And a lot of companies wouldn't do that. Instead, they would have to build something that is homegrown. So this combination

of there's way too much choice,

and a lot of those choices result in kind of security

kind of compromises, if you will, or data privacy compromises.

And then around that time, it was also that, I mean, the clouds were taking over. Right? So I think

2015 was the time when more than 50% of new infrastructure spend was going into the cloud. So I started kind of pulling the thread on what does it mean to build a cloud native

end to end data flow as a service or, like, data platform as a service. And, again, there are companies that did that too, and they all came to be kinda auto scaling and all of that stuff. But once you look under the hood of these companies, these specific vendors that offer data platforms as a service or they call themselves serverless, A lot of them were in some sense building our clusters themselves, managing those clusters, but then giving a serverless kind of interface to their customers.

But

internally, they were having to manage clusters.

They were having to grow these clusters, shrink them,

mostly grow if they're all growing,

Map customers to these clusters so that,

like, you couldn't have, like, 1 giant cluster of, like, Spark or,

Kafka or something like that that you would use to kind of,

bring in the data or manage the data. Instead, you'd have these

handful of clusters that these companies, these vendors would manage, And then they would map these customers to clusters.

So what I realized was that given that they were all trying to solve this whole multitenancy

as a

sole

they're going to solve for multitenancy here. They were having to,

build a lot of scaffolding to make multitenancy actually work well. And this is also around the same time that serverless computing was coming to fore, where there's this service called AWS Lambda that is getting a little bit more popular. And what I realized is that AWS Lambda as such might just make the entire cluster management problem

go away. So what I mean by that is Lambda,

as a service, it came in and said, if you represent your workloads

to follow my constraints, as in it cannot run for too long and it cannot take too many resources,

then me as a service, as in Lambda, we'll make sure that whatever your business logic is will run at any scale. We don't have to do any provisioning. You don't have to do any of the,

the cluster management as such. And as a software stack, it actually becomes very easy to deploy inside of customers,

environments.

So because of serverless, what I realized was that it might actually be possible to build this notion of a private SaaS solution, As in all of your software is running within the customer VPC. It's still in the cloud. Right? But it is running within the customer VPC and no data is leaving the customer systems. But at the same time, given this notion of cross account roles and stuff like that, that these cloud providers provide me as a vendor would be, I like I would be able to provide it as a fully managed service. So I think there was kind of, I talked about kind of all the technology choices and then I think finally the,

the motivation for founding it. So as I was thinking about these kinds of technologies, the first thing that I did as an EIR was I was just writing code just to play around with Lambda. And then there's a few portfolio companies that said that, okay. Hey. If you actually build something like this, then we'd actually use it and maybe even pay for it. And,

that is the impetus where, folks at Social Capital told me, hey. Maybe there's a company here, and here's a check. And, that's how DataCore was formed.

That makes it an easy decision of somebody just handing you money. Go to go build this thing.

Yeah. I mean, I was already building it, and,

then they're like, yeah. Maybe there's

there there. So On the website that you have, you mentioned the fact that data coral has a very data centric view of how to build a data processing platform. So I'm wondering how that differs from the way that other platforms or platform components think about the overall processing of information as it traverses source systems to destination systems

systems and just some of the different capabilities that that unlocks of thinking about it in this different way? Absolutely. So I'd like to kind of take us back to how things were before there was SQL and before there were relational databases. Right? So people would, write programs that would process

data that was sitting in files to get any answer. Now along came SQL and said, hey. Just specify what you need. Like, what is the, the question that you need answered? Don't tell me how. Right? I mean, there is a software that can figure out the how, but make it very clear

about what is the input data that answer to your question relies on. Think from clause of SQL. Right? So databases have made,

like, there's been tons of research and there mean, huge businesses built out of just this kind of core concept of getting somebody, giving somebody a high level language that is data centric. So it was not about writing programs or writing processing steps. It is about just declaring

the question that,

that required an answer, but then being very explicit on what the data dependency is. Right? So for me, that is a data centric language. Now fast forward to 2, 008 or 9 when, we were working on Hive. Before there was Hive rather, there were MapReduce there was MapReduce. Right? So Hadoop was a MapReduce framework and a system. And people are writing MapReduce jobs to do well. The classic example was to do word counts in like large documents or whatever. And people would write as these kind of computations got more complicated. People would write multiple MapReduce jobs and say, hey. This is the input for this MapReduce job, and it generates these other set of files. And then there is this downstream MapReduce jobs that would take these outputs of the previous MapReduce jobs as inputs, then they would generate their own and so on and so forth. Right? So, essentially, people were

hand coding these DAGs of MapReduce jobs. And Hive basically came along and said, hey. You don't really have to hand code these things. Again, same thing as SQL back in the day, just write SQL,

or like HiveQL in this case. And then the system will actually compile that and generate the DAG of MapReduce jobs that need to run, and then there will be an execution layer that would do the appropriate orchestration.

So, again, hive to cough. Right? But then when you started thinking about end to end data flows, as in pull data from somewhere,

organize that data in, like, multiple ways by taking it through a series of transformations,

and then actually publish that data back out. We are back in the whole writing programs

phase, as in you're writing pipelines. These pipelines have jobs, and these jobs are essentially scripts that are pulling data from so you have scripts for a bunch of things. You have scripts that can pull data from different places and write them out. Then you have

downstream scripts or downstream jobs that are reading

from the output of the upstream jobs and then doing their own transformations and writing something out back into a database or a file system. And then you have this entire tag of jobs that are being hand coded. Right? Like, you can think of a bunch of ETL systems that are used to build these kinds of DAGs, where essentially a lot of this orchestration is being hand coded. So our data centric approach has been to essentially do what SQL did for data processing of files or Hive did for MapReduce, which is that we'll provide a SQL like language that'll allow you to specify end to end data flows, and then that'll get compiled

into a DAG of jobs or, like, entire data pipelines that can then be run-in a fully orchestrated manner by our engine. So when you think about what it means for such a data centric approach to exist, you can have somebody who is not an engineer, right? So who knows, let's say just about SQL, they're able to actually build out end to end data flows by only thinking about the shape and semantics of data. Right?

And then as a system, yes, we are running whatever,

we are running. I mean, at the end of the day, there has to be a data pipeline or there has to be a bunch of processing that needs to happen of, like, data movement or data transformations.

But all of that is hidden to the user. But what the user does get back is full visibility

into

the freshness

and the integrity or the quality of the data. So 1 of the big challenges that that happen when you have these data pipelines that like when you have these scripts and in turn, inside of those scripts, you're probably running SQL queries for the most part if you're not actually moving data around. You'll find that the data dependencies are actually hidden inside of the script. So you're having to explicitly hand code the job and task dependencies

to match the data dependencies that exist. Does that make sense? Yes. What we are saying is, hey, don't worry about the jobs and tasks. That is in some sense the physical plan of your data flow, just like how there is like a physical plan for a query in a database. Instead of hand coding these physical plans, they'll get generated. Right? So that is how we believe that we are essentially moving people up the stack, if you will, of just thinking in terms of the shape and semantics of data, instead of any of the underlying systems, any of the any of the scripting as such, or any of the hand orchestration that ends up happening. So in our mind, this is like what happened

in computer science, if you will. Right? So earlier, there used to be assembly language that was pretty architecture specific. That is about registers

and memory banks. And you had to kind of really hand code everything that that needed to happen for the for the business logic. And then there was, like, higher level languages that said, hey, you don't have to do this computer engineering. Instead, you can do computer programming where you're mainly focused on the business logic of your application. So we are calling our data centric approach

data programming rather than data engineering. And a couple of the things that come out from that as questions that I'm thinking of are, 1, you're mentioning

the visibility

of data freshness and data lineage.

And I'm curious

what types of strategies you're using for instrumenting

these processing components that you're running. And then also

as far as moving higher up the stack, I'm wondering what are the areas where you have found some of the abstractions to be leaky, where you have to try to reimagine

or reengineer

the way that you're consuming the data or structuring the data as it traverses these various components of the pipeline,

particularly when you're dealing with these myriad different data sources that often have an ability for customers to create their own schemas and mappings that aren't necessarily going to be standardized across all of the implementations and how much upfront effort is required to be able to

manage those differences

in the source systems as they traverse the various layers?

Absolutely. So again, we are not changing the complexity of the data itself. Right? So that is something that is inherent to the data, and we are not trying to solve for that complexity

of standardizing data, schemas, or anything like that. But what we are saying is that to the data practitioner,

we are saying schema is the interface. Right? So what they're looking at are these connectors that are just getting data with as high fidelity as possible from the source systems into an a warehouse, let's say. And in that warehouse, they're actually specifying

transformations in the language of that warehouse. Right? So in the SQL, if it is a SQL query, then sorry, a SQL

database, then they're just writing the SQL in the dialect of that database. Right? And this also actually ties into the leaky abstractions, and I'll get to that in a minute. But what we are saying is that whatever the business logic of your data is, if it is representable

in SQL,

then directly represented in the SQL of the underlying system. But if not, you're actually writing a SQL like statement

that says, hey, insert into this schema in this analytics warehouse from Salesforce connection parameters. Right? So now when you think about extracting data provenance

or extracting

the well, essentially data provenance, it is literally coded up in the SQL like statements that people are writing to specify their end to end flows. It's just that we parse it and compile it and we, like, we infer what the data dependencies are, simplistically speaking, by just looking at the from clauses of each of these statements that get written. So that's how we are able to extract the data provenance statically. Right? We don't have to really instrument anything because people are literally telling us, hey, this processing step depends on this data or, like, this particular transformations depends on this data because that's in the from clause. And you can kind of imagine,

we have not yet kind of published out a spec for our language, but you can kind of imagine that even if it is, like, complicated

processing, like, let's say, model building or something that is not representable in SQL, we have modeled that

as a user defined table generating function. Right? So Hive basically has this there's a bunch of other databases that have them as well, like stored procedures. But the idea is that when you are trying to define these data flows, you're mainly

describing

the data dependencies.

The actual processing step itself could be happening inside of a warehouse, inside of a container, but the actual data flow itself is represented,

like in SQL. In fact, in SQL,

especially the transformations part. Now coming to the so there's a couple of equivalences to this, and we have leveraged a lot of those concepts too. And I can get into a lot of the talk that has,

gone into, like, how we have brought together some of these concepts. Right? So databases have this notion of views. Right? So you can build views and views on top of other views. So in a very simplistic form, you can think of an entire end to end data pipe not end to end. Like, all the transformation pipelines to all be represented as views and views on top of other views. But then, like, you probably know,

creating those views, the let's say, the furthest downstream view becomes really expensive if you can keep creating it over and over. So you want to kind of cache the query results. And databases have had,

I mean, concepts of of trying to do this, right, which is called materialized views. But then when you think about building these materialized views, you realize that databases run into problems trying to provide a consistent way of these updates getting propagated between these materialized views when 1 view depends on another view. So most analytics databases don't really support

materialized views Well, even kind of long standing databases like, you know, Oracle DB 2 and so on. They've been working on this, what is known as an incremental view maintenance problem. As in, let's say, you have

views that have, like, joins and aggregates and so on, and a row got inserted

into, like 1 of the source tables. How do you actually propagate those that particular update through all of the materialized view definitions in terms of this end to end pipelines? So doing that is actually pretty hard. Doing that in a consistent manner is even harder. As in, if you want to do it in a transactionally consistent manner, doing materialized views

in general in a transactionally consistent manner is incredibly hard. So most databases kind of give up on it, analytics databases. And that's the reason you have these data pipelines where you have derived tables that then get updated at a given schedule so that there's some predictability and so on. So in our minds, like, giving up on materialized use and just calling them derived tables and so on and writing these scripts that would get triggered

and stuff like that is in some sense kind of throwing away the baby with the bathwater. So you have materialized use that provided a very nice abstraction for people who just wanted to think about the shape and semantics of data. And we told them, hey. No. The database is not able to support them well. So now you had to think about how to orchestrate those transformations.

So what we have said is, especially for this notion of transformations, you can just build them as views, and we'll build a poor man's version, if you will, of a materialized view on any database. And we'll do a deterministic orchestration based on our kind of micro batch processing layer and so

so on. So that is kind of our secret sauce, and that is what we have we have made. What the secret sauce, you can think of it as, like, our opinionated way of building data flows. So now coming to the leaky abstractions part, we are not building a transformation layer ourselves.

Right? So if you want to process large amounts of data, we are saying, hey. Leverage 1 of the data warehouses that already exist. We don't wanna be yet out of the database because, well, there's plenty that are that are reasonably good. Right? So they may not do everything around end to end data flows. But if you submit a query, they do a great job of doing it, performing the data processing for that query in a really optimist,

like an optimal way and then giving you the answers. So what that means is that now

we are exposing

all of the problems as such of these underlying data warehouses, right? So our entire serverless kind of implementation,

we are completely serverless in terms of data movement. So we build that ourselves. But then once the data goes into a warehouse and then it needs to get transformed there, we provide a seamless way to orchestrate the transformations. But then that means that the performance of the underlying warehouse is something that our users had to worry about. But the good news is that we are actually capturing

the a bunch of statistics about how these view queries are, how are they running, how much data are they processing, how long are they taking,

is there skew in the data? So these are all things that we can start capturing. And our goal is to build

an optimizer

that can optimize an end to end data flow rather than what a database does, which is just optimize 1 query at a time. So for example, if I know that the result of 1 query is being

joined downstream because, again, the user has specified it as part of a data flow, is going like, the output of 1 query is going to be joined on a couple of these attributes.

I know how to lay out the data so that the downstream queries can actually be really efficient. Now a lot of this stuff is being done right now by kind of going over query logs and inferring what the workload looks like. But then turns out when people are defining data flows, these are standing queries, as in the logic the business logic of the data itself does not really change when you're building these data flows. We just want to keep the data alive. So what we have said is that maybe you can actually statically compile and optimize these data flows that will leverage these warehouses to make sure that the end to end data flow is fully optimized. My my mind was just blown a number of different ways.

I I especially like the way that you're thinking about the idea of select from

Salesforce

into

your data warehouse, and then just letting the rest of the processing happen under the covers without having that be the primary concern of somebody working at the data engineering layer. Because as you were saying, that that's where a lot of wasted time and energy goes into when, really, it's just plumbing. And the primary concern is just getting your data from 1 place to the other and being able to do something with it. So 1 of the things that I'm particularly interested in is the underlying architecture of the system that you've built in order to be able to

compile this higher level query language into those different processing steps. And I know that 1 of the pieces is this concept of a slice that you have as far as a microservice

for managing those different portions

of the processing pipelines. I'm wondering if you can just talk through the overall systems architecture

and some of the ways that it has evolved since you first began working on it. Yeah. So when I first began working on it, my goal was to just try to pull the thread on AWS Lambda and see how far you can go in terms of representing all the functionality needed

in in a data infrastructure stack to be represented completely in terms of these bite sized operations that AWS Lambda can do really well. And turns out you can do quite a lot. So when you think about data infrastructure in general, you're thinking about 3 different pieces of functionality.

1 is collecting data from different places. So you might have data coming in as event streams, data sitting in databases, like your production databases, like Postgres, MySQL, Oracle, whatever else from which you want to pull data so that you can analyze it. Or you have a huge amount of applications that you're using as a business where your business data is sitting inside of those applications.

So these are all your SaaS tools that you end up using. And getting data like centralizing data from all of them, in some sense, is kind of massively parallelizable,

if you will. Like, because especially if you are thinking about just pulling data with high fidelity. Right? You're not trying to do any transformations

as such. So that was the first kind of step that I took. And then what I realized is that you are to actually package a lot of resources,

not just the the Lambda function, but like, the rules, the permissions,

the orchestration layer, where how do you do kind of configuration management? How do you do state management? How do you build out the orchestration layer? And how do you provide visibility? So all of these things put together, like, all of the software that we put together, we wanted to make it kind of 1 microservice.

And we made that 1 deployable unit, and that is what we ended up calling the slice. Right? So if you had a connector to Salesforce, that would become a slice because that is 1 deployable unit. And in terms of how does this tie back into these data programs, if you will, like, remember I told you, that we're calling it data programming. Right? So if you said something like select into a schema from Salesforce connection parameters,

when you ran that thing, there'd be a microservice that would get deployed. So this would be the slice that would get deployed. And then that that'll keep the data alive. That'll start getting data in. Right? So that is 1 type of a slice. And then we built this whole orchestration layer. Right? So for different warehouses. So right now we have supported,

Redshift and we've also supported AWS,

Athena and, of course, any other kind of big data query engine that uses the same system catalog that Athena does, which is the Glue data catalog. And there is a whole bunch of orchestration or like a whole bunch of functionality that is needed to push data into these warehouses as well as orchestrate the the transformations that need to happen within these query engines. And again, we build that entire thing also as a serverless component.

And we also call that a slice because, again, we are using reusing a ton of that code. So now you can think of a driver for 1 of these warehouses to also be a slice. So when somebody says, hey, I wanna use Redshift, then there is a slice that gets deployed,

which says, okay, now I know that all of the collect slices will now be able to write to redshift.

Or if I added a managed Athena slice,

then all the collect slices will know how to write to Athena, right, or do data catalog in this case. So the notion of a slice for us is actually an implementation

detail, which actually also translates to whatever the user is saying. Right? So typically, when you're building connectors or when you're building these drivers and so on, they're all kind of when you think about the deployment model, they're all kind of built and shipped as kind of, let's say, 1 monolithic application.

And then a user kind of toggles the functionality that they need. But given that we are completely serverless, we are actually not even deploying the software

that a user is not using the functionality

of. Right? So whenever a user needs a piece of functionality, a slice gets deployed for that. Right? And remember that all of this deployment is happening inside of the customer environment.

Right? So it's happening inside of the customer VPC. So when a user says, hey. I need this connector, then the automatically, like, our,

software actually deploys a slice inside of their VPC that then starts collecting data or organizing data inside of a query engine. Or on the other end, this is kind of the last part of a data infrastructure stack that not that many people talk about, but ultimately this is what you want is get data out of your analytics engine back out into applications or

back out into production databases so that your product can use it. We built out these

deployable

modules and we call them slices. There are collect slices, there are organized slices, there are harness slices. And again, this is just implementation detail of how a serverless system works. But as far as the user is concerned,

they don't really need to know that there is a slice or anything like that. Right? So all of that is like an implementation detail. It's it just makes our job easy in terms of providing a fully managed service. So in terms of when we got started, like, we started building these slices, but then we built a layer on top. And then this notion of slices has essentially gone into the background. And in terms of

the life cycle of the data as it goes between these different slices,

There are a few things that I'm curious about, 1 being managing

schema transformations or updates or

variations in data format as it goes between the different layers of the slices and making sure that the inputs and outputs that are expected between them match up. And, also, because of the fact that it's running on Lambda,

how you manage the

time constraints

of being able to process these different batches of data and ensure that you don't run up against the limitations

of the underlying platform, both from a time and performance

perspective and, just overall

available resource, whether it's compute cycles or memory or data storage?

Yeah. So, again, like, Lambda was our main kind of workhorse

for a long, long time. But since then, as you can imagine, if things need to take longer or whatever, we have started leveraging other kind of ways of doing the processing as well even though we have kept the same data centric approach. So I talked about these slices. So each of these slices, they have a shared metadata layer. So all of them do configuration management the same way, state management the same way. And we leverage, like, open data format. So we don't have our own proprietary data formats because as a like, that is not where we are innovating. But every slice knows what kind of data will essentially publish not only the schema of what it is publishing, but also the data format itself, like the actual physical layout. And that is something that's available

across the entire data flow. Now the other aspect with the shared metadata layer and the fact that the shared metadata layer is where all of the data provenance is captured is that when there is a change that happens at the source,

in terms of a schema, let's say, then we have each of these connectors, for example, that are connecting to a source. They're also monitoring for schema changes. So whenever there's a schema change, then that results in a data event. Right? Saying that, hey. There is a schema change that happened. And then the downstream processing steps either know how to deal with it or don't. And if they don't know how to deal with it, they'll notify the the user to say, hey. Something changed up stream. I'm gonna stop processing now because I don't know what to do. Like, for example, let's say a column got dropped at the source. Because we have data provenance, we are able or actually, more technically, we have schema provenance. Data provenance is a much harder problem.

We can get to that, but that's kind of a digression.

Given that we have schema level provenance, we are able to notify our use like, if you're able to handle it, like, let's say a new attribute got added, then it'll just get added to the to the data in the warehouse. But let's say an attribute got dropped, then if it's not being used anything,

anywhere in kind of the downstream kind of queries, then it can be kind of silently dropped. But if it is being used, then you know that the business logic, like, we are not trying to interpret the business logic of any of the queries, not yet anyway. We'll just notify the user. We'll kind of quires that kind of particular

processing kind of flow and will quires it and will notify the user to say, hey. Something has changed. You need to go change your

business logic of your query to handle this particular change. And then when they actually make that change, we actually provide a compile step. So what that means is, let's say they change the query and they also dropped an attribute, right, in that query because, the table in the from clause had dropped an attribute. Now how do you make sure that something downstream does not fail? Right? So this is actually a pretty big problem when you're building these data pipelines where there's a bunch of scripts. So what our compile step does

is, in some sense, it makes it so that these data programs are typed. Right? They're they're actually fairly strongly typed around schema. So you can think of view definitions

as a function signature. Right? So think, like, whatever is in the output, the schema of the view is the signature of that view. Right? So then given that we know all the downstream queries, you can think of them as all call sites of this function. Right? So our compiled step basically says, hey, these are all of these downstream queries that'll get affected. And as a user, you can either say, hey, I know how to change it. I mean, this is similar to how you do programming, right, the computer programming. You change the function signature,

you go change all the call sites. Otherwise, the compiler, like, at least when you program in compile languages, you compile it, and then you have to go change all the call sites. And that forms a consistent update to your entire code base. So what we have done is we have taken that exact same thing and said, okay. Now instead of doing just 1 update to your business logic, you need to create an update set.

And that update set should be a consistent update to your entire,

data flow. And it's not just at a 1 thread of a data flow, it's the entire DAG. And we make a consistent update to the data flow

at the right time. Like for example, if you have something that's running hourly and there is something downstream that's running daily, you don't wanna change the business logic of something that's running hourly in the middle of the day because then your daily process will have half old version and half new version of the hourly data. So instead, we change both the business logic of the hourly as well as the daily if the daily also has to change, at the beginning of the day. Right? So

all of this is actually done

at compile time. Right? So we are able to infer when exactly to make those changes and make those changes in a consistent manner. Right? So to answer your question about how do we handle changes, if we can handle it directly ourselves, we do it. Otherwise, we notify the user, but then give them, like, provide enough guardrails

so that the updates that they're making

are consistent.

And so in terms

of

the overall

experience

and life cycle of a customer who's on boarding onto data coral. I'm curious what the process looks like both for greenfield projects

and for customers that have an existing data platform that want to start converting over to using Data Coral for the majority of their processing. So given our

kind of microservices based architecture,

we could actually

live alongside whatever else a customer might have. In fact, that's in our name, Coral, because it can grow on top of anything. So whether it is kind of brownfield or greenfield, we can start off with whatever the customer might need. Right? So if they just want a bunch of connectors because they wanna pull data from different sources and they don't have the resources to actually build out those connectors, we can just provide those connectors.

And if they want us to provide a managed warehouse, we can do that as well.

And in some cases, what ends up happening is people start off with some of these connectors to ingest data, then they see that they get this kind of full power of materialized view, so then they started using that. And then finally, kind of the,

in the fullness of time, they're able to actually use us for end to end data flows.

And, again, as I said, we can live alongside whatever might happen,

whatever they might have. The only part that we are we are working on and we are, we work with our customers on as well is this whole notion of noisy neighbor problems. Like and especially in data warehouses that are not serverless themselves. Like if they have a bunch of workloads that are resulting in a lot of load in the underlying warehouse and then our software also has to run queries on the warehouse at around the same time, then we run into some scalability challenges and so on. And we work with our customers

to make sure that there's enough capacity and so on. So we've had both cases. We have been the very first data infrastructure stack. People are, have come to a small companies where they said, hey. We are thinking about building out data warehouse. We don't know where to start. And we have gone in there. We have, in fact, deployed their,

1st Redshift instance as part of our installation,

and we have provided, like, an end to end stack to be there for them pretty quickly. We've also had cases where people had their own Redshift clusters or they had their own custom built scripts. And in those cases, we tell them, hey.

Tell us about a few use cases that you know you'd love to solve, but then you just don't have the,

the resources right now to tackle them because your main, like, building out KPIs or KPI dashboards or whatever else is actually taking a significant amount of time. And we start off with those use cases. We we can start off small, but then we grow over time. Right? So

the more that our customers see that they can get a significant amount of value without having to spend a bunch on engineering, the more they start using us. So we know that we have succeeded if data engineers, data scientists, data analysts, they start getting creative.

They start saying, oh, my the stuff that I was worrying about is no longer a worry, so let me think about this other dataset that I might be able to bring in to understand better what a customer is doing or understand better what my sales team is doing or marketing team is doing and so on. Right? So we've had customers who also said, hey. I wanna just pull in some of the Slack messages that are happening in certain channels around kind of help,

channels

so that I can analyze what kinds of questions people are asking to see if there's something that I need to build in the product. So those are not the types of things that you think about upfront when you are thinking about building your standard kind of business dashboards or whatever. But if once you start,

like, you know, elevating people from this kind of dealing with data pipelines and elevating them to thinking about just about the shape and semantics of data, then you start seeing that they're truly leveraging all the data that they have. And there are a whole number of other avenues I'd like to explore

because I think this is a very fascinating topic and a very fascinating platform

understanding what you have found to be some of the most interesting

or unexpected or challenging aspects

of the lessons learned or work that you've done in the process of building data coral?

So building data coral is kind of 2 things. Actually, it's many things, but I break it down into 2 things. 1 is the technology itself,

and, I think we talk quite a bit about the technology itself. And then the second thing is about building out a company, like, building out a start up,

hiring people, and getting people excited about kind of the same vision and,

actually finally kind of delivering,

the value to our customers. And in terms of the

challenges, I mean, clearly,

the biggest challenge for us right now is kind of hiring. I mean, it's a pretty hard market, so we have we're spending quite a lot of time trying to get the best people that that we can on all aspects of the company. Right? So on the technology side, on engineering, on sales, on marketing, on product. And, I believe that we have actually assembled a pretty solid team. And, of course, to boot with,

we have,

some pretty amazing and supportive investors. So Social Capital did our seed round, and Madrona,

based in Seattle,

did it our series a. And it has been immensely helpful for me, I mean, as as an engineer, first time entrepreneur, to be able to build out the company to have, support of such investors.

So that's kind of around the company building itself, and I I'm guessing that's kind of true for most companies nowadays anyway. Now going back to the technology side, I think some of the unexpected

challenges as such are, like, where exactly the bottlenecks

lie when you make certain technology choices. Right? So we are building a distributed system. So typically when you're building a distributed system, there are many choices that you have to make around how do you do configuration management, how do you do state management, how do you do orchestration of workloads, How do you do resource management and resource scheduling? And then finally, how do you actually provide visibility into the workloads and the overall work that's happening in the system? So the 1 choice that has been clear and continues to be clear is choice on how resource management and resource scheduling works. And we have chosen

serverless technologies like AWS Lambda to do that, but the rest is something that we have built. And to top that, the fact that we are building multiple isolated installations rather than single multi tenant system provides some pretty unique challenges

around what even a CICD pipeline looks like. So deploying to like a handful of clusters is very different than deploying to hundreds of isolated

installations.

It's actually very easy compared to a server full installation.

Right? But even if it is much easier than a server full installation or a server full implementation,

building a CICD pipeline for a serverless

service that is running as multiple isolated installations

is is surprisingly hard, as you can imagine. So we are kind of solving the problem of both the SaaS company as well as an on prem company. So we are providing, like, a private SaaS offering,

and we believe that we are kind of hitting a bunch of these kind of scalability challenges around just CICD itself. And kind of, I just want to mention though that I actually believe and maybe even want that most applications, even SaaS, has to move towards private SaaS.

So why is it that my company's data is actually sitting in some other company's

systems? So as kind of the data privacy

concerns grow, I do believe that this whole model of deploying software inside of the customer VPC

for not just infrastructure, but even applications

should become a thing. And yeah. So

we are on on track for that. But as you can imagine, like, just the practicality of both building a business as well as building out the technology and deploying it, there's plenty of challenges there. And just briefly,

Data Coral definitely seems like a very useful useful and flexible platform, but what are the cases where it's the wrong choice and somebody should start thinking about something along a different avenue? Yeah. So we have, we have built a pretty opinionated

kind of data flow,

kind of platform. Right? So what we have said is that we'll be doing micro batch processing. So that by definition

kind of makes any real time applications

or any

requirements where your data latencies need to be in, like, seconds or, like, sub seconds. So we are not at all a good choice for that. And we are yet to build a lot of stuff. I mean, we have plans to build things around kind of model building and stuff like that, but we are not yet there. So, again, those are places where we,

recommend our customers look elsewhere. But in terms of kind of building this near real time or, like, minutes latencies or our latency,

data flows without really having to hand code any of these pipelines, Those are things that we believe that we can do a pretty good job. And is there anything that you have planned for the near to medium future of DataCore, both from a technical and business perspective that you'd like to share before we start to close out the show? Yeah. Absolutely. I mean, on the technology side, I mean, there's more connectors to be built always. There are a couple of drivers that we are, excited about working on for the analytics engines. 1 is Snowflake. The other is Databricks Delta. And

in terms of the machine learning kind of side of things to try and support that, we are looking at SageMaker

as another kind of, system to,

integrate with. And then on the business side, we are, of course, trying to grow the business. We are, we have a bunch of customers, and we are talking to a bunch of other companies as well who could become our customers. So our goal as a venture backed company, as you can imagine, we raised our series a about 6 months ago. We'll probably be raising again in, like, the middle of next year, let's say. So we are tracking our business to, get the right kinds of customers, get to a repeatable sales model,

and build out a big enough pipeline for the next stage of the company. And 1 last question that I have before we start to move on that,

I'm curious

about is how much effort you anticipate

going through if and or when you begin to support other cloud providers beyond Amazon?

Yeah. So clearly, that has been a question that has come up quite a bit. And there have been solutions,

or like services that say they are cloud agnostic or multi cloud. But in most of those cases, they end up not leveraging everything that a particular cloud has to offer. Right? So that means that they're having to rebuild or, like, start from scratch on a bunch of services that maybe these clouds offer as platform as a service. So we believe that we need to be cloud best rather than cloud agnostic. So right now, we are all in on AWS because, again, it's the biggest cloud provider. It has a lot of features that are actually pretty enterprise ready, and they're, they're pretty solid. But the concepts remain the same. Right? So the high level language, the way to build things in a serverless manner, all of that. I mean, conceptually, they're pretty much the same. So when we take on the next cloud, like an Azure or GCP,

we will actually leverage a bunch of the same concepts. I mean, they're exactly the same, really. But in terms of leveraging the specific cloud, we look very closely at what the cloud has to offer so that the interface we provide to our customer is

the same, which is make the shape and semantics of data front and center. But then the rest, we will actually really invest in that cloud to truly leverage everything that the cloud has to offer. So the simple economic reason is that if a cloud provider is has, like, a 50 or a 100% team building out a PaaS service that you can use, then why not actually use it? Just because another cloud does not provide it does not mean that in that cloud, that is not the right way to do things. Alright.

And are there any other aspects of the work that you're doing at Data Coral or any other the any any of the other topics that we discussed today that you think we should discuss further before we start to close out the show? Yeah. Absolutely. I think what would be really interesting

for me to understand, I guess, from your audience is, do they find that these kinds of abstractions that we're talking about actually making sense? And, of course, like, until about 6 months ago, we were in stealth. And since then, we have not really talked that much about ourselves. That is going to change. I guess doing this podcast itself is a good first step. And yeah. So our belief is that data practitioners should be thinking more in terms of the shape and semantics of data instead of the plumbing. So we'd love to chat more as we get feedback from both our own customers as well as from your audience about what they think about providing these kinds of abstractions as well as the kinds of implementations

the serverless implementations that we have done. And for anybody who does have any feedback or additional questions for you, I'll have you add your preferred contact information to the show note. And as a last question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. So in my mind, the gap is not in the tooling, but the gap is in,

a common thread or like a a common layer that knows how to make all of these tools work well together. So there are choices for all kinds of things that you wanna do, starting

from, like, the stuff that we do, which is collecting data, organizing data, or harnessing the data, data governance, data cleaning. I mean, there's this whole multiple people have published,

periodic tables of all the different companies that offer different features or, like, different parts of whatever is needed for data management, but it seems like there's no common layer that somebody can look at and say, okay. Now this is what data management actually means, and this is how data is flowing. This is what the data quality is and so on. So at the risk of adding 1 more piece of noise, what we have said is that we want to be that layer, but we are really far from actually accomplishing that goal. Alright. Well, thank you very much for taking the time today to talk about the work that you're doing at Data Coral. It's definitely a very fascinating

platform,

and there are a number of ideas that you've incorporated into it that I think bear further exploration. So thank you very much for that, and I hope you enjoy the rest of your day. Yeah. Thank you so much.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links