Automating Your Production Dataflows On Spark

Hello, and welcome to the Data Engineering Podcast,

the show about modern data management.

When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it, so check out our friends over at Linode.

With 200 gigabit private networking, scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform.

And if you need global distributions, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai.

And for your machine learning workloads, they just announced dedicated CPU instances.

Go to data engineering podcast.com/linode.

That's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show.

This week's episode is also sponsored by Data Coral, an AWS native serverless data infrastructure that installs in your VPC.

DataCorel helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure,

meaning you can spend your time invested in data transformations and business needs rather than pipeline maintenance.

Raghu Murthy, founder and CEO of Data Coral, built data infrastructures at Yahoo and Facebook, scaling from terabytes to petabytes of analytic data.

He started Data Coral with the goal to make SQL the universal data programming language.

Visitdataengineeringpodcast.com/datacoral

today to find out more.

And having all of your logs and event data in 1 place makes your life easier when something breaks unless that something is your Elasticsearch cluster because it's storing too much data.

ChaosSearch frees you from having to worry about data retention, unexpected failures, and expanding operating costs.

They give you a fully managed service to search and analyze all of your logs in s 3 entirely under your control, all for half the cost of running your own Elasticsearch don't forget to thank them for supporting the show. You listen to and don't forget to thank them for supporting the show.

You listen to this show to learn and stay up to date with what's happening in databases,

streaming platforms,

big data, and everything else you need to know about modern data management.

For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season.

We have partnered with organizations such as O'Reilly Media, Corinium Global Intelligence, Alexio, and Data Council.

Upcoming events include the Data Orchestration Summit and Data Council in New York City.

Go to data engineering podcast dotcom/conferences

to learn more about these and other events and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macy. And today, I'm interviewing Sean Knapp about Ascend, which he is billing as an autonomous data flow service. So, Sean, could you start by introducing yourself? Yeah. Hi. Thanks for having me, Tobias. I'm Sean Epp, the founder and CEO of Ascend. Io. And do you remember how you first got involved in the area of data management?

I do.

It was actually at the very start of my career. This is back in 2004.

I just graduated college as a a computer science major

and ended up starting at Google on the front end engineering team. And part of our big mandate was,

to do a lot of

experimentation with the various sort of usability factors on web search, really trying to get,

better engagement with our users.

And

a lot of my my academic background was around,

cognitive science and human computer interaction. And

what I found very quickly was

in experimenting with all these usability factors in our users,

I ended up spending far more time writing data pipelines

to analyze the usage of our users

as, you know, when you're at Google scale,

simply trying to figure out what your users did yesterday,

required pretty remarkably,

sized infrastructure and sophistication

to,

inform what you should do next and the efficacy of what you had done before.

And so very quickly, I went from being a front end engineer

to specializing deeply in data engineering and,

data science. And how did you find that

difference in terms of the tool sets going from a front end engineer to dealing with a lot of back end data processing?

It was incredibly different in that, you know, front end is a very visual experience. You have,

what I think are are fairly mature tools and technologies.

When we went into

a lot of the data processing domain,

what I found was

the the raw capabilities,

the ability to store and process data at incredibly

large volumes,

was quite mature.

The,

rest of the the tooling and ecosystem around that,

wasn't as advanced.

And that was sort of my first experience

in data management and pipelines in general was it's very easy to write a pipeline to do a thing,

but to write many pipelines that depend on each other and interconnect and do more sophisticated and advanced things together, I really became a lot of that,

that challenge that I observed really early on in my career. Yeah. It's definitely interesting how a lot of these tools have developed from,

as you said, being very technically capable, but they leave a lot of sharp edges that you can happen to harm yourself on if you don't know exactly what you're doing and have a lot of background in the space. And it's interesting to see how the industry has been progressing to add a bit more polish and, safety features into these systems to make them easier to approach. Yep. I wholeheartedly agree. And I think that's it it's something that we see in the the natural

evolution

of many technology domains.

And I think that this is really sort of the the big encumbrance we find today and the thing that that makes life as a a data engineer,

more challenging.

To put it simply is just

trying to maintain these these powerful yet brittle and finicky technologies,

is really becoming

1 of the the larger pain points of modern day data engineering. So can you start a bit by explaining what the Ascend platform is that you've been building?

Yeah. I'd be I'd be really happy to. So, you know, about 4 years ago,

we sat down

and took a look at the the data ecosystem

and said, look. Like, we've been building

complex analytics and machine learning and and, data systems for really the last at that point, it was the last 11, 12 years of my career.

And

1 of the things that that really wanted to try and tackle was how do we make our lives

as data engineers easier, more effective,

and simpler.

We feel like we're constantly reinventing the wheel, constantly being paged at 3 in the morning. There must be something better. And so the the sort of core thesis behind

Ascend was, what if we could create

a new technology,

something that's not a storage or a processing system, but instead,

a control system,

similar to what we've seen in other industries and other domains.

For example, like how,

infrastructure now has Kubernetes, a declarative model for infrastructure

that has an intelligent control plane that orchestrates the underlying infrastructure.

Could we do something like that for data pipelines?

And so we spent

a a good bit of time

really,

architecting

what would a declarative

model look like for data pipelines,

and can we

architect a an orchestration system

that provides this control plan that pairs both declarative configurations and code,

with an intelligent control plane to automate a lot of the operation

performance tuning and scaling of, data pipelines that

to date we have had to simply manually code. And can you talk a bit more about some of the background and origin story and your inspiration for creating it in the first place and some of the aspects of the problem space that you're working in that keeps you motivated? Absolutely. So 1 of the things that that was really motivating is,

you know, a few years before I started as Ascend,

I spent a lot of time with our engineering teams

building these pipelines that, you know, simply put,

we felt that we could describe these, you know, what were fairly big, hefty pipelines

with a handful of SQL statements. Yet we found we were writing a ton of code and

dealing with incredible pains of maintaining those systems

due to the intricacies

of the data

and challenges like late arriving data and,

deduplicating

data and failures in the underlying systems. And so we took a step back and said, well, there's an innumerable set of heuristics and problems that we encounter here. And

when we think about it from, an academic perspective,

there's you know, we have great pipeline,

or we have a great processing technology, and we have great storage technology, but we don't have a a pipeline engine. You know, when I use a database or a data warehouse, there's a query planner and a query optimizer that run and optimize with that database engine. We don't get the same thing with pipeline today, or in essence,

reimplementing

database

query planner logic every single stage of every single pipeline,

and this seems like a really interesting and incredibly hard challenge to solve.

But if we can solve that, we can truly introduce

a a new wave of data engineering

and get ourselves out of this data

plumbing

business and far more into data architecting business. And so that was really excited to think through what we could actually enable across the ecosystem

if we got

people out of the muck. And my understanding is that the foundational layer for the platform that you've built is using Spark. So I'm wondering if you can talk a bit about your

criteria

for selecting the execution substrate for the platform

and some of the features of Spark that lent itself well to the project that you were trying to build? Yeah. That's a great question. And so interestingly enough, when we originally started,

Ascend, we started with BigQuery as the execution engine.

As we started with a a simple, SQL dialect,

because it was entirely declarative in nature, and we knew that BigQuery was a very scalable engine. And, obviously, as a result, we operated purely in the Google Cloud environment. And

the idea behind this was in those early stages to really just prove out the concept of this,

declarative,

pipeline engine and orchestration layer. And as we proved that out and and found that we could really solve some pretty powerful and compelling challenges,

we then,

started to change our focus to

how do we make this extensible to multiple compute engines? How do we make this extensible to multiple cloud environments?

And that's where we started to do a lot of our research. We looked, into Beam. We looked into Spark.

We looked into a handful of other technologies,

and also interviewed a lot of,

our friends and and, customers and so on across spaces or in industries.

We found that

Spark was the just simply

most popular. It was 1 that we had the most expertise and understanding in house, and it was 1 where we felt we could really provide a a multi cloud,

multiplatform

advantage.

And

for developers, 1 of the things we also found was there was a strong desire to want to still build for Spark,

just simply not have to deal with the brittleness and the finickiness of it per se. And so we actually found some really cool capabilities of

of doing the

the approach of not only can we manage a Spark infrastructure for you, but we can remove a lot of the,

the scaffolding around using Spark and really focus a lot more of the engineering time on data frame transformations

and logic

as opposed to parameterization

and tuning and and tweaking.

So we found we could get

not just this this multi, cloud and multiplatform,

benefit, but really could expose Spark to,

our users in a way that was really compelling. And I'm curious

what are some of the

sharp edges

and limitations

of Spark that you've run into in the process of building on top of it.

And if you were to rewrite Spark from scratch today to fit your particular set of requirements,

if there are any aspects of it that you would change.

Yeah. I'd say, you know, we've there's a bunch of tweaks and nuances we've done over the course of for example,

as we became,

HIPAA compliant, like, we had to do

quirky things with,

how do you manage both encryption and compression of data,

as it's being stored and in transit,

and things that that just weren't quite properly supported there. But I'd say, like, these are always just, like, the the small sharp edges you find.

The,

you know, really big interesting 1 that I would, like,

love to see Spark,

even more for, fully fully embracing, and they're doing a lot more with this in in 3 dot o,

is native Kubernetes support. We're

really big fans of running Spark on K8s.

We actually have been running K8s as an underlying infrastructure since January of 2016,

and all of our Spark usage today all runs on Elastic Kubernetes infrastructure.

And so really continuing to invest

in that that tight connection between Spark and Kubernetes for us has been an area of extreme interest. So can you describe a bit more of the technical implementation

of Ascend?

Yeah. The,

the or technology itself

really works

at a a couple of different layers. You know, at the infrastructure layer, we've designed,

to run on

all 3 clouds,

that being Amazon, Azure, and Google.

And as a a unified infrastructure layer,

we run 2 Kubernetes clusters. 1 is for what we call our control plane. That is all of our microservices

that operate at the metadata layer. It's about 15,

maybe 20 microservices now, combination of of Node,

Golang, and Scala,

services that mostly talk gRPC to each other and build a pretty cohesive model of what's going on in the system.

And I can talk dive more into that. And then the data plane is the other Kubernetes infrastructure.

That's a elastically scaled on spot and preemptible instances,

that runs both kubespark,

for a lot of our spark infrastructure and also runs,

workers.

These, essentially,

auto scale,

go based workers that we use for a lot of processing that sits outside of Spark, where,

the shape and model of of the work required fits better into a custom, set of work that's run directly on Kubernetes as opposed to in Spark. But both of those run inside of this elastic,

compute infrastructure.

And what are some of the ways that the implementation details and the overall system architecture have evolved since you first began working on it, and some of the assumptions that you had going into the project

project? Yeah. I'd say the know, 1 of the key things as part, as part of this notion of, declarative pipelines

is the the the core engine that operates on everything is this control plane. The control plane's,

responsibility is to to take this blueprint that is is the the output of of the the data architecture

and continually compare it against

the data plane,

essentially, what has already been calculated and what exists today.

And that control plane has to answer the questions of,

does what I have already existing in the data plane reflect

the blueprint at the logic plane? If it doesn't,

what needs to be regenerated or updated or deleted, and how do I dynamically do this?

So when we first architected,

the system in the that control plane,

the idea behind this was, well, let's inspect the entire, graph.

So you you apply a bunch of compiler theory, and you do a bunch of things like,

we saw not just the data, but all the transforms that we're performing recursively all the way down to the data shaas to do to try and rapidly determine,

have we done the appropriate work, or do we need to do new work?

But what we found is even at scale,

this becomes hard. And so we started to invest a lot of time and energy in answering the question of, what happens if I'm tracking not just

10 or even a 100000 transforms,

that may have millions

of partitions of data. But what if I'm tracking 100 of millions or billions of partitions of data? Not just records, but actual individual files.

How can I, within a second or 2, rapidly determine

whether or not my blueprint

actually matches,

what it exists in the the underlying storage layer?

And so this was 1 of the these huge areas of investment,

where we spent, well over a year for a big chunk of the team,

building the next generation of our control plane,

to be able to do that, essentially,

inspect,

this massive,

underlying

blueprint of of data, or the massive blueprint of data and the underlying physical state of data,

and within a matter of a second or 2, tell you

what new work has to be done. And that was probably 1 of the biggest,

undertakings that we've had to go through as a a company.

And in terms of the interface that an end user of the Ascend platform would be interacting with,

how does that compare to the Spark API, and what are some of the additional features and functionality that you've layered on or

changes in terms of the programming model that's available?

Yeah. The

the the interfaces themselves

from a a sort of a tactical perspective

are similar in the sense of there's command lines, there's SDKs, there's APIs.

We also offer a a a really rich UI experience too, as you can navigate the entire lineage and dependency of all the various, data sets and where they came from and get the profile of data on them and so on. I'd say the the mental model, however, is slightly different.

When we,

think of how our data flows work,

they are separated from the execution layer in the sense of,

there's

no job that is run. Right? So when you send something to Spark, you're saying run this job.

When you send something to Ascend,

you're saying make it so. And so it's a the the fundamental sort of approach is,

you can send declarative instructions to Spark, but it it's contained within

a task or really a job itself, not a task in in Spark's vernacular.

And so the idea behind this is,

what if you could take not just 1 job, but your entire world of jobs and make those all declarative? And so the you know, I I would say the if

Spark were to have a perpetual

query planner and optimizer

that looked across all jobs that it had ever done before

and understood the storage layer as well as the compute layer and then the dependencies between those, that's kinda how Ascend looks at it. And so, you know, even we've spent little, very little time trying to optimize

how Spark itself runs, but we've spent a lot of time trying to optimize what gets sent to Spark. And so those those paradigms are are,

I I think, best summarized as the difference between imperative programming models and declarative programming models.

And my understanding is that in terms of how somebody would actually

create a

desired end state with the Ascend platform is by writing these series of transforms,

which you guarantee to be side effect free. And so I'm wondering if you can talk a bit more about that mechanism and some of the ways that you

ensure that there are no side effects that in the transforms and, some of the challenges in terms of the conceptual model that engineers go through coming from Spark and working with Ascend?

Mhmm. Yeah. I'd be happy to. So, you know, trying to have transforms b side effect free sort of fits into a couple of categories here.

You know, 1 thing that we did, and this is was 1 of the benefits in starting with SQL,

is we can parse that SQL, and we can analyze and understand

that SQL, the output schema, the, inferred partitioning mechanisms,

and really optimize that

in harmony with the downstream

transforms in SQL that is working with,

or or that is working on that data.

And so

part of the ability to avoiding the side effects is if you start first with a language like SQL, it makes it much easier for us to

always know

where a piece of data came from,

why,

it got there, and whether or not the calculation is in that partition is still valid or did something upstream change and it needs to be recalculated. And that and that was because we could just parse SQL really effectively and understand,

what that dependency chain is. So that was really where it it started.

As we opened up more,

PySpark,

and data frame transformation capabilities,

it really became,

the a balance of exposing,

a lot more of the raw horsepower and capabilities,

while asking users to inform the the control plane enough with hints so that you know

things like, is this a full reduction? Is it a partial reduction?

Is it a straight map operation?

We can infer a fair bit from those code snippets,

but we, at the same time, do need the developer in sort of architectural assistance,

to properly optimize

that system. Then what we did is at the the underlying layers,

we've put in a lot of work into our storage layer to do things like,

a lot of optimization

around deduplication.

So for example, before we ever send any task to Spark or any job to Spark, we first look and say, well, what's the transformation that is being done and on what sets of data is it being done on? And have we ever done this for any data pipeline

anywhere else in our ecosystem,

and is it possible to optimize it? Is it already actually sitting in s 3 or GCS or or Azure blob somewhere? And if it is, we don't even have to send that work. We can simply leverage the same piece of data. And so that optimization layer

tied with much more of a a functional execution model where we never overwrite

that individual piece of data, but instead introduce it introduce a safety layer

of atomic commits at the metadata layer allowed us to ensure that no data was ever propagated unless it passed the integrity checks and was properly committed

and became a a new,

update of that model. And so it was this combination,

to recap, is it was this combination of,

much more declarative models understanding

of the the the developer's intent,

that then informs that control plane with the safeguards of,

functional programming models

tied to

safety checks like atomic commits and integrity checks to guarantee that nothing ever flows through that shouldn't actually be there. And my understanding too is that in addition to the execution layer,

Ascend has the concept of a structured data lake capability

as well.

And I'm wondering if you can do a bit of compare and contrast between what you're building there and some of the capabilities that are available through Delta Lake. Yeah. I'd be happy to. So,

you know, what we did with

Ascend structured data lake is we said, all data

that can be managed and,

exposed via the structured data lake is really outputs of various

pipelines and data flows. And what we can do as part of this is leverage all of the underlying metadata that we collect, where we know how to do things, like dynamically partition the data

based off of the, the transforms that are being performed and what we understand already about profiling of the data. We can also do things like guarantee,

atomic commits and atomic reads of the data. Again, based off of

really inserting

this abstraction layer between,

where you're accessing the the block level, if you will, of data or the the partition level, within a blob store and the metadata layer to to ensure that level of consistency.

And so

for us, our approach around this was really oriented towards,

can we ensure that you have a well formed, well structured

data lake that is directly reflective of the pipelines that operate on it and exposes more of that metadata?

Well, I'd say about Delta Lake, which is a really cool technology too,

solves a slightly different problem, which is,

you know,

at at the core is how do you insert additional data that gets your data lake to behave a little bit closer to how a data warehouse would be, where you can get snapshotting

and

and compression of data as you're incrementally adding to those datasets.

It has a handful of other really cool capabilities as well, but I would say that they both, you know, would work,

in concert with 1 another, but solve different problems.

And for a data engineer who is onboarding onto the Ascend platform and looking to complete a particular project,

do you have any concrete examples of what the overall workflow looks like and how that fits into some of the broader ecosystem of data platforms?

Yeah. We do.

1 of the things that that we've opened up that's pretty cool for a data engineer that wants to try Ascend

is literally you know, we have examples of this, now on a weekly basis where,

folks literally go to our website.

You can get straight access to the product,

in a trial environment,

without ever having to put in a credit card or talk to anybody.

It's literally instantaneous

to start building on the product. And then really that that experience and how they integrate into,

their existing systems. We've worked pretty hard to make this seamless, and this is actually 1 of our our metrics and our goals of what we're trying to optimize is, you know, as you drop into any sort of a trial environment or if you were already a customer and you you have that is you simply create

1 of 3 core constructs.

It's either a read connector,

a,

transform,

or a write connector. And

so most users obviously start first with the read connector. You describe where your data is sitting. It could be,

sitting inside of s 3 or Redshift or Kafka queue. And as you describe that, you literally point Ascent towards it. And Ascent will start actually listing it out, analyzing the data. If it's not snappy compressed parquet files, we're going to convert it, and store it internally for you so it's optimized for Spark automatically,

and then you can actually just start building against it. You can do things like, we have a capability called queryable data flows, where any stage of any data pipeline,

we expose like a warehouse, where you can do just dynamic ad hoc queries,

and then rapidly convert those to,

full stages of pipelines called transforms and vice versa. And so the iteration process becomes really quickly to start to build up a few stages.

And then either on the tail end, the way that it works is we make it really easy to either write your data back out, Redshift, BigQuery,

s 3, GCS, ABS, etcetera,

or actually access the data,

if you're trying to put it into, like, Tableau or Power BI or others. We have SDKs and APIs as well as a URL structure for embedding those straight into other systems to read out the records or the raw byte streams,

from any,

any data flow. So we've we've added a lot of these capabilities to try and make it as fast as possible to be able to create that that end to end hello world

experience if you will. And can you also talk a bit more about some of the pipeline orchestration

system that you've built into Ascend and some of the overall benefits that that provides to data engineers as well? Yeah. I'd be happy to. You know, at the the pipeline orchestration

level, the the really key benefit of this goes back to what we talked about, which is that that shift from an imperative system to a declarative

system. The benefit of this is that we get to, automate and offload a lot of the painful pieces

of things like, well, what happens when you have late arriving data? Do you need to shift how you're partitioning of that data, Or do you have to retract previous calculations and update them and propagate them through, as a result? So things like that end up being handled in a declarative model as the system itself simply analyzes and detects what data has started to move through, what data already moves through, and knows the lineage and dependency of those. And so what we found is it's much faster to architect and design these pipelines and get them deployed out in scale. The other,

big benefit that we've seen is the deduplication factor that we get, with a lot of the underlying infrastructure and storage

and as well as the the pipelines

allows us to do things like rapid branching

of pipelines that don't require reprocessing. And so as we,

are able to take an existing pipeline that may be running in production,

you can simply branch that like you would branch code

and only modify bits and pieces of that. And the only reprocessing that's done, even if you're developing on it, is really just the the deltas, the changes

to that dataset and that pipeline versus,

having to reprocess everything. Then I'd say that the 3rd piece that's also related is we remove a lot of the scheduling burden.

You used to have to,

set, like, I want to monitor these datasets at these times with these timers and triggers and check for these parameters.

And

even as you're gluing together different DAGs,

of of pipelines,

there was a huge scheduling burden to actually pair all these together. What we figured out was we can actually remove a lot of that and simply make it far more, hence the name, dataflow esque based off of the data models as opposed to

scheduling and and and trigger based model. For people who have existing

investment

in the overall Spark ecosystem

and existing Spark jobs, is there a simple migration capability that you have available as far as being able to translate their existing jobs to the new programming model or run them directly on Ascend while they work on remapping them to the paradigms that you are

supporting, or is it something that they would have to do it piecemeal where they just do direct migrations from the existing infrastructure and then do the translation to deploy onto Ascend? Yeah. That's a really good question.

We've we've definitely worked with a bunch of customers to really,

accelerate that migration path.

1 of the the really cool things that we found is that ability to,

both simplify the code as it migrates over,

and then also optimize a lot of the the pipelines themselves. And 1 of the the best examples, actually, a case study,

on our our website,

we were able to,

as part of the migration process, cut out 98% of the code

required

for a a particular use case simply because so much of this was the scaffolding,

around spinning up Spark and and managing the parameterization and tuning,

versus the the actual architecture of the the data and and the system itself.

And so what we find with a lot of our customers as they start to to build on Ascend is that ability to either integrate into their existing pipelines

or rapidly migrate that over in a a much simpler declarative model to Ascend, itself.

The other piece that I think is super important to to highlight is,

we are really big believers that

your your tech stack isn't gonna have just any 1 piece of technology. Right? Like, you're gonna run,

some stuff with this, and you're gonna run some of your own spark stuff probably or Presto or some other Hive or Hadoop cluster.

And

as a result, this is 1 of the reasons why we we've launched this the notion of this structured data lake is, hey. Let's actually give you the ability to put your own,

your own Spark jobs, your own Hadoop jobs, or, even put up notebooks

directly accessing the underlying

optimized internal storage layer of Ascend

so that we can really plug in just like any other piece of your,

your infrastructure. It's just an s 3 compatible interface at the end of the day,

that can hook into any of the rest of your your tech stack. And my understanding is that for providing that s 3 interface, you're actually using the MinIO

product for the gateway interfaces

on top of the non s 3 storage system. So I'm wondering if you can talk a bit more about that and some of the other technical specifics of the structured data lake as far as the file formats or schema enforcement that you have in place. Yeah. So the the MinIO,

gateways

was super cool. Like, we're really big fans of of the technology.

What we essentially did was,

every environment that we we stand up,

has that gateway running.

And the the sort of classic MinIO,

approach is to map from map an s 3 compatible interface to Azure Blob Store, Google Cloud Store, or Dura Hadoop file system, or or or a plethora of others, but really sort of preserve the same

file path structure, more or less. And what we did that was really fun was

we grabbed,

that and created a different handler inside of min. Io

that rather than mapping directly back to any

particular blob store,

first talk to our control plane. And the idea behind this is because we do all this advanced deduplication of data and jobs and tasks,

not too dissimilar from how you would see, like, a a network file system does block level deduplication.

You have to actually

construct a virtual file system

based off of the metadata itself.

And so

as you're

listing out

the the data services and the data flows and all the underlying datasets,

that you have access to, that MinIO,

gateway is actually talking to our control plane and saying, hey. What is the actual file path structure and system in place?

And even listing for,

the underlying partitions,

of a particular dataset itself, maybe digging into all sorts of different directories and structures that that are just optimized, that are shared across a lot of the, the sort of upper component level model, but just optimizing dedupe. And so that communication model,

required a bunch of optimizations

that really tuned for performance,

and so on,

but gives you, in essence,

a a really consistent

and elegant,

s 3 compatible

data lake. The I'd say the the interesting ways of it and that we expose that data today,

it is all, we give you the straight,

snappy compressed parquet files,

that we actually pull into Spark and process and and move around on our own. But we do also give you the ability to stream out those records through an HTTP

or an HTTPS

interface,

and CSV or JSON or other formats as well if you want to dynamically convert those to feed, an application. So we've taken both the sort of the low level access of just raw,

parquet files at the s 3 interface as well as a sort of a more application level API,

to get the JSON or CSV or other formats as well. And what have been some of the

most interesting or challenging aspects of building and launching Ascend that you have dealt with? You know, I I think there's been a lot of really interesting and and challenging aspects.

You know, 1 is, obviously, there's a pretty broad range of use cases that people have for,

data pipelines.

And, you know, you get everything from,

people who are like, I have 1 file that is a truly, truly massive file and dataset that has to, you know, crunch through on its particular cadence,

then you see other folks who literally are generating,

hundreds of millions of files across their dataset that they're trying to push through Spark.

And so

the the fun part as engineers is to actually take the the sort of plethora of different use cases and access models

and boil that down into a a couple of these reusable patterns

that we believe can actually fit multiple use cases

at the same time

and and sort of refactor the problem scope, if you will, into just a handful of really powerful capabilities,

that give sort of a new layer and a new platform to build on top of. And what are some of the most interesting or unexpected

lessons that you've had to learn in the process or edge cases that you've encountered

while building Ascend, both from the technical and business perspective?

Yeah. I'd say, you know, from the the edge cases and so on from the the technical perspective,

you know, we've we placed a lot of bets really early on in, the Kubernetes domain, and that's paid off tremendously well.

You know, 1 of the things that we've had to work hard on is the sort of various levels of Kubernetes support across the cloud providers.

So, you know, we we will greatly

relish the day when we can actually just run on all of the the hosted Kubernetes,

capabilities that are are, like, super tuned and optimized and on spot or preemptible instances as, you know, we'd love to get away from having to to manage a lot of that. We're not quite there yet. And so having to build up a lot more expertise

around managing on the infrastructure side is is, for example, 1 of those areas. You know, I would say

on the, the nontechnical

side, or I guess I would say pseudo technical side, you know, there's

been a really interesting investment on our side of how do we have a a really high,

high output, really high skill set,

team on the, product and engineering side, where we create a a a different model that we've seen with a lot of other startups

where we've gone,

not the classic

agile,

and scrum methodology,

where it tends to be, team and code based centric, but we've shifted really, successfully to a model that is much more agile based off of projects itself. And so we find that

teams end up coming together and dispersing

very quickly based off of the projects they're working on,

really towards

some goal and some outcome of a new feature that they're building or a new capability.

And 1 of the things that we found coming off of this is

the output and and what we're able to accomplish is really, really remarkable by comparison.

The teams find that they're

like, we're in fewer meetings, we're moving faster, we're launching more features, and probably the coolest part is,

everybody gets to own big capabilities that they get to drive all the way through. And so we're finding really across the board,

that especially for a a fast moving start up such as ours,

that model of

product management and software engineering

has really been far more fruitful and exciting for us. And then in terms of the overall capabilities of the system

or business success that you've achieved so far? What are some of the elements that you're most proud of? And

in terms of feature sets or capabilities,

any that have gained the greatest level of adoption?

Yeah. We're

that I'd say

not too surprisingly, the last 2

capabilities we've announced have been really, really,

popular. Both are queryable data flows.

We were actually, shocked a little bit when we launched it. We quietly launched it to see if folks would would notice.

Not too many folks really discovered it inside of the product. But once we actually started to socialize it with our users, we've been shocked as to how much they're using this queryable data flow,

approach.

It just makes it so much faster to build pipelines.

So we've been really happy to to get the the metrics off of that. And then the the structured data lake has been really awesome. I get it. Honestly, it makes it so much easier to connect into so many other parts of the ecosystem,

and we're we're super stoked about that. And then I'd say, you know, honestly, the the last part that's been really fun,

to watch has been

we just started to open up a lot more of,

tutorials

and, how to's on the product, a bunch, natively integrated into the product, a bunch more on our dev portal.

And we're seeing people really gravitate towards,

using those as a way of,

testing out the platform and self serving on it. That's been really cool to watch.

And in terms of the current state of the system, what are some of the sharp edges that remain, and when is Ascend the wrong choice for a given project?

Yeah. That's a great and super fair question.

The, you know, I'd say the the easiest 1 that we see a lot of the time is,

if your data volume is either

moving or evolving slow enough or, frankly, could fit inside of a a data warehouse,

I'd honestly recommend

keep your data there. Building pipelines is harder and more challenging. We're working super hard to make it way easier to build pipelines,

But we see a lot of folks who, honestly,

their data can and should fit inside of a warehouse, and and you should do it. And then, you know, I'd say there becomes a point in time where it really makes sense to build pipelines. You're doing more advanced or sophisticated things on your data or the volumes are getting larger. And, you know, we oftentimes find folks who have massive datasets that definitely don't belong inside of a dare data warehouse, and the things are are either crumbling or they're just spending insane

amounts of money on that. And that's, you know, probably usually 6 to 12 months later than they should have started moving to pipelines,

but that becomes a really, you know, good time to then change gears and really look at whether it's in Ascend or another technology, but getting it out of the warehouse.

And then looking forward, what do you have planned for the near to medium term future of Ascend and

any

trends that you see in terms of the industry that give you inspiration

for the long term future of the platform?

Yeah. There's a couple of trends that we see that are, I think, are gonna be really impactful.

You know, 1 is obviously the move,

to multi cloud.

We see a lot of interest,

across the ecosystem

of being able to run, not just in 1 cloud, but across all of them in ways that treat each cloud region or zone,

really as

a a a micro data center with localized,

compute and storage and with intelligent movement of data based off of,

dependencies.

And so we see that as a really interesting domain. The

other area that, that we see that is really interesting is,

you know, we use Spark heavily, but there's a a ton of other technologies out there that open up a whole world of other use cases

that, are really interesting for us. You know, everything from TensorFlow to Kafka to Druid,

All of these open up different use cases and and and different capabilities, for example, and and many other technologies that are that are out there. And this is part of

why, we architected and designed Ascend this way was,

really as a a control plane. You know, while we,

can orchestrate,

we also have the ability to orchestrate,

across different,

storage layers and across different processing layers and analyze and monitor access patterns

of how users are tapping into data and, as a result, can do really optimal things,

that you ordinarily couldn't do if you were just pulling data off of disk. And so doing things like,

intelligently building profiles off of how you're querying your data and perhaps moving it into a Druid,

or moving it into

a a more memory backed storage layer are things that we can really start to do and extend out that

the the underlying compute infrastructure

to the appropriate tools and technologies dynamically.

So those are are both areas that we're really excited about. And are there any other aspects of the work that you're doing at Ascend

or other aspects

of the idea of declarative

data pipelines or anything along those lines that we didn't discuss yet that you'd like to cover before we close out the show? No. I'd say the the thing that we're really

observing is, you know, as as we start to look into 2020 and see a lot of the,

a lot of the sort of big efforts and initiatives that we're all putting into data engineering.

The shift from more imperative based systems to declarative based systems, we've seen across a bunch of other technology landscapes.

And what we're finding is for most,

teams in most companies, as we get more and more use cases, more and more data, and more and more people are all contributing to that those data pipelines and that data network

inside of, our various companies,

we just get an exponential

increase in complexity.

And

what has helped actually

alleviate the pains in the late night or early morning pages that we get,

that wake us up when we really don't wanna be working

is that complexity in trying to battle with it. And what we've seen

is that shift

to these declarative based models allows us to to just write cleaner, simpler, more compact code and

more elegant systems as a result.

And I think we're gonna see that that really become increasingly important as, frankly, just data engineering gets more and more popular. Well, for anybody who wants to follow along with the work that you're doing and get in touch, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. I'd say that the biggest gap,

at the end of the day is being able to answer the question or I guess I'll put it a couple of questions is, what data do I have? How was it generated?

And where did it come from? Because if we can't answer those questions, it's really hard to build,

systems that,

are more automated. It it all then falls back on us as engineers. So we have to be able to build technology that can answer those questions. Well, thank you very much for taking the time today to join me and discuss the work that you've been doing with Ascend.

Definitely looks like a very interesting platform that's solving a lot of painful pieces of data management in the current landscape. So thank you for all of your efforts on that front, and I hope you enjoy the rest of your day. Thanks so much. You too, Tobias.

Thank you for listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links