Self Service Open Source Data Integration With AirByte

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code.

Monitoring data quality, tracing incidents, and testing changes can be daunting. It often takes hours or days.

DataFold helps data teams gain visibility and confidence in the quality of their analytical data through data profiling, column level lineage, and intelligent anomaly detection.

DataFold also helps automate regression testing of ETL code with its data diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values.

DataFold integrates with all major data warehouses

as well as frameworks such as Airflow and DBT and seamlessly plugs into CI workflows.

Go to data engineering podcast dotcom/datafold

today to start a 30 day trial of DataFold.

Once you sign up and create an alert in DataFold for your company data, they'll send you a cool water flask. Your host is Tobias Macy. And today, I'm interviewing Michel Tricot and Jean Lafleur about Airbyte, an open source framework for building data integration pipelines. So Michel, can you start by introducing yourself?

So I'm Michel.

I have been working in the data industry since I started my career in 2007, studied more like financial data.

And in 2011,

I moved into the US

and

actually started in this company called LiveRamp, which is today a public company. And over there, I was running

all the integration

team, which was, like, 30 people. And we're basically powering all the data exchanges

from LiveRamp and to LiveRamp. So we're talking

100 of terabytes of data that was delivered on a daily basis.

Like, 1 of my core, competencies since I started, yeah, working.

And, Jean, how about you? Airbyte is actually my 4th startup. I was into b to c's and into dev tools. But the latest startup before Airbyte

was a software engineering management platform

that sits on top of all the dev tools.

So we had to build all those ETL pipelines, like 6, 7, before we can bring any value,

and it was a mess.

So that's when I get into data.

And Michelle and I, we've known each other for 7 years, and we knew we wanted to work together. So when my first start up, this 1, well, didn't end well. 2 first ones were exited.

But, at that point, we decided to do something about it.

And so that brings us to the project that you're working on now, which is Airbyte. So I'm wondering if you can give a bit of background about what it is that you're building there and some of the story behind it and how it got started.

Yeah. So 1 thing that we've discovered over the years and John in his last experience is that

building

integration

is hard.

Building

it, like, technically is easy, but the complexity comes from, like, the maintenance of it.

And

it's a problem that is taking a lot of people's time in every company that we've talked to, every company that we've been at. And we felt like, and everybody is redoing the same thing over and over again. Like, you have, like, a 100

Stripe connector that exists in the world, even more. And what we thought is, at that point is we want to be able to better leverage the the human

effort on, like, providing disconnectors, and anybody can use them and contribute to them and cover this long tail of of integration.

So while we were investigating for that project, we actually

talked to customers of existing

solutions like Fireflies, Stitch Data, or MyTina,

and each 1 of them was actually building a parallel system

to cover integration that were not supported or that were not behaving the way they wanted. And that's what really motivate John and us to go for, like, an open source approach where

people can address a long tail, and we can work with the community on that.

And as you mentioned, data integration is something that you would think would be a solved problem because of how long it's been a problem and how many different efforts have been made to try to address it. But I'm wondering if you can just talk through some of the landscape

of data integration

and some of

the issues that exist in the different solutions and how Airbyte is aiming to distinguish itself in that space.

You have 3 different options. You have the the closed source cloud based, like Fivetran or Stitch Data.

The issue here is that, well, they will never be able to really cover the long tail of integrations because, as Michel mentioned,

the issue is really about maintaining the connectors. So if you're closed source and cloud based, you will have always this ROI consideration

to support a new connector. So after 8 years, when you look at Fivetran, they have 150 connectors. So that's why we when we talk to their customers

and we talk a lot with right now, we have 500 companies that tested us, and so we talked a lot with them. A lot of them are using Fivetran, but they need other connectors as well. So that's 1 of the issues with closed source. But cloud based, they see privacy data privacy, which is not a first class citizen.

The other type of companies who have closed source and self hosted, like Matillion or DataChrome.

So they solve the self hosted problem, but not the closed source 1. And it's really a top down sales

cycle in that case. You cannot have just like a bottom up approach where you have a data engineer that needs to fix

any connector, build any connector, and just start using them. And that's where the open source part comes in.

And in the open source, you have a singer, but the issue is, I don't know if you've seen that, but Tenen

purchased

Stitch Data, who is the owner of Zyngr, and they stopped investing in them. And the Xyngr is also a lot

of repos. It's less maintenance.

There's not much standardization.

And at that point, you see a lot of their tabs that are going out of date, and that's where we come in. As the open source, we wanna standardize the way it's being done.

And it also open source enables us to address other use cases that closed source cannot address, like databases, certifications, kind of things. Yeah. I've definitely been aware of the singer spec for a while now and kind of saw some of the initial

interest around it. I know that there is never really a great

way to find out information about how you actually

implement it and tie things together and manage the overall deployment and monitoring of it. And I know that

the Meltano project has recently pivoted to try to be kind of the de facto way of

using the singer

taps and targets and trying to level up the overall singer ecosystem.

And then there are also a number of projects like Embalk and Goblin that have been approaching it in a slightly different way where there's sort of the monolithic core with the different plugins that you can add in for sources and destinations,

but it's not necessarily

as flexible as the singer specification where you have just this interchange format

along the same lines of, like, a UNIX pipe.

And then there are other things. You mentioned Fivetran and Matillion.

But then in the previous generations, there were things like SQL Server Integration Studio, which was closed source, or Pentaho Suite, which was more of a drag and drop GUI type of approach. And so it's definitely interesting to see the evolutions of the ways that the data integration problem is trying to be solved as the overall data ecosystem continues to grow and change in terms of the sort of best practices?

Yeah. I mean, 1 thing that

that Singer

attempted to address was really around, like, building the connector. But building the connector is just the visible part of the iceberg at that point, which is,

as I mentioned, like writing a connector is something that you can do. It takes you a few hours, but the real complexity and that's why it's still an unsolved program is like like a connector is gonna live. It depends on an external resource, an external system, and this external system is gonna change.

And what you want is you want to have a process,

But, like, in a a few years ago, when we build the integration team at at Live Room, 1 of the core we are maintaining, like, a 1000 or 2000 connectors, and we had to have a very, very strict

process on how do you test, how do you monitor.

It's not just about writing the plan. It's really everything that goes around it.

And, yes, it's good to have an instructions from exchange format, but you also need to have a process around maintenance, and that's what is the most important for connectors.

And so for the Airbyte project, I'm wondering if you can give a bit more context about how you view the target users for it and some of the ways that that persona has helped to guide and inform the way that you have designed the overall system and the interfaces that are available for interacting with it?

The first user that we're really targeting is the data engineer. It means it's the person that is spending a lot of time maintaining connectors

and making sure that data migration happen correctly.

And this is a huge burden for this team. So what we want first is we want to burden them by providing them with a working solution out of the box.

But this data is not always consumed by data engineers, and that's where we are also thinking about the 2nd category of user that we're targeting.

It's people are becoming more data savvy in organization. You have more data scientists, data analysts, and more people that need to interact with data. And the warehouses like Snowflake, the BigQuery have enabled this new role to become smarter

with data. But it's good to have a processing engine for getting insight from data, but first, you need to get the data in. And when we

think about our user and how they inform us is, a data engineer doesn't want to spend time, like, enabling

a new pipeline.

By it being a burden for them, it prevent this other role to actually leverage that data. And what what we want is to reconsider these 2

parts of the data

consumption and the data predictions

to work better together

and just making them more autonomous.

In terms of the actual sort of default interface, I know that you have oriented around a UI driven mechanism of being able to

set up sources and destinations. And I'm wondering if you can discuss some of the

benefits and trade offs of that approach as opposed to a more just sort of textual approach where it's code native and everything goes into source code.

Yeah. As I mentioned, the thing is organization and people in organization are more data savvy. And these people are not always technical, and what they understand is a UI. So first, the UI is more like how can we provide value

as quickly as possible? How can we make them autonomous as quickly as possible?

Now the way we're also thinking about it is

it has to integrate well with

the data infrastructure that are already present

that is already present in these teams,

in these data teams. And that's why we're

right now, it's very focused with on UI,

but behind the scene, it's also powered by an API.

And in the end, like, this API, we will be able to describe more, like,

having a more textual way of configuring Erbites and running data, replication on Erbites. But it's really getting the value 80% of the value as quickly as possible at that point.

And so before we dig into the technical aspects of Airbyte, I'm wondering if you can give a bit more background as to

what your motivation is and how you view the business opportunities

of creating this open source platform

and the overall ecosystem that is available for growth in the data integration market?

So when we started, as Michel mentioned,

we tried to talk to as many of, Fivetran stage and Mathenian's customers as possible and wanted to see some patterns. So that's how we learned that closed source cloud based wouldn't really fix the problem, but only an open open source 1 would. So our first goal is really to become the standard for open source. That's our goal for 2021. So we won't won't be focusing on any monetization

features

related features until 2022.

And at that point, what we see is if we become the standard,

then we can have several this model options. The first 1 could be the standard open core

model

where any feature that addresses the need of an individual contributor

should be open source. So that includes connectors, but anything that addresses the need of the company could be licensed. And that's where Quick Thing and

Management, we're thinking of a cloud based control panel. Your data stays in the data plan, so in your infrastructure, we won't have access to it. Where we can provide an SLA, for instance,

for any enterprise features such as data quality, privacy compliance features,

SSO, user access management, these kind of things. And that's 1 option that we'd see is that it will be a lot of business in. And we completely find that, like, 90% or 95%

of

our users just use the community edition.

We're very happy that we make a change and we help a lot of companies, that the impact of the company is much more than our revenues. And the second business model is more what we call powered by a byte, where

we

can power

all your connectors with our API,

and

you offer us connectors to your own clients. So you're in charge of your UI,

and you integrate with our API, and we power in the back end, your connectors.

Digging a bit more into the actual implementation of Airbyte, can you describe how the overall system is architected

and some of the ways that the approach or the overall goals of the platform have grown or evolved since you first began working on it?

Yeah. And I just want to go back on 1 thing regarding the UI API

and more like

descriptive way of configuring a data pipeline.

If you're thinking of how AWS started, they started first as just a UI, then they provided an API, and then you had tools like Terraform that went on top of it that leverage an API behind the scene. And we see that as a nice trend on how we can,

like, address

many type of usages, but starting at by the 1 where you get the value directly. In term of

regarding your question for the architecture,

so there are 2 main parts to Airbyte. The first 1 is the 1 we call core,

which is everything that is

related to configuration,

everything that is related to our API,

our UI, and also the the scheduler.

And the piece that also takes care of running the different synchronization

and replication processes.

And on the other side, we have also the integration. So all the infrastructure that comes

to that comes into play to build a solid

integration. So it's about

how it's being packaged, how it's being tested, how it's being monitored. So we have like really these 2 sides to the project.

Yeah, I mean, we've made some choices

today around like the scheduling because we wanted to get the value as quickly as possible.

At the moment, we're talking with several data teams where what they want is they don't want to use our scheduler,

which is I don't think, the best in the market compared to an airflow.

And they want to have airflow

to actually

schedule and manage all the, like, the scheduling and the triggering of this replication job. So right now what we're doing is we are

making our scheduler a lot more mature so that it can interact very well with all these external data systems. So that's 1 thing we've learned with during our journey and what that we're putting a lot of effort on is like very deep integration with the rest of the data stack.

Same thing, we're also leveraging

gbt for a lot of our model transformation.

Right now, everything is happening behind the scene, but what we're seeing with our user is that they want to have access to dbt model that we leverage to do this transformation so that they can then

cascade more transformation and more analysis on these, generated models.

I also know that the core implementation

is based on Java and that you also support Python for being able to build some of these connectors. I'm wondering

what your decision making process looked like for choosing the overall technology stack and some of the

design goals that you had that also led you down the path of using Docker as kind of the first class concern for how to package and deploy the overall system?

For the language that we use, so Python

for all the connectors

is I think something where people are very familiar with. And

the fact that, for example, the senior community has been very involved with Python shows that it is a language that is has a lot of success

to build this connector. So that's why we went for Python.

Now we also support connectors in Java, but I can talk about that when we discuss about Docker.

Now the reason why we use Java for Core is

more like a historical reason where what we've seen over the past 10 years is that a lot of the data technology has been built either with Java or Scala or, like, JVM based technology.

And also it's the 1 where the team is the most familiar with. And

at that point,

Java is really

we're comfortable with

it. We're not attached to it. We are ready to have like a Kotlin

implementation of a part of our core.

It's really about

how quickly can we deliver value to our user. And we know that in the data world, people are proof and in enterprise, people are very familiar with Java. So it seems like a ubiquitous language at that point.

But if someone wants to write a very important part of the core in Go, that's something that we would support as well. The language is a tool to get to a goal at that point.

Now on the side of the connectors,

so this is a very interesting

question.

The thing with running connectors

is

when you have the code,

then you have the test that run around it, and to run that you also need to know what environment

you need.

And that is actually a very

problem that a lot of people that have been using CGR in the past have encountered where they don't have the proper version of Python. They don't have the

right path environment. They don't there is a lot of things that come into, like, the configuration

and the environment.

It makes packaging harder, meaning that suddenly you can pull dependencies that are not up to date.

And

the reason why we wanted to use Docker is because the data infrastructure is moving more and more to our containerization.

People are like Kubernetes is becoming ubiquitous, so we wanted to be

very compatible with this system. We wanted to

have

the connector to be fully

shipped with the environment that it requires to actually run.

And it also allows us to

let the community to contribute in the language of their choice as long as they

follow the protocol. And, like, typically, right now, we have 1 contributor who is working on

a complete

coverage of the Google Analytics

API,

and it's doing in Elixir.

And we don't need to know how to install an Elixir environment. Nobody should know about that. It's just it's shifting Docker,

and it works out of the box. So sorry about the simplicity of and the maintenance of and simplicity of the maintenance.

It obviously has some downside,

but these are more like execution complexity.

And once they are fixed,

we're good to go. We need to be able to properly schedule these containers, and that's something we're working on. But after it's done, we won't have to worry about them. Right? And my phone vehicle have to worry about languages. So that's what motivated that choice. Yeah. It's definitely

an interesting

kind of balancing act because there's on the 1 side, you want to have some level of homogeneity in the implementation so that if somebody comes to

the project, then they will be able to dig into the various different connectors if there's something that they wanna tweak or understand. And if there are a number of different languages, then it kind of increases that barrier. But at the same time,

you want to be able to bring everybody who has an idea of how to implement the connector and not force them into a particular language that they're not necessarily familiar with. It's interesting how

Docker and containers have kind of changed that dynamic a little bit. Yeah. It's also amazing for everything related to developer experience. That's a huge thing that we're working on is how can we make the onboarding of community members seamless.

Having it backed by

Docker, having it backed by container means that anyone can come into the project,

and they don't need to install some random

CLI on their laptops. They can just

open the project and boom, they can start develop on the connector.

On the point of contribution and community growth, I also noticed that you have taken a monorepo approach where all the connectors are housed in

the same repository as the core implementation.

And I know that sometimes that

it's another balancing act where on the 1 hand, it makes it easy to find all the connectors because they're all in 1 place. But on the other hand, it also means that somebody who's contributing needs to be able to navigate their way through the project, and it can complicate some of the ways that you handle things like version numbering and deployment of individual connectors because then you have to version the entire repository all at once. And I'm wondering what your thoughts are on that and what led you to go down the monorepo path.

So the monorepo

is a decision that we made because we

knew that we were going to iterate on the protocol

a lot. I think that's 1 piece that has been hindering CING error is it's very hard for them to iterate on a protocol because

every single connector is just spread across like hundreds of different GitHub repo. So if they want to make a breaking change,

and I think when you start a project, you are going to be doing breaking change.

Having the mono repo

allows us to keep all the connector

here, and whenever we do that breaking change, we can migrate all these connector to properly

adapt to the new

protocol. Now as the protocol mature,

I think the different connectors are gonna be spread across different repos for sure

because there will be people who want to just

have it on their own repoint. They don't want to contribute to the main 1, and that's fine. It's just that for now, it gives us a lot of control on how we develop the protocol and how we improve it.

It also allows us to

provide a developer

environment

for contributors.

That's especially

important at first when you want to onboard new people

where

the build system is just working. So if someone wants to create a connector, they have, like, the test infrastructure in place. They have like the integration testing infrastructure in place. So there is a lot less logistic that you need to do to create a new connector.

Also, it's a way for us to just keep the community eyes on only 1 repo. So making sure that people know everything that's happening and they don't have to look in, like, 100 of different places.

Yeah. Again, it's interesting to have the federated approach where anybody who wants to build their own connector and manage it themselves and contribute it to the community can do so, but then you have to have some means of

cataloging all of the available connectors

so that when you are a newcomer to the project, you don't have to, you know, dig through forum posts and GitHub issues and try to piece together what is the actual totality of the ecosystem

rather than just having it all in 1 place and just here's a list, here are all the connectors.

Yeah.

And definitely the day we start spinning up the mono repo, if we need to, that is gonna be a prerequisite

is we need to have a way of cataloging

this and make the discoverability

of disconnectors

extremely easy.

RudderStack's smart customer data pipeline is warehouse first. It builds your customer data warehouse and your identity graph on your data warehouse with support for Snowflake, Google BigQuery,

Amazon Redshift, and more.

Their SDKs and plug ins make event streaming easy, and their integrations with cloud applications like Salesforce and Zendesk help you go beyond event streaming.

With RudderStack, you can use all of your customer data to answer more difficult questions and send those insights to your whole customer data stack. Sign up for free at dataengineeringpodcast.com/rudder

today.

Digging a bit more into the actual

protocol level of how you manage the interchange and pluggability of the sources and targets and some of the transformations,

I know that you have a number of blog posts that dig into your experiences

of exploring the singer specification and the implementations there

and some of the lessons that you learned from it and some of the reasoning behind

choosing your own protocol specification

that is forwards compatible with singer if somebody wants to translate a singer tap or a target

into

the Airbyte system, but it's not backwards compatible. And I'm wondering if you can just talk through some of the shortcomings that you saw and the ways that CINGER manages that data interchange.

And,

also,

given that you do have this new protocol that you have documented, what you see as the potential for a more widespread community adoption

of that specification

with alternate implementations?

On the single side, I think they came up with spec, but at that point, specs are not enough, especially for something that scales so much. Like, there are tens of thousands of connector that need to happen. So you need to have an environment around it. What has happened is people have added their own extension to the protocol. And, you know, when we initially

started Airbyte,

we were relying on single taps and single targets.

And we realized that

with time, we're just spending more time trying to reverse engineer all these little add ons that were made that were not consistent across different report different

source and target. And we said, okay. It was a tough decision for us because you don't want to invent the wheel when there is something that exists. What we realized is that actually it exists, but it's losing a lot of traction. And we said, okay. There is there are a few reasons, and

we're going to solve it. We're going to

learn from what they've done well, learn from what they didn't do well, and come up with something that can reunite the community there. And that's why also we made sure that the protocol can be compatible with what CINGER was doing so that

the effort that has been put by the community of CINGER can actually is not wasted effort, and they can leverage what they built. The intentions format,

the fact that they are using JSON, I think it's of this interest for doing the same.

That might not remain JSON. At the end of the day, it's more a matter of describing the data model of your protocol.

And I'm pretty sure that in the future, we will need to support more efficient serialization,

both in terms of volume and speed.

But for now, like JSON gives us a lot of visibility and auditability in what's happening as we're still developing the protocol.

But I can imagine that we will have layers for maybe putting that into, like, an error message or, like, a protobuf message or a threat message.

What really matters, like, what

is the schema of your protocol?

The interchange,

you can change it if you need to. Yeah. I was definitely interested in digging more into the use of JSON as the interchange format. Because as you mentioned, it's not necessarily the most efficient, but it is easy to just dump it out to disk or cut it out to a terminal to see what's happening and be able to unpack it. And it's recognized by so many different programming languages, but it's good to see that you have some thoughts as to the forward direction of maybe adding support for binary protocols for better efficiencies.

I'm also interested in understanding your perspective on what you see as being the trade offs of using JSON as the interchange format

due to the potential for information loss as you convert to and from JSON,

thinking in terms of things like type information from maybe a richly typed source

or some of the contextual data that you may or may not be able to encapsulate in that JSON specification and maintain across that interchange boundary?

The way we're thinking about these connectors, there are 2 pieces. On 1 side, you have the data. On the other side, you have the catalog. And that's something that we believe

CINGER did good is, like, separating the 2. The catalog

allows to describe the schema of the data so that even if you lose

data on the Exchange side, you can still reinterpret it from the catalog.

And that is very important for us because

today, we might be using, like, the most basic types of JSON. So maybe we are losing the fact that something was afloat versus just a number.

But as we rely on the catalog to

explain

what this type is, we can always recreate that information in the past or maybe serialize it differently if it's a float.

To describe the schema, we're actually using, JSON schema, which

actually

has

the feature to describe more advanced types, has the feature for describing, like,

a constraint on the data,

and that's what we're gonna leverage. And in the protocol, when we serialize and deserialize,

it is something that we're gonna enrich this feature and be more smarter about not losing data on the data that we replicate.

Another interesting area to dig into is the overall

interfaces that you have and the extension points in the system for Airbyte for being able to integrate with the broader ecosystem of data tools,

thinking in terms of things like great expectations for being able to embed quality checks in your data flows for the extracted load process

or being able to integrate different orchestrators

like DAXTER or Airflow or Prefect

to be able to hook into the life cycle of the pipelines and either have them manage the execution as part of their scheduled runs or be able

to use the completion of a pipeline as a trigger to kick off some downstream pipeline and things like that.

I think, like, in open source, you want to be best at doing something.

And, like, Great Expectations,

Airflow, DAX, or, DBT are doing, like, really a phenomenal job at what they're doing. So that's why we're really focusing at becoming the best at moving data and integrating

very well with them. So for instance, Airflow,

the integration with them should be coming within a couple of months, but actually, you can already use us with Airflow

using our our API.

And definitely at DBT, DAXTER, great expectations

are in our short term roadmap.

And so for people who are interested in getting started with Airbyte and using it within their data flows, what's involved in actually getting it set up and creating a pipeline and being able to gain visibility into the data flows and just the overall maintenance and implementation

of Airbyte within somebody's

overall data platform?

So we've optimized a lot on

single instance runs. Right now, we're also working on making it multinode and integrating better with Kubernetes. We have an alpha version of it.

If you want to just the simplest version of Airbyte, it's just a matter

of running Docker Compose. It's gonna spin up a bunch of containers and you're up and running. You have a UI. And at that point, you can just connect to the UI and you can start configuring your sources and destination, and it will start syncing and replicating data. So it's as simple as that. In terms of the maintenance,

we are iterating on, like, making the the upgrade path a lot nicer and the visibility into the system. So right now we expose as much log as we can, and when we have the community to under or like our user to understand what's going on,

Generally, they just need to

expose the logs and we get it and we debug together and that it gives us visibility. Now we want to integrate with more like logging ingestion system so that people can see that into the dashboard.

But that will come with time and with where the community is putting us.

Yeah. The maintenance,

we're working on the the upgrade path at the moment. We've released something last month

on making sure that

we can upgrade configuration.

Back in the day, it was just you have to to remove everything and restart from scratch. Now we can actually go from 1 version to the next, and you don't lose your existing configuration,

existing data state.

Yeah. We're really going for, like, the simplicity of operation and the simplicity of deployment.

And then you mentioned the work that you're doing to make up multi nodes. I'm wondering if you can just talk through some of the axes for scaling a deployment of Airbyte and scaling the throughput capacity

of a given set of pipelines and just some of the challenges that exist in data integration that might be unique to that space?

So there are a few dimension on which you can improve scale. The first 1 is

depending on how many connector you have, you might want to spread that across multiple nodes, and that's to us 1 of the

more straightforward way to scale your data replication. We already have people who have, like,

15 or 20 connectors configured and at that point, like, 1 instance is not enough until you need to run that on on multiple nodes.

So this is really for, like, the number of existing replication.

Now the other dimension on which you need to be able to scale is on the scale of the data. So when you're talking to an API, that's okay. In general, the volume is not gonna be more than

10 gigs

a day. Now with databases,

when you start integrating with, like, a Kafka stream or or like this very

high throughput or like a click stream input,

this changes. And at that point,

things that we're exploring, for example, on data on database and we're gonna release something in the next month or 2 is more around smarter database replication using a change data capture

solution and

also on partitioning the worker. So typically, if you want to replicate

like a Kafka stream or Kafka topic, you might want to have more than just 1 worker pulling data from this Kafka topic. And so these are the kind of access for scale that we have today. The 1 we have is around

breaking down all these integration on multiple nodes, and we're gonna work on the next ones in the next during the year. In terms of being able to manage the upgrade flow for people who are running Airbyte and being able to ensure that they can do sort of a continuous integration or continuous delivery approach for bringing in new connectors or upgrading the versions of Airbyte. What are some of the challenges that you're seeing from your own experience of running it and that the users should be aware of as they're designing their deployment strategies for bringing Airbyte into their infrastructure?

So a protocol change

is generally gonna be

something that is easy to catch. Meaning,

air byte provides a grace grant here where it's not going to sync data if the protocol is incompatible. So at that point, it's a matter of updating

the different connectors.

Now the real challenge with data integration is more like, what if the data format changes at the source level? How do you handle the migration? And that's where the complexity

is.

And I think there are a few automated strategies that you could put in place when, like, type are changing, columns are being renamed, and this is something that will be configurable by the user within Airbyte on what kind of strategies they want to adopt.

But there is always gonna be cases where it is not possible to have an automated migration. And at that point, it becomes the responsibility

of Airbyte

to act as a safety net and prevent data corruption

before,

someone

knows what to do to get this loaded migration pass. And that you you cannot know because it's

so ingrained into your data infrastructure that you cannot know. And I would say that's

for every data system that you have, that's probably 1 of the hardest problem. And because we are really focusing on extracting node, we want to make sure that we protect people

from ingesting

incompatible

data. That is the biggest thing for us.

And whichever system you use, you need to have this gap rates in place.

And in terms of the specific categories of connectors,

as I look through the work that you've done and the work in the singer ecosystem and just the overall space of

point to point integrations

for the extract and load paradigm,

it seems that the majority of effort as far as the destinations has gone into kind of the

major data warehouse vendors. So thinking in terms of, like, Redshift,

Snowflake,

and that there

is not as much effort or that there's just some incidental complexity that prevents a lot of

agreement on how to do things like loading data into s 3 for use with things like Spark or Presto or

the support for maybe more open source data warehouse or data lake infrastructure. And I'm wondering what your thoughts are on that. I would say 1 of the reason why data warehouse have been so popular in terms of integration is because these are the technology that enables

this new category of user to get insight from data. Now

I would say today, warehouse were at the top of connector that were asked by the community.

Now we have at the new level of these priorities, which is more around like data lakes, S3, GCS, Azure Blob Storage,

and we're working on it at the moment. Meaning, what we've been doing over the past few months is like really understand

what do people want when they thought when they talk about

data lake. What do they want when they talk about an s 3 integration?

Yeah. That point for you, we need to have to provide something that can be as simple as just dumping a CSV,

but people want more than that. They want to have support for

writing Parquet file data. They want to be able to have better

partitioning of that data on s 3. And so

it's it is a very hard problem, and we were really in a requirement gathering phase at that point on what do people want. And once we have that, then, yes, you can start

using, like, your Redshift spectrum on top of your data or, like, your Presto or any kind of file based query engine.

Yeah. It's definitely

becomes pretty evident as you start to dig into that problem domain of rating these things out to s 3,

how much the data warehouse does for you in that regard because

you can just say, write it out to this table, and then the data warehouse will handle the partitioning and the disk allocation and making sure that the indexing is set up properly versus if you're just writing it to s 3. As as you said, there are so many questions to be answered in that regard.

And so in terms of the overall project, obviously, you have a business goal of having a sort of sustainable

company that you can continue to grow and work on for years to come, but you also have this open source foundation.

Wondering if you can discuss your thoughts on the approach you're taking to

governance of the open source project and the long term sustainability

and viability,

if for whatever reason, the startup doesn't remain viable over the long term.

We will want to give, like, commit

access to to more and more contributors

with time. And we're still learning, Michelle and I, it's our 1st open source experience.

But our field of the field right now is really to give more control with time.

We are kind of like a federation, if you know, like terminology, where we think we'll have a lot of contributor growth and user growth. And it only works in scales

if you give more control to the community.

In terms of sustainability

of of the project, so in case

the business goes down, for sure, as we give more control,

there will always be the open source part. Now we hope

and we think that an open source project can really change the world if you have a business system at the beginning.

And our goal is to change, like, data movement overall.

So our mindset is to make it a profitable business.

And as mentioned before, we perfectly find that 95% of our customers

might only use the community edition. So that would be our first view right now, and we're still learning.

And so as you have been

building the project and

onboarding more people to it and sort of spreading it around and raising awareness around it. What are some of the most interesting or unexpected or innovative ways that you've either seen Airbyte used or

contributions or requests that you've seen come in as to ways

to extend or integrate with it?

1 of the monetization

idea that we mentioned before,

which is around the, like, embedding

Airbyte into an actual product is not something that we thought about initially. We are really thinking about the analytics case solving data movement for 1 company,

but not for powering

the data types of a product.

So that was 1 thing.

Now we recently talked to 1 of our community member, and they're actually using Airbyte to populate their cache, to warm up their cache. So every hour you have an Airbyte job that runs and it's going to populate the cache with the latest version of the data. Not something we thought about.

So it's called to discover these gold nuggets.

Like our plan right now, we've been discussing with Maillesearch. We've been discussing with other companies, and we see use cases with also

actors in the data industry.

For instance,

like, right now, we're working on a tutorial

where we have you just save your Slack messages

on your free plan and search for them, like, indefinitely. So

being locked on that part, and we can have you do that. And just when you can move data freely,

it unlocks a lot of use cases like this. So I think we'll have a lot of fun with tutorials

to expose these use cases,

to the community.

And in your experience of building

the Airbyte platform and growing the business around it, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

It's really about

the amount

of effort

that is being put today in this company. Sometime we talk to organizations, they have, like,

20, 30 engineers just

maintaining connectors. And

we knew it was a problem, but 30 people is like an enormous amount of

human time that is put in as a problem that

could be solved differently.

And to us, it's like these people, these engineers, they want to do something else in general. They want to be smarter

with the data instead of focusing on plumbing.

And we're talking with this, it seems like, every day, and, yeah, they're like just waiting for a solution like that to appear. I think that's

why we're getting

this growing community

so quickly.

Now

there is an interesting, it's more like a, I would say a gap, which is people are looking more and more for like traceability of the data, understanding

how

data has been

derived.

Because we're at the top of the data ingestion,

it is actually something that we can propagate

down, like, the data value chain,

ensuring that people understand where the data is coming from and how it's been synced.

I would say, like, what is interesting with open source is the the inbound needs you have, interests you have. We had all across the board, I think the US right now is about 35, 40%

of our leads. The rest is really in Europe, Asia,

everywhere. And it's from early stage companies to enterprise. And what was surprising to me is the early stage part. We thought at the beginning, it would be more about the medium,

midsize companies and enterprise. But we see a lot of startups using us because, well, data is getting everywhere.

And moving as soon as you have data, you need to move that data. So we are the beginning of a long journey.

For people who are looking for a means of building a data integration solution or onboarding new data sources, what are the cases where Airbyte is the wrong choice?

I would say today is when you want to integrate unstructured data. This is not something we've been focusing on at all. We're really focused on, like,

structures and structure.

If you have, like, blobs of data with no with no schema, it's not something that where we're gonna be very good. We actually talked to a few company who who needed that, and, yeah, that's

probably not the right time for us to be solving that problem. I would think of potentially 1 case If you have a lot a lot of data that you need more than your workstation to replicate data,

and you have no engineers

like to have to no data engineers to have to you're only, like, a data on this, and you just this data replicated.

In that case, you need somebody to have you deploy us more than in your service.

So that's the case where you might want to have the

cloud based approach.

But we're thinking about providing

a hosted version,

like, in the next 4, 5, 6 months. So this is something that we'd be able to address at that point.

As you continue to iterate on Airbyte and as more people continue to use it and provide feedback, what are some of the things that you have planned for the near to medium term future of the project into the business?

Reliability

is number 1, ease of deployment, ease of maintenance.

It has to become a no brainer to users, and it has to work all the time in every situation. Like, connectors is like a 1, 000 paper cuts problem.

And right now with the community, we're learning how to take and solve this 1, 000 perpockets.

So reliability reliability.

And

after that, it's really

focusing

on

having a better support

and focusing

on

building our community because we

need the community

to help us and

to, like, build this long tail of of integration. So it's both like a technical challenge for us and also, like, making sure that we become the open source standout for solving that

problem. I would add to that integration with the data stack, especially the so great expectations, DBT,

Flow, Daxter.

We have a alpha version for Kubernetes as well. So like sentiment, you know, really the goal by the end of the year is that we've become the open source standard, the obvious choice. That's really our goal

before we focus on any monetization features.

Are there any other aspects of the Airbyte project or the overall space of data integration

and the work that you're doing to grow the community around the open source aspect and the business layers

that we didn't discuss yet that you'd like to cover before we close out the show?

So, like, we're working senior software engineers and a founding developer advocate.

Our focus is really about building up the the community

and increasing the the conversion between user and contributors,

so making it as easy as possible to to build

to help us build connectors and maintain them. So, yeah, it's a long journey. We're starting with the reality and

but the the goal is really to change how data is being moved and, like, not be an issue anymore in a in a mid term for future.

And for anybody who wants to follow along with what you're doing and get in touch, I'll have you add your preferred contact information to the show notes. And as a final question,

I'd like to get each of your perspectives on what you see as being the biggest gap in the tooling or technology that's available for data management today.

We're getting better at processing data. We're actually becoming very good, and I think we've seen it with the success of Snowflake.

But

with more opportunity to leverage data, we're starting to discover more problems like discoverability

of the data, metadata associated to the data, the quality of that data, like the security, because now you're opening data to more and more people. So you need to make sure that you are following like security, privacy, like there's all the things that

are coming

that has been unlocked

by

the ease

of data processing.

And I think we're gonna see a lot more open source,

commercial company that are gonna come into that into this, into the into this industry in the next few years because,

yeah, it's becoming democratized.

And with democratization

comes all

this,

like, side effects

of control, more control on the data. The the next 3, 4 years will be very interesting, the data.

Well, thank you both very much for taking the time today to join me and share the work that you're doing on Airbyte. It's something that I've been keeping an eye on for a little while now. So definitely excited to try it out and experiment with it a bit. So thank you both for all the time and energy you're putting into

solving some of the problems around data integration,

Tobias.

Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language,

community, and the innovative ways it is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links