Summary
Data integration is a critical piece of every data pipeline, yet it is still far from being a solved problem. There are a number of managed platforms available, but the list of options for an open source system that supports a large variety of sources and destinations is still embarrasingly short. The team at Airbyte is adding a new entry to that list with the goal of making robust and easy to use data integration more accessible to teams who want or need to maintain full control of their data. In this episode co-founders John Lafleur and Michel Tricot share the story of how and why they created Airbyte, discuss the project’s design and architecture, and explain their vision of what an open soure data integration platform should offer. If you are struggling to maintain your extract and load pipelines or spending time on integrating with a new system when you would prefer to be working on other projects then this is definitely a conversation worth listening to.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.
- RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.
- Your host is Tobias Macey and today I’m interviewing Michel Tricot and John Lafleur about Airbyte, an open source framework for building data integration pipelines.
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by explaining what Airbyte is and the story behind it?
- Businesses and data engineers have a variety of options for how to manage their data integration. How would you characterize the overall landscape and how does Airbyte distinguish itself in that space?
- How would you characterize your target users?
- How have those personas instructed the priorities and design of Airbyte?
- What do you see as the benefits and tradeoffs of a UI oriented data integration platform as compared to a code first approach?
- what are the complex/challenging elements of data integration that makes it such a slippery problem?
- motivation for creating open source ELT as a business
- Can you describe how the Airbyte platform is implemented?
- What was your motivation for choosing Java as the primary language?
- incidental complexity of forcing all connectors to be packaged as containers
- shortcomings of the Singer specification/motivation for creating a backwards incompatible interface
- perceived potential for community adoption of Airbyte specification
- tradeoffs of using JSON as interchange format vs. e.g. protobuf/gRPC/Avro/etc.
- information lost when converting records to JSON types/how to preserve that information (e.g. field constraints, valid enums, etc.)
- interfaces/extension points for integrating with other tools, e.g. Dagster
- abstraction layers for simplifying implementation of new connectors
- tradeoffs of storing all connectors in a monorepo with the Airbyte core
- impact of community adoption/contributions
- What is involved in setting up an Airbyte installation?
- What are the available axes for scaling an Airbyte deployment?
- challenges of setting up and maintaining CI environment for Airbyte
- How are you managing governance and long term sustainability of the project?
- What are some of the most interesting, unexpected, or innovative ways that you have seen Airbyte used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while building Airbyte?
- When is Airbyte the wrong choice?
- What do you have planned for the future of the project?
Contact Info
- Michel
- @MichelTricot on Twitter
- michel-tricot on GitHub
- John
- @JeanLafleur on Twitter
- johnlafleur on GitHub
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- Airbyte
- Liveramp
- Fivetran
- Stitch Data
- Matillion
- DataCoral
- Singer
- Meltano
- Airflow
- Kotlin
- Docker
- Monorepo
- Airbyte Specification
- Great Expectations
- Dagster
- Prefect
- DBT
- Kubernetes
- Snowflake
- Redshift
- Presto
- Spark
- Parquet
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting. It often takes hours or days. DataFold helps data teams gain visibility and confidence in the quality of their analytical data through data profiling, column level lineage, and intelligent anomaly detection. DataFold also helps automate regression testing of ETL code with its data diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. DataFold integrates with all major data warehouses as well as frameworks such as Airflow and DBT and seamlessly plugs into CI workflows.
Go to data engineering podcast dotcom/datafold today to start a 30 day trial of DataFold. Once you sign up and create an alert in DataFold for your company data, they'll send you a cool water flask. Your host is Tobias Macy. And today, I'm interviewing Michel Tricot and Jean Lafleur about Airbyte, an open source framework for building data integration pipelines. So Michel, can you start by introducing yourself?
[00:02:04] Unknown:
So I'm Michel. I have been working in the data industry since I started my career in 2007, studied more like financial data. And in 2011, I moved into the US and actually started in this company called LiveRamp, which is today a public company. And over there, I was running all the integration team, which was, like, 30 people. And we're basically powering all the data exchanges from LiveRamp and to LiveRamp. So we're talking 100 of terabytes of data that was delivered on a daily basis. Like, 1 of my core, competencies since I started, yeah, working.
[00:02:45] Unknown:
And, Jean, how about you? Airbyte is actually my 4th startup. I was into b to c's and into dev tools. But the latest startup before Airbyte was a software engineering management platform that sits on top of all the dev tools. So we had to build all those ETL pipelines, like 6, 7, before we can bring any value, and it was a mess. So that's when I get into data. And Michelle and I, we've known each other for 7 years, and we knew we wanted to work together. So when my first start up, this 1, well, didn't end well. 2 first ones were exited. But, at that point, we decided to do something about it.
[00:03:26] Unknown:
And so that brings us to the project that you're working on now, which is Airbyte. So I'm wondering if you can give a bit of background about what it is that you're building there and some of the story behind it and how it got started.
[00:03:37] Unknown:
Yeah. So 1 thing that we've discovered over the years and John in his last experience is that building integration is hard. Building it, like, technically is easy, but the complexity comes from, like, the maintenance of it. And it's a problem that is taking a lot of people's time in every company that we've talked to, every company that we've been at. And we felt like, and everybody is redoing the same thing over and over again. Like, you have, like, a 100 Stripe connector that exists in the world, even more. And what we thought is, at that point is we want to be able to better leverage the the human effort on, like, providing disconnectors, and anybody can use them and contribute to them and cover this long tail of of integration.
So while we were investigating for that project, we actually talked to customers of existing solutions like Fireflies, Stitch Data, or MyTina, and each 1 of them was actually building a parallel system to cover integration that were not supported or that were not behaving the way they wanted. And that's what really motivate John and us to go for, like, an open source approach where people can address a long tail, and we can work with the community on that.
[00:04:53] Unknown:
And as you mentioned, data integration is something that you would think would be a solved problem because of how long it's been a problem and how many different efforts have been made to try to address it. But I'm wondering if you can just talk through some of the landscape of data integration and some of the issues that exist in the different solutions and how Airbyte is aiming to distinguish itself in that space.
[00:05:20] Unknown:
You have 3 different options. You have the the closed source cloud based, like Fivetran or Stitch Data. The issue here is that, well, they will never be able to really cover the long tail of integrations because, as Michel mentioned, the issue is really about maintaining the connectors. So if you're closed source and cloud based, you will have always this ROI consideration to support a new connector. So after 8 years, when you look at Fivetran, they have 150 connectors. So that's why we when we talk to their customers and we talk a lot with right now, we have 500 companies that tested us, and so we talked a lot with them. A lot of them are using Fivetran, but they need other connectors as well. So that's 1 of the issues with closed source. But cloud based, they see privacy data privacy, which is not a first class citizen.
The other type of companies who have closed source and self hosted, like Matillion or DataChrome. So they solve the self hosted problem, but not the closed source 1. And it's really a top down sales cycle in that case. You cannot have just like a bottom up approach where you have a data engineer that needs to fix any connector, build any connector, and just start using them. And that's where the open source part comes in. And in the open source, you have a singer, but the issue is, I don't know if you've seen that, but Tenen purchased Stitch Data, who is the owner of Zyngr, and they stopped investing in them. And the Xyngr is also a lot of repos. It's less maintenance.
There's not much standardization. And at that point, you see a lot of their tabs that are going out of date, and that's where we come in. As the open source, we wanna standardize the way it's being done. And it also open source enables us to address other use cases that closed source cannot address, like databases, certifications, kind of things. Yeah. I've definitely been aware of the singer spec for a while now and kind of saw some of the initial
[00:07:18] Unknown:
interest around it. I know that there is never really a great way to find out information about how you actually implement it and tie things together and manage the overall deployment and monitoring of it. And I know that the Meltano project has recently pivoted to try to be kind of the de facto way of using the singer taps and targets and trying to level up the overall singer ecosystem. And then there are also a number of projects like Embalk and Goblin that have been approaching it in a slightly different way where there's sort of the monolithic core with the different plugins that you can add in for sources and destinations, but it's not necessarily as flexible as the singer specification where you have just this interchange format along the same lines of, like, a UNIX pipe.
And then there are other things. You mentioned Fivetran and Matillion. But then in the previous generations, there were things like SQL Server Integration Studio, which was closed source, or Pentaho Suite, which was more of a drag and drop GUI type of approach. And so it's definitely interesting to see the evolutions of the ways that the data integration problem is trying to be solved as the overall data ecosystem continues to grow and change in terms of the sort of best practices?
[00:08:35] Unknown:
Yeah. I mean, 1 thing that that Singer attempted to address was really around, like, building the connector. But building the connector is just the visible part of the iceberg at that point, which is, as I mentioned, like writing a connector is something that you can do. It takes you a few hours, but the real complexity and that's why it's still an unsolved program is like like a connector is gonna live. It depends on an external resource, an external system, and this external system is gonna change. And what you want is you want to have a process, But, like, in a a few years ago, when we build the integration team at at Live Room, 1 of the core we are maintaining, like, a 1000 or 2000 connectors, and we had to have a very, very strict process on how do you test, how do you monitor.
It's not just about writing the plan. It's really everything that goes around it. And, yes, it's good to have an instructions from exchange format, but you also need to have a process around maintenance, and that's what is the most important for connectors.
[00:09:36] Unknown:
And so for the Airbyte project, I'm wondering if you can give a bit more context about how you view the target users for it and some of the ways that that persona has helped to guide and inform the way that you have designed the overall system and the interfaces that are available for interacting with it?
[00:09:55] Unknown:
The first user that we're really targeting is the data engineer. It means it's the person that is spending a lot of time maintaining connectors and making sure that data migration happen correctly. And this is a huge burden for this team. So what we want first is we want to burden them by providing them with a working solution out of the box. But this data is not always consumed by data engineers, and that's where we are also thinking about the 2nd category of user that we're targeting. It's people are becoming more data savvy in organization. You have more data scientists, data analysts, and more people that need to interact with data. And the warehouses like Snowflake, the BigQuery have enabled this new role to become smarter with data. But it's good to have a processing engine for getting insight from data, but first, you need to get the data in. And when we think about our user and how they inform us is, a data engineer doesn't want to spend time, like, enabling a new pipeline.
By it being a burden for them, it prevent this other role to actually leverage that data. And what what we want is to reconsider these 2 parts of the data consumption and the data predictions to work better together and just making them more autonomous.
[00:11:14] Unknown:
In terms of the actual sort of default interface, I know that you have oriented around a UI driven mechanism of being able to set up sources and destinations. And I'm wondering if you can discuss some of the benefits and trade offs of that approach as opposed to a more just sort of textual approach where it's code native and everything goes into source code.
[00:11:38] Unknown:
Yeah. As I mentioned, the thing is organization and people in organization are more data savvy. And these people are not always technical, and what they understand is a UI. So first, the UI is more like how can we provide value as quickly as possible? How can we make them autonomous as quickly as possible? Now the way we're also thinking about it is it has to integrate well with the data infrastructure that are already present that is already present in these teams, in these data teams. And that's why we're right now, it's very focused with on UI, but behind the scene, it's also powered by an API.
And in the end, like, this API, we will be able to describe more, like, having a more textual way of configuring Erbites and running data, replication on Erbites. But it's really getting the value 80% of the value as quickly as possible at that point.
[00:12:38] Unknown:
And so before we dig into the technical aspects of Airbyte, I'm wondering if you can give a bit more background as to what your motivation is and how you view the business opportunities of creating this open source platform and the overall ecosystem that is available for growth in the data integration market?
[00:13:01] Unknown:
So when we started, as Michel mentioned, we tried to talk to as many of, Fivetran stage and Mathenian's customers as possible and wanted to see some patterns. So that's how we learned that closed source cloud based wouldn't really fix the problem, but only an open open source 1 would. So our first goal is really to become the standard for open source. That's our goal for 2021. So we won't won't be focusing on any monetization features related features until 2022. And at that point, what we see is if we become the standard, then we can have several this model options. The first 1 could be the standard open core model where any feature that addresses the need of an individual contributor should be open source. So that includes connectors, but anything that addresses the need of the company could be licensed. And that's where Quick Thing and Management, we're thinking of a cloud based control panel. Your data stays in the data plan, so in your infrastructure, we won't have access to it. Where we can provide an SLA, for instance, for any enterprise features such as data quality, privacy compliance features, SSO, user access management, these kind of things. And that's 1 option that we'd see is that it will be a lot of business in. And we completely find that, like, 90% or 95% of our users just use the community edition.
We're very happy that we make a change and we help a lot of companies, that the impact of the company is much more than our revenues. And the second business model is more what we call powered by a byte, where we can power all your connectors with our API, and you offer us connectors to your own clients. So you're in charge of your UI, and you integrate with our API, and we power in the back end, your connectors.
[00:14:59] Unknown:
Digging a bit more into the actual implementation of Airbyte, can you describe how the overall system is architected and some of the ways that the approach or the overall goals of the platform have grown or evolved since you first began working on it?
[00:15:15] Unknown:
Yeah. And I just want to go back on 1 thing regarding the UI API and more like descriptive way of configuring a data pipeline. If you're thinking of how AWS started, they started first as just a UI, then they provided an API, and then you had tools like Terraform that went on top of it that leverage an API behind the scene. And we see that as a nice trend on how we can, like, address many type of usages, but starting at by the 1 where you get the value directly. In term of regarding your question for the architecture, so there are 2 main parts to Airbyte. The first 1 is the 1 we call core, which is everything that is related to configuration, everything that is related to our API, our UI, and also the the scheduler.
And the piece that also takes care of running the different synchronization and replication processes. And on the other side, we have also the integration. So all the infrastructure that comes to that comes into play to build a solid integration. So it's about how it's being packaged, how it's being tested, how it's being monitored. So we have like really these 2 sides to the project. Yeah, I mean, we've made some choices today around like the scheduling because we wanted to get the value as quickly as possible. At the moment, we're talking with several data teams where what they want is they don't want to use our scheduler, which is I don't think, the best in the market compared to an airflow.
And they want to have airflow to actually schedule and manage all the, like, the scheduling and the triggering of this replication job. So right now what we're doing is we are making our scheduler a lot more mature so that it can interact very well with all these external data systems. So that's 1 thing we've learned with during our journey and what that we're putting a lot of effort on is like very deep integration with the rest of the data stack. Same thing, we're also leveraging gbt for a lot of our model transformation. Right now, everything is happening behind the scene, but what we're seeing with our user is that they want to have access to dbt model that we leverage to do this transformation so that they can then cascade more transformation and more analysis on these, generated models.
[00:17:40] Unknown:
I also know that the core implementation is based on Java and that you also support Python for being able to build some of these connectors. I'm wondering what your decision making process looked like for choosing the overall technology stack and some of the design goals that you had that also led you down the path of using Docker as kind of the first class concern for how to package and deploy the overall system?
[00:18:07] Unknown:
For the language that we use, so Python for all the connectors is I think something where people are very familiar with. And the fact that, for example, the senior community has been very involved with Python shows that it is a language that is has a lot of success to build this connector. So that's why we went for Python. Now we also support connectors in Java, but I can talk about that when we discuss about Docker. Now the reason why we use Java for Core is more like a historical reason where what we've seen over the past 10 years is that a lot of the data technology has been built either with Java or Scala or, like, JVM based technology. And also it's the 1 where the team is the most familiar with. And at that point, Java is really we're comfortable with it. We're not attached to it. We are ready to have like a Kotlin implementation of a part of our core.
It's really about how quickly can we deliver value to our user. And we know that in the data world, people are proof and in enterprise, people are very familiar with Java. So it seems like a ubiquitous language at that point. But if someone wants to write a very important part of the core in Go, that's something that we would support as well. The language is a tool to get to a goal at that point. Now on the side of the connectors, so this is a very interesting question. The thing with running connectors is when you have the code, then you have the test that run around it, and to run that you also need to know what environment you need.
And that is actually a very problem that a lot of people that have been using CGR in the past have encountered where they don't have the proper version of Python. They don't have the right path environment. They don't there is a lot of things that come into, like, the configuration and the environment. It makes packaging harder, meaning that suddenly you can pull dependencies that are not up to date. And the reason why we wanted to use Docker is because the data infrastructure is moving more and more to our containerization. People are like Kubernetes is becoming ubiquitous, so we wanted to be very compatible with this system. We wanted to have the connector to be fully shipped with the environment that it requires to actually run.
And it also allows us to let the community to contribute in the language of their choice as long as they follow the protocol. And, like, typically, right now, we have 1 contributor who is working on a complete coverage of the Google Analytics API, and it's doing in Elixir. And we don't need to know how to install an Elixir environment. Nobody should know about that. It's just it's shifting Docker, and it works out of the box. So sorry about the simplicity of and the maintenance of and simplicity of the maintenance. It obviously has some downside, but these are more like execution complexity.
And once they are fixed, we're good to go. We need to be able to properly schedule these containers, and that's something we're working on. But after it's done, we won't have to worry about them. Right? And my phone vehicle have to worry about languages. So that's what motivated that choice. Yeah. It's definitely
[00:21:35] Unknown:
an interesting kind of balancing act because there's on the 1 side, you want to have some level of homogeneity in the implementation so that if somebody comes to the project, then they will be able to dig into the various different connectors if there's something that they wanna tweak or understand. And if there are a number of different languages, then it kind of increases that barrier. But at the same time, you want to be able to bring everybody who has an idea of how to implement the connector and not force them into a particular language that they're not necessarily familiar with. It's interesting how
[00:22:09] Unknown:
Docker and containers have kind of changed that dynamic a little bit. Yeah. It's also amazing for everything related to developer experience. That's a huge thing that we're working on is how can we make the onboarding of community members seamless. Having it backed by Docker, having it backed by container means that anyone can come into the project, and they don't need to install some random CLI on their laptops. They can just open the project and boom, they can start develop on the connector.
[00:22:43] Unknown:
On the point of contribution and community growth, I also noticed that you have taken a monorepo approach where all the connectors are housed in the same repository as the core implementation. And I know that sometimes that it's another balancing act where on the 1 hand, it makes it easy to find all the connectors because they're all in 1 place. But on the other hand, it also means that somebody who's contributing needs to be able to navigate their way through the project, and it can complicate some of the ways that you handle things like version numbering and deployment of individual connectors because then you have to version the entire repository all at once. And I'm wondering what your thoughts are on that and what led you to go down the monorepo path.
[00:23:25] Unknown:
So the monorepo is a decision that we made because we knew that we were going to iterate on the protocol a lot. I think that's 1 piece that has been hindering CING error is it's very hard for them to iterate on a protocol because every single connector is just spread across like hundreds of different GitHub repo. So if they want to make a breaking change, and I think when you start a project, you are going to be doing breaking change. Having the mono repo allows us to keep all the connector here, and whenever we do that breaking change, we can migrate all these connector to properly adapt to the new protocol. Now as the protocol mature, I think the different connectors are gonna be spread across different repos for sure because there will be people who want to just have it on their own repoint. They don't want to contribute to the main 1, and that's fine. It's just that for now, it gives us a lot of control on how we develop the protocol and how we improve it.
It also allows us to provide a developer environment for contributors. That's especially important at first when you want to onboard new people where the build system is just working. So if someone wants to create a connector, they have, like, the test infrastructure in place. They have like the integration testing infrastructure in place. So there is a lot less logistic that you need to do to create a new connector. Also, it's a way for us to just keep the community eyes on only 1 repo. So making sure that people know everything that's happening and they don't have to look in, like, 100 of different places.
[00:25:06] Unknown:
Yeah. Again, it's interesting to have the federated approach where anybody who wants to build their own connector and manage it themselves and contribute it to the community can do so, but then you have to have some means of cataloging all of the available connectors so that when you are a newcomer to the project, you don't have to, you know, dig through forum posts and GitHub issues and try to piece together what is the actual totality of the ecosystem rather than just having it all in 1 place and just here's a list, here are all the connectors.
[00:25:37] Unknown:
Yeah. And definitely the day we start spinning up the mono repo, if we need to, that is gonna be a prerequisite is we need to have a way of cataloging this and make the discoverability of disconnectors extremely easy.
[00:25:55] Unknown:
RudderStack's smart customer data pipeline is warehouse first. It builds your customer data warehouse and your identity graph on your data warehouse with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plug ins make event streaming easy, and their integrations with cloud applications like Salesforce and Zendesk help you go beyond event streaming. With RudderStack, you can use all of your customer data to answer more difficult questions and send those insights to your whole customer data stack. Sign up for free at dataengineeringpodcast.com/rudder today.
Digging a bit more into the actual protocol level of how you manage the interchange and pluggability of the sources and targets and some of the transformations, I know that you have a number of blog posts that dig into your experiences of exploring the singer specification and the implementations there and some of the lessons that you learned from it and some of the reasoning behind choosing your own protocol specification that is forwards compatible with singer if somebody wants to translate a singer tap or a target into the Airbyte system, but it's not backwards compatible. And I'm wondering if you can just talk through some of the shortcomings that you saw and the ways that CINGER manages that data interchange.
And, also, given that you do have this new protocol that you have documented, what you see as the potential for a more widespread community adoption of that specification with alternate implementations?
[00:27:25] Unknown:
On the single side, I think they came up with spec, but at that point, specs are not enough, especially for something that scales so much. Like, there are tens of thousands of connector that need to happen. So you need to have an environment around it. What has happened is people have added their own extension to the protocol. And, you know, when we initially started Airbyte, we were relying on single taps and single targets. And we realized that with time, we're just spending more time trying to reverse engineer all these little add ons that were made that were not consistent across different report different source and target. And we said, okay. It was a tough decision for us because you don't want to invent the wheel when there is something that exists. What we realized is that actually it exists, but it's losing a lot of traction. And we said, okay. There is there are a few reasons, and we're going to solve it. We're going to learn from what they've done well, learn from what they didn't do well, and come up with something that can reunite the community there. And that's why also we made sure that the protocol can be compatible with what CINGER was doing so that the effort that has been put by the community of CINGER can actually is not wasted effort, and they can leverage what they built. The intentions format, the fact that they are using JSON, I think it's of this interest for doing the same.
That might not remain JSON. At the end of the day, it's more a matter of describing the data model of your protocol. And I'm pretty sure that in the future, we will need to support more efficient serialization, both in terms of volume and speed. But for now, like JSON gives us a lot of visibility and auditability in what's happening as we're still developing the protocol. But I can imagine that we will have layers for maybe putting that into, like, an error message or, like, a protobuf message or a threat message. What really matters, like, what is the schema of your protocol?
The interchange,
[00:29:23] Unknown:
you can change it if you need to. Yeah. I was definitely interested in digging more into the use of JSON as the interchange format. Because as you mentioned, it's not necessarily the most efficient, but it is easy to just dump it out to disk or cut it out to a terminal to see what's happening and be able to unpack it. And it's recognized by so many different programming languages, but it's good to see that you have some thoughts as to the forward direction of maybe adding support for binary protocols for better efficiencies. I'm also interested in understanding your perspective on what you see as being the trade offs of using JSON as the interchange format due to the potential for information loss as you convert to and from JSON, thinking in terms of things like type information from maybe a richly typed source or some of the contextual data that you may or may not be able to encapsulate in that JSON specification and maintain across that interchange boundary?
[00:30:21] Unknown:
The way we're thinking about these connectors, there are 2 pieces. On 1 side, you have the data. On the other side, you have the catalog. And that's something that we believe CINGER did good is, like, separating the 2. The catalog allows to describe the schema of the data so that even if you lose data on the Exchange side, you can still reinterpret it from the catalog. And that is very important for us because today, we might be using, like, the most basic types of JSON. So maybe we are losing the fact that something was afloat versus just a number. But as we rely on the catalog to explain what this type is, we can always recreate that information in the past or maybe serialize it differently if it's a float.
To describe the schema, we're actually using, JSON schema, which actually has the feature to describe more advanced types, has the feature for describing, like, a constraint on the data, and that's what we're gonna leverage. And in the protocol, when we serialize and deserialize, it is something that we're gonna enrich this feature and be more smarter about not losing data on the data that we replicate.
[00:31:36] Unknown:
Another interesting area to dig into is the overall interfaces that you have and the extension points in the system for Airbyte for being able to integrate with the broader ecosystem of data tools, thinking in terms of things like great expectations for being able to embed quality checks in your data flows for the extracted load process or being able to integrate different orchestrators like DAXTER or Airflow or Prefect to be able to hook into the life cycle of the pipelines and either have them manage the execution as part of their scheduled runs or be able to use the completion of a pipeline as a trigger to kick off some downstream pipeline and things like that.
[00:32:18] Unknown:
I think, like, in open source, you want to be best at doing something. And, like, Great Expectations, Airflow, DAX, or, DBT are doing, like, really a phenomenal job at what they're doing. So that's why we're really focusing at becoming the best at moving data and integrating very well with them. So for instance, Airflow, the integration with them should be coming within a couple of months, but actually, you can already use us with Airflow using our our API. And definitely at DBT, DAXTER, great expectations are in our short term roadmap.
[00:32:50] Unknown:
And so for people who are interested in getting started with Airbyte and using it within their data flows, what's involved in actually getting it set up and creating a pipeline and being able to gain visibility into the data flows and just the overall maintenance and implementation of Airbyte within somebody's overall data platform?
[00:33:12] Unknown:
So we've optimized a lot on single instance runs. Right now, we're also working on making it multinode and integrating better with Kubernetes. We have an alpha version of it. If you want to just the simplest version of Airbyte, it's just a matter of running Docker Compose. It's gonna spin up a bunch of containers and you're up and running. You have a UI. And at that point, you can just connect to the UI and you can start configuring your sources and destination, and it will start syncing and replicating data. So it's as simple as that. In terms of the maintenance, we are iterating on, like, making the the upgrade path a lot nicer and the visibility into the system. So right now we expose as much log as we can, and when we have the community to under or like our user to understand what's going on, Generally, they just need to expose the logs and we get it and we debug together and that it gives us visibility. Now we want to integrate with more like logging ingestion system so that people can see that into the dashboard.
But that will come with time and with where the community is putting us. Yeah. The maintenance, we're working on the the upgrade path at the moment. We've released something last month on making sure that we can upgrade configuration. Back in the day, it was just you have to to remove everything and restart from scratch. Now we can actually go from 1 version to the next, and you don't lose your existing configuration, existing data state. Yeah. We're really going for, like, the simplicity of operation and the simplicity of deployment.
[00:34:41] Unknown:
And then you mentioned the work that you're doing to make up multi nodes. I'm wondering if you can just talk through some of the axes for scaling a deployment of Airbyte and scaling the throughput capacity of a given set of pipelines and just some of the challenges that exist in data integration that might be unique to that space?
[00:35:01] Unknown:
So there are a few dimension on which you can improve scale. The first 1 is depending on how many connector you have, you might want to spread that across multiple nodes, and that's to us 1 of the more straightforward way to scale your data replication. We already have people who have, like, 15 or 20 connectors configured and at that point, like, 1 instance is not enough until you need to run that on on multiple nodes. So this is really for, like, the number of existing replication. Now the other dimension on which you need to be able to scale is on the scale of the data. So when you're talking to an API, that's okay. In general, the volume is not gonna be more than 10 gigs a day. Now with databases, when you start integrating with, like, a Kafka stream or or like this very high throughput or like a click stream input, this changes. And at that point, things that we're exploring, for example, on data on database and we're gonna release something in the next month or 2 is more around smarter database replication using a change data capture solution and also on partitioning the worker. So typically, if you want to replicate like a Kafka stream or Kafka topic, you might want to have more than just 1 worker pulling data from this Kafka topic. And so these are the kind of access for scale that we have today. The 1 we have is around
[00:36:24] Unknown:
breaking down all these integration on multiple nodes, and we're gonna work on the next ones in the next during the year. In terms of being able to manage the upgrade flow for people who are running Airbyte and being able to ensure that they can do sort of a continuous integration or continuous delivery approach for bringing in new connectors or upgrading the versions of Airbyte. What are some of the challenges that you're seeing from your own experience of running it and that the users should be aware of as they're designing their deployment strategies for bringing Airbyte into their infrastructure?
[00:36:58] Unknown:
So a protocol change is generally gonna be something that is easy to catch. Meaning, air byte provides a grace grant here where it's not going to sync data if the protocol is incompatible. So at that point, it's a matter of updating the different connectors. Now the real challenge with data integration is more like, what if the data format changes at the source level? How do you handle the migration? And that's where the complexity is. And I think there are a few automated strategies that you could put in place when, like, type are changing, columns are being renamed, and this is something that will be configurable by the user within Airbyte on what kind of strategies they want to adopt. But there is always gonna be cases where it is not possible to have an automated migration. And at that point, it becomes the responsibility of Airbyte to act as a safety net and prevent data corruption before, someone knows what to do to get this loaded migration pass. And that you you cannot know because it's so ingrained into your data infrastructure that you cannot know. And I would say that's for every data system that you have, that's probably 1 of the hardest problem. And because we are really focusing on extracting node, we want to make sure that we protect people from ingesting incompatible data. That is the biggest thing for us.
And whichever system you use, you need to have this gap rates in place.
[00:38:26] Unknown:
And in terms of the specific categories of connectors, as I look through the work that you've done and the work in the singer ecosystem and just the overall space of point to point integrations for the extract and load paradigm, it seems that the majority of effort as far as the destinations has gone into kind of the major data warehouse vendors. So thinking in terms of, like, Redshift, Snowflake, and that there is not as much effort or that there's just some incidental complexity that prevents a lot of agreement on how to do things like loading data into s 3 for use with things like Spark or Presto or
[00:39:08] Unknown:
the support for maybe more open source data warehouse or data lake infrastructure. And I'm wondering what your thoughts are on that. I would say 1 of the reason why data warehouse have been so popular in terms of integration is because these are the technology that enables this new category of user to get insight from data. Now I would say today, warehouse were at the top of connector that were asked by the community. Now we have at the new level of these priorities, which is more around like data lakes, S3, GCS, Azure Blob Storage, and we're working on it at the moment. Meaning, what we've been doing over the past few months is like really understand what do people want when they thought when they talk about data lake. What do they want when they talk about an s 3 integration?
Yeah. That point for you, we need to have to provide something that can be as simple as just dumping a CSV, but people want more than that. They want to have support for writing Parquet file data. They want to be able to have better partitioning of that data on s 3. And so it's it is a very hard problem, and we were really in a requirement gathering phase at that point on what do people want. And once we have that, then, yes, you can start using, like, your Redshift spectrum on top of your data or, like, your Presto or any kind of file based query engine.
[00:40:30] Unknown:
Yeah. It's definitely becomes pretty evident as you start to dig into that problem domain of rating these things out to s 3, how much the data warehouse does for you in that regard because you can just say, write it out to this table, and then the data warehouse will handle the partitioning and the disk allocation and making sure that the indexing is set up properly versus if you're just writing it to s 3. As as you said, there are so many questions to be answered in that regard. And so in terms of the overall project, obviously, you have a business goal of having a sort of sustainable company that you can continue to grow and work on for years to come, but you also have this open source foundation.
Wondering if you can discuss your thoughts on the approach you're taking to governance of the open source project and the long term sustainability and viability, if for whatever reason, the startup doesn't remain viable over the long term.
[00:41:26] Unknown:
We will want to give, like, commit access to to more and more contributors with time. And we're still learning, Michelle and I, it's our 1st open source experience. But our field of the field right now is really to give more control with time. We are kind of like a federation, if you know, like terminology, where we think we'll have a lot of contributor growth and user growth. And it only works in scales if you give more control to the community. In terms of sustainability of of the project, so in case the business goes down, for sure, as we give more control, there will always be the open source part. Now we hope and we think that an open source project can really change the world if you have a business system at the beginning.
And our goal is to change, like, data movement overall. So our mindset is to make it a profitable business. And as mentioned before, we perfectly find that 95% of our customers might only use the community edition. So that would be our first view right now, and we're still learning.
[00:42:32] Unknown:
And so as you have been building the project and onboarding more people to it and sort of spreading it around and raising awareness around it. What are some of the most interesting or unexpected or innovative ways that you've either seen Airbyte used or contributions or requests that you've seen come in as to ways to extend or integrate with it?
[00:42:55] Unknown:
1 of the monetization idea that we mentioned before, which is around the, like, embedding Airbyte into an actual product is not something that we thought about initially. We are really thinking about the analytics case solving data movement for 1 company, but not for powering the data types of a product. So that was 1 thing. Now we recently talked to 1 of our community member, and they're actually using Airbyte to populate their cache, to warm up their cache. So every hour you have an Airbyte job that runs and it's going to populate the cache with the latest version of the data. Not something we thought about. So it's called to discover these gold nuggets.
[00:43:37] Unknown:
Like our plan right now, we've been discussing with Maillesearch. We've been discussing with other companies, and we see use cases with also actors in the data industry. For instance, like, right now, we're working on a tutorial where we have you just save your Slack messages on your free plan and search for them, like, indefinitely. So being locked on that part, and we can have you do that. And just when you can move data freely, it unlocks a lot of use cases like this. So I think we'll have a lot of fun with tutorials to expose these use cases, to the community.
[00:44:13] Unknown:
And in your experience of building the Airbyte platform and growing the business around it, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:44:24] Unknown:
It's really about the amount of effort that is being put today in this company. Sometime we talk to organizations, they have, like, 20, 30 engineers just maintaining connectors. And we knew it was a problem, but 30 people is like an enormous amount of human time that is put in as a problem that could be solved differently. And to us, it's like these people, these engineers, they want to do something else in general. They want to be smarter with the data instead of focusing on plumbing. And we're talking with this, it seems like, every day, and, yeah, they're like just waiting for a solution like that to appear. I think that's why we're getting this growing community so quickly.
Now there is an interesting, it's more like a, I would say a gap, which is people are looking more and more for like traceability of the data, understanding how data has been derived. Because we're at the top of the data ingestion, it is actually something that we can propagate down, like, the data value chain, ensuring that people understand where the data is coming from and how it's been synced.
[00:45:45] Unknown:
I would say, like, what is interesting with open source is the the inbound needs you have, interests you have. We had all across the board, I think the US right now is about 35, 40% of our leads. The rest is really in Europe, Asia, everywhere. And it's from early stage companies to enterprise. And what was surprising to me is the early stage part. We thought at the beginning, it would be more about the medium, midsize companies and enterprise. But we see a lot of startups using us because, well, data is getting everywhere. And moving as soon as you have data, you need to move that data. So we are the beginning of a long journey.
[00:46:24] Unknown:
For people who are looking for a means of building a data integration solution or onboarding new data sources, what are the cases where Airbyte is the wrong choice?
[00:46:35] Unknown:
I would say today is when you want to integrate unstructured data. This is not something we've been focusing on at all. We're really focused on, like, structures and structure. If you have, like, blobs of data with no with no schema, it's not something that where we're gonna be very good. We actually talked to a few company who who needed that, and, yeah, that's
[00:46:56] Unknown:
probably not the right time for us to be solving that problem. I would think of potentially 1 case If you have a lot a lot of data that you need more than your workstation to replicate data, and you have no engineers like to have to no data engineers to have to you're only, like, a data on this, and you just this data replicated. In that case, you need somebody to have you deploy us more than in your service. So that's the case where you might want to have the cloud based approach. But we're thinking about providing a hosted version, like, in the next 4, 5, 6 months. So this is something that we'd be able to address at that point.
[00:47:37] Unknown:
As you continue to iterate on Airbyte and as more people continue to use it and provide feedback, what are some of the things that you have planned for the near to medium term future of the project into the business?
[00:47:48] Unknown:
Reliability is number 1, ease of deployment, ease of maintenance. It has to become a no brainer to users, and it has to work all the time in every situation. Like, connectors is like a 1, 000 paper cuts problem. And right now with the community, we're learning how to take and solve this 1, 000 perpockets. So reliability reliability. And after that, it's really focusing on having a better support and focusing on building our community because we need the community to help us and to, like, build this long tail of of integration. So it's both like a technical challenge for us and also, like, making sure that we become the open source standout for solving that
[00:48:38] Unknown:
problem. I would add to that integration with the data stack, especially the so great expectations, DBT, Flow, Daxter. We have a alpha version for Kubernetes as well. So like sentiment, you know, really the goal by the end of the year is that we've become the open source standard, the obvious choice. That's really our goal before we focus on any monetization features.
[00:49:00] Unknown:
Are there any other aspects of the Airbyte project or the overall space of data integration and the work that you're doing to grow the community around the open source aspect and the business layers that we didn't discuss yet that you'd like to cover before we close out the show?
[00:49:16] Unknown:
So, like, we're working senior software engineers and a founding developer advocate. Our focus is really about building up the the community and increasing the the conversion between user and contributors, so making it as easy as possible to to build to help us build connectors and maintain them. So, yeah, it's a long journey. We're starting with the reality and but the the goal is really to change how data is being moved and, like, not be an issue anymore in a in a mid term for future.
[00:49:45] Unknown:
And for anybody who wants to follow along with what you're doing and get in touch, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get each of your perspectives on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:50:00] Unknown:
We're getting better at processing data. We're actually becoming very good, and I think we've seen it with the success of Snowflake. But with more opportunity to leverage data, we're starting to discover more problems like discoverability of the data, metadata associated to the data, the quality of that data, like the security, because now you're opening data to more and more people. So you need to make sure that you are following like security, privacy, like there's all the things that are coming that has been unlocked by the ease of data processing.
And I think we're gonna see a lot more open source, commercial company that are gonna come into that into this, into the into this industry in the next few years because, yeah, it's becoming democratized. And with democratization comes all this, like, side effects
[00:50:54] Unknown:
of control, more control on the data. The the next 3, 4 years will be very interesting, the data.
[00:51:01] Unknown:
Well, thank you both very much for taking the time today to join me and share the work that you're doing on Airbyte. It's something that I've been keeping an eye on for a little while now. So definitely excited to try it out and experiment with it a bit. So thank you both for all the time and energy you're putting into solving some of the problems around data integration,
[00:51:22] Unknown:
Tobias.
[00:51:29] Unknown:
Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, community, and the innovative ways it is being used. And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Episode Overview
Guest Introductions: Michel Tricot and Jean Lafleur
The Airbyte Project: Background and Motivation
Challenges in Data Integration
Target Users and Design Goals of Airbyte
Business Opportunities and Open Source Strategy
Technical Architecture of Airbyte
Community Contributions and Monorepo Approach
Protocol and Data Interchange Format
Integration with Broader Data Ecosystem
Scaling and Deployment Strategies
Connector Categories and Data Lakes
Governance and Sustainability of Airbyte
Unexpected Use Cases and Community Feedback
When Airbyte is Not the Right Choice
Future Plans for Airbyte
Final Thoughts and Closing Remarks