Summary
Airbyte is one of the most prominent platforms for data movement. Over the past 4 years they have invested heavily in solutions for scaling the self-hosted and cloud operations, as well as the quality and stability of their connectors. As a result of that hard work, they have declared their commitment to the future of the platform with a 1.0 release. In this episode Michel Tricot shares the highlights of their journey and the exciting new capabilities that are coming next.
Announcements
Parting Question
Airbyte is one of the most prominent platforms for data movement. Over the past 4 years they have invested heavily in solutions for scaling the self-hosted and cloud operations, as well as the quality and stability of their connectors. As a result of that hard work, they have declared their commitment to the future of the platform with a 1.0 release. In this episode Michel Tricot shares the highlights of their journey and the exciting new capabilities that are coming next.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Your host is Tobias Macey and today I'm interviewing Michel Tricot about the journey to the 1.0 launch of Airbyte and what that means for the project
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Airbyte is and the story behind it?
- What are some of the notable milestones that you have traversed on your path to the 1.0 release?
- The ecosystem has gone through some significant shifts since you first launched Airbyte. How have trends such as generative AI, the rise and fall of the "modern data stack", and the shifts in investment impacted your overall product and business strategies?
- What are some of the hard-won lessons that you have learned about the realities of data movement and integration?
- What are some of the most interesting/challenging/surprising edge cases or performance bottlenecks that you have had to address?
- What are the core architectural decisions that have proven to be effective?
- How has the architecture had to change as you progressed to the 1.0 release?
- A 1.0 version signals a degree of stability and commitment. Can you describe the decision process that you went through in committing to a 1.0 version?
- What are the most interesting, innovative, or unexpected ways that you have seen Airbyte used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Airbyte?
- When is Airbyte the wrong choice?
- What do you have planned for the future of Airbyte after the 1.0 launch?
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
- Airbyte
- Airbyte Cloud
- Airbyte Connector Builder
- Singer Protocol
- Airbyte Protocol
- Airbyte CDK
- Modern Data Stack
- ELT
- Vector Database
- dbt
- Fivetran
- Meltano
- dlt
- Reverse ETL
- GraphRAG
[00:00:11]
Tobias Macey:
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Your host is Tobias Macey. And today, I'd like to welcome back Michel Tricot about the journey to the 1 point o launch of Airbyte and what that means for the project. So, Michel, can you start by introducing yourself for anybody who hasn't heard your previous appearance?
[00:00:29] Michel Tricot:
Yeah. Of course. Thank you for having me, Tobias. So I'm Michel Tricot, and I am the the cofounder and the CEO of Airbyte.
[00:00:35] Tobias Macey:
And do you remember how you first got started working in data and what it is about it that has kept you interested for so long?
[00:00:42] Michel Tricot:
I mean, I could go very back in the past and look at me as a teenager collecting, DVX and, websites because at the time Google was barely starting. But more more professionally, started really in 2008. Yeah. Working on financial data. I thought I was going to be a trader, and I ended up doing financial data, and I had a lot of fun with it. And what happened after that is I moved in the US, joined a very at the time very small, became very large, AdTech company, and everything related to AdTech and MarTech is about data and data at internet scale, and I would say that's really where I got burned the most and where I learned, a lot of this principle around what does it mean to work with data, what is the process of data manufacturing. And at the end of the day, starting Airbyte in 2020 and having a have have been having a blast since then.
[00:01:38] Tobias Macey:
And for people who are curious about Airbyte and some of the specifics and internals, I'll refer them back to the previous interview we did, and I'll link that episode in the show notes. But for somebody who, is just hearing about Airbyte in this episode, if you can just give a quick overview about what it is that you're building and some of the notable milestones that you have gone through on your path to this one point o release.
[00:02:03] Michel Tricot:
Yeah. So we started Airbyte in 2020, and Airbyte is an open source data movement platform. You can almost think of it as the highway for getting data from point a to point b, and point a be being like a place where you have siloed data, and point b being a place where you can actually extract value from data. And we started in 2020. That was the first release, the first open source release. And in 2021, we started to get more and more traction from the community up to a point where in the summer of 2021 the whole team was completely frozen. We couldn't do anything, we couldn't build a product, we're just spending so much time on slack helping our community be successful with Airbyte. For me I I see that as the first real milestone like yes you've released something you never know if it's gonna work but this one was the real like first milestone that I can remember because we really suffered a lot from it it was it was fun but it was very frustrating to not be able to to build more. And in 2022, the the milestone there was releasing Airbyte Cloud. So it was our first time having to really operate Airbyte before it was open source software, so we just put it there on GitHub and people would figure it out. Here, we had to be our own customers of a this is how you need to be running Airbyte at scale, and also what are all the the pain points that go with it. And that was yeah that was quite a journey in 2022.
We like to refer to this one as real that this one was very much a painful year because everything that we did in 2021 was about moving fast and 2022 was okay we need to take a step back and look at what does it actually mean. 2023 would say was the connector builder, you know airbags is nothing without connectors because that's what allows us to connect point a and point b together and I think in some point in 2021 2022 I wrote down a document internally about like what we call nailing the maintenance which is how do you create a very large catalog of connectors, lots of breadth, and how do you make sure that these connectors are high quality, they are well maintained etcetera etcetera. And 2023, we really build the first big building block to getting to that state. And, yeah, now we're here in 2024.
We have a ton of connectors and we're preparing we're launching the the the Airbyte 1.0 version. So very, very happy about that.
[00:04:44] Tobias Macey:
On that point of having the connectors that are available be high quality, reliable, I know that that was one of the major concerns early in your design process, particularly given the state of the ecosystem around the stitch framework or protocol. And I'm curious if you can talk to some of the lessons that you learned in those early days of evaluating the stitch protocol, deciding that that didn't suit your needs, building the Airbyte protocol, the interfaces that you had around that. And I know you also went through a major revision of that protocol in that time frame as well and just some of the lessons learned early in that process that helped inform the decisions that you made about what it actually means for those connectors to be reliable, what elements of that you wanted to be customizable versus which elements needed to be standardized, etcetera.
[00:05:37] Michel Tricot:
Yeah. Of course. So when we think about building connectors and building a large scale of the a large number of those, the the thing to think about is you're not just building a connector. What you're building is you're building a factory where you get raw material and input, and you get a high quality connector as an output. And for me when I look at the past 2 3 years, this is what we've been building. Like the byproduct of it is we get more connectors, but what we are building is that factory to get from someone saying hey I need to get data from x to I have a working connector that will work for me for the for the long term. And to me that's the the big the big, challenge in what we're doing is like what does that factory looks like and when we started in 2020, yes, there was the the stitch framework actually initially airbyte was running with those connectors and very very fast we realized that the quality was not there because it was basically up to every single human to just make sure that this connector always works and there was no real like testing or ways to, I don't know, to be prescriptive about what is a high quality connector.
And that's where we started to change things very fast. Like, I think in 2021, we're already we're already of the of this protocol, and we build our own. Obviously, we we we took some inspiration because there was some there were still some good things there. But what we did first was when someone builds a connector, they need to encode the whole environment of what that connector needs. And, you know, in 2021, some people ask us, oh, why why is that a docker image? And I say, yeah. Because a docker image, it doesn't matter that it's an image. What matters is that by making it an image, you have to encode what library needs to exist. Like, if you need some specific SSL library installed on your system, it will be encoded somewhere and you will never forget about it. So it was a way of just making sure that a connector is completely self contained and it can run anywhere no matter what your environment look like. You only need to have a way to execute Docker image and that's all. And if you don't want Docker, it's fine. You can still read the Docker file and you know what you need. And you can this way, you're basically documenting what what what you need, as code.
And for us, that was really the the first step. The the next steps was very much around how we do the testing of those, like, how we create whether it's sandbox, whether it's replaying like API calls etcetera etcetera. And the goal here is just to create that asset of a what what does it mean to have a high quality connector? Well, you need to be able to run test. You need to be able to run, real tests. You need to be able to validate that the the data looks like the schema you're describing is all these things. And we've been building every single step of that process of manufacturing a connector. And today, we have a ton of dashboard internally that are monitoring every single connector. We know which stream are working, which stream are low quality we can at that point just invest more time in those or someone from the community is gonna look at it and just help us on on fixing it.
And finally, the last one was how do we make it more like simpler to build, disconnectors? The the theory that we have here is that we have a protocol and you can build anything with that protocol but it's very painful like nobody creates a website with TCP IP. They all do they all have, like, higher level on top of that. And that's basically what we've done since the beginning, which is first very close to the protocol. Then in 2021, we had the first version of the CDK. That's when we started to have more, like, community adoption. Then another version of it. Then we started to have low code, then no code. And that's basically what we're doing to just minimize the the amount of effort for building, but most importantly for maintaining these connectors. And that's I think today, most of our connectors are using low code, no code, meaning that they can be maintained. It's just 1 or 2, 2 minutes. So and that's what we want to get to. It has to be low cost.
[00:10:05] Tobias Macey:
Another aspect of the connectors and the ecosystem around them is that you have done substantial investment in simplifying that development process or the generation with the low code and no code interfaces, specifically on the source side. The destinations are still more complex and rightly so because of the need to be able to manage how updates are inserted into the platforms, the different types of destinations, whether it's a data lake or a database or a data warehouse. And I'm wondering if you can talk to some of the struggles or some of the challenges in simplifying the interface and simplifying the development effort for being able to generate those destination connectors and beyond just the source connectors that are a little bit more straightforward, at least in the at least in some cases. Obviously, there are very complex source connectors as well. But Yeah. So, today, a lot of the
[00:10:59] Michel Tricot:
destinations that we support, they have a lot of specificities so you could load any data into like a SQL based store by just saying like insert this is going to be very inefficient. So then you go and you start looking at like what is the most efficient way to load data into Redshift? What is the most efficient way to load data into Snowflake or BigQuery or like ClickHouse etcetera etcetera. And that's why like having a CDK for those like more heavy infrastructure is building an abstraction you can do it that's actually we have one that is coming out with a CDK we built the oracle destination on that abstraction and most of the destination are going to sorry not oracle Databricks and we're going to be able to use that abstraction for more and more destination so but it's probably not going to be to the point where you can do low code no code because these systems you want to take advantages of their specificities now a place where I think destinations are going to become easier to build is gonna be when we start addressing more, like, API types of destination because those already have, some kind of, like, generalized interface, which is it's an API, and you're pushing data there. There might be some specificities, but they will be easier to encode in a low code, no code. So this is a place where, yes, we will be investing in a in a framework there.
[00:12:34] Tobias Macey:
Another interesting aspect of both the timing of when you first launched Airbyte as well as the location in the data stack that you're targeting is that there have been numerous shifts in the past 4 years, in particular, starting in 2020, the rise and then subsequent decline or dissolution of the modern data stack, which is a term that nobody wants to really throw around anymore. And then, also, maybe even more impactful is the rise and adoption of generative AI and the need for vectorization of data for the semantic retrieval capabilities.
And I'm wondering if you can talk to some of the ways that those industry trends have impacted the way that you think about the areas of focus and the capabilities that you're building into Airbyte and the requirements around what Airbyte can deliver to the end consumers?
[00:13:33] Michel Tricot:
Yeah. So, you know, I I think I can still remember the first readme we had about air bite which was open source ELT tool or something like that. I don't want to just pull the the cover to myself but I do believe that in 2021 we were the one who talked about data movement for the first time. Maybe that's wrong, I don't know but I like to think that that is true and the reason we quickly moved away from just ETL and ELT and just going for data movement is we wanted to position Airbyte and to develop Airbyte in a way that is very that is at the infrastructure layer so that it's not about just pushing data into a data warehouse, although today it is one of our main use cases, but we've seen people doing other things with airbytes that were also super interesting and that's why we wanted to just go a level lower and just look at it as data movement which is moving data from point a to point b provide intelligence on top of these pipes and just make sure that data can be sent to a place of value.
And for me that's why like when I hear about the modern data stack yes I mean we use the word a little bit we stopped using it as you said but at the end of the day what the modern data stack is really what it's just like how do you create an architecture and infrastructure that helps you build like a long term system and today with people are going to continue to adopt warehouses like there is no question there like people need to provide and to compute analytics to do dashboard etc. Maybe we don't call it modern data stack but at the end of the day once they do something like that we want to guide them on how do you future proof that. So, yes, you have your warehouse, think about your storage, think about your compute, think about your network, like data movement, like how you bring data into it, and then what comes after that. Like, do you need to do reverse CTO? Do you need to do dashboarding? So those are real problem that people will have. We we just stop calling them modem data stack, but they will still be there. Now on the gen AI side, that's actually where we got a lot of pull from the community.
The same way we got some pull from from reverse CTL, never really got to it, but we got a lot of pull from the committees. We provide pipes. How do we make sure that our pipes can connect to different systems than warehouses? How can they make sure they can connect to data lake? How can we make sure they can connect to vector databases or other type of destination? And for us, that's just how do we evolve, how we build this nation because sources are not going to change that much. Like, we still get need to pull data from Salesforce, from HubSpot, from Postgres, from Oracle, from SAP. It's just where does it go and what kind of intermediate process do you need to put in place. So if you need to do embedding of the data, well, how do you configure airbag to do so? And that's things that we, we need continue to develop because it's part of building pipes.
[00:16:47] Tobias Macey:
As far as the ecosystem of data movement, that is maybe the core essence of what data engineering is. It's just moving data from one place to another so that you can use it for some process. Even just the and I say just, maybe that's not the right the right word to use. But the constraining the scope to extract and load, which is a big area of focus and something that was the, you know, initial step of the modern data stack is having that easy extract and load capability. There are a lot of complexities, both explicit as well as incidental complexity that comes out of it, a lot of edge cases, performance challenges. I'm wondering if you can talk to some of the hard won lessons that you learned in the process of building and scaling Airbyte about the hard realities of data movement and data integration.
[00:17:48] Michel Tricot:
Yeah. The the the hard thing about it is the moment you depend on a system that you don't control but that you're supposed to link these systems together, well, in a way you're responsible when they go down and that's really the problem. And that's why like people accept like except people like Airbyte nobody wants to do something like that it's just it's just crazy. And this to me is the this is the real complexity, it's like sometimes even if you want to create the best possible platform sometimes you cannot create a pipe that is as good as you would want it to be. You know we have people who say oh pulling data from this API is slow, I say yes it's slow but we don't have a choice it's because it's rate limited on the other side so you cannot go fast.
But in a way it always reflect on on the tool and now the question for us as like operator and and builder of a platform is how do we train and educate people to understand that yes pulling data from this API is actually really hard because you have all these rate limiting things maybe there are some pieces of data that you cannot get out maybe the like this particular API endpoint does not support like incremental updates. Although the platform supports it, the source does not support it or the destination does not support it. And I think this is the real challenge when you're moving data is like you can build the most perfect platform you're always dependent on the source. So now the only thing you can do is just educate, make the most of it like you know when you have like rate limitings you can say yes if we don't rate limit they're gonna close down your account and that's like explaining like why why we have to do these things. But I think here it's more like educating the the humans that depend on that data and just providing them like yes like this is actually something we did for 1.0 is this concept of checkpointing is sometimes it takes forever to pull data from an API.
The last thing you want is something goes wrong in the middle and boom you have to start from the beginning. Like this is the worst possible expense. And here, this is us having to walk around the fact that it does not support incremental and just create a system of checkpointing to make sure that although the expense is not the best it's not the worst. And this is the kind of thing we need to do as an infrastructure product is yes all these weird edge cases and setting the right expectations.
[00:20:26] Tobias Macey:
Over the past 4 years, as you said, you started with a very straightforward here's the open source project. Here are a bunch of connectors. Let's see what happens to now you're committing to the 1 point o release. Obviously, there have been a lot of evolutionary steps architecturally. What are the core decisions from an architectural perspective that you made early on that have proven to be most effective and most stable? And what are some of the aspects that have required the most constant change and exploration to settle on what you're now committing to as a stable one point o version?
[00:21:08] Michel Tricot:
Yeah. The one that I'm so happy we had made that decision in the past is running connectors at dock as docker images, absolute game changer and I know it adds a little bit of overhead but it means that in terms of how we develop how we think about maintenance and that's the thing that is the most important like disconnectors needs to always work etcetera. So for me that was probably the best decision we've made in the past. The worst decision that we've made in the past now was, I don't know if it's the worst one, I think it was a good decision at the time but we had to pay the price later, was we did not optimize for we did not optimize for scale at the time. And it's just for me, for like as a founder, is you should not optimize for something like scale until you have to hit some scale.
And what we're optimizing at the beginning was very much like time to value making sure that people can just download the software get up and running experience the value and boom it's done. But it had some implications it means that there were some developments or some like architecture choice that were suboptimal for operating air by that scale. And, yeah, we I think when we started to run cloud and to operate cloud, that's really when we started to hit this kind of, of problems. In 2023, we onboarded our largest customer on cloud. Thousands of jobs run-in parallel.
It broke our cloud, So nobody saw it but we saw it and we felt it to a point where we are having that meeting and someone was saying hey I think we made a bad deal here it's going to cost us more than they are paying us, it creates fire everywhere on cloud and I say yeah actually no we want to be able to take on this kind of volume for the price that we that we provide and it actually forced us to work on like a completely new way of running connectors. We it something we released I think the first version was in December it was still like behind feature flag was how we can very fast very quickly like spin up new data planes so that if we have a customer that needs a lot more like if we run into limitation on Google's API or like AWS APIs for Kubernetes boom we can spin up a new cluster and everything works magically and that's really but that took a lot of time because the the the fundamentals were not there at the time in the platform so we had to really build through okay we need scale we need a lot of scale the other one that was and this one like got us into yeah was we started to use DBT in our connectors directly. So when we put data into warehouses, we would just transform it using like a generated DBT script on on a warehouse.
And although it was very good at the beginning when the volumes of data were small, the moment we started to have more data, we started to have people complaining about, like, cost of running, like, the normalization of Airbyte. And, like, the last thing we want is to incur more cost. So that was also a big change in like how we architect not just the platform but really some of the connectors was around like how we manage data warehouses. But initially it was a great decision because it allowed us to move very fast but at some point we had to really focus on okay what does it mean to run air bite at scale? Like, we cannot do that. We have to be more efficient. We have to be smarter.
[00:24:51] Tobias Macey:
Some other interesting evolutions of the air bite platform are around the operability of it, where in the very early versions, it was you deploy it. Here's a web app. You point and click. Here are your connections off to the races. And then over time, you developed more API interfaces for being able to do connections as code, triggering syncs as code. And now you also have the PyAirbyte interface where you can issue the web server component and do it entirely in a Python script. And I'm wondering if you can talk to some of the internal conversations and decisions that precipitated each of those decisions and the those capabilities in the platform.
[00:25:39] Michel Tricot:
Yeah. I think it's it's it was very much a a maturity of the audience we were talking to. I think when we started, a lot of people were adopt I mean, that was what happened in 2020 and 2021 is a lot of companies were just revamping their data stacks I think covid gave them a lot of breathing room to invest in this kind of project and a lot of the people we are talking to at the time they wanted to have like a very easy solution just UI. Now you know when you have an open source project like that you don't control how people use your software. And what we started to see happening very fast was people not using Airbyte for their own personal use case but just to create an application on top of it And they started to act around, like, our terrible API that we had at the time and just trying to operate air by programmatically.
And even ourselves internally, we we have, like, we started to have more, like, needs for, like, like, workflow orchestration and we could a ui does not does not is not sufficient and very fast like API became something we had to do. The piece around pyar byte is actually it really shows the power of how we designed connectors as very thin pieces of code because you can't transpose them to any systems. And Pyabat is an example of it of let's get rid of the platform all the heavy features and let's just use the connector itself within code and just pull the data from code. But for me it was really like a progression in terms of the maturity of the user which was first very early BI people bringing or data engineers bringing, warehouses wanting something very simple and then you start having people who develop application on top of airbag to or who have like more advanced use case with orchestrator and things like that and then it was oh now we have application developers that are building with airbyte and the platform is good but maybe it's too prescriptive for them like they don't want a UI they don't want an API they just want to have access to the data. So how do we serve this audience? And payabite I think for me is just now we're in the software development land of people building application or people in notebook doing experimentation and they don't have to take on like the the whole platform. But when they go to production, they probably need to have the platform because it has all the bells and whistles for reliability.
[00:28:16] Tobias Macey:
Absolutely. Yeah. That that's definitely a big piece of my understanding about the utility of PyAirbyte is that a big part of the value of the Airbyte platform is all of the state management of being able to see when the syncs ran, making sure to manage the checkpointing, the incremental state so that I don't have to rerun everything all the time. But that point of being able to quickly test out the connectors in a development environment, as you said, brings a lot of value because you don't have to set up the entire platform just to test something out. Circling back a little bit to the LLM and generative AI application ecosystem of generating the embeddings, it's very straightforward to do that in a naive manner of just give me the text. I'll chunk it at arbitrary delimiters and just feed it through into the vector database.
Now that that ecosystem has developed a bit more maturity and a bit more understanding about what are the appropriate chunking strategies, how do I associate different attributes and metadata with those vectors so that I can do things like filtering. And now that we have things like graph rags starting to gain some attention, being able to associate those vectors with graph nodes. I'm wondering if you could talk to some of the challenges that you have of being able to understand how best to surface those concepts in Airbyte. And when it's just a matter of it doesn't make sense to do that in air bite because there's too much customization and application specific logic that needs to be embedded and so you need to move that into a different layer.
[00:29:55] Michel Tricot:
Yeah. I actually think that right now we know we still know almost nothing that's so and you see that you know we we talk to a lot of our like whether they are startups or more mature companies getting the most of the time they don't use platforms right now. They work and build everything very vertically, and the reason they are doing it is the pace of innovation and development cannot be blocked by what feature a specific platform supports. And that's why you're seeing a lot of it built very vertically. I think platform will come but we're probably 1 to 2 or 3 years out, like, ahead of of the of that release.
So at that point, air bite, if you look at it from like UI perspective, we can do the first POC. We can start, like, getting you to to get to that like first moment of like understanding that you can get all that data with airbyte into a vector store. Now the thing that we really need to figure out is okay how is the world going to evolve because chunking maybe today we have something for chunking or for, like, embedding or, like, appending metadata that will work for some of the use cases, but it will probably never work for all, like, most use cases. And that's where that's why we have PayA Byte actually because today we just want to be there when people look and say, oh, I don't want to build a connector to bring data from, I don't know, Gong calls or whatnot.
We just want that that that we just want to be there and just provide Ihabite and then be with them as they experiment because they they don't know, we don't know what they need today. So for us it's just about can we create can we be there at the entry point and learn from how they're using it so that then we can bring this learning into the platform and make it a platform and at some point have that as a as a data movement platform for this type of use cases. And that's really the the path we've taken where, yeah, we we try to address to be there once they develop.
[00:32:13] Tobias Macey:
Another element of what you're building and how it exists in the ecosystem is that the competitive landscape for data data movement has changed fairly substantially over that same time period where when you started out, it was the Stitch data ecosystem was kind of the option if you wanted open source. There was a high degree of variability in the quality. There was no cohesive platform for operating those. Meltano came up to try and address that situation and help to bootstrap that community. On the commercial side, there was Fivetran, who is still one of the more notable fully commercial options.
But then there have also been a number of other entrants, most notably, I'm thinking of DLT as a very composable data movement capability. And I'm wondering how that shifting landscape helps you understand what are the needs in the ecosystem so that you know how to build and address them as well as how to position yourself for people who are coming to this problem fresh and trying to understand what are my options, what are the differentiators, and how do I think about which tool to select to solve my problem?
[00:33:30] Michel Tricot:
Yeah. I would say that, primarily, if you don't have connectors, it doesn't matter what you do. Like, that's the value you need to get. You you know, when when I think about how we built Airbyte, there was like a ping pong ping on like connectors platform connectors where we started with a ton of connectors but like mediocre platform then we invested a lot in the platform did not invest as much in connector and now that we have a platform that is running at high speed high reliability etcetera now we're focusing again on connectors because at the end of the day people take what the platform gives you as a given they don't think that there is effort to be done in the platform when this is actually a lot of the things that we provide is hey you don't have to worry about data schema changes etcetera etcetera.
So there will always be other players, at the end of the days people will pick another platform if the connectors are not the right quality. That's that's gonna be the thing because and and now that we have dominant player on like fully commercial and and airbyte on open source, yes other player can come but at the end of the day people will make the choice based on the connectors that are available. Now there there is there is a question around use cases and I think the the thing we're looking to, to address with spyabyte is how do we address use cases where it's more sophisticated use cases than pushing data. Like, it's a sophisticated use case, but more like downstream on on warehouses.
Here, the sophistication comes from what kind of preprocessing you need to do on your data. The fact that you're not using a warehouse to process your data, like, and Piabyte is here to just do this first discovery. And I I think I think we have a strong community, a lot of a lot of connectors and and a lot of use cases that are addressed today. So
[00:35:37] Tobias Macey:
Absolutely. And also, as you look across the entire data landscape, it it's definitely not a zero sum game. There are still so many people who are just writing bespoke code, spending all of their time just on maintaining those different point to point connections that if they choose any of the tools, then it's a win for the whole ecosystem.
[00:35:56] Michel Tricot:
Yes. Exactly. And, yeah. Honestly, like, the that's the that's also I would say that's the power of open source is preventing build by providing I mean, I'm a lazy engineer. Most engineers are a little bit lazy, not in the bad sense, but they just want to not have to build the thing that, yeah, they don't they don't want to build, and they're gonna find open source. So the question here is, do they start adopting Airbyte or do they start adopting something else? And I think today, a lot of it is coming for Airbyte. But yes we have to be very ready and and provide like very strong like I don't want platform I just want the connectors because I'm building something that no platform can do today and that's why like having our connectors as code in a code use case is very important.
[00:36:50] Tobias Macey:
Now we've talked a lot about the history, the journey that you've been on. We've touched a few times on the fact that you're gearing up for this 1.0 release. When you do commit to a 1.0 version, there is a signaling of a certain degree of stability and commitment to the way things are right now and the fact that you're not going to drastically change anything. And I'm wondering if you could talk to the decision process, the discussions, the confidence building that you've done both internally and the community that makes you feel that you're ready for that 1.0 version and what that 1.0 version is intended to signal.
[00:37:30] Michel Tricot:
Yeah. We had a lot of conversation internally about when should we do 1.0. At some point, we're even talking about that in 2021. That would have been a huge mistake. The the thing about 1.0 is it's really about we have hit a milestone in terms of the technology. Now technology is not a milestone by itself, it has to be proven. It has to be proven by the type of workload that we are running, the type of, feedback that we are getting. And, you know, like, today, we believe that the platform we have is the the foundation or, like, the the ramp for, like, even more data movement use cases. And that's what we wanted 1.0 to be. And the proof points that we wanted to have was, do we have more than a certain number of, like, enterprise users? Whether they are paid users or whether they are open source users.
And and we got to this one. We actually hit that number, earlier this year, probably, I think it was in April or something like that. But that was something we wanted to see, like, critical data analytics pipeline being handled with airbyte with a certain amount of scale. The the other piece was around the connector quality because as I told you, we have this ping pong between connectors versus platform versus connector. And for us, it was, do we feel good about the process, the manufacturing process that we have in place for connector? And today, we do. Like, the our ability to bring in more contribution from the community, our ability to react when a connector has an issue, the the our ability to just maintain that connector over time. I think today, we've hit the we've hit the number we wanted. And, yeah, and after that, it's just diversity of workload.
We wanted to have, yes, the click click the UI. We wanted to have the programmatic with API. We wanted to have more of the CICD with Terraform, and we wanted to have Piabyte for people who are building on top of data connectors. But that was really what we wanted to get with, with 1.0.
[00:39:45] Tobias Macey:
And as you have been gearing up to that 1.0 launch, what has that meant in terms of the requirements that you have around new connectors being adopted, the requirements around existing connectors and their ongoing maintenance, the commitment to which connectors you are going to continue to support, which ones may still phase out and be deprecated, and the commitments to the community engagement and what they can expect from Airbyte going forward.
[00:40:20] Michel Tricot:
Yeah. So I think here we need to make the distinction with regard to connector between sources and destinations. Destinations, our goal here is, like, the moment we start building a destination ourselves like as airbyte, this is something that we'll continue to maintain because we've also made the choice of we see a lot of use like, a lot of users and enterprises asking for that particular destination. At that point, it also becomes something that we, would charge for. So the moments they are, like, air bite maintained, they will be there for, like, probably ever. I'm sure there will be some exception, but, that that that's the thing.
Now in term of, like, community or, like, marketplace connectors, we at the source level, what we've done is we've deprecated a few of those in the past and it was more about us reinvesting into high quality connectors and making sure that as we build that engine, we also simplify the work for ourselves so people can still take it but right now in order to make that connector fit with the systems we are building it's too much of a lift and it would require a full rewrite to be compatible with the the level of quality we want. So at that point, we just deprecated them. Some people are actually taking them and just reshaping and putting them back, but for us, that was a one time thing that we did this thing of deprecation. It was more about, hey, this is all the learning we've had. These connectors that were built in 2021, we don't want to maintain them. And nobody in the community is doing it so we're just gonna remove them because they reflect badly on the platform and people who are using air bite. So we just we just remove them. But now it's gonna be very different. We're also going to we're pushing more of sort of the the metric that we look at internally around, like, connector quality and connector, reliability.
So these are now all available on our documentation. That's also the thing we wanted to do with 1.0 is providing the right expectation for how is that connector behaving, how is it working and making it easier much much easier to modify and to maintain. And that's to me the the the big thing about about 1.0 is like we're not gonna remove connectors, we're just gonna provide the right tools to make them better and we are also we're also providing the good visibility into the quality. I'm not sure if it fully it fully answers your question though.
[00:43:03] Tobias Macey:
No, I think that you did a good job addressing it. I was mainly just trying to get at what are the changes that people can expect around Airbyte from the community side as well as the commitments that you're making with this 1 point o release and Yeah. Any substantial changes that need to be made ahead of that release to put you in a good position going forward?
[00:43:26] Michel Tricot:
Yeah. The the other thing we've done is we've we part of the system that we're building, there is also, like, how do we manage community contribution? You know I have a graph and at some point it was showing how fast do we reply to PRs and there was a moment where it was not good. Now we got it down to maybe like a day or 2 but that's part of like making the making the process much much stronger and our goal is to just especially for connectors like how can we get that faster into the into the mainline. So and that's part of building that process, building that factory for connectors.
[00:44:09] Tobias Macey:
Another interesting decision that you made early on that it seems like has paid dividends is that you are keeping all of the connector definitions from the source code perspective in a mono repo. So all of the connectors, all of the platform is all in well, actually, now it's in 2 repositories, but it it all of the connectors at least are in one repository versus being spread around GitHub and having to be recollected via some, website, etcetera. As you move towards this one point o release, is that something that you intend to continue with? And what are your thoughts around the possibility or viability of having some sort of, kind of community discovery platform for people who are building other connectors out of band of the AirPipe repository or any support that you want to have about maybe third party connector libraries?
[00:45:07] Michel Tricot:
Yeah. Actually, we we do have some, there is a company that has been building connector by themselves, and they host them on their own repo. So the thing that we're releasing also is one point one point o is this concept of, connector marketplace. And connector marketplace can be both for community members, but also for, like, vendors who want to build their own connector. The thing that we would like to but this is a place where also we're going to listen to what people are telling us. Ideally, most of the low code connectors should live on the air by triple. And the reason we we we think it's a it's a pretty low lift is that one now we can very quickly review this type of PR, but also there is no security concerns on disconnectors because they are low code. So we control the execution framework of these connectors. It's just a yammer file that gets, that gets executed.
Putting, connectors on other repo, It's a place where either it's I mean, would we put them on the platform right away? I don't know. Because at some point, like, we're dealing with data, and we want to make sure that it's not just the quality, it's also the security of these connectors. And And we want to make sure that everything that we run behind the scene to to validate that the connector is not doing something crazy, like, we want that to run before it runs on onto onto the app by platform. But people can already do it. They can already build connectors on repos. It's just we want to vet the connectors because and I've worked in data since, yeah, 2008.
Like, the amount of like, data is a key asset that every companies have very privacy centric sometimes. We have to make sure that this data is always safe and, and that's what we guarantee basically with, I mean, that's what we yeah. We guarantee with having that on on Airbyte. It goes for our security processes and reduce.
[00:47:13] Tobias Macey:
As you have been building Airbyte, growing the company, growing the platform, and the ecosystem, what What are some of the most interesting or innovative or unexpected ways that you've seen the technology applied?
[00:47:26] Michel Tricot:
Yeah. The the first one is one we discovered in, in 2022 was people just built a a Redis destinations and they were just using airbag to do a cash warm up. I was just, yeah that's a cool use case, I love it. It's just you know it's this kind of thing where you don't have control you're just giving like a a knife and a and scissor to people and a piece like like hammer and wood, and they just build something that you never thought about. The other thing, and this one became a product, was this powered by airbyte thing. Initially, we're just thinking of airbyte as hey. You're just using it to push data into your warehouse, and people started to build application on top of it. Just, yeah. That's a cool use case. Never thought of that. And it became we kept seeing people asking for, hey. How can I build on top of Airbyte? How can I build on top of Airbyte? And that's when we released the first version of the API so that people don't have to click, on on buttons and and fill form.
But that that's, I would say, were 2 very interesting ones because one led to a product, the other one just nice, very, very smart. And we have a few others, but I don't know if you we want to cover all of them.
[00:48:44] Tobias Macey:
And in your own experience of going through this journey, building this platform, building this business, what are the most interesting or unexpected or challenging lessons that you've learned?
[00:48:55] Michel Tricot:
Yeah. First one was when we started our batch, someone gave us an advice. 2 advice actually. 1 was don't optimize for scale, optimize for time to value. The second one was you will never know who is using your software. Well, we we followed the first advice which was around scale. We did not invest in scale initially. That was very cool at the beginning, became a big, a big boulder that we had to to push forward at the end, but now we're there. The one around not knowing our our users, the fact that we had Slack very early on, the fact that, you know, sometime, like, very early on in the in the UI, we're asking for, hey.
What what do you do? Where are you from? Etcetera. Like, having this ability to know the customer or the or the user, whether it's on cloud or on open source, that was just it was possible. Like, people were willing to to help us by giving us, like, information like that, but it just gave us like a direct line to people and especially early on, like now we still do it but maybe like not at such a large scale but very early on every time we were thinking about oh should we build x or y We just go on on on on Slack or we would go and send a newsletter, like put that in newsletter and within 30 minutes or one hour, boom, everyone was just telling us we prefer x or we prefer y.
So in terms of, like, product development and where you focus your team, invaluable. Now I would say that the thing that are hard is building open source is hard because you don't like, especially for infrastructure is you don't control who is using your software, meaning that you will have people who face issues, and they will not be successful using Airbyte. And as a product builder, it pains me to be in that situation where I know that someone cannot be successful with Airbyte. And, yeah, you have to live with that which is some people will will not be successful with it. But it's also because you cannot control with downloading, and you cannot say, oh, yeah. Your use case is probably not a good fit for for Airbyte. And I I think that's the the the the place where especially as we're building open source, like, how how do we want to widen the the net or the the spectrum of user that we want to be able to, to make successful?
That's really the challenge.
[00:51:31] Tobias Macey:
And on that note, what are the cases where Airbyte is the wrong choice and somebody should either build something custom or use some other technology that's off the shelf? You know what? I would have had an answer for you
[00:51:44] Michel Tricot:
a few months ago, but now we released Pyervice. So I would say there is probably oh, okay. Maybe for pre one point o, yes, there is one around, like, streaming type of use cases. Like, we don't do those. Now it's planned for post 1.0. But, otherwise, with Pyabyte now, you can do whatever you want. So you have access to all the catalog of connectors on code, and you can add all the logics you want on top of it. So, yeah. Probably that would probably be the the former gap and the and the no gap today.
[00:52:21] Tobias Macey:
And looking forward to that post one point o world, you hinted at streaming. What are some of the other capabilities that you have planned for the future of Airbyte once you have moved past this 1.0 milestone? You've committed to stability in the platform and the interfaces, and so you have a more, stable target to build off of and move fast on.
[00:52:46] Michel Tricot:
Yeah. For us, it's gonna be very much on, like, operational use cases. I think we've been very focused on analytics, data warehouse, use cases. All the AI use cases are putting us into the direction of operational use cases, and reverse ETL is an operational use cases. Streaming, in my head, only makes sense for operational use cases where it's not a human making decision but a machine making decision. So that's gonna be a big big focus for us post one point o is like how do we address things that are not analytics but are also, like, operation, driven.
The other one is we have a, like, a a large road map around, like, enterprise connectors. We're starting to release, some of those, but we have a long road map and we're working with our enterprise customers to build them. So we want to release them in the in the next few months. Speed is gonna be a big deal. I think we have some low hanging fruit that we want to to make faster you will see we've made a few changes in the past few months like speed has gone up quite significantly but I think we can still do a 10 times so like a time stamp somewhere. So that's I'm very excited about. Like, data needs to move as fast as possible. We can never be a bottleneck.
[00:54:07] Tobias Macey:
We didn't really touch on it throughout this conversation, but I'll also say thank you for all of the work that you and your company have done on the outreach perspective of providing a lot of high quality blog posts, conducting the state of data engineering surveys year over year to help collect some of that information and just all of the investment that you're putting back out into the community as well. With that being said, are there any other aspects of the work that you've been doing on Airbyte, the 1 point o launch, the future that you're building towards that we didn't discuss yet that you'd like to cover before we close out the show?
[00:54:45] Michel Tricot:
No. But I think we're gonna have some new, unstructured data is gonna be unleashed and I'm very curious to see like how that's gonna pan out. We have some theory but it's basically you're saying that now 80% of the data that was very very expensive to leverage is gonna be unleashed. I'm looking forward to that world and, yeah, want to make sure that we're positioning Erbide also in that in that field because with data geeks, we want to push and get more data through through the pipes.
[00:55:21] Tobias Macey:
Alright. Well, for anybody who wants to keep in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:55:38] Michel Tricot:
Yeah. I think the biggest gap is what do we do with unstructured data? Like, the it's gonna be a lot of things around, like, the security, the privacy of that data because, well, it's not as easy as doing a regex on something. So to me that's a big that's a big deal today.
[00:55:58] Tobias Macey:
Absolutely. Well, thank you very much for taking the time today to join me and share the work that you and your team have been doing on Airbyte. Congratulations on the 1 point o milestone. I've been using Airbyte for several years now, so appreciate all of the, value that that has provided for my own work. So thank you again for all the time and energy you folks are putting into that, and I hope you enjoy the rest of your day.
[00:56:22] Michel Tricot:
Thank you, Tobias.
[00:56:31] Tobias Macey:
Thank you for listening, and don't forget to check out our other shows. Podcast.netcoversthepythonlanguage, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host at data engineering podcast.com with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and
[00:57:10] Michel Tricot:
members.
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Your host is Tobias Macey. And today, I'd like to welcome back Michel Tricot about the journey to the 1 point o launch of Airbyte and what that means for the project. So, Michel, can you start by introducing yourself for anybody who hasn't heard your previous appearance?
[00:00:29] Michel Tricot:
Yeah. Of course. Thank you for having me, Tobias. So I'm Michel Tricot, and I am the the cofounder and the CEO of Airbyte.
[00:00:35] Tobias Macey:
And do you remember how you first got started working in data and what it is about it that has kept you interested for so long?
[00:00:42] Michel Tricot:
I mean, I could go very back in the past and look at me as a teenager collecting, DVX and, websites because at the time Google was barely starting. But more more professionally, started really in 2008. Yeah. Working on financial data. I thought I was going to be a trader, and I ended up doing financial data, and I had a lot of fun with it. And what happened after that is I moved in the US, joined a very at the time very small, became very large, AdTech company, and everything related to AdTech and MarTech is about data and data at internet scale, and I would say that's really where I got burned the most and where I learned, a lot of this principle around what does it mean to work with data, what is the process of data manufacturing. And at the end of the day, starting Airbyte in 2020 and having a have have been having a blast since then.
[00:01:38] Tobias Macey:
And for people who are curious about Airbyte and some of the specifics and internals, I'll refer them back to the previous interview we did, and I'll link that episode in the show notes. But for somebody who, is just hearing about Airbyte in this episode, if you can just give a quick overview about what it is that you're building and some of the notable milestones that you have gone through on your path to this one point o release.
[00:02:03] Michel Tricot:
Yeah. So we started Airbyte in 2020, and Airbyte is an open source data movement platform. You can almost think of it as the highway for getting data from point a to point b, and point a be being like a place where you have siloed data, and point b being a place where you can actually extract value from data. And we started in 2020. That was the first release, the first open source release. And in 2021, we started to get more and more traction from the community up to a point where in the summer of 2021 the whole team was completely frozen. We couldn't do anything, we couldn't build a product, we're just spending so much time on slack helping our community be successful with Airbyte. For me I I see that as the first real milestone like yes you've released something you never know if it's gonna work but this one was the real like first milestone that I can remember because we really suffered a lot from it it was it was fun but it was very frustrating to not be able to to build more. And in 2022, the the milestone there was releasing Airbyte Cloud. So it was our first time having to really operate Airbyte before it was open source software, so we just put it there on GitHub and people would figure it out. Here, we had to be our own customers of a this is how you need to be running Airbyte at scale, and also what are all the the pain points that go with it. And that was yeah that was quite a journey in 2022.
We like to refer to this one as real that this one was very much a painful year because everything that we did in 2021 was about moving fast and 2022 was okay we need to take a step back and look at what does it actually mean. 2023 would say was the connector builder, you know airbags is nothing without connectors because that's what allows us to connect point a and point b together and I think in some point in 2021 2022 I wrote down a document internally about like what we call nailing the maintenance which is how do you create a very large catalog of connectors, lots of breadth, and how do you make sure that these connectors are high quality, they are well maintained etcetera etcetera. And 2023, we really build the first big building block to getting to that state. And, yeah, now we're here in 2024.
We have a ton of connectors and we're preparing we're launching the the the Airbyte 1.0 version. So very, very happy about that.
[00:04:44] Tobias Macey:
On that point of having the connectors that are available be high quality, reliable, I know that that was one of the major concerns early in your design process, particularly given the state of the ecosystem around the stitch framework or protocol. And I'm curious if you can talk to some of the lessons that you learned in those early days of evaluating the stitch protocol, deciding that that didn't suit your needs, building the Airbyte protocol, the interfaces that you had around that. And I know you also went through a major revision of that protocol in that time frame as well and just some of the lessons learned early in that process that helped inform the decisions that you made about what it actually means for those connectors to be reliable, what elements of that you wanted to be customizable versus which elements needed to be standardized, etcetera.
[00:05:37] Michel Tricot:
Yeah. Of course. So when we think about building connectors and building a large scale of the a large number of those, the the thing to think about is you're not just building a connector. What you're building is you're building a factory where you get raw material and input, and you get a high quality connector as an output. And for me when I look at the past 2 3 years, this is what we've been building. Like the byproduct of it is we get more connectors, but what we are building is that factory to get from someone saying hey I need to get data from x to I have a working connector that will work for me for the for the long term. And to me that's the the big the big, challenge in what we're doing is like what does that factory looks like and when we started in 2020, yes, there was the the stitch framework actually initially airbyte was running with those connectors and very very fast we realized that the quality was not there because it was basically up to every single human to just make sure that this connector always works and there was no real like testing or ways to, I don't know, to be prescriptive about what is a high quality connector.
And that's where we started to change things very fast. Like, I think in 2021, we're already we're already of the of this protocol, and we build our own. Obviously, we we we took some inspiration because there was some there were still some good things there. But what we did first was when someone builds a connector, they need to encode the whole environment of what that connector needs. And, you know, in 2021, some people ask us, oh, why why is that a docker image? And I say, yeah. Because a docker image, it doesn't matter that it's an image. What matters is that by making it an image, you have to encode what library needs to exist. Like, if you need some specific SSL library installed on your system, it will be encoded somewhere and you will never forget about it. So it was a way of just making sure that a connector is completely self contained and it can run anywhere no matter what your environment look like. You only need to have a way to execute Docker image and that's all. And if you don't want Docker, it's fine. You can still read the Docker file and you know what you need. And you can this way, you're basically documenting what what what you need, as code.
And for us, that was really the the first step. The the next steps was very much around how we do the testing of those, like, how we create whether it's sandbox, whether it's replaying like API calls etcetera etcetera. And the goal here is just to create that asset of a what what does it mean to have a high quality connector? Well, you need to be able to run test. You need to be able to run, real tests. You need to be able to validate that the the data looks like the schema you're describing is all these things. And we've been building every single step of that process of manufacturing a connector. And today, we have a ton of dashboard internally that are monitoring every single connector. We know which stream are working, which stream are low quality we can at that point just invest more time in those or someone from the community is gonna look at it and just help us on on fixing it.
And finally, the last one was how do we make it more like simpler to build, disconnectors? The the theory that we have here is that we have a protocol and you can build anything with that protocol but it's very painful like nobody creates a website with TCP IP. They all do they all have, like, higher level on top of that. And that's basically what we've done since the beginning, which is first very close to the protocol. Then in 2021, we had the first version of the CDK. That's when we started to have more, like, community adoption. Then another version of it. Then we started to have low code, then no code. And that's basically what we're doing to just minimize the the amount of effort for building, but most importantly for maintaining these connectors. And that's I think today, most of our connectors are using low code, no code, meaning that they can be maintained. It's just 1 or 2, 2 minutes. So and that's what we want to get to. It has to be low cost.
[00:10:05] Tobias Macey:
Another aspect of the connectors and the ecosystem around them is that you have done substantial investment in simplifying that development process or the generation with the low code and no code interfaces, specifically on the source side. The destinations are still more complex and rightly so because of the need to be able to manage how updates are inserted into the platforms, the different types of destinations, whether it's a data lake or a database or a data warehouse. And I'm wondering if you can talk to some of the struggles or some of the challenges in simplifying the interface and simplifying the development effort for being able to generate those destination connectors and beyond just the source connectors that are a little bit more straightforward, at least in the at least in some cases. Obviously, there are very complex source connectors as well. But Yeah. So, today, a lot of the
[00:10:59] Michel Tricot:
destinations that we support, they have a lot of specificities so you could load any data into like a SQL based store by just saying like insert this is going to be very inefficient. So then you go and you start looking at like what is the most efficient way to load data into Redshift? What is the most efficient way to load data into Snowflake or BigQuery or like ClickHouse etcetera etcetera. And that's why like having a CDK for those like more heavy infrastructure is building an abstraction you can do it that's actually we have one that is coming out with a CDK we built the oracle destination on that abstraction and most of the destination are going to sorry not oracle Databricks and we're going to be able to use that abstraction for more and more destination so but it's probably not going to be to the point where you can do low code no code because these systems you want to take advantages of their specificities now a place where I think destinations are going to become easier to build is gonna be when we start addressing more, like, API types of destination because those already have, some kind of, like, generalized interface, which is it's an API, and you're pushing data there. There might be some specificities, but they will be easier to encode in a low code, no code. So this is a place where, yes, we will be investing in a in a framework there.
[00:12:34] Tobias Macey:
Another interesting aspect of both the timing of when you first launched Airbyte as well as the location in the data stack that you're targeting is that there have been numerous shifts in the past 4 years, in particular, starting in 2020, the rise and then subsequent decline or dissolution of the modern data stack, which is a term that nobody wants to really throw around anymore. And then, also, maybe even more impactful is the rise and adoption of generative AI and the need for vectorization of data for the semantic retrieval capabilities.
And I'm wondering if you can talk to some of the ways that those industry trends have impacted the way that you think about the areas of focus and the capabilities that you're building into Airbyte and the requirements around what Airbyte can deliver to the end consumers?
[00:13:33] Michel Tricot:
Yeah. So, you know, I I think I can still remember the first readme we had about air bite which was open source ELT tool or something like that. I don't want to just pull the the cover to myself but I do believe that in 2021 we were the one who talked about data movement for the first time. Maybe that's wrong, I don't know but I like to think that that is true and the reason we quickly moved away from just ETL and ELT and just going for data movement is we wanted to position Airbyte and to develop Airbyte in a way that is very that is at the infrastructure layer so that it's not about just pushing data into a data warehouse, although today it is one of our main use cases, but we've seen people doing other things with airbytes that were also super interesting and that's why we wanted to just go a level lower and just look at it as data movement which is moving data from point a to point b provide intelligence on top of these pipes and just make sure that data can be sent to a place of value.
And for me that's why like when I hear about the modern data stack yes I mean we use the word a little bit we stopped using it as you said but at the end of the day what the modern data stack is really what it's just like how do you create an architecture and infrastructure that helps you build like a long term system and today with people are going to continue to adopt warehouses like there is no question there like people need to provide and to compute analytics to do dashboard etc. Maybe we don't call it modern data stack but at the end of the day once they do something like that we want to guide them on how do you future proof that. So, yes, you have your warehouse, think about your storage, think about your compute, think about your network, like data movement, like how you bring data into it, and then what comes after that. Like, do you need to do reverse CTO? Do you need to do dashboarding? So those are real problem that people will have. We we just stop calling them modem data stack, but they will still be there. Now on the gen AI side, that's actually where we got a lot of pull from the community.
The same way we got some pull from from reverse CTL, never really got to it, but we got a lot of pull from the committees. We provide pipes. How do we make sure that our pipes can connect to different systems than warehouses? How can they make sure they can connect to data lake? How can we make sure they can connect to vector databases or other type of destination? And for us, that's just how do we evolve, how we build this nation because sources are not going to change that much. Like, we still get need to pull data from Salesforce, from HubSpot, from Postgres, from Oracle, from SAP. It's just where does it go and what kind of intermediate process do you need to put in place. So if you need to do embedding of the data, well, how do you configure airbag to do so? And that's things that we, we need continue to develop because it's part of building pipes.
[00:16:47] Tobias Macey:
As far as the ecosystem of data movement, that is maybe the core essence of what data engineering is. It's just moving data from one place to another so that you can use it for some process. Even just the and I say just, maybe that's not the right the right word to use. But the constraining the scope to extract and load, which is a big area of focus and something that was the, you know, initial step of the modern data stack is having that easy extract and load capability. There are a lot of complexities, both explicit as well as incidental complexity that comes out of it, a lot of edge cases, performance challenges. I'm wondering if you can talk to some of the hard won lessons that you learned in the process of building and scaling Airbyte about the hard realities of data movement and data integration.
[00:17:48] Michel Tricot:
Yeah. The the the hard thing about it is the moment you depend on a system that you don't control but that you're supposed to link these systems together, well, in a way you're responsible when they go down and that's really the problem. And that's why like people accept like except people like Airbyte nobody wants to do something like that it's just it's just crazy. And this to me is the this is the real complexity, it's like sometimes even if you want to create the best possible platform sometimes you cannot create a pipe that is as good as you would want it to be. You know we have people who say oh pulling data from this API is slow, I say yes it's slow but we don't have a choice it's because it's rate limited on the other side so you cannot go fast.
But in a way it always reflect on on the tool and now the question for us as like operator and and builder of a platform is how do we train and educate people to understand that yes pulling data from this API is actually really hard because you have all these rate limiting things maybe there are some pieces of data that you cannot get out maybe the like this particular API endpoint does not support like incremental updates. Although the platform supports it, the source does not support it or the destination does not support it. And I think this is the real challenge when you're moving data is like you can build the most perfect platform you're always dependent on the source. So now the only thing you can do is just educate, make the most of it like you know when you have like rate limitings you can say yes if we don't rate limit they're gonna close down your account and that's like explaining like why why we have to do these things. But I think here it's more like educating the the humans that depend on that data and just providing them like yes like this is actually something we did for 1.0 is this concept of checkpointing is sometimes it takes forever to pull data from an API.
The last thing you want is something goes wrong in the middle and boom you have to start from the beginning. Like this is the worst possible expense. And here, this is us having to walk around the fact that it does not support incremental and just create a system of checkpointing to make sure that although the expense is not the best it's not the worst. And this is the kind of thing we need to do as an infrastructure product is yes all these weird edge cases and setting the right expectations.
[00:20:26] Tobias Macey:
Over the past 4 years, as you said, you started with a very straightforward here's the open source project. Here are a bunch of connectors. Let's see what happens to now you're committing to the 1 point o release. Obviously, there have been a lot of evolutionary steps architecturally. What are the core decisions from an architectural perspective that you made early on that have proven to be most effective and most stable? And what are some of the aspects that have required the most constant change and exploration to settle on what you're now committing to as a stable one point o version?
[00:21:08] Michel Tricot:
Yeah. The one that I'm so happy we had made that decision in the past is running connectors at dock as docker images, absolute game changer and I know it adds a little bit of overhead but it means that in terms of how we develop how we think about maintenance and that's the thing that is the most important like disconnectors needs to always work etcetera. So for me that was probably the best decision we've made in the past. The worst decision that we've made in the past now was, I don't know if it's the worst one, I think it was a good decision at the time but we had to pay the price later, was we did not optimize for we did not optimize for scale at the time. And it's just for me, for like as a founder, is you should not optimize for something like scale until you have to hit some scale.
And what we're optimizing at the beginning was very much like time to value making sure that people can just download the software get up and running experience the value and boom it's done. But it had some implications it means that there were some developments or some like architecture choice that were suboptimal for operating air by that scale. And, yeah, we I think when we started to run cloud and to operate cloud, that's really when we started to hit this kind of, of problems. In 2023, we onboarded our largest customer on cloud. Thousands of jobs run-in parallel.
It broke our cloud, So nobody saw it but we saw it and we felt it to a point where we are having that meeting and someone was saying hey I think we made a bad deal here it's going to cost us more than they are paying us, it creates fire everywhere on cloud and I say yeah actually no we want to be able to take on this kind of volume for the price that we that we provide and it actually forced us to work on like a completely new way of running connectors. We it something we released I think the first version was in December it was still like behind feature flag was how we can very fast very quickly like spin up new data planes so that if we have a customer that needs a lot more like if we run into limitation on Google's API or like AWS APIs for Kubernetes boom we can spin up a new cluster and everything works magically and that's really but that took a lot of time because the the the fundamentals were not there at the time in the platform so we had to really build through okay we need scale we need a lot of scale the other one that was and this one like got us into yeah was we started to use DBT in our connectors directly. So when we put data into warehouses, we would just transform it using like a generated DBT script on on a warehouse.
And although it was very good at the beginning when the volumes of data were small, the moment we started to have more data, we started to have people complaining about, like, cost of running, like, the normalization of Airbyte. And, like, the last thing we want is to incur more cost. So that was also a big change in like how we architect not just the platform but really some of the connectors was around like how we manage data warehouses. But initially it was a great decision because it allowed us to move very fast but at some point we had to really focus on okay what does it mean to run air bite at scale? Like, we cannot do that. We have to be more efficient. We have to be smarter.
[00:24:51] Tobias Macey:
Some other interesting evolutions of the air bite platform are around the operability of it, where in the very early versions, it was you deploy it. Here's a web app. You point and click. Here are your connections off to the races. And then over time, you developed more API interfaces for being able to do connections as code, triggering syncs as code. And now you also have the PyAirbyte interface where you can issue the web server component and do it entirely in a Python script. And I'm wondering if you can talk to some of the internal conversations and decisions that precipitated each of those decisions and the those capabilities in the platform.
[00:25:39] Michel Tricot:
Yeah. I think it's it's it was very much a a maturity of the audience we were talking to. I think when we started, a lot of people were adopt I mean, that was what happened in 2020 and 2021 is a lot of companies were just revamping their data stacks I think covid gave them a lot of breathing room to invest in this kind of project and a lot of the people we are talking to at the time they wanted to have like a very easy solution just UI. Now you know when you have an open source project like that you don't control how people use your software. And what we started to see happening very fast was people not using Airbyte for their own personal use case but just to create an application on top of it And they started to act around, like, our terrible API that we had at the time and just trying to operate air by programmatically.
And even ourselves internally, we we have, like, we started to have more, like, needs for, like, like, workflow orchestration and we could a ui does not does not is not sufficient and very fast like API became something we had to do. The piece around pyar byte is actually it really shows the power of how we designed connectors as very thin pieces of code because you can't transpose them to any systems. And Pyabat is an example of it of let's get rid of the platform all the heavy features and let's just use the connector itself within code and just pull the data from code. But for me it was really like a progression in terms of the maturity of the user which was first very early BI people bringing or data engineers bringing, warehouses wanting something very simple and then you start having people who develop application on top of airbag to or who have like more advanced use case with orchestrator and things like that and then it was oh now we have application developers that are building with airbyte and the platform is good but maybe it's too prescriptive for them like they don't want a UI they don't want an API they just want to have access to the data. So how do we serve this audience? And payabite I think for me is just now we're in the software development land of people building application or people in notebook doing experimentation and they don't have to take on like the the whole platform. But when they go to production, they probably need to have the platform because it has all the bells and whistles for reliability.
[00:28:16] Tobias Macey:
Absolutely. Yeah. That that's definitely a big piece of my understanding about the utility of PyAirbyte is that a big part of the value of the Airbyte platform is all of the state management of being able to see when the syncs ran, making sure to manage the checkpointing, the incremental state so that I don't have to rerun everything all the time. But that point of being able to quickly test out the connectors in a development environment, as you said, brings a lot of value because you don't have to set up the entire platform just to test something out. Circling back a little bit to the LLM and generative AI application ecosystem of generating the embeddings, it's very straightforward to do that in a naive manner of just give me the text. I'll chunk it at arbitrary delimiters and just feed it through into the vector database.
Now that that ecosystem has developed a bit more maturity and a bit more understanding about what are the appropriate chunking strategies, how do I associate different attributes and metadata with those vectors so that I can do things like filtering. And now that we have things like graph rags starting to gain some attention, being able to associate those vectors with graph nodes. I'm wondering if you could talk to some of the challenges that you have of being able to understand how best to surface those concepts in Airbyte. And when it's just a matter of it doesn't make sense to do that in air bite because there's too much customization and application specific logic that needs to be embedded and so you need to move that into a different layer.
[00:29:55] Michel Tricot:
Yeah. I actually think that right now we know we still know almost nothing that's so and you see that you know we we talk to a lot of our like whether they are startups or more mature companies getting the most of the time they don't use platforms right now. They work and build everything very vertically, and the reason they are doing it is the pace of innovation and development cannot be blocked by what feature a specific platform supports. And that's why you're seeing a lot of it built very vertically. I think platform will come but we're probably 1 to 2 or 3 years out, like, ahead of of the of that release.
So at that point, air bite, if you look at it from like UI perspective, we can do the first POC. We can start, like, getting you to to get to that like first moment of like understanding that you can get all that data with airbyte into a vector store. Now the thing that we really need to figure out is okay how is the world going to evolve because chunking maybe today we have something for chunking or for, like, embedding or, like, appending metadata that will work for some of the use cases, but it will probably never work for all, like, most use cases. And that's where that's why we have PayA Byte actually because today we just want to be there when people look and say, oh, I don't want to build a connector to bring data from, I don't know, Gong calls or whatnot.
We just want that that that we just want to be there and just provide Ihabite and then be with them as they experiment because they they don't know, we don't know what they need today. So for us it's just about can we create can we be there at the entry point and learn from how they're using it so that then we can bring this learning into the platform and make it a platform and at some point have that as a as a data movement platform for this type of use cases. And that's really the the path we've taken where, yeah, we we try to address to be there once they develop.
[00:32:13] Tobias Macey:
Another element of what you're building and how it exists in the ecosystem is that the competitive landscape for data data movement has changed fairly substantially over that same time period where when you started out, it was the Stitch data ecosystem was kind of the option if you wanted open source. There was a high degree of variability in the quality. There was no cohesive platform for operating those. Meltano came up to try and address that situation and help to bootstrap that community. On the commercial side, there was Fivetran, who is still one of the more notable fully commercial options.
But then there have also been a number of other entrants, most notably, I'm thinking of DLT as a very composable data movement capability. And I'm wondering how that shifting landscape helps you understand what are the needs in the ecosystem so that you know how to build and address them as well as how to position yourself for people who are coming to this problem fresh and trying to understand what are my options, what are the differentiators, and how do I think about which tool to select to solve my problem?
[00:33:30] Michel Tricot:
Yeah. I would say that, primarily, if you don't have connectors, it doesn't matter what you do. Like, that's the value you need to get. You you know, when when I think about how we built Airbyte, there was like a ping pong ping on like connectors platform connectors where we started with a ton of connectors but like mediocre platform then we invested a lot in the platform did not invest as much in connector and now that we have a platform that is running at high speed high reliability etcetera now we're focusing again on connectors because at the end of the day people take what the platform gives you as a given they don't think that there is effort to be done in the platform when this is actually a lot of the things that we provide is hey you don't have to worry about data schema changes etcetera etcetera.
So there will always be other players, at the end of the days people will pick another platform if the connectors are not the right quality. That's that's gonna be the thing because and and now that we have dominant player on like fully commercial and and airbyte on open source, yes other player can come but at the end of the day people will make the choice based on the connectors that are available. Now there there is there is a question around use cases and I think the the thing we're looking to, to address with spyabyte is how do we address use cases where it's more sophisticated use cases than pushing data. Like, it's a sophisticated use case, but more like downstream on on warehouses.
Here, the sophistication comes from what kind of preprocessing you need to do on your data. The fact that you're not using a warehouse to process your data, like, and Piabyte is here to just do this first discovery. And I I think I think we have a strong community, a lot of a lot of connectors and and a lot of use cases that are addressed today. So
[00:35:37] Tobias Macey:
Absolutely. And also, as you look across the entire data landscape, it it's definitely not a zero sum game. There are still so many people who are just writing bespoke code, spending all of their time just on maintaining those different point to point connections that if they choose any of the tools, then it's a win for the whole ecosystem.
[00:35:56] Michel Tricot:
Yes. Exactly. And, yeah. Honestly, like, the that's the that's also I would say that's the power of open source is preventing build by providing I mean, I'm a lazy engineer. Most engineers are a little bit lazy, not in the bad sense, but they just want to not have to build the thing that, yeah, they don't they don't want to build, and they're gonna find open source. So the question here is, do they start adopting Airbyte or do they start adopting something else? And I think today, a lot of it is coming for Airbyte. But yes we have to be very ready and and provide like very strong like I don't want platform I just want the connectors because I'm building something that no platform can do today and that's why like having our connectors as code in a code use case is very important.
[00:36:50] Tobias Macey:
Now we've talked a lot about the history, the journey that you've been on. We've touched a few times on the fact that you're gearing up for this 1.0 release. When you do commit to a 1.0 version, there is a signaling of a certain degree of stability and commitment to the way things are right now and the fact that you're not going to drastically change anything. And I'm wondering if you could talk to the decision process, the discussions, the confidence building that you've done both internally and the community that makes you feel that you're ready for that 1.0 version and what that 1.0 version is intended to signal.
[00:37:30] Michel Tricot:
Yeah. We had a lot of conversation internally about when should we do 1.0. At some point, we're even talking about that in 2021. That would have been a huge mistake. The the thing about 1.0 is it's really about we have hit a milestone in terms of the technology. Now technology is not a milestone by itself, it has to be proven. It has to be proven by the type of workload that we are running, the type of, feedback that we are getting. And, you know, like, today, we believe that the platform we have is the the foundation or, like, the the ramp for, like, even more data movement use cases. And that's what we wanted 1.0 to be. And the proof points that we wanted to have was, do we have more than a certain number of, like, enterprise users? Whether they are paid users or whether they are open source users.
And and we got to this one. We actually hit that number, earlier this year, probably, I think it was in April or something like that. But that was something we wanted to see, like, critical data analytics pipeline being handled with airbyte with a certain amount of scale. The the other piece was around the connector quality because as I told you, we have this ping pong between connectors versus platform versus connector. And for us, it was, do we feel good about the process, the manufacturing process that we have in place for connector? And today, we do. Like, the our ability to bring in more contribution from the community, our ability to react when a connector has an issue, the the our ability to just maintain that connector over time. I think today, we've hit the we've hit the number we wanted. And, yeah, and after that, it's just diversity of workload.
We wanted to have, yes, the click click the UI. We wanted to have the programmatic with API. We wanted to have more of the CICD with Terraform, and we wanted to have Piabyte for people who are building on top of data connectors. But that was really what we wanted to get with, with 1.0.
[00:39:45] Tobias Macey:
And as you have been gearing up to that 1.0 launch, what has that meant in terms of the requirements that you have around new connectors being adopted, the requirements around existing connectors and their ongoing maintenance, the commitment to which connectors you are going to continue to support, which ones may still phase out and be deprecated, and the commitments to the community engagement and what they can expect from Airbyte going forward.
[00:40:20] Michel Tricot:
Yeah. So I think here we need to make the distinction with regard to connector between sources and destinations. Destinations, our goal here is, like, the moment we start building a destination ourselves like as airbyte, this is something that we'll continue to maintain because we've also made the choice of we see a lot of use like, a lot of users and enterprises asking for that particular destination. At that point, it also becomes something that we, would charge for. So the moments they are, like, air bite maintained, they will be there for, like, probably ever. I'm sure there will be some exception, but, that that that's the thing.
Now in term of, like, community or, like, marketplace connectors, we at the source level, what we've done is we've deprecated a few of those in the past and it was more about us reinvesting into high quality connectors and making sure that as we build that engine, we also simplify the work for ourselves so people can still take it but right now in order to make that connector fit with the systems we are building it's too much of a lift and it would require a full rewrite to be compatible with the the level of quality we want. So at that point, we just deprecated them. Some people are actually taking them and just reshaping and putting them back, but for us, that was a one time thing that we did this thing of deprecation. It was more about, hey, this is all the learning we've had. These connectors that were built in 2021, we don't want to maintain them. And nobody in the community is doing it so we're just gonna remove them because they reflect badly on the platform and people who are using air bite. So we just we just remove them. But now it's gonna be very different. We're also going to we're pushing more of sort of the the metric that we look at internally around, like, connector quality and connector, reliability.
So these are now all available on our documentation. That's also the thing we wanted to do with 1.0 is providing the right expectation for how is that connector behaving, how is it working and making it easier much much easier to modify and to maintain. And that's to me the the the big thing about about 1.0 is like we're not gonna remove connectors, we're just gonna provide the right tools to make them better and we are also we're also providing the good visibility into the quality. I'm not sure if it fully it fully answers your question though.
[00:43:03] Tobias Macey:
No, I think that you did a good job addressing it. I was mainly just trying to get at what are the changes that people can expect around Airbyte from the community side as well as the commitments that you're making with this 1 point o release and Yeah. Any substantial changes that need to be made ahead of that release to put you in a good position going forward?
[00:43:26] Michel Tricot:
Yeah. The the other thing we've done is we've we part of the system that we're building, there is also, like, how do we manage community contribution? You know I have a graph and at some point it was showing how fast do we reply to PRs and there was a moment where it was not good. Now we got it down to maybe like a day or 2 but that's part of like making the making the process much much stronger and our goal is to just especially for connectors like how can we get that faster into the into the mainline. So and that's part of building that process, building that factory for connectors.
[00:44:09] Tobias Macey:
Another interesting decision that you made early on that it seems like has paid dividends is that you are keeping all of the connector definitions from the source code perspective in a mono repo. So all of the connectors, all of the platform is all in well, actually, now it's in 2 repositories, but it it all of the connectors at least are in one repository versus being spread around GitHub and having to be recollected via some, website, etcetera. As you move towards this one point o release, is that something that you intend to continue with? And what are your thoughts around the possibility or viability of having some sort of, kind of community discovery platform for people who are building other connectors out of band of the AirPipe repository or any support that you want to have about maybe third party connector libraries?
[00:45:07] Michel Tricot:
Yeah. Actually, we we do have some, there is a company that has been building connector by themselves, and they host them on their own repo. So the thing that we're releasing also is one point one point o is this concept of, connector marketplace. And connector marketplace can be both for community members, but also for, like, vendors who want to build their own connector. The thing that we would like to but this is a place where also we're going to listen to what people are telling us. Ideally, most of the low code connectors should live on the air by triple. And the reason we we we think it's a it's a pretty low lift is that one now we can very quickly review this type of PR, but also there is no security concerns on disconnectors because they are low code. So we control the execution framework of these connectors. It's just a yammer file that gets, that gets executed.
Putting, connectors on other repo, It's a place where either it's I mean, would we put them on the platform right away? I don't know. Because at some point, like, we're dealing with data, and we want to make sure that it's not just the quality, it's also the security of these connectors. And And we want to make sure that everything that we run behind the scene to to validate that the connector is not doing something crazy, like, we want that to run before it runs on onto onto the app by platform. But people can already do it. They can already build connectors on repos. It's just we want to vet the connectors because and I've worked in data since, yeah, 2008.
Like, the amount of like, data is a key asset that every companies have very privacy centric sometimes. We have to make sure that this data is always safe and, and that's what we guarantee basically with, I mean, that's what we yeah. We guarantee with having that on on Airbyte. It goes for our security processes and reduce.
[00:47:13] Tobias Macey:
As you have been building Airbyte, growing the company, growing the platform, and the ecosystem, what What are some of the most interesting or innovative or unexpected ways that you've seen the technology applied?
[00:47:26] Michel Tricot:
Yeah. The the first one is one we discovered in, in 2022 was people just built a a Redis destinations and they were just using airbag to do a cash warm up. I was just, yeah that's a cool use case, I love it. It's just you know it's this kind of thing where you don't have control you're just giving like a a knife and a and scissor to people and a piece like like hammer and wood, and they just build something that you never thought about. The other thing, and this one became a product, was this powered by airbyte thing. Initially, we're just thinking of airbyte as hey. You're just using it to push data into your warehouse, and people started to build application on top of it. Just, yeah. That's a cool use case. Never thought of that. And it became we kept seeing people asking for, hey. How can I build on top of Airbyte? How can I build on top of Airbyte? And that's when we released the first version of the API so that people don't have to click, on on buttons and and fill form.
But that that's, I would say, were 2 very interesting ones because one led to a product, the other one just nice, very, very smart. And we have a few others, but I don't know if you we want to cover all of them.
[00:48:44] Tobias Macey:
And in your own experience of going through this journey, building this platform, building this business, what are the most interesting or unexpected or challenging lessons that you've learned?
[00:48:55] Michel Tricot:
Yeah. First one was when we started our batch, someone gave us an advice. 2 advice actually. 1 was don't optimize for scale, optimize for time to value. The second one was you will never know who is using your software. Well, we we followed the first advice which was around scale. We did not invest in scale initially. That was very cool at the beginning, became a big, a big boulder that we had to to push forward at the end, but now we're there. The one around not knowing our our users, the fact that we had Slack very early on, the fact that, you know, sometime, like, very early on in the in the UI, we're asking for, hey.
What what do you do? Where are you from? Etcetera. Like, having this ability to know the customer or the or the user, whether it's on cloud or on open source, that was just it was possible. Like, people were willing to to help us by giving us, like, information like that, but it just gave us like a direct line to people and especially early on, like now we still do it but maybe like not at such a large scale but very early on every time we were thinking about oh should we build x or y We just go on on on on Slack or we would go and send a newsletter, like put that in newsletter and within 30 minutes or one hour, boom, everyone was just telling us we prefer x or we prefer y.
So in terms of, like, product development and where you focus your team, invaluable. Now I would say that the thing that are hard is building open source is hard because you don't like, especially for infrastructure is you don't control who is using your software, meaning that you will have people who face issues, and they will not be successful using Airbyte. And as a product builder, it pains me to be in that situation where I know that someone cannot be successful with Airbyte. And, yeah, you have to live with that which is some people will will not be successful with it. But it's also because you cannot control with downloading, and you cannot say, oh, yeah. Your use case is probably not a good fit for for Airbyte. And I I think that's the the the the place where especially as we're building open source, like, how how do we want to widen the the net or the the spectrum of user that we want to be able to, to make successful?
That's really the challenge.
[00:51:31] Tobias Macey:
And on that note, what are the cases where Airbyte is the wrong choice and somebody should either build something custom or use some other technology that's off the shelf? You know what? I would have had an answer for you
[00:51:44] Michel Tricot:
a few months ago, but now we released Pyervice. So I would say there is probably oh, okay. Maybe for pre one point o, yes, there is one around, like, streaming type of use cases. Like, we don't do those. Now it's planned for post 1.0. But, otherwise, with Pyabyte now, you can do whatever you want. So you have access to all the catalog of connectors on code, and you can add all the logics you want on top of it. So, yeah. Probably that would probably be the the former gap and the and the no gap today.
[00:52:21] Tobias Macey:
And looking forward to that post one point o world, you hinted at streaming. What are some of the other capabilities that you have planned for the future of Airbyte once you have moved past this 1.0 milestone? You've committed to stability in the platform and the interfaces, and so you have a more, stable target to build off of and move fast on.
[00:52:46] Michel Tricot:
Yeah. For us, it's gonna be very much on, like, operational use cases. I think we've been very focused on analytics, data warehouse, use cases. All the AI use cases are putting us into the direction of operational use cases, and reverse ETL is an operational use cases. Streaming, in my head, only makes sense for operational use cases where it's not a human making decision but a machine making decision. So that's gonna be a big big focus for us post one point o is like how do we address things that are not analytics but are also, like, operation, driven.
The other one is we have a, like, a a large road map around, like, enterprise connectors. We're starting to release, some of those, but we have a long road map and we're working with our enterprise customers to build them. So we want to release them in the in the next few months. Speed is gonna be a big deal. I think we have some low hanging fruit that we want to to make faster you will see we've made a few changes in the past few months like speed has gone up quite significantly but I think we can still do a 10 times so like a time stamp somewhere. So that's I'm very excited about. Like, data needs to move as fast as possible. We can never be a bottleneck.
[00:54:07] Tobias Macey:
We didn't really touch on it throughout this conversation, but I'll also say thank you for all of the work that you and your company have done on the outreach perspective of providing a lot of high quality blog posts, conducting the state of data engineering surveys year over year to help collect some of that information and just all of the investment that you're putting back out into the community as well. With that being said, are there any other aspects of the work that you've been doing on Airbyte, the 1 point o launch, the future that you're building towards that we didn't discuss yet that you'd like to cover before we close out the show?
[00:54:45] Michel Tricot:
No. But I think we're gonna have some new, unstructured data is gonna be unleashed and I'm very curious to see like how that's gonna pan out. We have some theory but it's basically you're saying that now 80% of the data that was very very expensive to leverage is gonna be unleashed. I'm looking forward to that world and, yeah, want to make sure that we're positioning Erbide also in that in that field because with data geeks, we want to push and get more data through through the pipes.
[00:55:21] Tobias Macey:
Alright. Well, for anybody who wants to keep in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:55:38] Michel Tricot:
Yeah. I think the biggest gap is what do we do with unstructured data? Like, the it's gonna be a lot of things around, like, the security, the privacy of that data because, well, it's not as easy as doing a regex on something. So to me that's a big that's a big deal today.
[00:55:58] Tobias Macey:
Absolutely. Well, thank you very much for taking the time today to join me and share the work that you and your team have been doing on Airbyte. Congratulations on the 1 point o milestone. I've been using Airbyte for several years now, so appreciate all of the, value that that has provided for my own work. So thank you again for all the time and energy you folks are putting into that, and I hope you enjoy the rest of your day.
[00:56:22] Michel Tricot:
Thank you, Tobias.
[00:56:31] Tobias Macey:
Thank you for listening, and don't forget to check out our other shows. Podcast.netcoversthepythonlanguage, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host at data engineering podcast.com with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and
[00:57:10] Michel Tricot:
members.
Introduction and Guest Welcome
Michel Tricot's Journey in Data
Overview of Airbyte and Milestones
Building Reliable Connectors
Challenges in Destination Connectors
Industry Trends Impacting Airbyte
Lessons in Data Movement and Integration
Architectural Decisions and Evolution
Operability and API Interfaces
Generative AI and Data Movement
Competitive Landscape and Differentiation
Commitment to 1.0 Release
Community Engagement and Connector Management
Innovative Uses of Airbyte
Challenges and Lessons Learned
Future Plans Post 1.0
Closing Remarks and Future of Data Management