Summary
The first stage of every data pipeline is extracting the information from source systems. There are a number of platforms for managing data integration, but there is a notable lack of a robust and easy to use open source option. The Meltano project is aiming to provide a solution to that situation. In this episode, project lead Douwe Maan shares the history of how Meltano got started, the motivation for the recent shift in focus, and how it is implemented. The Singer ecosystem has laid the groundwork for a great option to empower teams of all sizes to unlock the value of their Data and Meltano is building the reamining structure to make it a fully featured contender for proprietary systems.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt.
- You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!
- Your host is Tobias Macey and today I’m interviewing Douwe Maan about Meltano, an open source platform for building, running & orchestrating ELT pipelines.
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by describing what Meltano is and the story behind it?
- Who is the target audience?
- How does the focus on small or early stage organizations constrain the architectural decisions that go into Meltano?
- What have you found to be the complexities in trying to encapsulate the entirety of the data lifecycle in a single tool or platform?
- What are the most painful transitions in that lifecycle and how does that pain manifest?
- How and why has the focus of the project shifted from its original vision?
- With your current focus on the data integration/data transfer stage of the lifecycle, what are you seeing as the biggest barriers to entry with the current ecosystem?
- What are the main elements of your strategy to address these barriers?
- How is the Meltano platform in its current incarnation implemented?
- How much of the original architecture have you been able to retain, and how have you evolved it to align with your new direction?
- What have you found to be the challenges that your users face when going from the easy on-ramp of local execution to then trying to scale and customize their pipelines for production use?
- What are the most critical features that you are focusing on building now to make Meltano competitive with managed platforms?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on and with Meltano?
- When is Meltano the wrong choice?
- What is your broad vision for the future of Meltano?
- What are the most immediate needs for contribution that will help you realize that vision?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
- Meltano
- GitLab
- Mexico City
- Netherlands
- Locally Optimistic
- Singer
- Stitch Data
- DBT
- ELT
- Informatica
- Version Control
- Code Review
- CI/CD
- Jupyter Notebook
- LookML
- Meltano Modeling Syntax
- Redash
- Metabase
- Apache Superset
- Apache Airflow
- Luigi
- Prefect
- Dagster
- Transferwise
- Pipelinewise
- 12 Factor Application
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I'm working with O'Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to data engineering pod cast.com/97 things to add your voice and share your hard earned expertise. And when you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With their managed Kubernetes platform, it's now even easier to deploy and scale your workflow. So try out the latest Helm charts from tools like Pulsar, Pachyderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.
Go to data engineering podcast.com/linode, that's l I n o d e, today and get a $60 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Dawah Mann about Meltano, an open source platform for building, running, and orchestrating ELT pipelines. So, Dawah, can you start by introducing yourself? Yes. Of course. First of all, thanks for having me, Tobias.
[00:01:28] Unknown:
So my name is Dawa Nam, like you mentioned, and I work at GitLab. I've been at GitLab for, a little bit over 5 years now. I originally joined as a developer, then I became development lead, which turned into the engineering management role at some point. And then about 9 months ago, 4 years into my time at GitLab, I moved over to the Meltano project. And Meltano, of course, is what I'm here to talk about today. And do you remember how you first got involved in the area of data management? Yeah. So, really 9 months ago when I got into the Multano team, is when I got involved in the area of data management. Like I mentioned, my background is in in software engineering. I've, I got in GitHub as a developer 5 years ago, and I got into development engineering management. And a year ago or so, the team at GitLab was in need of an engineering manager. And at that point, having seen JetLab grow from 10 people to about a1000, I was kinda starting to feel that itch of wanting to work on a new project again and maybe something smaller. So when the meltdown opportunity came around within JetLab, I, I didn't hesitate, and I grabbed it. So I've only been involved in anything related to the data space for about 9 months or so. Before that, of course, I knew about data management, but I've really only started to read up and become an expert recently.
[00:02:39] Unknown:
Given that you're so new to this area, what are some of the aspects of the learning curve that you've been running into as you get ramped up on the project and the use case that it fills and some of the challenges within the overall ecosystem that you're trying to tackle? Yeah. Great question. So it's interesting in a way because I joined the Motown team back in,
[00:02:59] Unknown:
September of last year, but it wasn't really until March of this year that I actually started digging into, you know, data engineering, data management, as you call it, and and the tools that are available in this space and the problems that they meant need to solve. When I came on board to the Montana project, it had already been around for about a year and a half inside GitLab. So we had a pretty clear idea of what the tool is that we were trying to build, and I was really just approaching it as a developer building that tool based on the roadmap that we had laid out ourselves. As I've becoming, you know, up to speed on on the data space and what what what kind of tooling is available here and then the role that Git and Montana might fill in the space, and how that relates to, you know, what we originally set out to do 2 years ago and where our opportunities lie today. The learning curve had actually been less steep than I thought it would be because I've had the opportunity to talk with a lot of really great people from the data scene, who I've met and been introduced through to channels like locally optimistic on Slack and also the singer and, dbt ecosystems have a lot of really great people in them, who have been able to basically kind of point me in all the right directions, to I wouldn't at all would call myself an expert on on data management in general, but I am a little bit of an expert now in specifically the open source ELT field in large part because of, this is what I've learned from these people.
And through that information, I've actually been pleasantly surprised by the amount of material that is available, online and in written about,
[00:04:20] Unknown:
you know, how all of this fits together. So the learning curve hasn't been as steep as I expected, actually. And given that you are still new to it and you still have sort of the beginner mindset as it pertains specifically to data management and ELT, what are some of the benefits that you see that providing as somebody who is taking on the project lead role of Meltano?
[00:04:41] Unknown:
Yeah. I think, 1 thing that makes a difference is that my background is very much in software engineering. So that means that especially coming out of GitLab, which is, of course, a, you know, an end to end platform for the entire DevOps life cycle, it means that from the get go, I am approaching data engineering and this entire topic with this, software development and DevOps mindset, and I kinda come into it expecting all of the, benefits that best practices of DevOps like code review, continuous integration and delivery, and version control in general, provides. So it means that while, historically, it looks like all data engineering and EOT pipelines have been, implemented and realized through, you know, more and more visual tools, Informatica is 1 that, of course, is a big name even though today it might no longer be, you know, the go to modern tool. But it means that, in in trying to build Meltano, making it, you know, fit into the DevOps life cycle of it's just another software engineering project with people interacting through version control and code review, to me is a given, and it means that in figuring out how to build a data engineering or ELT tool, that these are kind of fundamentals from the beginning instead of something that I had to learn over time. So I think 1 advantage of my being a software engineer by trade who is now getting into data engineering instead of the opposite, which is, of course, what you see in a lot of other data engineering tools out there, means that what people who are looking for some of these software engineering and DevOps benefits, what they will find in Nutana will probably be closer to what a traditional software engineering project might look like compared to people who are, only just getting into that retrying to learn it. So from the get go, if you start working with Nutana, we will expect you to be comfortable with topics like version control, continuous integration, and deployment, and then code review and the like because we think that these are core part of what makes Nutano and and similar tools. You know, what they bring to a team that it might gonna have today, where a lot of that extra value actually comes from if you're looking at it from a, you know, collaboration and ultimately, efficiency and out, perspective. So in that sense, my being a novice and really having to hear from experienced data engineers what it is that they would like the tool to like Nutanix to do. With me then kind of figuring out, okay, how do we fit that into this this DevOps approach to data engineering?
It's been really valuable that I don't have too many preconceived notions of what an EOT pipeline looks like. And that's also why, and you'll see this later on in the call too, 1 of the main things that I'm actually looking for in terms of contribution to Montana right now, is just in in in the form of feedback by, you know, experienced people with data engineering, in general and and open source data engineering around senior types and targets in particular, but we'll get to those, in a second. I'm sure. And digging deeper into Meltano itself, can you give a description about what it is and some of the story behind it? And I know that you recently pivoted in terms of the main focus behind it. So if you can give a bit of context there as well. Absolutely. My pleasure. So, like you mentioned, actually, today, Meltano is explicitly an open source platform for building, running, and orchestrating ELT pipelines. And in a moment, I'll I'll clarify because these ELT pipelines are specifically ELT pipelines built out of single taps and targets for the extraction and loading bits and then dbt models for the transformation bit. But originally, when Meltano was founded 2 years ago within GitLab, this was only a part of what we wanted to, realize with Meltano. So 2 years ago or so in in the summer of 2018, the GitLab data team was scaling up, ramping up. GitLab as a whole was was growing, and we realized we needed to do more with the data that we were gathering. So we started to build our data team and put together our data stack. And coming from the open source, you know, background, GitHub itself being an open source project originally and even today being an open core product where a really large amount of our engineering time every day actually goes into the open source version, which is, you know, freely available to all rather than the proprietary edition that we make money on, we started looking for open source tooling first. Before we checked out some of the more popular proprietary and paid tools, we wanted to see if it was possible to build a full data stack with everything our data engineers, analytics engineers, as well as analysts and data scientists would need just out of open source tooling. So what we found is that all of this open source tooling already did exist. And if you, you know, went through the trouble of actually tying all of these components together, you could build a pretty robust data integration pipeline out of only Outsource components. But we also realized that, the glue in between these different components so you gotta think of the extractor, the loader, the transformations themselves, but also the orchestration, which which manages, running this on a schedule and then making sure it keeps running reliably and that we will be notified in fails. We recognize that between a lot of these open source tools that existed, this glue hadn't necessarily been filled in or at least not been there was an open source tool available that you could really just get started with and 10 minutes later, see your data, flow from the data source to the data warehouse, and then also have the opportunity to actually start analyzing it. So what we realized is that there would be value in us building, the tooling to glue together these various open source components. So Multano was founded in GitLab, with the idea being, well, we want this ourselves. Let's build it, and let's also build an open source for the wider community. Relatively quickly, though, we we came to the conclusion that Meltano's, the pace of development of Meltano was not able to keep up with the growing needs of the actual Jetpack data that was you know, set to use is not that a project. And this had a lot to do with the fact that, of course, we wanted to extract data from various data sources, SaaS APIs and data formats, and load it into a data warehouse. But we realized that if we were going to have to write and maintain, these data extractors ourselves, this would take a lot of time and a lot of effort. That, at that point, might be better spent, on on on actually, you know, getting something out of that data, and we might be better off going with a a proprietary tool for the moment. But being GitLab, we still very much believe that this should all be possible with open source tooling. So the Montana project sticker stuck around even though the GitLab data team at the time was no longer actually using it. And since with GitLab, we had found quite some success in offering a, you know, a full tool for the DevOps lifecycle to an entire engineering team doing everything from version control, issue tracking, CICD, and some amount of tooling around, you know, security checking as well. We saw that there was a place in the open source space for a similar tool for the data life cycle. Again, the idea being that composed of various open source components, you could spin up this tool called Meltano, and immediately have a kind of a starting point or starting point for your entire data team to, have a single source of truth for what their data pipelines and their data set engine looks like, not just from the data integration perspective, like how do you get it from the sources to the warehouse, but also what did you do with that data next where, of course, you start looking into analytics or BI software, you know, notebooking with with Jupyter or other, you know, data science practices that ultimately all connect with that data warehouse. But that's where Multano started with this kind of end to end vision. We wanna build a tool that that does this all because we saw value, there with GitLab as well. But then over the last 2 years, it became clear that this end to end vision, while it resonated with with other people and data teams out there, we hadn't necessarily been able to actually attract teams that were, at that point, willing to start evaluating it internally or were actually able to start contributing to it to make this vision a reality with us. So 3 months ago, 2 months ago, back in March, I really started looking into, okay, we've built something pretty cool now, and it does actually work. You can use Meltano for the end to end data life cycle, so it can do everything from data integration to data transformation, and it had some basic point and click analytics functionality built in as well with a modeling language inspired by LucaML that allows you to, basically describe the schema in the data warehouse and then describe how these various tables relate to each other and can be joined so that you can then use the Nutano interface to kind of point and click and get some simple dashboards and reports out of that. But we realized that that this this future where an entire team would be able to use Nutano and and and I don't wanna say add nothing else because we recognize that, on on any of these kind of steps in the life cycle, specific teams or people might have bigger needs that go beyond what Nutano is able to deliver today or ever. And we are, you know, Meltano is not too opinionated about whether or not to use all of its part or whether you kind of pick and choose and decide to swap certain things out for possibly a proprietary tool or some other open source tool. But we realized that the story of Nutano and a team actually deploying it would still depend on that entire team, of course, being convinced of the extra value they would get out of using Meltano compared to what they are using today. And then if we wanted this team to actually contribute and and help us make this a reality, that also means that this team would need to be, you know, pretty technical or at least comfortable contributing to Python projects. And not just the Nutana project itself, but also the specific extractors and loaders for all of the various data sources and data warehouses. So we came to the conclusion that with this end to end vision, we were not actually able to get the people excited that we needed to get excited to make this a reality with us. Because if you wanna convince an entire team of the value of this integrated tool, then it basically already needs to have reached a level of quality in in each of the various steps that take up that life cycle. And ultimately, the person getting most value out of any data project is, of course, the people doing the insights getting the insights at the end, the people doing the analysis or running the, you know, the the notebook projects against it, for example. And these are not always the same people who are actually capable of contributing to extractors that are, you know, highly highly highly technical Python projects specialized in, for example, pulling data out of Salesforce or Google Analytics or Facebook ads or what have you. So we reached the conclusion that in order to make this eventual future in which there was a single end to end tool that Teams could get started with and build only out of HomeStory components, for that to be a reality, we really had to start at the the beginning of of the journey, which is data integration. And the interesting thing is that we realized that in the open source space, there exist a number of really great open source tools that kind of sit at the end of the data life cycle. So think of, BI and analytics tools like Redash, Metabase, or Apache Superset. And then which you can connect to the data warehouse, and then you get going from there. And there exists really great open source transformation tooling as well, specifically dbt, you know, with its dbt models, which, which analytics engineers who are capable of SQL are, you know, obviously, for great reason, getting really excited about these days to transform their data. But then the first step in the pipeline where you're actually getting your data out of a data source and and piping it into a, data warehouse, we realized that there were some open source projects that have kind of attempted to to make a dent here and to try to offer something that that could really serve as an alternative to some of these proprietary tools and how how's the tools out there. But today, in in large part, because I think these these how's the proprietary ELD and data integration platforms have such great data source and data warehouse support, both in quantity and in quality. A lot of companies out there, even if they do use open source technology in, the transformation or the the analytics stage, or if they were interested in in putting together an an open data stack that is completely based on open source components, including something like Airflow for your orchestration, the actual data integration bit, most companies that don't have the resources to actually build and maintain all of the extractors themselves would end up opting for very understandable reasons, for 1 of these paid hosted proprietary tools where, of course, you pay some money upfront and and or, you know, I guess, in most cases, on a subscription basis or usage basis. But and then, of course, you give all of the the burden of both maintaining and and building these extractors, as well as the burden of actually keeping these pipelines running, stably in a production environment. You leave that over to this this other party. So we realized that the place where the model end to end story would really kinda have to start is in that integration bit because no 1 is going to switch to a full open source end to end data life cycle tool unless its data and integration chops are competitive with with what people can find in in paid and proprietary tools out there. So the decision was made to, for the time being, start focusing specifically on, turning Meltano into, you know, a really great and truly competitive open source alternative to these proprietary EOT pipelines out there, with the kind of greater ideological goal being to make the power of data integration available to all, by building this troubleshooter alternative. Because right now, the data integration space has essentially become pay to play where unless you actually have the resources in in house to build and maintain these extractors and all of this tooling for running and orchestrating these pipelines yourself, you are almost forced to go with 1 of the paid options out there, which means that to a large portion of the companies out there in the world who would benefit from doing something more with their data, they are currently not actually able on on making progress on that goal or on that ideal until they have figured out the data integration step, which usually now means paying for it. So we realized that there was a great opportunity in the open source data space, not on the analytics or BI front specifically because, like I mentioned, there are a number of tools that already filled that need on the data transformation stage, the same as the case. But on the data integration space, we we felt that that is really where the open source data story kind of falls apart because for most companies today, the open source tooling available in the space just isn't sufficient and cannot truly compete with the paid options out there. So the history of what that is that, back in March, we pivoted very specifically to the ELT side of things, and that's what we've been, trying to get under the attention of the public over the last, 2 months or so. And a month ago, we officially announced that new direction with the blog post, And that is also what, what what sparked my reaching out to you over Twitter and my being on this, podcast today. So I'd love to talk more, you know, in the duration of this talk, of this interview about, business future direction and and and what this means and then where we could where we can find contribution from people. Yeah. It's definitely, as you said, 1 of the biggest challenges is
[00:18:40] Unknown:
once you have the data, then it's generally fairly specific to the organization and the questions that you're asking as to what you do with it. But getting a hold of the data in the first place, as you said, is 1 of the challenges because of the fact that there are so many different sources and the number of sources is generally growing at any given time for any given organization. And also as those data sources evolve and mature themselves, the specifics of how to integrate with them or the format of the data that they're producing is going to evolve, which means ongoing maintenance because you can't just write the integration once and then not have to touch it again. You have to make sure they stay up to date with all of the representations and all of the available options for what that source dataset is able to provide to you. And as you mentioned, 1 of the projects that you are working on using to help bootstrap your work is the singer project that already has some library of taps and targets for being able to pull that data out and load it into some destination. But I also know that the overall community around that solution is, a bit of a sort of patchwork. There's not really any sort of cohesive aspect to it. And so I'm wondering in your efforts to build this open source data integration platform, what are you seeing as the primary strengths of CINGR as an option and the benefits of using that as your basis going forward? And what are some of the shortcomings that exist either in the community or technological aspects of it that you're trying to improve or work around in the work that you're doing at Meltano? Yeah. Great question, and that's exactly where I wanted to go with this next. So like you mentioned, the singer
[00:20:20] Unknown:
ecosystem so Multano data integration, of course, starts with with extracting and loading. So you need an extractor and a loader. You need a tool that manages pulling the data out of a data source and then another tool that manages pushing the data into, a data warehouse or another file format or whatever it might be. And singer is a specification that describes how to write scripts that can take that role of extractor and loader. And specifically, what the Spam and Synchronoss specification does is describe a format for the intermediary, I guess, format that sits with that the singer extractor outputs, and that then serves as input through the singer targets. So ultimately, the singer specification is not much more than a description of how the tap and target can really, communicate. Like, what is the format that this extracted data should be in at the intermediary step that then any arbitrary target can take as inputs for it to convert that into the correct, you know, insert statements or whatever you have, in order to load the data into a data warehouse. So, a project like Multano and data integration platform always always starts with, like, how are we gonna write these extractor handlers? So, when Nutano was originally started and they get that team started, you know, looking into building its extractor and its loaders for the data source that we ourselves wanted to connect with. We, of course, looked around to see what what formats and what options and then what libraries of existing depths and targets or rather extractors and loaders already existed, then we came across singer pretty quickly. And the interesting thing about singer is that it is essentially it it was built by Stitch, specifically. Stitch is 1 of these ELT hosted ELT platforms that that we think that Cortana will be able to compete with 1 day. And Stitch founded the singer specification to allow Stitch users as well as, data engineering consultancies that serve Stitch users to build extractors for data sources that Stitch did not support yet that could then be kind of plugged into the Stitch system once they have passed some, review process. So what you see if if you look at the current dictionary or library of sync singer types and targets is that they exist for a good number of data sources, but especially those for which Stitch doesn't have native out of the box support just yet. And this is very powerful because, like you mentioned, where new SaaS services are popping up every day, and there are various regions out of in around the world where the most popular SaaS tools might not be the same ones that that a US company is likely to use. So it's really great that Stitch allows their users to, you know, kind of build these plugins that allow data to be extracted from from sources that were previously supported. And because SINGH has this kind of existing community around it of both users who are trying to build the steps and targets for you to stitch, as well as data engineers and data consultancies that are building these for use by their customers.
We saw this as the most promising open source ecosystem of extractor and loaders today. And another advantage is that it's all written in Python, which is, of course, the de facto language of of data engineers, today as well. There are some alternative, you know, open source ELT specifications and then sets of extractors and loaders that have been written in languages like Ruby. But obviously, that was a downside from the perspective of, wanting to make it really easy for actual data engineers to get started with with maintaining these, which is probably picked Singer at GitLab originally. But since Singer is was originally founded and and still explicitly kind of is primarily intended to be used with the Stitch platform.
The singer taps and targets by themselves don't get you the data pipeline that you would actually be comfortable running in production. So a couple of things that we have found are currently lacking in the in the singer ecosystem are, first of all, kind of a really great story about once you found a singer and tap tar singer tap and a singer target and extractor and a loader for data source and data warehouse that you're trying to use, How do you actually turn this into a data pipeline that you would be comfortable running in production and, you know, you're not having to double check every day to make sure that, that it didn't break down. Because at the lowest level, singer types and targets are just single, executables that take a couple of flags and then use standard in and standard out to, you know, output and input data in following the singer specification. So you need some kind of runner tool around that that can actually keep track of of of piping these together in a, reliable way, managing the configuration, both the tap and the target. So you should think about, you know, credentials or other configuration options, as well as managing the state of the pipeline. So that when the pipeline is run a second time, it starts off where the first run left off. So at a single ecosystem, a couple of these different runners currently exist, and you can run these locally and they work just fine. But then if you wanna deploy these into production, you have to figure out yourself how you're going to orchestrate them. Fortunately, you know, of course, Airflow supports a dash operator, which allows you to just call out to 1 of these runners, and then every, you know, orchestration platform or workflow management system supports dash scripts or commands in a similar way, but it still requires quite a lot of manual set of work. So as I said in the beginning, the idea was that Nutanix would provide the glue around to the open source components it it would, you know, consist of. So from day 1, Nutano kinda started filling in this this glue, and turned itself into a runner for singer taps and targets. But once you have a runner that takes care of configuration and entity selection and state management, you still actually want to run this in production. So if you're comfortable deploying Airflow, or deploying, you know, Luigi, or deploying Prefect, or what have you, then you should already be able to use single taps and targets with 1 of these, existing runners or with Nutanix as a runner.
But this still means that the learning curve and the barrier to entry is pretty high if you compare this to someone who can just go, for example, to stitch.com, sign up, immediately be presented with a dashboard, you know, with all of the logos of the supported data sources, click a connect button, enter some credentials, and then have your pipeline running there and be confident that you can just kinda forget about it and you'll be notified if something breaks. And for the most part, you can expect that, the platform will kind of fix it itself, especially if it turned out that the data source changed in a way that made the extractor incompatible. With a party like Stitch, you can, of course, assume that they will have the resources to fix that. So then even if your pipeline fails, it will stop it will stop failing and it'll work again, on the next iteration, the next interval, take. And currently, in the open source singer ecosystem, there's tooling around running them and deploying them and monitoring them and actually being able to say, hey. I set this up once, and now I'll deploy it, and I won't have to worry about it again. That doesn't really exist. So a big barrier to entry there is that if you actually wanna use an open source free data pipeline, you have to figure all of this stuff out yourself. And again, Nutanix is trying to, make that a lot easier by providing this tooling in Galu that makes it that simple to set up a singer based, data pipeline that can have an optional transformation step using a dbt model. And then also orchestrating this on top of a supportive orchestrator like Airflow and and deploying it using, for example, a a Dockerfile, which which I'm actually working on right now. So this is this is 1 of the barriers to entry to the singer ecosystem. Another 1 is the fact that the singer tabs that have been created are very you know, they're they're varying in in quality and in maintenance and in future completeness relative to the proprietary data connectors that you might find at at a hosted paid hosted vendor. A part of the reason for this, of course, is the fact that well, like I said before, for really understandable reasons, most companies, even if they at some point, explore using an open source ELT platform, they understandably would usually decide to just go with 1 of the paid vendors anyway because you don't necessarily wanna take on that burden of of maintaining the data source, extractors, and loaders that you need all by yourself.
So a lot of these, even if they've been written once and they work, they are not necessarily used as as frequently or in production for them to actually be maintained to the level that you can get started with them today and get a great quality out of them. And part of the reason for that is also that if you actually use singer tabs with Stitch, you're, of course, less inclined to build a singer tab for a data source that Stitch already supports out of the box. And since Stitch's own extractors and and loaders are not actually open source, that means that there are more singer tabs that are kind of in in niche markets and in local markets, while all of the popular tools are served by Stitch, but not necessarily by singer tabs because there just hasn't been as much of a motivation to build those because, again, most people using singer are probably using it with Stitch. And at that point, why would you build a singer tap if the Stitch extractor already exists? And the same kind of goes for singer targets because if you use singer taps with Stitch, it's actually still Stitch that is responsible for loading this data into your data warehouse, which means that the singer targets that exist have all been written by people who do wanna run singer tabs outside of the Stitch ecosystem.
While Stitch itself hasn't been particularly motivated to support those because, of course, at that point, you're kind of competing with the Dausted offering that they offer. So they are not inclined to, build this this tooling around the singer ecosystem, because in a way, they would be empowering, the open source community to to not need them as much anymore. And so this is a couple of things. Like I mentioned, the data the quality and the quantity of single taps and targets is currently a barrier entry for new users. The lack of tooling or the lack of of a great deployment strategy is a barrier to entry. And then there's also the fact that building singer taps and targets today is not as easy as it as it could be or, you know, really as we would want it to be or as it should be. Because while there exists a lot of documentation around the singer specification, and there exist, of course, a number of singer tabs that are all open source, and and you can find their repos in GitHub and then review their code, there doesn't exist a kind of cohesive set of best practices or a a boilerplate or templated system almost of how to get started with a depth that will be, you know, future complete and and robust and reliable and ready for production. And right now, building your own depth is very much a matter of reviewing 5, 6, 7 different techs, taking the best bits from all of them, and then trying to piece it together yourself. So additionally, there is opportunity in providing better tooling around actually building and maintaining and testing steps and targets, which will, of course, increase people's confidence in in their own data pipelines. Because today, as good as Multado could be as a runner and deployment platform for singer steps and targets, Ultimately, the quality of your singer, based, you know, t pipeline is only going to be as as good as the specific tap and target that you're using. So we see big opportunity too in, building more tooling and documentation, and perhaps, you know, util libraries that go beyond what is currently available to make it easier for people to set up a single tap for a new SaaS API they want to include in their data project.
And these are all things that that various parties, various of these data consultants have been referring to and are already using single taps, and some of them also singer targets and and pipelines. Some of them have open source some of the tooling they're running themselves, but none of them have gone so far as building entire kind of suite of tools to really make this as easy to use and and get started with and keep using as the proprietary hosted option out there. And that is exactly the gap that we want to fill because by empowering the existing singer community, the singer ecosystem can really start living up to its potential. And once it does, in combination with the Multano platform to actually run and build and deploy them, you will end up in a place where companies that are currently not doing anything with their data at all, because for whatever reason, they might not be able to afford 1 of the hosted options or there might be, like, legal reasons, like, for example, GDPR in Europe or, HIPAA in the US if you're dealing with health information, which preclude them from using 1 of the host of tools, using Anybut, for example, their their highest tier subscription levels, which actually do come with with things like HIPAA compliance or GDPR compliance in case you want this to be housed in Europe. So we wanna get to a place where through building Meltano, we empower the existing, singer community to the extent that the singer ecosystem grows to a place wherein combination with the Multano tools, even people who are not currently familiar with Cigna and who are not even com comfortable writing or maintaining Python tabs and targets themselves will be able to come here, find something really easy to deploy, and get started with a great set of, data sources and warehouses supported out of the box so that they can really get started with Nutana. While it might have taken them, you know, another 6 months from another funding grant or another 1 or 2 data hires before they would otherwise have been able to get started, doing anything with their data. And that's why very explicitly, it's kind of about empowering people to start doing more with their data and then turning this tooling into a commodity so that every company can benefit from these, doing something with their data to the extent that they currently can't.
[00:33:26] Unknown:
Today's episode of the data engineering podcast is sponsored by Datadog, a SaaS based monitoring and analytics platform platform for cloud scale infrastructure, applications, logs, and more. Datadog uses machine learning based algorithms to detect errors and anomalies across your entire stack, which reduces the time it takes to detect and address outages and helps promote collaboration between data engineering, operations, and the rest of the company. Go to data engineering podcast.com/Datadog today to start your free 14 day trial. And if you start a trial and install Datadog's agent, they'll send you a free t shirt.
And I wanna dig a bit more into the specifics of the actual singer specification and the fact that it uses standard out and standard in as the transfer mechanism. But before we get to that, I wanna dig a bit more into the focus that you have currently in terms of the target audience that you are working with and trying to cater to in the current incarnation of Meltano as you ramp up to the point of being more generally applicable and how that particular focus is in forming or constraining the architectural and design decisions that you are making as you build Meltano and its current implementation?
[00:34:42] Unknown:
Yeah. Great question. And, like you said, it's very important to stress that target audience of Meltano today is different than what it might be 6 months from now, and that is that's different from what it might be a couple years from now. But today, the we are specifically targeting data engineers who are already familiar with and and comfortable with or at least exposed to singer tips and targets. We are specifically targeting people who are already in the singer community and part of the ecosystem because these are the people that we can get the most relevant feedback from at this point because these are the people who either are already running singer, tap and target based pipelines in production based on their own kind of hand rolled setup, and these are the people actually building these singer taps and targets either for their own usage or for, their their clients in case of consultancies. So these are the people who can, at this point, give us most feedback to kinda make Multano the go to tool for, running, building, running, and orchestrating ELT pipelines based out of, built out of single taps that says on the homepage. Because these are the people who are already doing it without Multano. And we wanna, with all of their feedback, make Multano the tool that they actually wanna use going forward for their own. And then that will, will, of course, empower more people too because then the Meltano tool will be heavily informed by people who have actually already done this. So if that describes you, dear listener, then definitely check check out Nutana. If you are the second kind of people we're interested in this point are those who have already found out that they are interested in running open source data integration. People data engineers who are comfortable, you know, working with open source projects and are comfortable working with with better call data quality projects some sense, people who won't necessarily be looking for a massive amount of of support and that are excited to build this with us. We have already decided they want the open source data pipelines, but they might not necessarily be, familiar with Singer yet. And these are the people who we wanna show that because of Meltano, Singer is kind of the best possible option right now for them if they wanted a point pipeline like this. Because these are the people who, at the next stage, will, start using, first of all, the existing singer types and targets that exist, but then would also be comfortable potentially contributing to them if they find bugs that they wanna fix or to build new ones for, depths and targets that for data sources that are not currently supported. And these people we're attracting are, right now, mostly, smaller companies who, for, like I mentioned earlier, whatever reason, are not currently part of the market that is addressed by these pay and proprietary tools for for a myriad of reasons. But we are seeing, quite a lot of interest from developing countries, for example, where, of course, local income and local, prices they can charge for their products are, in a lot of cases, far lower than what is common in the US, which also automatically means that US tools are often out of reach for these companies. So they are more, interested in open source. And after that, once these data engineers have really gotten the Cigna ecosystem to a place where the quality and quantity of data sources, starts getting closer and closer to what the proprietary, Halsted platform should offer today, then little by little, the target audience of Cortana will kind of growing will start growing in the upper direction, where even people who are currently paying will start wondering why am I actually paying for this if this Montana thing seems pretty robust, and it seems to be able to do all the same stuff I'm currently paying for. And at that point, of course, there will still be companies who are, looking for 247 support, who are looking for all the various things you get with, you know, proprietary vendor. But I can also see a future in which GitLab or other parties will start offering a hosted version of Delano, which will then, you know, hopefully, still be cheaper and more extensible than the other platforms out there today because we do kind of build on this community supported ecosystem of, depth and targets.
And just trying to take this question 1 step further, the the ultimate future I see is 1 in which writing taps and targets and or extractor and loaders will no longer be a responsibility specifically of data engineering teams or data engineering tools, but I can see a future in which Multano data pipelines and specifically, singer tips will be as much of an open source standard that specific SaaS providers, especially newcomers onto the market, will themselves author their own official singer tab. And in the same way, you'll see data warehouses shipping their own singer targets because it allows them to initial immediately plug into the data projects and data pipelines of all of these users of Multano and singer. While right now, a newcomer like this into the market would need to wait for 1 of the big hosted parties to either decide to allocate resources themselves to building it, which will take a while, chances are, after founding of that project originally, or it might be that, you know, a customer will need to pay for them to be able to include that new SaaS API or SaaS tool in their, data integration story. So the future, which me and what this means is that today, if you are entering a market in which a lot of SaaS tools already exist that are widely supported by data integration platforms, you are in a significant disadvantage as a newcomer because many of your prospective customers who you would want to switch away from what they're currently using will not do so if what they're currently using is supported by a data integration platform, but what you want them to switch to, your new tool or your new data warehouse, is not yet. So in the future, I don't think we'll be depending as much on the individual community members or data engineers to build this, integration.
I think we'll see this being almost a given that any company that wants to be part of people's data pipelines will themselves have an official tap or an official target which ultimately helps everyone including the end user because everyone using bolt ono or anyone else using a single compatible data integration platform will be able to connect with all of these data sources and data warehouses from day 1 without ever having to worry about the quality of the individual type or target or what to do if a bug occurs because you can expect that this will be, part of the expected offering of the actual SaaS provider you're using. So that's kind of where we wanna go in terms of target audience, but today we're targeting specifically people already part of the Cigna community because today, Meltano is primarily a Cigna tap and targets
[00:40:59] Unknown:
running and and deployment and orchestration platform. And digging into the specifics of how Meltano is implemented original from the original direction where it was trying to be this all encompassing tool that included the entirety of the life cycle? Yeah. So since in the beginning, we knew that while we wanted Nutana to be a conventional reconfiguration
[00:41:27] Unknown:
tool where, you know, most people would just be able to get started without having to tweak too much, we did recognize that not everyone would want to use every part of Nutano. Like, Nutano, the the 7 letters in the word Nutano actually stands for model, extract, load, transform, analyze, notebook, and orchestrate because that was kind of the the wider end to end vision we had in mind at the time. But we knew that we were not gonna be able to convince everyone to go and use all of Meltano at once. So architecturally, Meltano starts with the concept of plugins and extractors and loaders and transformations, but also transformers like dbt and orchestrators like airflow to Meltano are all, plugins.
And ultimately, your Meltano project, which is a single source of truth for for your data project and your data pipelines, has a Meltano of YAML file, which which collects the various plugins that you've plugged in, which are kind of just dependencies that point at either a specific pipeline package or, you know, a git repo URL that that contains a, Python package. So because of this plugin based approach, the the pivot in terms of, okay, we're gonna focus only on ELT for the time being and not on these other steps that have to do with analysis and notebooking, etcetera. Only really meant that we wouldn't stress those other plug ins anymore because if you're using Montana with only extract load and transform plug ins, even in the previous iteration, it would basically already be the exact ELT tool that we have today. So that plug in based approach means that, Nutano is very much pick and choose, and you don't need to use all of it. You can use it as a simple singer tap singer tap and target runner. You can use it as a singer, you know, as a pipeline runner if you also wanna take DBT transformations into that. And you can use it as as a system to kinda abstract away the orchestration layer if you're com if you're comfortable only using pipelines consisting of EEL and T steps that just need to be run on a schedule. So actually, the original architecture of of making this very much pick and choose and then plug in based allowed us to pivot relatively easily to focus on only a specific part of that whole story. And someone using Nutano today, if they don't dig deep into the documentation, we'll never know that it can actually do a couple other things that we are now, for the moment, explicitly not stressing. But we have also not removed these things from Meltano either because if we do find a user who is, you know, motivated and inspired enough, like, hey, it would be cool if I also did basic point and click in analytics.
We want this developer to the site, this contributor to the site to start contributing into that direction and then making that part of it more powerful because we do see Multano very much evolving into the direction where the community takes us. But it doesn't necessarily need to be exactly what Gifat had in mind from the beginning and it's very likely that we will be, you know, spending months months or years years just focusing on ELT. But just like we saw with GitLab, there is power in in allowing people to go beyond the the standard functionality it offers today and, add some extra features that they would want. But it's very much up up to the community to see where it goes. Unfortunately, the plug in based architecture allows for that really easily. And just as an example of of the power of that, right now, singer taps and targets to Meltano are just extractor and loader plug ins that happen to use the singer runner. So hypothetically, theoretically, if another extractor loading framework comes up that that people start asking us for, or if an alternative to dbt, becomes popular, it is doable to allow to to add a new transformer plug in type or a new extractor loader plug in type to Meltano, which will allow us to move in that direction.
Because we, again, we wanna be the glue between these different tools, more so than lock people into a specific set of tools. And the idea is very much that the Meltano project is is your data project, where your data engineer, analytics engineer, analysts, etcetera, work from. And we wanna be able to evolve with data teams as they decide to, move different tools over time. And what we've seen recently is that we started out exporting specifically the airflow orchestrator, which means that if you are using Meltano, and you wanna start orchestrating, or in this case, you know, running on a schedule through pipelines, it's really easy to add airflow as kind of the back end orchestrator implementation. But because this is also plug in based, it will it's relatively straightforward to add support for another orchestrator like Prefect or Luigi. So then, again, it's it's up to individual data teams what they prefer, what they already have experience with, or what they wanna plug it into that they already have deployed. And Nutano makes it really easy to specify the different tools that your product exists, consists of and how those how those are tied together more so than it locking you in any specific any specific combination of tools. And that architectural, you know, pattern is very much what has allowed us to pivot as easily as we do. And it's pretty crucial to the future that we see, with Nutanix, basically outliving the specific open source tools that are invoked today that people might come towards. So I think it's less likely that we'll ever move away from seeing our taps targets because obviously we are also investing in making that ecosystem more, you know, having the ecosystem grow and empowering the community.
But on the front of orchestration, you're already seeing that Airflow is not necessarily losing popularity, but projects like Prefect are being considered by a lot of new teams over Airflow because, of course this will also evolve with the data space and hopefully Meltano will be able to evolve with the data space as well. Yeah. I definitely appreciate the pluggable
[00:46:52] Unknown:
aspect of Meltano and being able to replace the orchestrator, as you said, with something like Prefect or Daxter. And the fact that the singer taps and targets are able to be built and iterated on in isolation without having to worry about how they hook into the overall ecosystem or the specifics of Neltano. And digging more into the singer specification, I'm wondering what you have found to be some of the challenges that you and your users are facing when going from that easy on ramp of, I can run this locally on my machine. I can get data out of this service, and then I can pipe it into this other service by just using the pipe operator and bash and some of the complexities of scaling those and deploying them into production and monitoring their execution and their overall health and some of the ways that you are looking to address that within Meltano?
[00:47:44] Unknown:
Yeah. Great. So, the singer specification itself, we have so far not really considered changing. I think a lot of the power in the sense that we think in this currently, it serves the needs of of of data engineers and it serves the, needs of, you know, what you would expect from a specification that allows the data pipeline like this. And, of course, from our side, there's not a value in in explicitly starting out, not wanting to change the specification because right now, we are not at all in a position where we can do something as a kind of decisive decisive like that, and our power is very much in trying to become, the go to runner for singer based step pipelines, etcetera.
What we have found is that on the side of specific steps and targets, there is a lot that teams can do to actually improve those to be more, you know, ready for skill and etcetera. And a party, a group of people who have done a really great amount of work there are, TransferWise. A UK based, I think, startup that have recently published pipeline wise, which is their own runner for senior types and targets, which comes with their own forks of a number of depths and targets as well. And they are spending a lot of time making these targets for, you know, Snowflake, Postgres, and BigQuery really great and and feature complete and and ready for production. And there's still a lot of opportunity there. And like I mentioned, what we wanna do with Meltano is at some point also empower people to actually build steps and targets that are just as robust as the ones that, you know, transfer wise and then some other data teams out there are currently building. On the front of the singer specification so far, I think it provides enough for us to be able to build this platform on top of it. Of course, you know, singer taps and targets, or rather singer taps are already being run-in production by Stitch, and they were of course built to to serve the needs that they have for plugging it into their existing infrastructure and having them, kind of compete with the, you know, extractors that they natively support, you know, for some definition of native that that I don't think are actually written using the singer specification. So the singer specification is is is fine. But I think where people really, you know, run into trouble when I try to deploy these is just because there are a number of moving parts. Like, you have the configuration to singer steps and targets that needs be provided in a conflict. Json file, which can be passed in a flag to the actual executable. But, of course, this configuration file will contain a mix of, you know, standard, you know, Boolean, configuration values, but it will also contain the credentials used to connect to either the the the SaaS service or the data warehouse. So if you wanna deploy a single tap or target into production, you gotta figure out, okay, how am I going to separately manage these sensitive secrets from these these settings that are I'm fine having checked into a git repo somewhere.
And that's not currently addressed by by singer specification or the tooling provided around it. Similarly, a a singer tap, when it runs, out updates, an internal state dictionary to kind of say, you know, how far have we progressed with with syncing data from this data source. And then the state file at the end of the data pipeline, it needs to be saved and then passed on to the tap in the next incubation so that it starts off where it left off. But if you wanna run single taps in production and you only have a tap and target to little, you know, pipeable executables, you yourself have to kinda set up the infrastructure around it to manage this state. And then there's also entity selection. If you have a data source that supports a lot of different entities and properties, a lot of different tables and columns, the way that you tell a singer tab to only actually sync a subset of that is by providing a catalog file. And a catalog file describes the entire schema, and then it can select specific entities and properties. But generating this catalog file today means that you have to either completely manually generate it based on specific, entities and properties you know you're looking for, or you can invoke the discovery mode, which is implemented by a lot of taps, which literally just means running the tap as you normally would with dash dash discover as a flag, which will result in the tap actually outputting or generating a catalog adjacent file which by default selects every single attribute or a subset of attributes and then in the process of actually informing the tech when you truly run-in sync mode to only load to only extract a subset of these entities or and properties means modifying that JSON file that is generated by discovery mode to only select or, you know, add the selected colon true property to the specific properties you wanna extract, which again, literally is kind of a manual process right now, which might involve modifying a massive JSON file, which is, of course, very error prone. So Multano also helps out with that by adding some commands that make it easier to specify in a declarative rule based way which entities and and properties you're actually looking for so that this catalog file can be automatically generated and passed to Neltano. For configuration, we do something similar in which Neltana has different configuration layers. Environment variables, first of all, kind of following the the, the 12 factor f principle of of environment variables being kind of the, you know, the go to way of exposing sensitive or environment specific configuration to your application. And then there's another configuration layer, which is the actual config object inside your project. Nano.demo file, which you can use for non sensitive configuration. And there there's 1 or 2 more options there based on based on your preferences and your setup. And on the front part of of managing the status of the state of these these steps and targets that are really is really the state of the pipeline, Meltano manages the pipeline for you. And this comes concept of scheduled pipelines and knows that for each of these scheduled pipelines, it needs to store the state and reuse it on the next invocation.
So this is this is kind of, functionality that is relatively basic in a singer runner, runner of singer types and targets. And a number of these runners exist like I mentioned before, because singer types and targets are used by various data consultants and and teams in production at their clients or in their own teams data stacks. But then you still gotta go 1 step further, which means you actually wanna deploy this pipelines onto, you know, in your own cloud or if you have somewhere or Kubernetes or maybe you wanna help chart to make it really easy to deploy this. And we really wanna get to a place where someone doesn't need to know that running single taps and targets means dealing with state and configuration and entity selection. We wanna make these, you you know, simply configurable plug ins into the Meltano platform, and Meltano manages everything else for you, both in managing these various aspects of each single tap target and pipeline and then actually making it really easy to deploy these onto, production.
So in terms of the runner tooling that manages, like I said, state config and entity selection, a number of tools already exist, but none of these have gone so far as trying to abstract away these kinda aspects of singer types and targets, and and we can't afford to abstract that away, by allowing users to interact with extractor and loader plug ins just like they would with other Multano plug ins like the Airflow Orchestrator, like the trend the dbt transformer, which means that as a Multano user, the way you configure any of these different plugins is as identical, whether that's through environment variables or through the Meltano config CLI or through the config objecting your Meltano. Yml file. So we wanna intentionally kind of abstract away the singer and tap and the singer specification specifics because ultimately we think it fuses data teams that just wanna get stuff done more than it actually helps them to expose those underlying bits of what makes a singer tap or target a singer tap or target. And in your experience
[00:55:23] Unknown:
of taking over this team and working with Meltano
[00:55:27] Unknown:
and helping to understand the direction to take and actually building out the platform, what have you found to be some of the most interesting or unexpected or challenging lessons that you've learned in the process? I mean, the most unexpected lesson for me really was that, you know, when I when I came into the position that back in March where I kinda need to figure out where to take Montana from here. Like, what is the best bet to for the next couple months or the next 6 months or whatever. Because, you know, obviously, from GetLabs perspective, Nutano is an r and d project, and it's been, you know, invested in for 2 years. But we do expect, of course, some results at the end, at least in in user uptake and an actual increase in contributions. And unfortunately, we hadn't seen much of that over the last 2 years. While there had been some initial interest and then some people who've been giving us feedback over the last 2 years, Very few people had actually converted into to users and contributors. So I ended up in a position where I kind of need to figure out how to change that, where to go from here. And at that point, like I mentioned earlier, back in March, I I very much wasn't read up on the state of of the data space and the state of ELT tooling and and how Multano fits into the needs of the data engineering community and specific the the open source, minded data engineering community. So 1 thing I was just kind of really fortunate to find is that almost unknowingly by building this this singer runner for ourselves, almost unknowingly, we had built something that I came to the conclusions through talking to some of these singer community members was actually something they were really looking for and waiting for. And a lot of people expressed with us that while they had been able to kinda get out of the the ecosystem, what they needed up to that point, they did feel that the potential of the community and the ecosystem was was far larger than what was being realized today and that had a lot to do with lacking tooling and documentation.
So I was just really fortunate to find that of course, you know, because we as the McDonald team, decided to build something that made it easier to build singer types and targets, it turned out over time that we had actually been that whole time been building the exact thing that the Cigna ecosystem had already come to the conclusion that they needed to go further and grow. So I was happy to find that, we hadn't gone completely in the wrong direction building an end to end platform that no 1 wanted or betting on some open source technology that was actually, you know, falling out of out of favor and out of popularity. But I found within a couple of weeks since starting to think about, okay, where to go from here? And then talking to these singer ecosystem data engineers, that Meltdown actually resonated a lot with these people when explained in these terms of let's build a true open source alternative for the data integration problem. More so than when it was described as, let's build an end to end platform for for the data life cycle and for data teams. And we still believe, you know, that there is value in this end to end story in in the future. And I would love to kinda see the community take it there and develop it into that direction if that is where we decide, you know, we can can bring value. But I was just very happy to see that what we've done so far, even if it didn't pay off immediately, is paying off massively now because we built something that has really hit a nerve over the last month. And the response that I've gotten from single community members and data consultancies that are using senior types and targets or evaluating using senior types and targets in the data stacks that they offer to their end users. All of them, you know, all of them have spoken to us, which is a good amount of of the ones that are in the community, have been really excited to not just use it and try it out and give us feedback, but also actually build it and make it happen with us. I would never have expected, for example, that within 2 weeks after our kind of announcing the new direction for Multano and the new focus for Multano, a a part of your company called Applied Labs already reached out and said that they are planning to replace pipeline wise, which like I mentioned is an other singer pipeline runner. They're planning to replace it with Multano in applied data, which is the data platform, the integrated data platform that they offer to their clients. Because they see the value that will come from not just focusing on building a tool that can run senior types of targets, but also a tool that provides the UI around it that can be used to to treat these pipelines just like you would a pipeline in a tool like Stitch, where you just have a UI and you click your connector and you configure it, then you hit the start button, and then you check it, you know, on a schedule to see if if the monitoring, you know, if the graphs and everything still looks good. We wanna develop Multanum into that same direction because today is most appropriate for data engineers who are highly technical, because we wanna get it to the place where everyone can start using Multano really easily. And they see the value, and they have actually committed to putting 1 and a half engineers on the Meltano project for the next 2 months, specifically to focus on building out this data pipeline management user interface, which we already have in in a basic, in a basic form in Multano right now if you run the Multano UI command. But, explicitly, I've been focusing on the CLI and deployment story over the last month or so. But it's been really great in kinda confirming of what we're trying to do here to see that community members are already starting to contribute, not just a couple of hours a week when I feel like it, but people who actually believe in this vision just as much as we do, wanna make it happen and are putting their money where their mouth is and working with us on making, you know, that data pipeline UI a reality. And that will definitely be part of making a note down that can truly compete with hosted options out there a reality. Because we know that not all of these users that we wanna target, especially the less technical ones as smaller startups. We know these are not necessarily going to be comfortable running CLIs locally or managing and deploying their own project you know, using a Docker file. So, I've been really heartened to see that we're not alone, and we seem to really have hit a nerve. And I would could never have predicted that a month and a half ago when I was kind of faced with where to go from here, and and kind of all of the options seemed equally good and bad. But I'm glad that with the help of some of these data engineers I've talked to, we've really been able to, build something that we are really all excited about together, that we're gonna try and make and realize together. Because our intention is for Multano to not be another tool built by GitLab that's gonna try to get some users out of the GitLab community. We really wanna build a tool for the data engineering community with the data engineering community, and that's playing out exactly as I helped it with, a month ago. Are there any other aspects of your work on Meltano
[01:01:42] Unknown:
or the overall space of data integration or some of the challenges
[01:01:46] Unknown:
in an end to end tool for managing the data life cycle that we didn't discuss that you'd like to cover before we close out the show? No. I think we've heard pretty much all of it, and and I very explicitly don't wanna talk too much about the end to end vision for today because either what either even though it's still kind of in the back of my mind as an eventual future in which I could see Meltana developing. It will really be up to the community, that we attract, and I wanna build something great with the community. And so far, it seems we're doing that. So all. Thank you so much for giving me, you know, this opportunity to, talk about the project and then reach a broader audience. And I hope that people in the audience who, hear some things that might be relevant to them, will check out Meltano, give us some feedback. And even if today, it's might be quite far from something you would actually consider deploying into production, if you give us that feedback, know that I will continue to work 40 hours a week to make it a reality. And like I mentioned, people are starting to step up. We're going to be investing significant time as well. So even if today, Montana, is not quite what you were expecting it to be, check it out a month ago from now or 6 months and see where we go and then try to get us there with your help. Let's build this together. Well, for anybody who wants to get in touch with you or follow along with the work that you're doing or contribute to the project, I'll have you add your preferred contact information to the show notes. And as a final question, I would just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. I mean, my my answer can only really be 1 thing, which is the gap is in the lack of a true open source and solution even existing in this space. I think you could never call a market or a space, saturated until there is an open source equivalent that can actually rival the paid offerings out there, especially if this is a space where most of the end users or good amount of the end users are themselves actually programmers perfectly capable of coming together and building something like this if we actually, combine our forces. So in the data management space and in the data integration space, I think there's a massive opportunity to kinda disrupt it from the bottom with open source technology, and, we can do it together. And otherwise, you know, great data integration tools exist. I'm not claiming that that that there are is nothing out there today if you want to integrate your data. But as long as there's not something open source, so as long as there's not something free and and truly, you know, open and accessible to everyone out there, then a significant of the markets to a significant
[01:04:00] Unknown:
part of the market. It is as if there were no tool at all. And that is what I think is the biggest gap in the space today, the lack of open source solutions. Well, thank you very much for taking the time today to join me and discuss the work that you're doing on Meltano. It's definitely very interesting project and 1 that I intend to keep a close eye on and possibly employ for my own data platform uses. So thank you for all the time and effort you've put into that and the rest of your team as well. And I hope you enjoy the rest of your day. Thank you. You too, Tobias. Thank you so much. Listening. Don't forget to check out our other show, podcast dot init atpythonpodcast.com to learn about the Python language, its community, and the innovative ways that is being used.
And visit the site at data engineeringpod cast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave review on iTunes and tell your friends and coworkers.
Introduction and Project Overview
Guest Introduction: Dawah Mann
Transition to Data Management
Software Engineering Background and DevOps Mindset
Meltano: Origin and Evolution
Challenges in Data Integration
Singer Specification and Ecosystem
Target Audience and Community Engagement
Meltano's Pluggable Architecture
Challenges in Scaling and Deploying Singer Taps
Lessons Learned and Community Feedback
Future Vision and Closing Thoughts