Summary
Data integration in the form of extract and load is the critical first step of every data project. There are a large number of commercial and open source projects that offer that capability but it is still far from being a solved problem. One of the most promising community efforts is that of the Singer ecosystem, but it has been plagued by inconsistent quality and design of plugins. In this episode the members of the Meltano project share the work they are doing to improve the discovery, quality, and capabilities of Singer taps and targets. They explain their work on the Meltano Hub and the Singer SDK and their long term goals for the Singer community.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
- Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
- Your host is Tobias Macey and today I’m interviewing Douwe Maan, Taylor Murphy, and AJ Steers about their work to level up the Singer ecosystem through projects like Meltano Hub and the Singer SDK
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by describing what the Singer ecosystem is?
- What are the current weak points/challenges in the ecosystem?
- What is the current role of the Meltano project/community within the ecosystem?
- What are the projects and activities related to Singer that you are focused on?
- What are the main goals of the Meltano Hub?
- What criteria are you using to determine which projects to include in the hub?
- Why is the number of targets so small?
- What additional functionality do you have planned for the hub?
- What functionality does the SDK provide?
- How does the presence of the SDK make it easier to write taps/targets?
- What do you believe the long-term impacts of the SDK on the overall availability and quality of plugins will be?
- Now that you have spun out your own business and raised funding, how does that influence the priorities and focus of your work?
- How do you hope to productize what you have built at Meltano?
- What are the most interesting, innovative, or unexpected ways that you have seen Meltano and Singer plugins used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working with the Singer community and the Meltano project?
- When is Singer/Meltano the wrong choice?
- What do you have planned for the future of Meltano, Meltano Hub, and the Singer SDK?
Contact Info
- Douwe
- Taylor
- @tayloramurphy on Twitter
- Blog
- AJ
- @aaronsteers on Twitter
- aaronsteers on GitLab
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
- Singer
- Meltano
- Meltano Hub
- Singer SDK
- Concert Genetics
- GitLab
- Snowflake
- dbt
- Microsoft SQL Server
- Airflow
- Dagster
- Prefect
- AWS Athena
- Reverse ETL
- REST (REpresentational State Transfer)
- GraphQL
- Meltano Interpretation of Singer Specification
- Vision for the Future of Meltano blog post
- Coalesce Conference
- Running Your Data Team Like A Product Team
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means? Our friends at Outland started out as a data team themselves and faced all this collaboration chaos. They started building Outland as an internal tool for themselves. Outland is a collaborative workspace for data driven teams like GitHub for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlant enables teams to create a single source of truth for all of their data assets and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.
Go to data engineering podcast.com/outland today. That's a t l a n, and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $3, 000 on an annual subscription. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.
Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
[00:01:55] Unknown:
Your host is Tobias Macy. And today, I'm interviewing Dawah Mann, Taylor Murphy, and AJ Steers about their work to level up the singer ecosystem through projects like Meltano Hub and the singer SDK. So Dawa, can you start by introducing yourself? Thanks, Tobias, for having us again. My name is Dawa Mann, like you mentioned, and I am the CEO of Meltano and I've been the general manager of Meltano while it was a project inside GitLab for the past year. My history is in developer tools and open source developer tools, especially. I've been involved with GitLab since, it started out as a little open source project a little bit over 6 years ago. And Taylor, how about yourself? My name is Taylor Murphy.
[00:02:30] Unknown:
I'm the head of product and data with Neltano. Joined the team in March of this year, but I've been involved with Neltano since before it was even called Neltano when it was a small project within GitLab called the biz ops project.
[00:02:43] Unknown:
I served kind of as the the primary customer of the project then and helped guide some of the the early development, but now actually get to lead kind of where the project is going. And, AJ, how about you? Hi. I'm Ajay Stiers, also Aaron Stiers for people who knew me before. My background is in data warehousing, but I also have a software background as well. So as I was learning data warehousing and solving data warehousing problems, constantly reminded by the fact that, hey, software developers have a little easier than we do or better tools. So, really, my journey has been through traditional data warehousing, then Amazon and AWS and Amazon Digital Video, then startup and consulting, really trying to bring the best practices of software over to data engineering as well.
[00:03:25] Unknown:
And going back to you, Dawa, do you remember how you first got involved in data management?
[00:03:29] Unknown:
Well, my first involvement in data management has been these past 18 or so months that I've been working on Nutano. Before that, you know, I was a software developer and an engineering manager working on GitLab, a DevOps platform. And like Taylor mentioned, Meltano has been an internal project in GitLab to try to apply some of those same software development best practices to the data life cycle problem. So joining the Meltano team is all the data background I have, but I've been catching up very quickly over the last 18 months,
[00:03:58] Unknown:
partially with the help of people like Taylor and AJ. And Taylor, how did you first get involved in data? My background is actually in chemical engineering, and I got a PhD in the in the same field, but I spent a lot of time during my PhD doing some data modeling work. There was half gathering data in the lab and then analyzing it, you know, using basic tools like Excel. We had some custom code in MATLAB. Moving out of grad school, I joined a small startup in Nashville called Concert Genetics, and I was kind of their 1st data scientist hire. That was my title. But the work that I was doing on a day to day basis was very much data cleaning, basic analysis stuff, eventually moving into more complex analysis, all around pulling information around genetic tests, curating that data, standardizing it, and really building a bunch of custom data pipelines with the work of the engineers on the team. I was there for about 4 and a half years and kind of grew the data team out around me. So hired some true data scientists, hired some what we would call data engineers, and and even kind of proto analytics engineers because we were doing a lot of work on on Redshift.
In 2018, I moved to GitLab specifically to continue to improve my data engineering skills and pretty quickly found myself in a situation with the opportunity to lead the data team for the company. So I stepped up into that role, hired some fantastic data analysts and and data engineers, and really built the the foundation of the data stack that's still in use today at things like Snowflake and DBT. I'm a huge DBT fanboy active in the the community, and we leverage it for, I'd say, 90 plus percent of the data work that we're doing within GitLab. So, eventually, I moved back to the data engineering roles, a staff data engineer because I wanted to do a little bit more individual contributor stuff. Yeah. Moved on to the Meltano team when that opportunity came in early March of, this year. So that's kinda my long background, and I'm I'm still learning every day and excited to see what what else is out there. And AJ, how did you get involved in data? As I mentioned before, I started off in data and traditional data warehousing and seen through products like Microsoft SQL Server and all of those.
[00:05:59] Unknown:
And very early on started trying to bring software automation to that pipeline.
[00:06:05] Unknown:
We've had an episode before, Dawa and I, about the Meltano project. So we've dug deep into sort of what the goals there are, and a big portion of the foundation of that is the singer ecosystem. And I know that there's a lot of work going on there to try and help improve the overall quality and level up the ecosystem as a whole. So I'm wondering if you can just start by describing a bit about what the singer ecosystem even is and maybe some of the history of, you know, where it's been, where it was when you kind of came in with Meltano and started trying to level it up and some of the work that's necessary to bring it up to a par where you feel that it will be sort of in a good position to move forward?
[00:06:44] Unknown:
So I'll start by explaining a little bit more about CINGR, and then we can delve into the state of the ecosystem today. So CINGR is an open source standard for data connectors. It describes how you can build 2 different executables, 2 different pieces of software. 1 to connect with data sources, a second to connect with data destinations that can then be combined to sync data from any source to any destination. And really what singer is, is a description of the protocol that these 2 tools can use to communicate. What that means is that you can build with whatever programming language you like an executable that has the desired behavior for the source and another for the destination and have that power your EL pipelines.
So So the ecosystem around singer then becomes all of the different connectors that have been written by different people, different organizations at various times that speak the singer protocol. It can all be combined with each other, at least in theory, to load data from any source to any destination. And the ecosystem also includes all of the different tools that have been built that use these SINGER connectors, these steps and targets as they are called, to provide the data integration functionality within the tool in without end users necessarily knowing, use Cigna Tabs and Tabs and Tabs and without end users necessarily knowing, use CINGER TAPs and targets to power data integration functionality in these products.
[00:08:10] Unknown:
And so given that the ecosystem is this kind of constellation of loosely connected projects where there are different plugins for data sources and destinations, and the idea is that they'll all be able to communicate with each other to wire together a pipeline with, you know, little to no effort on the part of the person who's trying to integrate the pieces. But given how distributed these efforts are, I'm wondering if you can just talk through the current weak points and the challenges that exist within the ecosystem.
[00:08:38] Unknown:
I think the story certainly has changed since the last time you were able to talk with Dawa. We would say, you know, 6 months ago, what what are the current challenges of the singer ecosystem? And back then, we identified really that getting these taps and targets into a production state was 1 of the big challenges. You can the read me's on these projects, and it'll tell you, you know, here's the command to run it locally. Everybody always had the question of how do I get this into production? How do I, you know, drive my analytical decisions based on this data that I'm moving either into a warehouse or warehouse? So that was 1 big challenge. We also found that it was hard to build these taps and targets just generally. It wasn't clear the best methods to use, whether you had to do it in Python or not, understanding the spec, it kind of took you down this this big rabbit hole.
Then because of that, we saw that a lot of the taps and targets had a very inconsistent level of quality and also behavior. Some taps would implement all parts of the spec, some, you know, would kind of do slightly different things, and there's really no consistency between and across these different projects. And then kind of the last piece we identified was that it was hard to discover the large world of singer tasks and targets that are out there. And so we've done a lot to address that and we'll talk about that in podcast. And we think we've addressed a lot of these issues, and we have an answer to each of these challenges. But specifically today, I think the challenge for the Cigna ecosystem and and us as Meltano to really help level up the Cigna ecosystem is an awareness piece. So we really want to build a true open source ecosystem. That means having a large library of taps and targets that are maintained by the wider community, Similar to how, you know, client API libraries, have a vibrant open source community around them. We aim to build the tools and be the community leader to support the broader ecosystem.
I think the second challenge that we're facing a little bit is the reputation of Cigna. Cigna came out with a lot of promise. And because of these original challenges that I mentioned, it started to tarnish the name a little bit in certain people's minds. And so now I think we're pushing back against that and saying, like, no. Cigna is actually really great. We are really embracing the protocol and the larger ecosystem. And I think we have kind of the track record over the the past year certainly to show that that's the case, and there's a lot more that we wanna do. And now that we're a separate company, I think we're gonna be able to really invest the resources into leveling up this community into what it was originally promised to be. And to your point too about the fact that there was all of this promise and excitement in the early days of the protocol being announced, and it was originally sponsored by the Stitch company, which has since been acquired.
[00:11:11] Unknown:
So sort of the ecosystem had this champion to begin with, but then as they got acquired and they focused their efforts elsewhere, the community was kind of left without any real direction or leadership. And so it started to stagnate a bit is my understanding. And so the amount of effort that was put into these different taps and targets started to wane. And so the overall level of quality that existed across those targets was inconsistent because as you said, there wasn't any real direction or best practices for how to implement these projects. And so now with the work that you're doing with Meltano, I'm wondering if you can just talk through what you see as kind of the current role of the Meltano project and the Meltano team within the broader ecosystem of the singer project and the singer specification and some of the broader goals that you have for Meltano as it continues to try to carry this ecosystem forward?
[00:12:03] Unknown:
As Taylor has been discussing, a number of these issues in the singer ecosystem have had to do with the lack of attention it had been receiving from the original kind of founder or creator of the protocol, Stitch. And Meltano actually became a member of the CINGR ecosystem and community when in building this end to end data product, which Meltano was at the time, we needed to select a technology for the open source data integration bit. And we came across Singer in around 2018, and we were very excited along with the rest of the community to have this new standard coming up and this library of connectors quickly expanding. And a year ago when we were trying to figure out how to kind of re imagine Meltano, we recognized that the part that we had built that was getting the most traction was actually the tooling around singer. Because more and more people were looking for a way to run singer based pipelines without the Stitch platform.
And they were looking for people and organizations in the community who were going to provide better tooling and better support and better training material around singer. So we realized that the biggest opportunity for Meltano a year or so ago was to really embrace the singer community and the standard and start to try addressing these concerns that people had rightfully been having about the technology and its future. So the first problem to address is addressed by Meltano itself. It allows you to build, run, and deploy these singer based pipelines with very easy configuration and deployments. The other problem that Taylor mentioned has to do with the difficulty of building new connectors and the inconsistent quality between different connectors written at different times by different teams.
All kind of using this relatively vague specification of the singer protocol that really leaves a lot of things up in the air and that leads to, you know, different behaviors being implemented by different people. This we're addressing with the SDK for building taps and the SDK for building targets that we have been developing over the last couple of months in collaboration with the community, which provide a much higher baseline of building connectors on top of this spec without end users, people implementing the new connectors, needing to know every single detail of the singer specification.
Ajay can talk a little bit more about what the SDK does after since he has the lead engineer on that. The third thing has to do with discoverability of these connectors. On the 1 hand, it's difficult today to find out whether a connector already exists for a certain source or destination. But even if you've managed to find a git repo for a connector, there are many cases in which that's canonical repo is not actually the most actively maintained fork. This also has to do with this lack of interest in singer that Stitch has, had over the last year and a half or so, where the canonical repositories for various connectors maintained by them have a long list of unrefuted pull requests. And you need to manually go into the network view on GitHub and find a much more maintained fork in order to find 1 that you would actually wanna use and contribute to. So we are building the Meltano hub to increase the discoverability of connectors in general, and then also serve as health related metrics. Like how frequently is this updated? How many unreviewed?
You know, pull requests and issues are there on the repository. And we are in general building out Meltano with other tools like DBT and Airflow, Dexter and Prefect to provide a full open source prorating system, if you will, for open source data stacks that people can build on top of with singer forming the data integration part there. So the role in the Multano rather, the role in the CINGER community that we have today is, as I would say, the primary company really embracing it, building new tooling, getting the community to rally around these projects. We have seen a lot of interest from various companies that over the last few years have started using Xinger that are now collaborating with us on the SDK and the Hub because they all recognize the need for such a kind of central core set of tooling for the ecosystem and the community to build on top on.
[00:16:11] Unknown:
Taking the sort of 2 main efforts that you're doing separately between the SDK and the Meltano hub, let's first dig into kind of the discoverability piece of it because as you mentioned, it can be a very difficult aspect of decentralized communities is understanding what are the available options, what is the relative quality and activity. And, you know, when you have cases where a given repository is forked multiple times, which is the 1 that everybody is actually using versus the 1 that was easiest to find. And so in terms of the Meltano hub, I'm wondering if you can just talk through some of the goals that you have for it and some of the challenges that you have in terms of identifying which repositories and projects to include and just the overall design elements to make it easy for people to discover the tools that they actually need to complete their tasks.
[00:17:00] Unknown:
The primary goal with the hub is to bring a level of discoverability and transparency to the ecosystem that just doesn't exist currently. In terms of, you know, what are what are some measures that we can identify to to say, hey. This is the higher quality tap and target. So there's gonna be multiple layers for any given connector that's listed on the hub. So we started out by primarily just listing, you know, the taps and targets that are discoverable within Meltano. Those have gone through a review process. We're confident of the settings that are required to to run the tap and target within Meltano, And we have a good YAML backed definition for what that that tap target is. The next phase was to basically you know, we scraped GitHub to using some fairly tight search terms around like tap, target, Python. I think we included singer as well to kind of pull a full list of everything that's available on GitHub.
We're actually doing this using Meltano, running DBT with within Meltano against Dina as the target to clean up the data and to basically say, like, okay. Here's the big list of of taps and targets. And then from there, it's been a manual curation effort to initially go through and identify, like, yes. This is related to CINGR. It has a, you know, a read me, and it seems to be, you know, somewhat active. There's there's a commit within, you know, past couple of years. So there's that initial kind of manual work. And the iterations that we've taken it to now include all of the forks that are available. And we have a bunch of proxy metrics now because eventually we want to be more automated. But to really kind of bootstrap the discoverability, we're putting a lot of manual effort into this this at first.
So obvious things. The fork has more stars in the canonical repo. We're gonna have to help prioritize that 1. If there are more recent commits than some of the other projects that will that'll kind of factor into whether this is a we consider this a high quality tap and target. 1 of the big things though I think is is we're gonna have to rely on the community to help us identify. It's like, hey, the project owner for this tab doesn't seem to be responsive. I've forked it, I've implemented this in my version. Can we, you know, promote this on on the hub? And I think that's an excellent path to show that, yes, this is a better version of the tap and to show that this is kind of a community driven effort. You know, long term, I think the goal would definitely be to have these different project maintainers able to directly push to the hub themselves to identify, like, hey. I'm actively working on this or to even, you know, introduce new taps and targets that we haven't discovered yet. So short term, there are some very obvious things that I think we can do, and and we are surfacing those on the hub currently. Eventually, we wanna get to a little bit more automated and dynamic that is responsive to the community and can take, even broader array of feedback on what is a high quality tap and target.
[00:19:49] Unknown:
In terms of the actual list of taps and targets, when I was going through, there's a very significant list of taps for being able to source data from, and the number of available targets is comparatively smaller with the obvious targets of things like Snowflake and Redshift. And I'm wondering if you can just discuss what your opinions are as to why the number of targets is relatively small, particularly compared to the number of available taps.
[00:20:13] Unknown:
There are couple things going on here with the smaller number of targets. 1 of them is you just mentioned is that there are fewer number of, like, mainstream targets or let's call them data platforms to load to than there are total number of SaaS out there. So obviously, you know, we right now are approaching 200 different sources that we can support or that are taps out there, and we're putting those on the hub. We think that's gonna be 1, 000 in a couple years because the number of of SaaS sources and SaaS systems where you might wanna get data from is just going to grow and grow and grow. And And as it gets easier to write taps, you're gonna wanna pull data from them. It's not always the case that you can load data back into those systems. And so most people are just loading it into, you know, Redshift, Snowflake.
We just wrote 1 for Athena based on a preview of the target SDK. And there are just fewer, you know, 1st class data platforms that people wanna use for their, you know, DBT back end, for instance. I mean, a good measure would be, can we support every target that dbt supports? Because that's where we expect a lot of data engineers will eventually be doing their data processing. And so we wanna make sure we have a 100% coverage there. I do wanna say 1 more thing that we started to think about, in terms of broadening the number of targets we do support. The term has come up reverse ETL, which I'm not a huge fan of. I prefer to think of ELTP with p being published. So we're track, load, transformer, and then finally publish optionally.
And but anyway, that final step, something we're thinking a lot about, that reverse ETL or or the publish step of the pipeline. And we think there are things we can do in the target SDK that that more realistic for people. For instance, loading data back to Salesforce, loading data into, like, an account management system, or even some, you know, SaaS is out there do others do support data ingestion. And so the challenge with those is that the contract is always very strict. You can't just load any formed data into a target. So in order to make that viable, we need a decent mapping paradigm that can shape the data, rename it, alias it, nest it, or unnest it to make those viable. That kinda gives you a lay of the land for the targets, how we're thinking about it. Yeah. Definitely
[00:22:27] Unknown:
is a slightly more complicated piece of the puzzle because pulling data out of a system, it's fairly straightforward. You know, they tell you these are the shapes of the data that I have. But then when you're loading data into a system, that's where a lot of opinion and special cases and, you know, particular needs come into play. And so you have to be a lot more flexible in terms of how you actually load the data into the target or, you know, the potential transformations that you might want to do in flight where I know with sort of the ELT approach, you try to minimize or eliminate that, but, you know, still something that has to be factored into the target aspect. And, also,
[00:23:01] Unknown:
the capabilities of the targets are going to be different, but you still need to be able to support all the data sources where, you know, depending on the target, it might have to be entirely flat data, so you have to unnest it, or it might support nested data structures, and the levels of nesting might be different between them. So, like, it definitely adds a lot more complexity to the problem. Yeah. Exactly. A great example that came up recently in these conversations, we have office hour for Grille and we engage with the community in in our office hour section, get ideas and challenges, and talk about new features and needs. But 1 of the things that came up in 1 of those sessions was that, you know, let's say you're performing all your data transformation in Snowflake, and Snowflake is notoriously, like, all caps in all of their column names, at least in how it's stored internally. And so you can override that, but it's all a lot of work and it's a lot of pain. If you have a downstream target from that that requires a certain casing, where do you even start? We are working on this, and we have some issues open right now to explore this. But we think an inline mapping transform that can take aliases or can do minor transforms inline kind of analogous to the map step of a MapReduce where you're not necessarily joining your aggregating data, just kind of aliasing it and doing some some clean cleansing in line is a good way to open up a large number of target SaaS APIs.
[00:24:12] Unknown:
And then as you continue to build out the Meltano hub and add functionality, and you mentioned wanting to be able to automate some of the different elements of determining whether or not a given tap or target should be included in the hub or be able to publish some of the quality metrics. I'm wondering what are some of the additional features or plans that you have for the project. I know that when I was looking through the issue tracker, there was something for being able to potentially have a sort of core test case for a tap or a target to determine if it meets a certain baseline quality level. Wondering if you can just talk through some of the ideas that you have for the Meltano hub and the goals that you have for it in the near to medium term.
[00:24:54] Unknown:
The short term and kind of with this initial couple of releases, we've prioritized having a statically generated site. So everything is is very transparent and open about what data is driving the website, where is it stored, where are we pulling it from Because this is an open source ecosystem where we just been wanna be fully transparent. Long term, that means, you know, probably within less than a year, certainly, we want to move this to more of dynamically generated and and backed websites similar to what you would see with, like, BM registry or or pi p PI where different folks can register their own taps and targets without having to to make a pull request to this single repository.
We wanna support that decentralized nature of the ecosystem. Specifically, your point about kind of automated testing and validation of different taps and targets, the interesting problem, I think, for certain taps, we're we being Meltano, we're certainly not gonna be able to test every tap. There's, you know, a long tail of of thousands of of APIs that we would never be able to test. And so coming up with kind of a generic case for everything basically kinda boils down to, like, you know, are we able to run the executable? And by that, I mean, like, does it output its, you know, read me or does it list the configuration?
The cool thing with that though is is as more taps are built using the SDK that we've made, we can integrate this automation within any project that is built using the the SDK. So we can have it out you know, we can put places where I'll let AJ talk a little bit more about this. But, you know, we can have kind of mock data in there. We can output a a set of, you know, common CI tests that can also be uploaded to the hub to say, like, hey. This this was tested against this this fake data, and we even have that that data available for download. So as these kind of pieces all start to integrate together, we can kind of continue to slowly level up the entire ecosystem because you'll have high quality taps and that emits data that's easily read by the hub and and can be communicated to people to help them understand what is high quality and what is not. I'll just touch on that 1 point you mentioned. We actually did add recently
[00:26:57] Unknown:
a a generic CI suite that basically we will continue to add onto this suite. But for any tap, you can run this in your CI pipeline. It does some basic plumbing testing to make sure that you can open up each stream and emit at least 1 record and it closes it down. You know, we'll keep expanding what we think we can generically test so that test suite gets better and better, But there shouldn't be any, you know, you have to learn your CI pipeline enough to put that test script in there. We might even be able to help with that process. But there should be no reason why TAP maintainers don't have CI tests. And as more and more do, then we can plug that in as a key, if not 1 of the most important metrics for health of a tap. 1 more point I would like to add as well is part of the
[00:27:42] Unknown:
growth that we've seen in the the ecosystem has been driven by the partners of Meltano. So there's consulting companies that build tabs and targets for different customers. And we've had good success with these different consulting companies, basically sponsoring or or taking primary ownership, if you if you will, of an individual tap and kind of committing to being a responsive maintainer of a given project. So within the hub, we wanna highlight that. We wanna make it easy for either consulting group companies or individuals to adopt, TAP or Target if, you know, if the status is is kind of languishing. And so, again, we're kind of looking to the broader community, you know, whether it's it's individuals maybe potentially doing some free work or if it's a consulting company that is managing this tap for their their clients. We just wanna build a little bit more dynamic system to kind of engage everybody within the within the hub. And digging into the SDK itself,
[00:28:37] Unknown:
I know that, as you said, you're trying to provide a baseline to be able to build off of so that people don't have to know all of the ins and outs of the singer protocol. I'm wondering if you can just talk through the functionality that you're trying to provide and how that support or the particular areas of focus differ between the tap and the target
[00:28:55] Unknown:
SDK? I'll start by saying a little bit of the evolution of what's been done before. I think before, most of the investments were made in terms of helper libraries and maybe template projects, but the developer had to know when to implement each piece and how to implement each piece. And what happened was core functionality pieces like, okay, just emitting data were the first things to get delivered and then logging metrics, complex schema, making sure you handle every type of data type that might flow your way. Those things got added only when needed. And so that's why you see breakage sometimes on taps that haven't had a lot of love is that they solve what they needed to get the data through. Contrast that with the SDK. Every single tap will support 100% of the singer's spec without any real effort on the developer's part. The developer can just focus on, you know, integrating with our API. Like, how do I get records? Okay. I got a record. I'm just gonna emit a series of dictionary objects, and my development work is mostly done at that point. We still need to define for the downstream consumer what what our schema might be, like in terms of, you know, these 12 columns or or whatnot. But we're really letting the developer focus on what's specific to that tap. By design, there isn't any plumbing for the singer spec in the code that they need to write. And we're doing that on purpose so that they can just focus on that capability, but also the long term vision is this also enables the spec itself to grow. It allows us to make sure that if there are extensions to the spec that are optional, all taps and targets will deal with those gracefully.
And that really makes us future proof as well. And any taps or targets that are built on it will, to some degree, be able to, in the future, by upgrading their dependency on the SDK, just instantly support new capabilities. So developers can write for generic taps, they just implement a method called get records. But if you're writing a tap that hits REST API or GraphQL, you can just specify those things that are relevant to the to the REST API or the Graph URL. So for REST, you need the path and you need the URL params. Just override those. You don't have to deal with the request library. You don't have to deal with retry or back off or any of that stuff. And same thing with GraphQL. You just put in what your GraphQL query should be. You don't have to deal with the rest of implementation. Even off, We've got 3 supported off mechanisms, JWT, simple off, and o off. You know, where you can save tons of time on those pieces of the plumbing. And then also, again, as I mentioned before, future proof them.
[00:31:27] Unknown:
And so in terms of the overall design space of the singer specification, I know you mentioned that 1 of the challenges of the ecosystem is this kind of fragmentation of incomplete support for the spec and that there are different elements that might be, you know, extra effort that somebody didn't want to go through. And in your work of building the SDK to move that out of the path of consideration for people who just want to be able to integrate with a given data source, I'm wondering what are some of the kind of detailed pieces of the spec that you've had to tackle and any sort of design challenges that you've run up against in terms of how to approach building the SDK to make it easier for people to not have to worry about those pieces?
[00:32:10] Unknown:
Yes. That's a great question. So we've had several actually, and it's difficult to implement these things generically. But I can give a couple of really good examples. 1 is for incremental streams, meaning an a stream that should just automatically be able to resume itself from the last time it was run. You've got a a 1000000000 rows, 10, 000, 000, 000 rows in your fact table and you just want the new ones because you just, you know, you don't wanna replicate everything every time. For that incremental replication, there's an assumption, and it's in the spec. If you get interrupted after 90% is through, the spec has capabilities that you don't have to redo that 90% even in mid stream if it failed. But that all breaks down in terms of the logic for that if your data is not emitted in a sorted fashion.
Having 90% of the records, if it's not sorted, doesn't really guarantee me that I can resume at any point. I have to start over again. So that was 1 challenge we had to solve. So we wanted incremental application whether the stream is sorted or not. And we have that built into the spec now. Only if you specify that your stream is sorted with just a boolean is sorted equals true, then we apply a more stringent criteria, but you get the benefit of being able to resume more likely to resume after an interruption. And then similarly, it's been very hard for people to implement rest APIs that have a nested parent child relationship. So you have a project underneath the project. You have an epic, and underneath the epic, you have issues, you have comments. Underneath your comments, you have emoji reactions.
In the rest API, every 1 of those things is an integer key that's nested. And so previously, developers had to write very, very complicated loops dealing with that structure. And then how do you retry, restart? Like, where's the bookmark kept? That was a very, very hard thing for people to implement in the past. And now we have a very simple parent child relationship support in the SDK. So you can just specify that 1 stream is a is a parent to another and what context the parent should send to the child. And then once you set that up, you're done. The tap will take the SDK will take care of the orchestration of those, including the bookmarking, including the resume. So those are 2 examples of kind of hard things to work out. But if we solve them generically and as other developers start to join forces and contribute in the SDK with us, we're getting a lot of really very powerful capabilities in the SDK.
[00:34:26] Unknown:
I wanna jump in and add something onto that as well that's less technical, but I I think was well received within the community. 1 of the first things I did when I joined Meltano was actually to just basically read all the documentation out there and get really familiar with the current state of things. And I found a gap in how the singer specification was actually described. And so I took a stab at basically rewriting it in a way that kind of walks people through the basics of it and then dives into the technical details. And so we have our published our interpretation of the singer spec is is what we're calling it on the Meltano hub. I think it simplifies for everyone new to the ecosystem of like, what actually is this the singer spec? Like, what is happening? You know, what information is being piped from 1 executable to another? And so we dive into that and then we highlight the different types of records and also what configurations and and what each kind of key represents.
So we've had a few people reach out to us and and say that reading through that has helped them understand some confusing aspects of the spec that were previously intertwined with a specific way of running the taps and targets. And and we try to write this in a way that is agnostic to how you're actually running or executing. It's just this is the the pure spec and the pure configurations
[00:35:38] Unknown:
that are expected to be consumed by a tapper or target. And at the same time with the SDK, we're making it less necessary to actually know all of those details. But of course, there is still a large library of these steps and targets that have been written with these kind of utility libraries that AJ was referring to earlier, where making sense of those and fixing bugs and those are extending their behavior. Does require you to become pretty confident that you're not accidentally breaking something because of a misunderstanding of the spec.
So in connectors built with the SDK, that danger of accidentally breaking something will be much lower. The maintenance surface will be much smaller and much more specific to the source you're implementing rather than implementing all of this, you know, singer bits and pieces. But it's already making a big difference for the community just to have a clearer description of the, spec itself for when you do need to do that debugging or if you are contributing to the SDK itself and you want to, you know, re implement some aspect of the spec in a way where the user implementing the connector does not need to know it, but can do it through some integration point inside the classes that you can override within the SDK.
And just 1 data point that I think is really cool is we started seeing people porting existing steps and targets to the SDK and they are reporting on average about 70% savings in the code surface with increased performance and in some cases functionality that is technically optional in singer now being available for free because the SDK just implements it if you use the correct integration points.
[00:37:15] Unknown:
Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo or Facebook ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you're looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hitouch for reverse ETL today. Get started for free at dataengineeringpodcast.com/hitouch. In terms of the actual specification, as you've been working through building out the SDK, specification is aiming for and how it is intending to implement those capabilities. And you also mentioned the potential for future additions or enhancements to the specification. I'm wondering what are some of the aspects of the spec that you have run into during the process of building the SDK that you feel are sort of ripe for redefining or addition or areas that you were particularly impressed by because of the sort of level of elegance that they
[00:38:27] Unknown:
provided in terms of the way to think about these data transfer problems. I'll start off with just a positive word about the spec. It is really a cool spec. There are some things we might have done differently, but it has got so much already planned out and it's very capable spec. What we're finding is that there are a lot of extensions that people have added that add specificity or add capabilities. 1 of those being there's there's this activate version that's basically saying, when I send you this message downstream, you can go ahead and delete records prior to this or or mark them as deprecated. That's an extension to the spec that basically deactivates prior versions of those records.
And that's something we will probably document and carry into the SDK directly. So at best, to a tap and target that both understand that extension can take advantage of it. But even at worst, neither will be broken. So if the target doesn't support that or can't support deprecating records, then it would just gracefully, you know, warn or or continue
[00:39:31] Unknown:
without that. There's a number of things that are not currently in the spec, and there have been requests from the community to add additional functionality, and we're working with community members on various things there. But there are also things that were never explicitly defined in the spec at all. And Taps and Targets have been implementing behavior in certain ways and depending on particular behavior that is actually not defined in the spec. Which means that you'll sometimes run into errors when you run a particular tap with a particular target. So part of what we're doing with the SDK and with our kind of rewritten spec is choosing specific kind of new guidelines and rules with regards to some of those points.
That includes messages that have been used by a lot of taps and targets already like activate version and kind of putting that in stone and then making them semi official. And it includes things like having expectations around how the incremental replication state payloads. There's a little JSON dictionary that defines, you know, where a tap left off on the previous run, which can then be used on a subsequent run to pick up right there and not repeat any work. Certain taps and targets have been written with certain expectations of what will be in there. And that incompatibility is something we can resolve in the SDK by having every tap and target that uses the SDK do things a particular way.
[00:40:50] Unknown:
Yeah. So just to add 1 point there to the question of what are the weaknesses or what are the opportunities for improvement in the spec? 1 thing we have noticed is that in terms of performance, it's not always optimal to read and write a singleton row or record at a time. And that is kind of core to the SPAC is that the source emits what's the lowest combinator, a record, 1 record at a time, and that target receives 1 record at a time. And we think that for the future and scalability of high volume on high throughput taps and targets, And also for taps and targets that are already optimized to file based batches. We want to extend the singer spec by providing support for a batch message type. That is instead of sending 1 record at a time, we can actually send a pointer to a file. And that pointer to a file could have been automatically created by the TAP. It could have been created by the SDK.
Either way, it will be ingested by the target either through an adapter already built in the SDK that just provides this capability or by native integration with whatever the downstream target is. If that downstream target can just read files from s 3 super quickly, why not just let it do that instead of reloading and kind of reprocessing it a single row at a time? That's a case where the singer spec, we think can be improved, but it's not a fundamental change to the paradigm. So slightly a fundamental change to the paradigm, but it's still respecting this channel. It's just sending a different type of message in this spec, adding support for that additional message type to to enable, batch windows. And we have some precedent for this in the I guess they are now called Wwise, the TransferWise folks have created a feature called Fast Sync. And we're doing something slightly different, but it's in cooperation with them and based on some success they've already seen with this batch type support.
[00:42:34] Unknown:
The way we are implementing these new features like these batch messages or the kind of official iteration of activate message, rather the official iteration of activate version, is that this will effectively be opt in behavior on the target both where the orchestrator like Meltano will handle a little bit of negotiation so that's the TEP and Target can figure out which optional features they support. That way we can keep pushing forward the singer ecosystem and the singer protocol and the SDK without letting behind old connectors that may not have been updated with that new functionality yet. So if you are building a connector on top of the SDK, you can just update it to the latest version SDK and automatically get the benefits of another of these new features. But since this is a decentralized ecosystem of connectors, living in individual repos maintained in many cases by individual data teams or people, we need to find a way to scale this protocol to having thousands of connectors out there that might not all have been updated in the last couple months. So we are building this SDK and these protocol extensions in a way to allow kind of falling back onto the base singer specification as it exists today with this additional more performance or additional functionality offering, features being enabled only if the tap and target find out about each other that they both support it.
We don't want to be in a place where only the latest iteration of the SDK or the latest iteration of the spec will work in platforms like Meltano. It's very much core to the way we are designing these extensions that they are optional. Since the core of the singer spec is really good when it comes to, you know, communicating to a system, what the data looks like, the schema and what that data actually is. That will be good enough for the vast majority of pipelines. Everything else is extra
[00:44:23] Unknown:
and that is built into the way we are, extending it. The benefit of us working on both the Meltano Hub and the SDK is that we can, when new features or capabilities are added and implemented in a way that benefits all of the tap and targets that are built with the SDK, we can bring that information into the hub as well and and visualize that and make it easy for people to understand, like, does this target support incremental replication or, you know, upserts? Is this tap does this tap have incremental replication or you have to get everything all the time? And we wanna be able to highlight the capabilities of existing taps and targets, and some of that's gonna be, like, we may need to initially inspect the code to to pull that information out, but we expect as the ecosystem continues to grow that we'll just have that information programmatically shared on the hub, and and you'll know what happened to target is capable of.
[00:45:12] Unknown:
And so now that you have spun Meltano out into its own business and you've raised funding and you have your own sort of corporate mission for the project and the, know, the technology that you're building and supporting. I'm wondering if you can just discuss how that influences the priorities and the focus of the work that you're doing and some of your plans for what you intend to productize in terms of Meltano to make this sustainable going forward?
[00:45:36] Unknown:
Yeah. That's a great question. And the mission for us building Meltano hasn't changed because we're now a company. We just have more resources to actually make that mission come true. We are not planning to start selling any particular product for at least another year because we really want to build Meltano into the best possible open source data tooling for data teams to build their projects on top of and to get all of those advantages of software development best practices in the way that they build their data products and solve their data problems. So our goal is still to make this a beloved tool for data engineers that solves problems like data integration and data transformation better than a lot of the tools today. We think Cingr adds a lot to that by offering this massive ecosystem of connectors instead of being limited to a handful of connectors that can be supported by the non open source data integration vendors. And we think in general, by bringing together different best in class open source tools into a single platform in the shape of Meltano allows data teams to collaborate more And then down the line, we And then down the line, we can follow in GitHub's footsteps with the buyer based open core model where the core functionality that is used by the actual data engineers and software developers on the ground will forever be open source and free. And we will figure out a way to then build a business around that with proprietary functionality that might be more interesting to their decision makers at the manager and up level within the organizations.
So that anyone who is using Meltano in a smaller team setting or with less enterprise needs will be happy with the open source version forever, just as is the case with GitLab today. But once you are a big enterprise and you want to bring this in and integrate it with other enterprise level tooling, get things like audit logging and a single sign on, That's the kind of functionality that we might end up, charging for on a subscription basis. And today the recommended way of running Meltano is to self host it, set up your own infrastructure, you know, create a Docker container out of your project and put it up somewhere that can run Docker. Down the line, we will probably also have a hosted version where you can upload your Meltano projects to our platform and we will host it for you. Make sure that it has the uptime, make sure that you get the SLAs you're looking for in case you do not have the in house capabilities to handle those aspects.
So we now have the resources to build out this team and continue building out this product in this community. But ultimately what we are trying to accomplish here hasn't changed. And that is to build really great developer tools for people in the data profession.
[00:48:24] Unknown:
In terms of the overall use of the singer tools and Meltano and people using it for building these different data integration systems, I'm wondering if you can just share some of the most interesting or innovative or unexpected ways that you've seen Meltano and Singer and the SDK and Meltano Hub used.
[00:48:42] Unknown:
This is something, again, we learn a lot from when people come and talk about their use cases, what they're building, what challenges they're seeing. 1 of the ones we learned through through office hours was auto IDM is actually using singer to load and integrate with their active directory or not active directory. I think it's Google directory, but actually using it for account maintenance with the account objects being the target of an EL pipeline. So I think that's really interesting. And we're also finding more people who are using this for the kind of the reverse CTO or ELTP process of actually publishing to, like, Salesforce or the like. And so those are really novel. We also have our own Dawa who has a hobby project. He'll he can brag about that this cool integration. It takes data from 1 personal SaaS down to another. Dawa, do do you wanna mention that the Lunch Money app?
[00:49:32] Unknown:
Another example of where Meltano and Singer can be useful, which is not in the traditional scenario where you're at a company in a data team and you're working with, you know, company data is in kind of personal projects where data isn't some SaaS, some service you're using, you want to have it end up in some other place. So 1 example of this is budget tracking software like You Need A Budget or the specific tool I use called Lunch Money, where you can keep track, you know, your transactions across your different accounts and credit cards. But these tools do not always have integrations for every single bank or investments tracking portfolio app you use or other kinds of assets that are not currently automatically integrated with that tool. So you can use Meltano and I am using Meltano to have an automatic syncing pipeline from various sources, banks in Mexico, banks in the Netherlands, and to automatically on a day to day basis sync these transactions into Lunch Money. So I use the SDK we've built for targets to build a Lunch Money specific target and a tap to build connectors for the sources I'm talking about. And then I'm actually using GitLab CI, and it's built in schedules pipeline functionality to run Meltano within the repository CI infrastructure to on a day by day basis, sync over that data. So if you are looking to sync data from any source to any destination, think outside the box. Don't just limit yourselves to tools your employer uses because you'll probably have more use cases from for syncing data from a to b in your regular day to day life than you might realize. And I think those are some of the more interesting use cases of Meltano as well because it kind of helps you realize that data driven
[00:51:14] Unknown:
life rather than only limiting that to, to your business and your work. And 1 thing 1 more thing I wanted to add is I think there's a lot of potential here for hobbyists, personal projects, because you're getting an like enterprise data engineering product, essentially. Best practices data ops platform that you can run for free on any hardware you want. So why not use it for quantified self or hobby projects or any kind of school project, you know, where you wanna use these practices,
[00:51:41] Unknown:
not just limited to enterprise environments to Dallas Point. And as you have been working on the Meltano product and working with the singer community and building out the additional tooling and the Meltano Hub platforms. I'm wondering what are some of the most interesting or unexpected or challenging lessons that you've all learned in the process.
[00:51:58] Unknown:
I think for me, coming into the Meltana project, having seen it through its iterations and, you know, very recent or more recent growth within the singer ecosystem, I've really come to appreciate the value of positioning of a technical project. When Dawa made the initial pivot back in May of 2020 to really focus on Meltano being a fantastic way to run singer taps and targets, you know, nothing changed from the code base once that blog was published. But that's when the the traction started to happen and and people started to understand, oh, this is solving a particular need within the singer ecosystem. And, of course, you know, since that day, we've continued to iterate on the code base even better. But even with the kind of interpretation of the singer spec that I I mentioned and some of the blog posts that we've had, the value of documenting things and communicating about them in a way that makes sense and connects with the problems that people are having in their their day to day, work is is really powerful. And I think I underappreciated
[00:52:58] Unknown:
it as a data engineer, and I'm valuing it a bit more. In my case, you know, a year ago when I decided to kind of pivot to Meltano being specifically a product for building, running, and deploying singer pipelines, I knew that there was more to the Cigna community and ecosystem than people gave it credit for. But over the last month, since we have been building out the SDK and the hub and getting involved with organizations that's unbeknownst to us have been using Cingr for years, maybe inside their products and no 1 would ever know, or consulting firms that having, have been building products on top of it. I've been surprised and pleasantly surprised to find a lot of consulting firms that have been building their kind of internal tooling around Cingr already, and are now rallying around this toolset that we are building and pooling their efforts, bringing in their experience, bringing contributions to the SDK and the hub. And these were all people that I didn't really realize were a part of the Cingr ecosystem until a couple months back. And similarly, when we started scraping GitHub to find out how many connectors there really are, we were pleasantly surprised to find that there are over 200 unique connectors for different sources and destinations that technically any platform that supports singer connectors supports from day 1. Of course, we don't know the relative quality levels of all of these and there might be a couple that haven't been updated for the latest API versions of various things. But if you realize that most closed source data integration tools have libraries that top out around 150, the fact that the greater singer community, which has been written up for debt by some people, has managed to build more than 220, I think, by our latest counts.
Different connectors over the last 2 years is really impressive. And it speaks to the fact that there's something there with the CINGER protocol, the CINGER community, and this way of of addressing the data integration problem where almost despite the lack of attention it has received over the last 2 years, it has still become as widely deployed and successful as it is today. And to us, that gives us a lot of confidence that we make the right decision to bet on singer, embrace singer, and use that energy that still existed within the community and ecosystem.
[00:55:03] Unknown:
And then run with it basically because we think it has not lived up to its full potential yet, but that potential is bigger than I even could have predicted a year ago. What are the cases where the singer ecosystem or the Meltano platform are the wrong choice for data integration and you might be better served using either a bespoke solution or some other, you know, maybe stream oriented or
[00:55:28] Unknown:
As As we've been talking through Singer and Meltano and talking about the fact that it is this decentralized ecosystem of individual open source projects and that the quality isn't always consistent, even though this is something we're addressing. All of this means that you will want to be relatively technical for Singer and Nutano to be a good solution to the data integration problem you have. You need to be comfortable working with individual open source projects on GitHub. You run them locally. You might need to install some dependencies. You might run into a Python stack trace and you'll want to debug it to find out that you misconfigured something.
Right now, the level of polish that we know we can accomplish with the SDK and with Meltano, it's not quite there yet in the case of all connectors. There's a number of connectors that are supported out of the box by Meltano that have been kind of certified and vouched for by our user community. And you should expect relatively few issues using those. But in general, you want to be at least software development inclined for this to be a great fit. But then if you are, it's gonna be an amazing fit because you'll feel comfortable forking something that functionality doesn't currently have, updating some library to add support for this endpoint or this property that hadn't been implemented yet. And you'll feel comfortable using the SDK to build new connectors on top of this platform that gives you incremental application and stream selection and logging and metrics.
While if you were to write a custom Python script for every combination of source and destination, that's gonna be a lot of work to replicate that kind of surrounding functionality. And you're going to have that internal in house maintenance burden as opposed to building against a standard, which means you can open source it and crowdsource that maintenance support from the wider community. If you're not comfortable self hosting a platform like this, if you're not comfortable setting it up locally with a CLI and some YAML files that you'll check into a git repo, is not the right choice right now. We think that as the polish in the ecosystem increases, we will also start focusing more on the UI aspect of Meltano for those users who are more comfortable pointing and clicking in a browser rather than running CLIs locally. And like I mentioned earlier, a year or so from now, we'll probably look into having a hosted edition as well. And at that point, even a less technical data analyst, someone who is more comfortable point and clicking reports and dashboards together will also be able to use Meltano and Singer and get access to that massive library of connectors.
Today, those users are probably better off with some of the more traditional hosted tooling that gives you way less flexibility at the advantage of less technical expertise needed to get started.
[00:58:03] Unknown:
Are there any other aspects of your plans for the near to medium term of Meltano and the singer SDK and Meltano Hub and the overall growth of the ecosystem that we didn't discuss yet that you'd like to cover before we close out the show? 1 of the things that I wanna highlight is we published a blog post called what our vision for the future of Meltano. And there are some high level things that within Meltano itself that we wanna do, and and I won't go over all of those. But it kind of some of the obvious things of, like, pre monitoring, observability, data lineage things. We have that published. We also have a road map on the website as well highlighting kind of a month by month and quarter by quarter as time goes on, what we're gonna focus on. And I think we highlighted really in this conversation the big things around Cigna and the ecosystem. We're gonna keep putting resources into developing the hub to be a a fantastic way to, you know, discover these taps and targets. And we're gonna do that in a way that really collaborative with the community. If there's something that they wanna see, you know, people have suggested rating system potentially, even, you know, potentially, like, comments, like, see how that goes. But we wanna make it a the destination to find out what exists out there. And then tied in with the SDK, that's the benefit of of us being able to build these things as we can make them have a strong connection, and 1 really benefits the other. So we mentioned kind of the batch message or, you know, high throughput, extension to the the protocol.
But, really, it it's just we're a separate company, but we have the resources to really make this ecosystem what we what it should be and what it was promised to be in a way that supports the ecosystem and also enables Meltano to be a really successful open source, open core project. So I would definitely just highlight the vision blog post on our blog as well as the roadmap within our documentation on the website. 1 thing I wanna add to that just to make it clear, because we're talking talking about the singer ecosystem, and of course also talking about Multano, which we believe is the best way to run singer based pipelines. But singer still exists as kind of an independent
[00:59:58] Unknown:
standard that a lot of these connectors have been written to support. And a lot of different systems exist that behind the scenes use Cigna to power their data integration. And we want it to continue to be that way. We don't necessarily need to own singer. We want to improve the singer ecosystem, but the hub has been written also with a programmatic interface for getting metadata and information on these various steps and targets that can be consumed by any platform that supports CINGR based connectors. We want CINGR to be and continue to be the de facto standard for open source connectors. Whether you end up running those with our orchestration running solution, Nutano, or whether you implement them somehow into your own infrastructure or you use some competing product that has also decided to use Singr. So that this connector library is available, whatever your choice of tooling to run them becomes. Ultimately, we all benefit from having a massive, massive library of high quality connectors for data sources and destinations.
[01:00:58] Unknown:
And we don't want that to be Meltano specific, which is part of the reason why we have decided to focus on lifting up singer rather than establishing our own protocol from the start. Well, for anybody who wants to follow along with the work that you're all doing and get in touch and contribute to the ecosystem, I'll have you each add your preferred contact information to the show notes. And as a final question, I would like to get your perspectives on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[01:01:25] Unknown:
So recently, I gave a talk at Coalesce, which is the user conference that the Fishtown team that manages d dbt puts out. And that talk was focused on running your data team like a a product team. And the idea behind that that talk originally was data teams are, you know, kind of stuck in this service model that doesn't really serve them well as data professionals, and it really doesn't enable data teams to show the value that they truly can generate for business or for for an organization. And this ties into into technology because I think a lot of the tooling that exists for data teams today is very focused on 1 specific aspect that solves an immediate problem and doesn't really take a kind of a full look across the organization to how everyone is using data and interacts with information within an organization.
So 1 of the reasons I was excited to join Meltano is is we believe Meltano can be a tool that helps data teams to run better, to run more like product teams, and to collaborate better across the entire organization. We believe a lot of that comes down to having a lot of software development best practices integrated into the product. So having, you know, continuous integration based on a file format that everybody can collaborate on, whether it's in, you know, GitHub or GitLab, Having different environments that you can test out whether your your data transformation pipeline is is gonna work.
Having good observability across the entire data life cycle. These are problems that are being solved in isolation, but it creates this very disconnected hub of different pieces of software that everybody in the the company is using. And maybe the manager uses this 1 tool to look at health, but, really, the data engineers are in airflow, and they're not really talking to each other. And it creates this kind of mess. So we wanna be kind of a broad tool to help level up data teams across an organization to help them think about how are they delivering for their, you know, internal customers, their their colleagues.
Are you delivering insights that help move your organization forward? So it's partly a people problem, and I I think we hope to, through Meltano being a fantastic product, to really change the landscape of how data teams are organized, funded, and, you know, as well as the the tooling that they use to really just level up the the data profession so that data professionals, you know, 1, have have the respect that they deserve because of their capabilities and their skills, but also that they're providing the value that they absolutely can delivering great insights for organizations to make better decisions. That's something that I'm super excited about, and I I think we can really achieve that with Meltano.
[01:04:05] Unknown:
Alright. Well, thank you very much for taking the time today to join me. It's definitely a very interesting ecosystem and an important project that you're all working on, and I'm definitely excited to see the amount of energy and excitement that's been growing around it and the improvements that you've been able to make that are going to benefit the entire ecosystem. So appreciate all of the time and energy that you've all put into this, and I hope you all enjoy the rest for listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Host Welcome
Meet the Guests: Dawa Mann, Taylor Murphy, and AJ Steers
Overview of the Singer Ecosystem
Challenges and Improvements in the Singer Ecosystem
Meltano Hub: Goals and Discoverability
Meltano SDK: Simplifying Connector Development
Future Enhancements and Extensions to the Singer Spec
Meltano's Mission and Business Model
Innovative Use Cases for Meltano and Singer
Lessons Learned and Community Insights
When Meltano and Singer Might Not Be the Right Choice
Future Plans and Roadmap
Final Thoughts and Closing Remarks