Eliminate The Overhead In Your Data Integration With The Open Source dlt Library

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Introducing RudderStack Profiles.

RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable enriched data to every downstream team.

You specify the customer traits, then profiles runs the joins and computations for you to create complete customer profiles.

Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack.

You shouldn't have to throw away the database to build with fast changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses,

but swap the decades old batch computation model for an efficient incremental engine to get complex queries that are always up to date.

With Materialise, you can. It's the only true SQL streaming database built from the ground up to meet the needs of modern data products.

Whether it's real time dashboarding and analytics, personalization and segmentation,

or automation and alerting, Materialise gives you the ability to work with fresh, correct, and scalable results, all in a familiar SQL interface.

Go to data engineering podcast.com/materialize

today to get 2 weeks free.

Your host is Tobias Macy. And today I'm interviewing Adrian Brodaru about DLT,

an open source Python library for data loading. So, Adrian, can you start by introducing yourself?

Sure. Thank you, Tobias.

So,

my name is Adrian. I studied business administration some 15 years ago. I got into the data field some 10 years ago. 5 years of startups.

I tried the corporation, really hated it, got depressed, lost my faith in employment,

did 5 years of freelancing. And finally, after 5 years of freelancing, I started the company called DLT. So here we are. Here's here's why we're talking.

And do you remember how you first got started working in data?

Yeah. Actually,

you know, coming from, economics, I was actually interested in user behavior a lot. I wanted to understand that better. So I found a startup where I took a role as an analyst

and, you know, that took me slowly towards being the data engineer because ultimately,

without autonomy, you cannot do much if you download your own data. So it was over 10 years ago, I was the 1 man data team.

And so as you mentioned, you had this long career path that ultimately led you to building DLT. I'm wondering if you can talk a bit more about what it is that you're building and some of the story behind how it came to be and why you decided that this was the problem that you wanted to spend your time and energy on.

Of course. So

the problem that I originally faced was,

I guess,

sometime in my freelancing career,

about 5 years back, I was freelancing for a company called Urban Sports Club.

And there, I built their entire data warehouse and the data team. So I ended up being the bottleneck for a team of 6 people.

And the reason I say bottleneck is because in data engineering, the role is often the single point of failure. So you have stakeholders demanding more data sources, pipelines breaking all the time, bugs infesting the reports,

and and this more need more data than they have. Right? So my solution back then was to load the data without curating

front.

And this way, enabling the team there to actually work on the data directly.

So 3 years ago,

that wasn't a good solution. You know, it was just what I came up in that project.

But, 3 years ago, I tried to create a freelance co op to actually because I realized most freelancers are building the same thing over and over, just slightly different.

So I tried to,

get the most senior people here in Berlin together to build a coop,

but I realized it's actually a lot of work, and it cannot be done as a side project, basically, for a freelancer.

So what I tried to do with them was to build a unified way of doing things.

And since I was the

single 1 that was really interested to invest effort in it, I decided to,

look for a way to

align commercial incentives with this so I can do it full time. So 2 years ago,

I ran into a team that had previous fund, founding experience.

So

we took a good look at the DLT idea. We had built something before,

that was kind of similar.

In 1 year, we had something that was already quite good, but not for everybody. It was something that was solving more of my problems.

So back to your question also about the target audience. Basically, what we're building is something for the data engineer and for the data user that are Python first for want to build

pipelines with low friction and maintenance.

And

for the target audience that you're focusing on, I'm curious how you think about the particular problems that they're faced with and how DLT is focused on solving their use cases

and for those particular personas?

Okay. So

it really depends on the persona. Right? Because

if you are a very senior data engineer, then you're building a platform for other data engineers to use. And this is where choosing DLT really helps a lot because if you choose the paradigm of schema evolution,

you don't do any more schema maintenance. You have schemas automatically generated version, then you can downstream them into your applications if you want column or row level lineage and all that.

If you are a data engineer, then probably what you're doing is you don't have time to worry too much about all the ecosystem, but you're building lots of pipelines.

So because DLT automates everything from passing data to it to loading,

You can just pass, for example, your response JSON from an API straight to DLT

and declare how it should be loaded

in terms of incremental loading and it will just be handled. So it reduces development time, I would say, about 5 fold,

similar to maintenance

because now the schema is evolved and maintained.

And when it comes to the data analyst, I would say this is the base of the pyramid of the data user.

And 1 of the things that is hardest for these people is to self serve. Basically, you know, a data engineer might create all kinds of pipelines, but when they want anything changed, it's hard for these people to understand what's going on and change it.

However, we've managed to actually simplify,

the code so much that there's no real glue code happening anymore. It's just, you know, accessing some API.

And this enables the data user as well to just make

simple changes.

And

the obvious comparison for DLT

is with some of these other extract and load tools that have been gaining ground recently in the open source space. There is singer, which is 1 of the earlier ones that has some interesting community mechanics there, and then Meltano and Airbyte in particular.

And then for commercial offerings, there are Fivetran, Matillion, a whole suite of others.

I'm curious if you can talk to the,

complexities

or limitations that those tools have and some of the ways that you're trying to address those shortcomings in DLT.

Sure. So I love that you bring up Singular because Singular is actually part of the reason,

why I wanted to build this solution.

So what I mean specifically is if you know the story of Singr, Singr was actually released by Stitch Data,

when

Fivetran came to market. So I guess they felt threatened or they thought, okay. Maybe we can lower our maintenance or we can do some kind of open core offering.

And what happened was, basically, they released the system. So first of all, CINGR was built for software engineers, not for data people.

So they released this, source and target framework.

So first of all, data people are not used to working with frameworks. They're most of them are not object oriented developers except maybe the Java ones that are working in

in eastern teams and corporations.

So,

basically, the problem that CINGER has is the way that Stitch,

maintained their monetization, and that is it's really, really hard to run. Not only is it hard to run, but it's also not designed for the data users. So as I this just adds to the complexity.

So then, you know,

personally, I tried to use Zyngr, and I realized while I can get away with creating sources and using it, my team wouldn't.

So it's as simple as that. Can't use it in a team

and not without external maintenance anyway. And Meltano does a great job at improving the singer standard. So they built on it. They did a fantastic job. But they're still running on top of the singer standard. And that is fundamentally

1 of the core problems.

And they're basically a solution to the missing orchestration, you could say. And they've brought a lot of improvements,

but still not for the data user quite exactly.

Finally, Airbyte.

You know, we call it open source, but we typically have different expectations from an open source software than what Airbyte is doing. So Airbyte is more like

rebuilding things that exist under their umbrella.

And, you know, they they try to say that they're open source to attract a lot of community. But, really, if you need

the more advanced features, you need to pay. So in essence, all the above are limited, by some need to plug in a monetization hook. And

I would say, this is where we're fundamentally also different. So it's not just the architecture, but it's also how we make our money. Right?

So you can think of CINGR as they make money through having orchestration being really hard.

You can think of Airbyte as they make money by making their open source software

incomplete. The way that we plan to make money is more like the Kafka

and

Confluent,

pairing. So

this is the model of an open core company, which means we have 2 products, not 1. So we don't cannibalize

the open source product for the closed source product, but rather we have a really good product in the open source. And it has to be so good that it becomes standard. And

once that standard exists, you can offer things around operationalizing

the standard. So,

in our mind, there will be a DLT hub, which will be, you know, a place for the community to

use and share the pipelines.

And so now digging into DLT itself, I'm wondering if you can give an overview about how it's implemented, some of the architectural considerations,

and the

philosophical design aspects of

it? Oh, I would love to. So we're actually quite philosophically guided. We created our product principles from the very start to understand who we are, what we're trying to build, and who we will be to other people. And really what we're trying to do is create a product that makes a lot of sense, helps with your sanity and with your work, doesn't have any kind of monetization hooks to stress you out, and easy to is the right fit for the data person. And what I mean is that none of the products that we had before were really fit for dropping into

various situations. Right?

Whereas, DLT is a library. So this is for the first time something that happens in the data field. What this fundamentally means is that

you can choose various components of this library to use or not use. So for example, you could use the loader that has the schema evolution

or you could not use it. You could use, you know, you could just dump your,

data out of it to standard output and use something else.

Or you could,

for example,

the way that we create a source, we simply decorate

functions that produce data that you already have. Right? So you get to choose kind of what you use. If you want to use the state

that DLP helps manage, it can save a state at the destination that is committed

atomically with your data. So you could have some Python dictionary, basically, that's keeping the state of your pipeline. You can use it implicitly through declaring what kind of what's your primary key and what kind of loading you wanna do or you can use it explicitly. So to give you an example,

there are many CRM apps that have custom fields.

So for example, what we do for them so the user doesn't have to rename these fields, we get the field mapping.

We cache it in the state. So if it changes, it doesn't change the column names later. And then we reuse it for naming so people don't end up use having to rename

fields that have hashes in the names. Right? So it's all kinds of things that, you know, by being a library, it really helps people use it, drop it in, and it's just a productivity boost.

Another aspect of this project is that you,

very unapologetically

built it in Python. I'm curious

what

the decision process was for for choosing that language and runtime and the overall environment and ecosystem that that brings along.

Absolutely. So I think, you know, in this ecosystem, most of the data engineers

are really fed up of toy tools or experimental tools or things that people are doing that never translate into production.

So what we try to create is something that really works for people. And what really works right now for most data people is Python as a first language. It's just a simple choice. If we build it for people to use it simply, then this is the way.

And 1 of the aspects of Python, obviously, is also the,

performance characteristics

that has long been 1 of the,

complaints that people will levy against Python. So I'm curious

how the

use of Python factors in from a performance

consideration, particularly given the fact that this is in the critical path of managing data loads, which can be quite high volume, which can have some, intense performance characteristics and some of the ways that you are working to mitigate some of those performance considerations.

Right. So Python indeed is not 1 of the fastest languages,

but neither is data loading,

as

a expectation in general. Right? What we're doing is not something transactional that you click a button and 0 1 seconds later something happens. It's rather, you know, we expect that you have jobs running on a schedule in the night if they you know, if you're using Airflow, you probably don't even notice if they start 1 minute late. So to get back to your question, it's actually quite performant

for what it does. And,

if you do have to process a large amount of data, as I was saying, you can actually, let's say, separate it out, the library and its components. And for example, what 1 of our users was doing was I think he was sending the processing load to about 30 machines in parallel. So he had he was using the extraction

and putting this data somewhere in a storage buffer, then using the normalization

component on different set of machines and running them in parallel to normalize the data, and then finally using a different set of machines to do the loading. So you can reach really high performance.

This is currently running, for example, in company with 1, 000 people ingesting all kinds of typical data needs. And from when you first started

down the path of building this tool and you had certain ideas about how it was going to work. I'm wondering what are some of the aspects aspects of the design and goals that have changed in that process.

Right. So, you know, first, I was trying to solve my own problem, which was,

the large amount of time spent,

playing with JSON coming out of APIs to figure out what it is so I can type it and put it in a database.

And this process is kind of nasty because due to the difficulty of the technical

process, you might want to restrict the scope. So you're gonna ask the stakeholder, hey, what do you really want? Do I really need to type all this data?

So this now brings in a lot of complexities

that I wanted to get off my table.

So the first version was kind of you can imagine it as a simple engine. And when we brought this to other Python users, they were like, What is incremental loading? How does this work? It's too hard. I don't wanna use it. So we realized, okay, this is not the tool that's gonna win the hearts and minds of people. So So what we did was we doubled down on the user feedback that we got and came up with a simple declarative interface. We took a few months to develop that, and we created the workshop with 60 plus people. Half of them were aspiring data professionals and the other half senior.

We tested this interface, and they were all successful to build a data pipeline

incremental data pipeline in 6 hours.

So for the next 6 months, we thought, okay. This is good. So what we're gonna do is just bring it to the market. So we spent a lot of time doing

documentation, adding a docs assistant to help people,

with questions. We did lots of ecosystem integration. So you have, for example, command line, airflow deployment, or

Git actions deployment if you wanna deploy it there. Yeah. So this is what we've been working on, and now we have a tool that's actually usable by Juniors to do really cool things.

And so in terms of actually using DLT and incorporating that into different data flows and workflows,

I'm wondering when people first come across DLT, what are the problems that they're trying to solve when they actually reach for that, and then some of the ways that they're able to go from this solved by initial problem to scaling it to managing all of their data loading?

So it really depends who the persona is that first encounters this.

In my experience,

you maybe have seen this meme with a distribution

with the idiot

at the low end, the

Jedi master at the high end, and normal people in the middle. And my experience is that the people who are really at the start of their data journey, they're quite open minded and they will try things. And they will basically play and think, how can I use this? It's similarly true for the people who are at the end of the data journey, and they're a staff engineer, and they need to worry about what their team should use. But the rest of the people usually are quite

stressed out, and they just have problems they need to solve. So usually, they go for the verified pipelines because we offer some pipelines we've already created.

And, they try to, you know, figure out if this is something for them. This is a moment, you know, when they might look into the

code, might look into the,

Slack.

And finally, if they get that this is, that pipeline building tool before it's a pipeline, they will typically try it. And,

because the process of trying is really simple, so we have collabs. You can even just it's a 1 liner. Right? The first thing that you can try. And this enables people to understand a little bit about how easy this makes data loading. So

big shout shout out to DuckDb, actually, for enabling us to create the modern data stack in a notebook.

Since, you know, we can just demonstrate DLP in a Colab notebook, load to DuckDb, and this shows you the power a little bit. So a typical path would be that somebody comes, looks at the sources, considers them,

and if they try to change something, then they will see that this is a tool that is really good for,

maintaining your code and keeping your code in

it. 1 of the interesting aspects of DLT is that because it is a library, it is easy to bring in for a very lightweight

experimental workflow.

1 of the capabilities

that systems such as Fivetran

and Matillion and Airbyte are focused on is the scale out. I need to be able to manage lots of different pipelines

continuously.

I'm curious

how you are addressing both ends of that spectrum and some of the operational and deployment aspects of DLT when you do move beyond that initial experimental pipeline to, I need this to be the backbone of my critical data infrastructure?

So that's actually part of the beauty of DLT that, you know, an experimental pipeline already brings to the table scalability, schema evolution,

alerts,

and all, you know, all the basic things you need.

So the process of transitioning from experimental production is just a matter of deployment really, if you're happy with your extraction code as well. So

to deploy a DLT pipeline,

depending on what you're deploying on, it might be as simple as, command line. Right? And this would for example, you can do DLT deploy,

script name airflow,

and it will generate an airflow DAG file that in case you have an extraction DAG that has some, let's say, dependent logic, will run

the tasks in the right order. It will create tasks in airflow so you actually understand what's going on. Yeah. So it's actually quite a simple process.

I I I would say

sometimes the experimental thing just stays as production. Right? So to give you an idea, this would be simple webhooks or webhook integrations, you know, because this is the first time kind of when you can have

a loading library

run-in a Cloud function.

So,

for example, you can just simply set up in 5 minutes a webhook on GCP,

on the lowest requirements,

drop in a piece of DLT codes to accept

whatever events are coming. And then you could even decide if you're putting this into some kind of bad event table or just in your main table based on the schema.

Sorry, I realized I didn't answer your question about, Matillion and Fivetran.

So the way we think about them is

5tran, at least I'm not very familiar with the Matillion, but 5tran

spends

lots of resources on creating really elegant, really nice pipelines with really good schemas on destination.

So I would not challenge what they're doing right now. I think they're doing a good job. The data that comes out of Fivetran is quite usable. If you can afford the price tag and it's not particularly a problem it's not for everybody. It's not for small startups that are trying to save a buck. Yeah. So the way that we think about Fivetran right now is that part of the things that is

how to put it best,

There will always be a segment of pipelines that are, let's say, short tail that people will typically use in any company,

and there will always be a part of pipelines that you cannot find on Fivetran

or other such tools.

So a typical usage pattern is that people,

would use something like Fivetran and custom code

or that they would first use Fivetran and then migrate to custom code. We are that custom code.

And to that point of custom code, the open source nature of the project,

the focus on it being more of a library than a full on,

end to end product.

I'm curious how you have thought about the experience for

extending

the project, adding new sources, adding new destinations,

and making that developer

experience for customizing it to their particular workloads,

as seamless as possible?

So we're still working on, those aspects. And I think there are many ways to make mistakes here. Depending on how much support you offer, you might get,

bogged down into

destinations,

best way is to, get in our Slack and ask, get in touch with our senior engineers that have done it before. You can also, you know, do it on your own, but it's going to

be harder. When it comes to sources,

really, DLT is meant for you to create your own sources ad hoc. So,

you know, it's literally the easiest tool for doing this out there. You can even get,

snippets from chat GPT. And on our build pipeline page, actually, there are even

some experimental ways of creating pipelines. So for example, we have a ChatGPT assistant that you can tell it which API you want, and it will try to build a pipeline for you. It's reasonably okay provided that there are documentations.

And another interesting way to approach it would be, you know, how maybe half the pipelines nowadays are running on open API standard, and this open API standard describes the API.

So this open API standard is used to generate flagger documentation

or

Python packages

like Python wrappers for the API.

What we can also generate is a DLT pipeline. So we already have a demo, proof of concept for this that you can use basically to,

generate a pipeline from an open API spec. And we're working on simplifying this even further.

This episode is brought to you by DataFold,

a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production.

DataFold leverages data diffing to compare production and development environments and column level lineage to show you the exact impact of every code change on data, metrics, and BI tools,

keeping your team productive and stakeholders happy.

DataFold integrates with DBT, the modern data stack, and seamlessly plugs in your DataCI for team wide and automated testing.

If you are migrating to a modern data stack, DataFold can also help you automate data and code validation to speed up the migration.

Learn more about DataFold by visiting data engineering podcast.com/dataFold

today.

As far as the

collaboration

aspect,

I imagine that most of that is managed through whatever

software tools you're using. An interesting element of some of these

more product focused data integration tools is the idea that you want to bring in people who aren't part of the core data team, who aren't necessarily engineers.

Wondering how you're thinking about that aspect of bringing everybody in the organization

into this work of data integration and the overall workflow of adding value to the business through data

given the fact that it is a very code first interface?

That is a very good question, and,

this is something that we've considered quite deeply.

You might be aware of this,

of course, you're aware of

data mesh concept. Right? And this is trying to push, let's say, responsibility over the data product to the rest of the company and also,

let's say,

empower them.

But in my experience, there hasn't really been any technological

advance made to enable this. Right? So it's nice that you'd like now the teams to

autonomously manage data sources. But if you don't give them the tools to do it, it's just a wishful thinking.

So,

to answer your question, the way we think about it is you have

data as a product, and you have people that have roles around this product. So there's a data engineer that's responsible to transport that data

into the data warehouse.

There's an analyst that is responsible for curating this data. And there's a data producer

responsible for the data source

that is responsible for maintaining its consistency and communicating changes.

In my experience

of over 10 years as a data professional, this doesn't happen. Right? So you need some kind of technical enforcement for it.

And DLD allows you to basically detect schema changes from when you're loading data

and notify this to the producer and to the consumer. And, additionally, because DLT is

creating schemas for every version,

keeping track of which load package had which version of which schema,

you're actually able to downstream this information

into

other tools if you want. So you could create a dashboard that is telling you what changed when, which columns were introduced by

what pull request, for example.

Or as 1 of our users is doing it, they're passing this data to SQL Mesh, and SQL Mesh has

additional,

data contracts

and,

functionality

around,

using this metadata to allow understanding the impact that changes have on the things downstream.

For the production

runtime context, I imagine that you're largely delegating that to whatever orchestration system the users happen to have employed. I'm curious some of the ways that you're thinking about integrating with those runtime contexts and the different hooks that you have available for being able to provide useful information

to that runtime, whether that be Airflow or Daxter, Prefect, etcetera?

I would say

the state of the tooling around this area is moving faster than my knowledge around it. So what I mean is that my team is doing a lot of things,

around this that I'm not up to date with.

But the way we think about it is DLT doesn't want to replace anything. It wants to fit into your ecosystem. So right now, we're working on creating a really nice Airflow integration.

So it's already quite nice, but it can be better. We would love to support the other orchestrators as well. So what we're lacking is, let's say, requirements. So, you know, if you want your orchestrator supported, then come to us and tell us what that would look like and what's important to you as a feature.

And you spoke to this a little bit with your business model around DLT Hub. But because of the fact that it is an open source project and it's also being largely driven, at least for the moment, by you and your team, what are your

what are your overall philosophies and your current approach to the governance and future sustainability

of the project?

It's a very good question. So in my experience,

just looking because we've done quite a bit of analysis of open source projects to try to understand what works and what doesn't.

And what doesn't work is just

relying on the goodwill of people.

What works is when you're doing 99% of the work and you have the community to guide you.

So what we are intending to do is basically build a community product that will encourage the community to use DLT more. So the way we see it, it would be DLT Hub. There's likely some way for users, let's say, to help each other on there and share content. And the way that we think about it is that this would allow us to monetize

around various areas. Right? So for example, some people,

just

have trouble,

working with Airflow or don't like to use some orchestrator. They might just want to use some simple,

web based tool that doesn't require any effort from their side. They might want to easily access community resources in that tool or community maintenance

or upgrades or whatever. So the way that we think about it is that this is a symbiotic relationship between the 2. This enables us to basically

maintain the DLT project

with the revenue coming from the other stream. And basically,

at the same time, because they're connected at the hip, it also doesn't allow us to say, okay. We'll forget about DLD. Right? Because DLD has to become,

world standard before we can be successful with the platform.

Another interesting aspect of the evolution of open source projects, particularly

when you're trying to be a de facto standard for a given space, is the question of moving beyond the boundaries of a particular language community.

I'm curious

if you have thought at all about some of the potential for future progression of DLT

as a specification or as an interface and being able to have language bindings with other runtimes, particularly if somebody wants something in the Java ecosystem or something performance intensive focused on something like Rust or Golang? So that's a good question. We haven't,

dug into the space, and we don't we haven't really considered any plans in that direction.

And the explanation is simple. For us, we need to retain focus. And

when it comes to, let's say, support for data tooling, there are

data users and there are lots of engineers. And we think that we're serving

the vast majority of data users with this. I think the people that are doing some kind of Java stuff, they're rather MLOps, DataOps, data I don't know, data engineers and corporations. They have different problems.

And in your work of building DLT,

getting it in front of people,

trying to help them with adoption and usage, what are some of the most interesting or innovative or unexpected ways that you've seen it used?

I have quite the story about that, actually. And, you know, this is my first open source company that I work for, so I wasn't aware of,

some of the interactions that can happen in the open source. It completely blew my mind what the community can do. To give you a very specific example,

we have 1 early adopter

that built their entire platform with DLT

and actually gave us so many product ideas and so much contribution,

in that way that he's probably

helped us further our product

in some areas by 1 year. Right? And to give you the exact case,

so this was

a very

quite senior engineer that was looking to build a data stack for a 1, 000 people company.

And this person first considered Meltano.

But Meltano didn't meet their requirements because it was too hard to use for their team and

required too much maintenance for them.

So,

they first replaced the orchestrator

but still use singer.

And finally, they stumbled into our

library,

and they realized that if they put the data coming out of CINGR into DLD,

this enables them to have schema evolution and robust pipelines without worrying about them breaking.

But then when he looked into

the way that we do our

sources

and the way that you can customize this. Because it's a library, you can literally do whatever you want and then pass the data to DLD.

And this kind of customization

is not possible in other SDKs or frameworks.

And this fundamentally

enables you to have maybe 5 times less code with custom functionality doing whatever you want. So what he did was he rewrote his 20 to 30 sources or 15 or 20.

And, here he wrote them in, with our library and, yeah, basically just runs the entire company on that. He has a web UI

on top of the solution

that enables his analytics engineers to turn off specific, let's say,

endpoints,

turn on or off specific endpoints from the integrations.

And he passes this data to SQL Mesh, which gives

him this extract functionality that allows you to see, okay, where did the data really come from? Where did the change come from? So the way that he gives us requirements is really pushing us forward a lot. Yeah. It's 1 of the great benefits of open source is once you put something out into the world, there's no telling what somebody's going to do with it. So it's great to hear that you've already got some adoption and,

some useful feedback. And

in your work of building DLT

and trying to address this particular need within the community and the ecosystem,

what are the most interesting or unexpected or challenging lessons that you've learned in the process?

1 of the most challenging lessons that I've had to learn was around code sharing. So this is a problem that we haven't yet solved, but it's on our list to tackle ASAP. We think this is fundamental to getting a community flywheel going. So, you know, people want to contribute and they want to share. But what we found is that

people also don't produce code of great quality,

especially when, they don't really allocate the time for it.

So there has to be some kind of balance between code quality and accepting contributions.

We looked at how other projects did it, and we saw that many of them, basically, Singer, Meltano, Airbyte, you name it. When it comes to actually accepting the contribution, this is really hard.

Luckily,

you know, this is

we don't rely on that because, DLT is a pipeline building tool first.

Alright. So to sum up the

unexpected challenge

around accepting contributions,

there are 3 tiers of quality. 1 is the software engineering.

The other is, let's say, what a senior data engineer will do. And the other is what we would call snippets.

And we realized that, you know, if we try to get data engineers to build software engineering complete things, it's not gonna work. If we try to accept snippets, we realize that code quality is quite bad. So we're better off letting JetGPT offer them. So we need to find some balance that allows people to really contribute the important information. So some kind of, you know, describing the business cases and implementing them well or things like this. So we're still working on it, and it's going to be interesting how we can do this

scalably. Yeah. It's

interesting to to draw parallels again to,

in particular,

singer and Airbyte,

opposite ends of the spectrum from that

up to the user to go and actually find all the different implementations

of whether or not there is a connector for the thing that they're trying to connect to and whether it's any good or if it solves their particular use case

versus the air byte approach of everything goes into a monorepo,

everything goes through that central organization to

manage the

quality of the code, which then brings

in scaling challenges

on the business side because you can only have so many people reviewing so many pull requests before you get completely overwhelmed, which is something that they're currently going through.

And I'm wondering how you're thinking about some of that aspect as well of,

do we allow people to,

contribute packages into sort of a marketplace where we have a hub along the lines of what Miltano has built for the singer ecosystem?

Do we want to have a monorepo where we are maintaining control over the entire code base? Just curious what your thoughts are along the continuum of that balance.

So we don't really want to have a monorepo that,

locks us into maintenance because it's not feasible. And another problem actually that Multano has, while it's easy to go and create your own,

connector,

it's not so easy to edit something that already exists. Right? Because your contribution has to be accepted. So you can fork. But if you really want to use the same version and have have people build on the same thing, it adds complexities.

So I don't really have a good answer for that. We are actually doing this project with open API, which might, you know, cover half the pipelines out there. There's ChatGPT

coming in close. I think the user is going to be rather a director of how things should happen, but we'll see exactly what happens when we get there. And for people who are interested in what you're building at DLT, they're considering whether to adopt that for their data integration use cases, what are the what are the situations where it's the wrong choice and they're better served with 1 of the other offerings of the ecosystem?

So I would say if you want really good quality pipelines from the start, you might as well just use an existing solution and not start building your own.

So what I mean is if you want to use a SaaS ETL

tool, you know, we're not trying to advocate to,

people who

don't try to solve a problem for which DLT works.

And I would say there is another

corner case where DLD is not the right choice, and that is when you need to load really high throughput data with a stable schema. So if you have the stable schema and you know it, ahead of time, you shouldn't be iterating through the document row by row to infer it. You should just

declare your schema to your loader and,

just load your data.

And as you continue to build and iterate on the DLT project and build up the DLT Hub business around that, what are some of the things you have planned for the near to medium term or any particular problems or projects you're excited to dig into?

Yes. So,

basically, on our road map for the, let's say, this year, we want to push this open API topic further. We want to look into

how we can,

modularize a little bit the pipelines to make

extendability and maintenance much, much easier for the user. Because fundamentally, you know, every API has

authentication,

pagination,

and endpoints. If you can put these together, you have your pipeline.

What else?

Another

yeah. The contribution part is definitely top of the list. And, I would say we will start working on the paid platform, but I don't think it will land this year.

Are there any other aspects of the work that you're doing on DLT and DLT hub or the overall space of data integration and extract and load that we didn't discuss yet that you would like to cover before we close out the show?

I think we covered it. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap of the tooling or technology that's available for data management today. So I think the biggest gap is really education.

To be honest, I think that so in my generation when we were

learning about how to do our work, unfortunately,

most of the resources we had were just vendors trying to sell us stuff, and it was really terrible. And they were all telling us to create these half baked solutions that bring a lot of pain. So what I would really love to see is

more education around.

Maybe you don't need Spark to load 50 rows. Maybe

maybe

existing solutions are

working.

Well, thank you again for taking the time today to join me and share the work that you're doing at DLT. It's definitely a very interesting project and 1 that I'll be keeping a close eye on. So I appreciate the contributions that you and your team are making to the ecosystem, and I hope you enjoy the rest of your day. Thank you. Have a nice day as well. Pleasure being on the show.

Thank you for listening. Don't forget to check out our other shows, podcast.init,

which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast,

which helps you go from idea to production with machine learning.

Visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts atdataengineeringpodcast.com

with your story.

And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links