Summary
In this episode of the Data Engineering Podcast, Adrian Broderieux and Marcin Rudolph, co-founders of DLT Hub, delve into the principles guiding DLT's development, emphasizing its role as a library rather than a platform, and its integration with lakehouse architectures and AI application frameworks. The episode explores the impact of the Python ecosystem's growth on DLT, highlighting integrations with high-performance libraries and the benefits of Arrow and DuckDB. The episode concludes with a discussion on the future of DLT, including plans for a portable data lake and the importance of interoperability in data management tools.
Announcements
Parting Question
In this episode of the Data Engineering Podcast, Adrian Broderieux and Marcin Rudolph, co-founders of DLT Hub, delve into the principles guiding DLT's development, emphasizing its role as a library rather than a platform, and its integration with lakehouse architectures and AI application frameworks. The episode explores the impact of the Python ecosystem's growth on DLT, highlighting integrations with high-performance libraries and the benefits of Arrow and DuckDB. The episode concludes with a discussion on the future of DLT, including plans for a portable data lake and the importance of interoperability in data management tools.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Imagine catching data issues before they snowball into bigger problems. That’s what Datafold’s new Monitors do. With automatic monitoring for cross-database data diffs, schema changes, key metrics, and custom data tests, you can catch discrepancies and anomalies in real time, right at the source. Whether it’s maintaining data integrity or preventing costly mistakes, Datafold Monitors give you the visibility and control you need to keep your entire data stack running smoothly. Want to stop issues before they hit production? Learn more at dataengineeringpodcast.com/datafold today!
- Your host is Tobias Macey and today I'm interviewing Adrian Brudaru and Marcin Rudolf, cofounders at dltHub, about the growth of dlt and the numerous ways that you can use it to address the complexities of data integration
- Introduction
- How did you get involved in the area of data management?
- Can you describe what dlt is and how it has evolved since we last spoke (September 2023)?
- What are the core principles that guide your work on dlt and dlthub?
- You have taken a very opinionated stance against managed extract/load services. What are the shortcomings of those platforms, and when would you argue in their favor?
- The landscape of data movement has undergone some interesting changes over the past year. Most notably, the growth of PyAirbyte and the rapid shifts around the needs of generative AI stacks (vector stores, unstructured data processing, etc.). How has that informed your product development and positioning?
- The Python ecosystem, and in particular data-oriented Python, has also undergone substantial evolution. What are the developments in the libraries and frameworks that you have been able to benefit from?
- What are some of the notable investments that you have made in the developer experience for building dlt pipelines?
- How have the interfaces for source/destination development improved?
- You recently published a post about the idea of a portable data lake. What are the missing pieces that would make that possible, and what are the developments/technologies that put that idea within reach?
- What is your strategy for building a sustainable product on top of dlt?
- How does that strategy help to form a "virtuous cycle" of improving the open source foundation?
- What are the most interesting, innovative, or unexpected ways that you have seen dlt used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on dlt?
- When is dlt the wrong choice?
- What do you have planned for the future of dlt/dlthub?
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
- dlt
- PyArrow
- Polars
- Ibis
- DuckDB
- dlt Data Contracts
- RAG == Retrieval Augmented Generation
- PyAirbyte
- OpenAI o1 Model
- LanceDB
- QDrant Embedded
- Airflow
- GitHub Actions
- Arrow DataFusion
- Apache Arrow
- PyIceberg
- Delta-RS
- SCD2 == Slowly Changing Dimensions
- SQLAlchemy
- SQLGlot
- FSSpec
- Pydantic
- Spacy
- Entity Recognition
- Parquet File Format
- Python Decorator
- REST API Toolkit
- OpenAPI Connector Generator
- ConnectorX
- Python no-GIL
- Delta Lake
- SQLMesh
- Hamilton
- Tabular
- PostHog
- AsyncIO
- Cursor.AI
- Data Mesh
- FastAPI
- LangChain
- GraphRAG
- Property Graph
- Python uv
[00:00:11]
Tobias Macey:
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Imagine catching data issues before they snowball into bigger problems. That's what DataFold's new monitors do. With automatic monitoring for cross database data diffs, schema changes, key metrics, and custom data tests, you can catch discrepancies and anomalies in real time right at the source. Whether it's maintaining data integrity or preventing costly mistakes, DataFold monitors give you the visibility and control you need to keep your entire data stack running smoothly. Want to stop issues before they hit production?
Learn more at data engineering podcast.com/datafold today. Your host is Tobias Macy, and today I'm interviewing Adrian Brudaru and Marcin Rudolf, cofounders at DLT Hub, about the growth of DLT and the numerous ways that you can use it to address the complexities of data integration. So, Adrian, can you start by introducing yourself?
[00:01:04] Adrian Brudaru:
Sure. So I'm a data professional. I got into the data field 12 years ago. I built data platforms for start ups and enterprises. And in my last 5 years, I would say I was doing lots of building higher projects, consulting, and data engineering. And I guess, you know, the reason why I'm here is because I saw a need for people like us, for data engineers. Namely, we didn't have dev tools,
[00:01:28] Marcin Rudolf:
and, I decided to do something about it. And, Marcin, how about yourself? Yeah. So, my background is actually software engineering, and, I'm doing this for a really long time, for 30 years. In that time, I did a lot of different things, like from the Telco software, some, you know, data factories in the early 2000, search engines, blockchain, and now I'm doing this. And I think it's the my best best gig in my life. Yes. So we do a lot of open source, a lot of coding, which I really love. And, we are also helping people to build and automate stuff. So, you know, you apply your engineering. This is extremely fulfilling, and I really like it. And going back to you, Adrian, for people who haven't listened to your previous appearance on the show, if you can just
[00:02:10] Adrian Brudaru:
bring us back to how you first got started working in data. Yeah. So I don't know. 12 years ago, I started as an analyst. I did 5 years of start ups. I quickly started building things end to end. And the thing about start ups is you don't really have a budget, so I was doing a lot of hands on building, really. I got after about 5 years, I basically switched to consulting because it gave me the chance to actually work more, as strange as that sounds. So it's more about the work, less about, let's say, the social contract. And here is where, basically, I found the need, for DLD, and, Martin here is helping, build it. So, yeah,
[00:02:47] Marcin Rudolf:
if I can check-in, like, I I I was doing a lot of machine learning before I go to school. So, actually, I had a startup, which is was doing a search engine for mobile applications. We did a lot of much much learning topic, inference. We also built a vector database without even knowing. So, yes, I I did a lot of data before, but maybe I was not aware that it's actually data engineering at the time. It was 2009, so quite long ago.
[00:03:13] Tobias Macey:
And bringing us to the conversation today, you've both been working very hard on building the DLT framework, the DLT hub business around it. For folks who want to get a bit more into some of the core of what is DLT, how does it work, I'll refer them back to the previous episode we did back in September of 2023. And so for people who haven't listened to that yet, if you could just give a quick overview about what is DLT and then talk a bit about
[00:03:40] Adrian Brudaru:
some of the notable ways that it has evolved since we last spoke. Yeah. So, basically, DLT is the 1st pip install dev tool for data engineers by data engineers to build pipelines fast, easy, and robust. It's just everyday boilerplate code for things like incremental loading, automatic schema inference, schema evolution. This is where we started. But in 1 year, our vision has expanded a lot. So with the continuous support of the community, we actually evolved into a very comprehensive Python library for moving data. We are well integrated with the modern data stack components. It works with high performance Python data libraries like PyRO, Pollers, IBIS, DuckDeeb, Delta.
And, yeah, it actually works very well on industrial scale even in constrained environments. We added things like data, contracts, parallelism. But I guess one of the biggest things that we're seeing is an explosion in lake house adoption. This is something that we are seeing quite a bit of user pull from. Yeah. And as as for adoption, I would say we reached 600 k monthly downloads, which is, I guess, 10 times higher than any other competitor in our space. And our users built over 10 k private sources by this time. Yeah. And they're they're using it for all kinds of things, including building rags, things like that. As you have been going along this journey,
[00:05:00] Marcin Rudolf:
building DLT, continuing to invest in and to evolve it, what are some of the core principles that help to guide your work on that project? So alright. So we actually have, like, a very clear principles how we build the product, how we operate in this, like, open source ecosystem. Yes? So one core principle is that DAT is a library. It's not the platform. Yeah. So what does it mean? That if someone writes code, it's using library, our library to to write this code. So you add library to your code. You don't add your code to someone else's platform. Yes? So the outcome of this is that we are trying to fit in into existing ecosystem. We are trying to work with everyone. We are not replacing anyone. Yes? Which ecosystem. We are trying to work with everyone. We are not replacing anyone. Yes? Which typical platform does is replace some other platforms. We are trying to fit in. Yes? So this is this is the first principle. So when you design when you look at, at the other projects, yes, you always look for for ways to interact to cooperate. The other one is like a we are trying to automate everything. Yes. So you should do things once. And this is like a this, this principle of efficiency that that's applies both to us and how we think about our users. This is also important thing. Then Kafka can know black box principles.
When, everything should be customizable. You should be able to change everything you want. So we also expect autonomy of the users. Yes. So, like, by letting them hack, letting letting them change everything and, you know, also looking into the code. So I think this is, like, a very important and, I mean, we also go to great lengths to for our users to do less work. We do more work for our in order for our users to do less. Yes. So we it's for me, like, a very important thing. So if I'm engineer, I I should have some empathy toward other toward other engineers. And this is one of the principles. So you need to be really empathetic, and then then it's, you can write really good code and really can help other people. And from that point of platforms and the impact that they have on the overall
[00:07:00] Tobias Macey:
architecture and approach that teams take to their data management, As a framework and as a library and in some of your blog posts, you've taken a very opinionated stance against the idea of using these managed extract and load services. I'm wondering if you can talk a bit more about what you see as the shortcomings of those platforms and what are the situations where you would actually argue in favor of their use. Yeah. I would say this is a question about somebody else that you're asking here,
[00:07:33] Adrian Brudaru:
And the way I would answer it is that we're actually very focused on our own principles, and we don't worry too much, what other people do. Simply put, what we want to achieve, we haven't seen anyone be able to manage that by offering manage extract load services. So, basically, the whole concept, I would say, is competing with the openness that we offer. And at the same time, for us to become a standard, we want to be adopted freely by other vendors as well. Right? So, so, you know, this means that we just cannot compete there. It's I would say this is a shortcoming for us. When it comes to the end user, you know, it very much depends on the persona. If you're building something large with custom requirements, you're probably better off having customizable solutions. Yeah. So we are coming from, like, a very different, let's say,
[00:08:22] Marcin Rudolf:
space. So, actually, DLP comes from market as all this machine learning revolution, AI revolution, like, and, ecosystem with Python libraries. So actually what we are, what we are looking at is like to have this deep install experience. So you can actually install data platform and, you are autonomous. You run on your own premises. It's often local workflow or single machine workflow. So actually we are quite orthogonal to this thing that you call managed services. We are trying to follow very different pattern. Yes. Very different usage patterns, very different workflows that are possible for this Python ecosystem and impossible for the, managed solutions. I don't know, of course, by vessel. There are certain workflows that are way easier when you have a managed solution, but we don't think we compete with this. And I think that in that context as well, the places where I would say that the managed platform
[00:09:16] Tobias Macey:
does make more sense is if you don't have a team that has the engineering acumen to build those custom options, and you just need to be able to pull data from one place to another, particularly if it's a widely used pattern where you have good support for the sources and destinations that you're working with. Absolutely. So we actually sometimes see users saying, like,
[00:09:41] Adrian Brudaru:
why in DLT I need code to do this or that? You know, we tell them, go use 5th trend. From that perspective too, as you said, you're not looking to necessarily
[00:09:49] Tobias Macey:
replace anyone. So I'm curious what you have seen as far as the types of teams and engineers that are using DLT. What is the overlap that you've observed as far as people who are using both DLT and another option where they use that other option for those, well paved paths and they use DLT for the more custom requirements?
[00:10:12] Adrian Brudaru:
So I would say there are a couple of patterns that I see, and that is what I call first time data platform and second time data platform. So the first time data platform is something that people just build quick and dirty. They just whatever. They put it together. It works kind of like what you're talking about, the tried and true patterns. But then there comes the point when something doesn't work. Then you start looking for a solution. You find DLT, and some people stop there, and they just, you know, have a DLT pipeline running alongside other things. But many people, at some point, they reach a point where they go like, okay. But why do I have 2 solutions? I could just use one solution. Or they could get to the point where why am I paying this much for event ingestion on Fivetran or something like that or SQL copy on Fivetran. Right? And then they start migrating more. And then we see the pattern of the 2nd time build or the 2nd time data platform where, let's say, these people have already experienced what the 1st time data platform is like. And let's say reining in the entropy that is created in such places is very difficult. So they just start with engineering best practices from the start, and then, you know, DLT is a no brainer choice kind of for these situations.
[00:11:17] Tobias Macey:
Since the last time that we spoke, there has been a lot of evolution in the space of data movement, some of the most notable pieces being what you mentioned earlier, mentioned about the increased growth of AI and the requirements of data movement and customization as far as how that data has moved. And from the competitive landscape, PyAirbyte has seen a lot of investment since the last time that we spoke. And I'm wondering if you can talk to some of the ways that those pressures and those evolutions in the broader ecosystem have informed the way that you think about the development and positioning of DLT? So, yes, this is this is a very good question. And,
[00:11:59] Marcin Rudolf:
so maybe I could get background where we are coming from. So, actually, DLT is coming from this revolution that is happening right now. So where we started the project, our way to convince people that, actually, we are doing something interesting, something new, was to telling everyone, you know, there is some revolution happening. It's a this revolution happens in ML. Yes. So there are people using these libraries, doing these new workflows, very autonomous, local workflows, for example. Yes. And now we are sure this revolution is gonna come to the data space. At some point, people will realize, actually, you can put together a lot of Python stuff and build your own data platform, you can install it, and you have can have the same kind of experience that these data data scientists have. So, our feeling and our we we are just going with this evolution. Yes. And we are, we are we are there for the for the whole time. Yes. So, you were asking about what changed in, in in data. So we the typical transition from, like, you know, this very nice JSON parser that is creating relational structures into data movement library that is, like, integrating with all this ecosystem of of other libraries.
And yeah. So that's my feeling. Yes. We we we are simply there. Yes. We are simply there. Whatever happens in the Python space that is beneficial to some part of ecosystem, we are also benefiting. So I don't know. Like, if there is a new version of the LLM from open, OpenAI content, like, I don't know, this this mini stuff recently, o one. We also benefit. Yes. It knows DLT and it helps our users right away to to write the code. So this is actually very interesting and we love actually, we love this, this kind of revolution happening. This is what we are betting on. Also, if I can build on
[00:13:46] Adrian Brudaru:
the LLMs part, it's there's a little joke I like to make. While other people add, LLMs or AI to their products, we added our product to LLMs. So, basically, if you go now on the newest LLM models, you don't need any plug ins, any rags. You can just ask it for a DLT pipeline. It will go online, search for documentation, and build it for you. Yeah. But other other good example is how we interact with the vector databases. So, actually, you have Lansd b, for example,
[00:14:14] Marcin Rudolf:
or you have, you have, embedded quadrant. And, you know, this is this is another library for us, and we are so tightly integrated. It's like a, you know, it's like a one one thing that you interact together. It's very different like being a destination, you know, on some SaaS platform. It's like your workflow, your one notebook, and you do it in the same way you need to interact with the other libraries. Pushing a little bit more on the differentiation
[00:14:40] Tobias Macey:
between DLT and some of the platform approaches, I think one of the things that those platforms offer is the state management for things like incremental loads where you can say, okay. It's going to maintain the checkpoint information about what was the last thing that I loaded, where do I pick up from there, what is your approach for being able to manage some of that state storage and incremental
[00:15:05] Marcin Rudolf:
and kind of resuming incremental loads for people who are building with DLT? Yeah. This is very good question, and I think we have very smart way of doing that. So simply for us, the state is a part of the data. So if you if you, have any kind of destination, I mean, your destination is able to store some kind of state. Yes? So we ship the data together with the metadata to the destination, and we also load it in, like, a, let's say, atomic way. So I think it's a very robust, way of handling state. So if your data loads and all the checks are passing, also your state loads. So if anything works okay, and you resume your load, you're gonna get the state that's always matching your data. So actually use the destination to store the state. If you want to abstract and store the state somewhere else, you can. Of course, it's a library. So the state's like a kind of context that you can swap. But, for the average user, it's a seamless experience. You don't need any kind of additional setup. You just load the destination to Postgres, to flow vector system, to even vector databases, and you get the state automatically stored. Yeah. And if I can,
[00:16:14] Adrian Brudaru:
you know how, for example, with Airflow, you would manage your state in Airflow, which means Airflow has to be running and functional for your pipeline to run. What, we do with DLD because, we persisted state of the destination
[00:16:28] Marcin Rudolf:
is we support serverless cases very well. Right? So if you wanna run on git actions, if you wanna run on serverless functions, anything like that, you don't need anything local to persist the state. Yeah. So our preferred way of, working with incremental loss actually is to wipe out everything after every turn. And we're gonna restore the core pipeline very quickly from the destination, the clean state with the last state, and and, run it for the new, let's say, set of data. You mentioned also the growth and evolution of the Python ecosystem
[00:17:00] Tobias Macey:
and how because you're just a Python library, you get to benefit from that as well. That ecosystem has also seen a lot of growth and investment, particularly in the data oriented set of libraries and frameworks. So I'm thinking, in particular, a lot of the rustification with things like the data fusion library and PyAra has seen a lot of growth and investment. And I'm curious if you can talk to some of the developments in that overall ecosystem of libraries, frameworks, and the Python runtime itself that you have been able to benefit from and you're most excited about. Yeah. So,
[00:17:37] Marcin Rudolf:
this is really interesting question. So probably you need to stop me at some point because there is so many libraries that we are, like, integrated to work with that, you know, I can I can, yeah, try to go for a long time? But we see a few trends. If you I can group them. So one trend is like a single node or like a single machine computing and a portable data engines, like a dark DB or this data fusion stuff or or even large DB, you could call it data engineers. Then we have open storage formats and associated libraries like, you know, PyIsberg or, DataArrows.
And, we also have a really I have this development with the arrow, and Py arrow. So it's like a you standardize this in memory table format and also compute. So so this those are the trends that I see. And now if you go through this, we of course have that, have, dark DB. Yes. We which we use everywhere. Like, it's, first of all, it's a way to onboard our users to give them this this local experience that doesn't need any kind of credentials. It can try everything. They can produce they can develop programs, improving the, developer experience so much that it's it's hard to explain this how how much change it was for for us when that DB appeared. Yes? Then we have, of course, I mentioned Arrow. I mentioned Polaris, PANDAS, and all this, like, a way to work with the, tabular data. So those are primary citizens for us. So we can load them directly. Everything that works for, you know, like a dictionaries, like a or JSON works also for them, incremental loading, SCD tool, mergers, and so on. So it's like a native citizen. Then you have these libraries for the open, table formats, and we actually use them a lot. We build, Delta Lake using Delta Rest serverless.
We have some big deployments already in production. And then are some other libraries. So maybe less mentioned, but super important. And I think they really show how much benefit by being a part of ecosystem, by trying to fit in nodes to replace. Yes. So even if you look up as as as something super org like SQL Academy, we recently added SQL Academy destination and, and, and a source. And now we have, like, hundreds of the databases. And, you know, we put some work to do the mergers, a seed tool, and incremental loading. And, you know, even obscure databases right now support that. So I could enumerate a lot. We benefit from SQL Cloud and we benefit from FSpec, just like abstraction device system. A lot from Pydantic, for example, to create data contracts that actually people understand from other work. That's that's really like, you know, there are libraries for entity recognition that we use to put the data contracts that are actually for the PII data. I don't know. Like, FICI. Yes, that you can detect all the all the entities that are PII or Microsoft Presidio. Yeah. I agree. It's very easy to wax poetic about all the interesting things that are going on in that space. So it's definitely great to see the amount of investment and integration that you've done. Yeah. And this is I I must say this is supernatural for us. Yes? So, it's not that we are doing this integration for the integrations. Like, we see a value coming from that. We see how this works together and it's like a multiplier.
Yes. We are not adding we are multiplying. So the one of our, let's say, core principles.
[00:21:06] Tobias Macey:
On that point too of being able to customize and build all kinds of use cases around DLT. One of the things that helps most with adoption for a project like DLT is the overall developer experience, the onboarding. I'm wondering if you can talk to some of the notable investments that you've made in that user experience of building with DLT composing pipelines and some of the ways that you think about the interfaces for source and destination development.
[00:21:40] Marcin Rudolf:
Yes. So, I mean, we are by nature very close to the code. Yes. We are a library, so we interact with the code. And I think that it's our DNA to make this, make this development experience easy. Yes. I'm a software engineer, so I also like a I'm trying to keep this thing good for people that want to build, that want to, want to write code. So I mentioned this, Doug, do you think? It's like improved a lot onboarding. Yes. Like, it shortens the time to learn stuff. You can run all the examples. You can get into into things and even really, very quickly build your local, very quick and low latency environment with DuckDV and use it for for testing. We put a lot of attention actually for our destinations to all of them behaving the same way. There are, like, a thousands of tests that we wrote to make sure that every destination starting from the vector databases, even into, like, a bucket files on the bucket. From the point of view of our users, they behave in the same way. They're gonna have the same schemas. You can if if DBT is supported for this destination, it's gonna be the same, you know, set of transformations so you can develop locally, and then run the same code on CI, and then deploy it. Yes. And this is really good for data quality. Yes. Because you can test it. There are many other mechanisms like that. Okay. We are pretty strong software engineering approach, as as I said. So we really pay a lot of attention to, to follow this Python intuition. It's to not really invent new stuff, but to use actually existing building blocks. So for example, all our sources are generators. Yes? Like a Python generators.
And you don't need to create some obscure object model. People know it. Yes. They know what they are, and they can use them right away. The same, like, destinations, you can now build your own destinations that are like a scenes that that consume data. Yes. So you use this something called Python decorator. You just decorate the function, and you have reverse TL, thingy that, you know, took you maybe 1 hour to to produce. So now we are, like, already coming to to the second part of your question, our investments into sources and, destinations. We actually built a very interesting team, like a really good support for, creating REST APIs right? REST API, pipelines. We have, like, a support for the imperative mode when you write code, with pagination, authenticate, and so on. So it's like a low level toolkit. But there is also higher level toolkit, like a declarative mode. And it's combined with the, code generation tool, which is converting open API, definitions into direct into pipelines and datasets. But, actually, you declare what you want. You build your tree of we call it resources.
And very quickly, you can define your pipeline and just run it. Yes? So we know that people love it. We get, are getting a lot of feedback. We see hundreds of those in production. Yes? So actually, that was, like, our big achievement. We did similar things for, like, a database and file file system sync. We standardized this. We let people declare tables, databases, combine them, and very quickly,
[00:24:49] Adrian Brudaru:
create pipelines that now sync databases. I think this is the most popular source, like, its database right now. Yeah. I'll say REST API is the second one, and I want to put a particular emphasis on this one because it was built with lots of, let's say, community pool. So the REST API declarative source is just a Python dictionary, and it was originally you know, somebody donated to us some code where they had to take on this. So we used it as inspiration, then we had more people from the community basically asking that we build something like this. And, eventually, we actually built it with community members. Right? They did, like, half the work, I guess. And, it's, it's literally our 2nd most used source, so it's a big success. Yeah. On that point of developer experience,
[00:25:36] Tobias Macey:
onboarding, speed of experimentation, one of the ways that, in my experience, has always been most effective to encourage that is having sane and useful defaults. And I know that a lot of the investment in this overall space of building libraries of sources and destinations for data movement is to have some standardized protocol for the data interchange, whether that's going back to the Unix shell with the pipe operator and just being able to operate on arbitrary strings or the singer specification, the air byte specification. I'm wondering how you've approached that aspect of this space and how you think about the data interchange protocol between the source and destination, particularly given that you're trying to move large volumes of data so you don't want to have to spend a lot of time on serialization and deserialization?
[00:26:28] Marcin Rudolf:
Alright. It's a very good question. And, actually, our users, interact mostly with the code. Yes? This, like, internals are also available. So we have this hacking principle that you always can get into internal and and use it, like, at the lower level stuff. But maybe I'll start with this, defaults. Yes? Same default that you mentioned. So we actually put a lot of attention, but into let people use some minimalistic really minimalistic code without setting any kind of options. We spent a lot of time to figure out what are the best settings for different things, and we are trying to set up this in a way that's that works. Sometimes it's a little conservative, but we are, like, making sure that you can very quickly start and then improve your your thing. Yes. So you can start without almost without learning. Yes. When you need something special, then you need to learn. Yes. So that's that's our approach. We don't want you to learn some object model before you even use it. So that's that's one thing. Second thing, actually, we do not have any kind of protocol. We are using this open formats everywhere we can. Right now, our primary citizen is Sparkify.
So when we extract, and that's another interesting story we we a year ago, people really had serious doubts. Can you really do any kind of bigger loads with Python? And we realized, yes. But you need to use Arrow. You need to use ConnectorX. You need to use Polaris. And then we did it. And now, you know, our interchange protocol is a is a packet file. And, internally, we form simply like a, you know, repositories or packet files with the manifest, which is a schema that can be a YAM file, can be a Pythonic model, whatever people prefer. And this is the exchange between different stages, among different stages of DLT. Yes. So the last stage is a a load stage. It's a destination, and it looks into this. It looks, what is the schema and loads the packet files into into the, into the destination. Yes. So ideally, we are not doing any if you are working with the arrow tables, you can you just save it once and then pass it to the load stage and it's not deserialized.
[00:28:39] Adrian Brudaru:
It's deserialized by the engine at the end. So I would also try to answer this in a way that data engineers would relate more easily maybe. And that is you typically would have a source to destination protocol because the source is passing data and metadata. The way DLT works is a little different. You basically have a component in the middle that is doing this metadata inference for you. So what this means is the source is only emitting data. You don't need to worry about metadata. If data if metadata is available, we can capture that, but it's not necessary. So what this means is you're just yielding JSON or data frames or, you know, like, much instead, maybe some yeah. Yeah. The fact that you're able to use some of those native constructs,
[00:29:20] Tobias Macey:
Py Aero in particular, I can see as being immensely valuable from a performance perspective because if you're using Aero for the source and the destination,
[00:29:30] Marcin Rudolf:
then they can just operate on the same block of memory. You don't have to deal with that save and load step in the middle. Yeah. That's true. I think this is part of your, question about this high performance libraries. This is one of the thing that we see. I also see good standardized. Maybe it's like a de facto standards, the standard, but, you know, this table format, which is RO, it's a huge benefit for the ecosystem. Another interesting aspect from that performance
[00:29:54] Tobias Macey:
perspective is the ability to parallelize and in particular with Python 3 dot 13 having the no GIL option and being able to do free threading. I'm wondering what you've seen as far as experimenting with that and some of the ways that that impacts the ways that you think about building and deploying these pipelines.
[00:30:15] Marcin Rudolf:
Yes. So, actually, we were experimenting a lot, maybe not with this, no deal option. But, you know, most of the Seras based libraries, they release the deal immediately. So, we were recently, building Delta Lake on Delta OS, and, we were, like, checking if this is really a parallel processing. Even if you have one process in Python and you have many threads, of course, they're gonna get serialized because there is deal. But Delta RS is nice and it's releasing these deals. I think it's giving you a little bit of this new Python experience, like Python 13. And, and it really works in parallel. Yes. We checked that, and, you can actually write to many tables at once, do mergers to many tables at once, which is using data fusion. So it's a lot of processing as well, like I see few processing, and we see that it works.
So, really, we think we're gonna benefit with this more with this high performance libraries than with the changes in the Python, itself. Python is getting more and more blue code. Very nice abstraction for the user as well. It's like an interface to the deeper things, and our task is to hide it and just expose what people can really understand and,
[00:31:23] Tobias Macey:
you know, and and interact. As I was preparing for this conversation and reading through your various blog posts, one of the things that captured my attention the most is this idea of a portable data lake and the impact that it has on going from local development to production, the challenges that exist on that journey. That's something that has been true for years, probably decades at this point. I'm wondering if you can summarize the ideas in that post and maybe talk to what are the missing pieces that would make that fully portable data lake something that can be properly realized.
[00:32:01] Marcin Rudolf:
So I, I could start with, like, a this more technical perspective on this. What what have to come together in order to enable this? I mean, we talked about this already. Yes. So let's start with the Sky Performance Libraries. Yes. So you need actually to somehow create, maintain, vacuum this this, this, data lakes. Yes. For that, this, this slider is the do that. It's hard to to be to be to be ready for this. Then a lot of stuff is standardized. I mean, this is what I mentioned, like, it's it's probably the factor standard, not really what it could standardize, starting from, like, a table format in memory, which is RO. And so we we also have finally, we have, like, a working we have iceberg. Yes. We have, we have delta. So those things are standardized, and you you can interact with this. Another thing that happened, you have this portable query engines. Yes. So, you know, data res is is data fusion.
And now you have dark tv. With dark tv, you can connect to any kind of store via so called scanners and, you know, read data from Delta, data from iceberg, data from process, from bucket files. Yes? So and you can move this engine close to your data. This is a big benefit. Yes? Without this benefit, there could be little reason probably to build these lakes. Yeah. And, you know, you also have this, transformation engines that are trying to replace DBT, maybe like SQL mesh or transformation and engines that are working on data frames, like Hamilton, for example. They are also, like, you know, making this experience with the port table data lakes way better and also way cheaper. Yes. Because you don't need to transform on Snowflake. Yes. You can now transform with data frame, or we can transform with that d v plus dbt if you want the old style. It's, I think, this has to come together in order to create enough value for people to adopt this stuff. And I would say there is also a community aspect that is important, and that is demand.
[00:33:55] Adrian Brudaru:
So something that I think many of us have noticed is there's been a lot of talk about the iceberg, a lot of talk about Delta, but limited adoption. So, of course, it's more in some areas than others. But, for example, if you look at Iceberg, it really exploded this year. So, particularly, I think, in January, in just before the acquisition of Tabular by Databricks, iceberg was about 2 times the search volume compared to Delta. Right? So something is happening. And now with the recent acquisition, that's even more publicity. I would say, by now, iceberg is a bit of a trigger word for many. You say iceberg, everyone's gonna tell you all the new things they're working on kind of. Yeah. I think that the growth of iceberg
[00:34:40] Tobias Macey:
has definitely accelerated a lot, which is great to see. It's definitely excellent that there has been a lot of investment as far as the tooling to be able to integrate with that effectively. So DuckDV, as you mentioned, the fact that there is Py Iceberg to be able to directly interact with the tables without having to have a query engine in the middle and then also just all of the different query engines integrating with that. So Trino, even Snowflake has invested in Iceberg support for being able to either query across Iceberg tables or use that as a native ingest path. And then in terms of your experience of building DLT, the fact that it's open source is great. That obviously helps with community adoption. But at the end of the day, you also have to have some path to sustainability.
I'm wondering as you have continued to build and invest in the tooling and grow the community, how have you worked on formulating the strategy for being able to build a sustainable product on top of that foundation?
[00:35:38] Adrian Brudaru:
So in essence, you're asking where is the money coming from. Right? I would say, you know, for the last 6 months, we've been quite successful at doing support. So we've had several types, let's say, of support that we see Fortune 500 companies ask us for from, let's say, consulting to classic support. 2nd, we have a very successful OSS Motion. And right now, as we were talking about, you know, portable data lake earlier, what this is, it's basically a dev environment for people who just want to go from local to production and easily develop these data platforms. And right now, we're, in design part partnerships with, multiple customers, and we're building this together. And, we can see in more detail that there is this movement in the market towards open compute, and we think this is something that will be ready for very soon. Yes. We learn a lot from what our customers are building as well. Yes. So,
[00:36:31] Marcin Rudolf:
you can actually build a lot of things on top of the IT as a lighter. Yes. So and this this is where we go product wise. Observing and building reproducing
[00:36:42] Tobias Macey:
certain solutions. In your experience of building the project, working with end users, and growing that community, what are some of the most interesting or innovative or unexpected ways that you've seen DLT used? Yeah. So, actually actually, this is a really good question. Yes.
[00:36:58] Marcin Rudolf:
Our users are extremely smart. Yes. They are extremely smart. The people that are using DLT are typically builders. They build their own data platforms, and we somehow get used to the fact that our situation that our users are ahead of us, like, in in ways that they think about DAT, how they can can apply it. Yes. I can give you a few a few examples. Yes. So, we, our earliest big production deployment was at Harnas. This is like a CI company. Yes. And a pretty, pretty big one. And, actually, D2E was more than a year ago. It was year and a half ago. It was already used to create a data platform that got integrated. Like, it happens has has its own object model on top of which can be utilized. So it got integrated into UIs. Yes. And, automatically, they were generated certain, you know, user interfaces.
And people that needed data, they could just interact with this interface, and DLT would produce the dataset on demand. Yes. So it's like a data democracy movement. So then we realized, yes. This is this is really new, and this is, you know, where DLT can go. Yes. Then, we also have, like, a a team of data scientists. Team of data scientists that use DLT plus span plus pandas plus MandeCom, which is like a list. You can maintain list of things in there to be the whole CMS, yes, with the machine learning component that is serving millions of users. They are not engineers. They are very smart data scientists. And, like, stitching the stuff together, they peed build a true data platform that does the work. Yes. So that was really amazing to see that, you know, the people that are not engineers can also, you know, build this kind of stuff. User facing products. Yes. Absolutely impossible without Python and these libraries and so on. Then recently, we built our first big Delta Lake with Posthog.
And, actually, this is really interesting case when delta lake is not done for internal use. It's actually a way to interact with the customers. So what you can learn from it okay. Typically, you would build the REST API and some open API interface, and people would build less clients to take data. Now you can, no, interact with the data. You can interact with the lake. There are schemas. There are tables. You can use whatever engine you want and bring the engine close to your data. It's extremely effect. So I think there is really something new, and we want to learn from it. I could continue really, like, we we have users that are were building asynchronous, destinations and asynchronous sources before us. So, like, taking data from Postgres extremely quickly via asyncio.
So this is really a lot. And the most recent development is people are using cursor AI to automate code generation and they are like feeding the documentation, creating some set of rules. You know, the pipelines have been generated for them. So, I mean, we expected that, but it's like a users did that way before we could even, you know, make a first proof of concept that's already in production. So this is amazing. Yes. It's like having people that are using your stuff. For me as engineers, it's extremely fulfilling. But, you know, using in a way that I didn't expect, this is the best part. Yes. And this is happening every day. You mentioned to being able to build a data platform,
[00:40:11] Tobias Macey:
and that's another topic area that you focused on in your blog posts and your messaging and positioning is the idea of DLT being a core tool in the toolbox of data platform engineers. And I'm wondering if you can talk to some of the ways that you think about what does a data platform engineer do, how does DLT help them, and what are some of the ways that they expose
[00:40:36] Adrian Brudaru:
those platform capabilities to the consumers of the platform to be able to build on top of that and the role that DLT plays in that relationship? So I would say data platform is something that gets thrown around a lot, but, really, it's just a technical representation of a data stack, be it the data warehouse or a data lake or whatever you're doing. Basically, where this, data platform engineer role is new is that what they're doing is they're enabling other data people do their job better. So, basically, they put together systems that essentially create ways for data developers to create pipelines faster, easier, cleaner, let's say, boilerplate code, all kinds of things, right, from governance to engineering.
[00:41:18] Marcin Rudolf:
Yeah. This is what we basically observe all the times. I mentioned this like unexpected things that happens. And the pattern that we see is that people are building, we call it like a data platforms in a box that, that are like a way to to bundle certain entities that you can create with DT. So, you know, sources, destinations, datasets, schemas, contracts, altogether create a package, yes, which can be, you know, developed locally on DuckDb, but deployed on CI in some, you know, test, environment and then deployed by the infra people. They can hand over the same package to the infra people, and they can deploy exactly the same thing to the production environment. Yeah. So this is what we see people are building. So, like, a port type port table or people's table platforms. Yes. So as Adam mentioned, that have the whole stack inside. But this is also the way people expose this to the data users. This is very interesting. Of course, this is not in every company, but in the companies that have strong data science team, they typically bundle this kind of platform, expose just certain things, as a pie Python interface. And those people can interactively, for example, work with the data frames, but the data frames are coming from the data lake, for example. Yes. And then you can train your model directly on the data that is, that is stored somewhere without even knowing using SQL or doing this kind of stuff. And it's the same code. It's the same set. It's a like a support table platform, as we call it. Yeah. And I would say deeper than the technical aspect is also governance. Right? Because when you have a uniform way of ingesting things, you have standardization.
[00:42:51] Adrian Brudaru:
You have schemas up front. And, basically, I don't know. You're probably familiar with the concept of data mesh. What data mesh is, advocating is, basically, that domains are more self sufficient. So domain knowledge is basically captured and fed into metadata for pipelines that allows, basically, the organization to understand what this data is. So the way you can think about DLD with its schema, with its data contracts, it's quite close to that, and we're working on semantic capabilities that will basically create these allow these data contracts to do way more advanced things such as data meshing
[00:43:25] Tobias Macey:
or, let's say, PII data contracts. Yeah. Definitely very excited to see the continued evolution in that regard as well. So I'll be keeping a close eye on your activities there. In your experience of building this tool, building a business, building a community, what are some of the most interesting or unexpected or challenging lessons that you've each learned in that process? I can start with a challenging lesson, and this one is a little painful.
[00:43:50] Adrian Brudaru:
So we were working on pipeline generation because, you know, LLMs, it seemed to all make sense. And this is how we came upon the open API in it. So the challenge there is that you have a number of, let's say, pieces of information that you need in order to generate the pipeline. And if you are using an LLM to guess them, then the error rate will compound. So, essentially, by the time you have a finished pipeline, it's probably not So we realized this approach. Basically, the technology is not there for it, so we went down an algorithmic path. So we figured, hey. We have almost everything in open API. We can infer the rest, and what cannot be inferred can be manually tweaked by the user. So we created this generator that basically scans an open API spec and, creates a REST API source from it, but nobody cared. So we were expecting, you know, that maybe people that are working with fast API that basically use this, standard or maybe people from the community, but literally,
[00:44:51] Marcin Rudolf:
we we couldn't find people who care. Yeah. And that's also a little bit unexpected. Yes. It's like a from the technical point of view, it's like an amazing thing. Yes. You have information. You can create some kind of thing automatically. But actually, people prefer to it's so easy without that, so they probably don't want to learn another another tool. It's easier for them to just write a simple Python dictionary. There is one, very interesting thing already mentioned. It was like a when we were doing this production, Delta Lake was post hoc. We realized we had this for for a long time, we had this, like, idea, this core idea of our product is that, you know, that we can convert any REST API into the dataset. Yes. Because people that interact with REST APIs, if they are consumers of the data, they don't they don't want to have HTTP calls to some endpoint. They want to interact with datasets. And now we we see something really new, like the user interface is not even gonna be the SAP. It's gonna be some kind of, like, yes, with the very different schema, some data catalog that is supposed to be auto generated because it's a lot of work to generate the catalog. And this is how it's gonna be how we're gonna interact, how the companies are gonna interact with the other companies or with the customer. This is a this is super new. Yes. We are thinking how to make it easier, how to automate it. Yes. It's we think it's it's gonna be, a part of this lake revolution that's happening. One thing that I was just realizing that we didn't touch on explicitly
[00:46:18] Tobias Macey:
is for a lot of this conversation, we've been leaning more towards the idea of structured data sources and destinations. We mentioned the ability to integrate with these AI stacks. I'm wondering what are some of the ways that you think about being able to easily address unstructured data sources and be able to either consume that as is or turn that into a structured representation. And some of the experiments that you've done internally and some of the ways you've seen teams addressing that challenge in their own work that are using DLT for that? Yes. This is a good question. Yes. So, obviously, DLT, like, the,
[00:46:54] Marcin Rudolf:
the biggest value you get when you can somehow take something that is unstructured and convert that to something that is structured at the other end. Because this is why DAT exists. Yes? So, of course, we started with the messy JSON files, but, and we did it a really long, long time ago. You can also use any of these libraries that are, let's say, parsing and converting, PDFs into some kind of meaningful structures or things that convert any massive files into a create a schema on top of them, and and then, you can convert them into the datasets. So you can actually plug any kind of Python library or or even the platform that does it into DLP as a source. Yes? We call it a structure source. We we have it for people that want to use it, in our verified sources. They they can try it out. So that's the one thing. Second thing, like, we also integrate with these frameworks that people typically use like a long chain. It's not even integration.
No. We are no. Our source are generators. Yes. So you can just take them and use them with launching. You don't need to do anything. You can, or you can pass launching documents into DFT and gonna also, you know, automatically pass them. So all of the things that you let's say, have, like, a launching apps or, like, launching plugins for such data, you can you can interact with DMT, through it almost automatically.
[00:48:16] Tobias Macey:
The other style of data sources and destinations in particular that has been gaining a lot of attention right now are property graphs because of the renewed interest in knowledge graphs, because of the advent of GraphRag. I'm curious how you have seen people using DLT in that context as well for being able to populate and maintain some of these property graphs. So, yes,
[00:48:40] Marcin Rudolf:
I'm we know that people are using, you know, GraphQL to query this as a data source. I'm not aware personally of for doing this,
[00:48:50] Tobias Macey:
things with DFT right now, mainly. Alright. Well, something to keep your eye on. So for people who are working in data teams, they have a need to be able to move data from point a to point b. What are the cases where DLT is the wrong choice?
[00:49:05] Adrian Brudaru:
Like we touched upon before, if you're not a Python first person and if you don't want if you don't care about software development best practices and this kind of stuff, then don't use DLT. It's not for you. DLT is for Python first data teams. Outside of this, I would say not much to worry about. Right? It's literally by Python data people for Python data people. And as you continue to
[00:49:29] Tobias Macey:
build and invest in DLT and DLT Hub, what are some of the things you have planned for the near to medium term and areas that you're excited to explore further and invest in? So like we were telling you, this pip install, portable, Pythonic data lake,
[00:49:45] Adrian Brudaru:
we think this is a tectonic shift. So there's going to be lots of work that can be done there, I would say. There are a few major areas from open compute to clean, easy dev environment and so on. But, notably, I guess, besides a list of features, one thing that we'll be doing short term is actually some customer roadshows in November, showcasing this, data lake product. So we'll be in San Francisco, New York City, Paris, Berlin. If anyone is interested, just get in touch with us on Slack. We might also add other locations.
[00:50:17] Tobias Macey:
Alright. Are there any other aspects of the work that you're doing on DLT and this overall space of data movement that we didn't discuss yet that you would like to cover before we close out the show? I think that your questions are were very really very interesting and comprehensive, and we touched the things that we we considered the most important. So how
[00:50:37] Marcin Rudolf:
you really being a part of this ecosystem, being a part of this AI evolution. Yes? And the way these new workflows that, are there. Can we benefit from everything that is beneficial to others? We also benefit. Yes. So if people are building a new library, yes, there is a, for example, a new, way to manage Python dependencies when you can ship a script as an executable. It's called UV, some model after the cargo from Rust. And now we are using this for a few days, and it's amazing. And it's adding so much power to the the product that we are building when portability, people start the ability is important.
[00:51:13] Tobias Macey:
So I'm I'm no. I'm amazed by this ecosystem simply and this way of of doing stuff. Not sure this DLT related. It's more like a being an engineer in in the space for a long time. And seeing this kind of thing working, it's this is this is fulfilling. Absolutely. Well, for anybody who wants to get in touch with either of you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspectives on what you see as being the biggest gap in the tooling or technology that's available for data management today?
[00:51:44] Adrian Brudaru:
I wouldn't say it's a gap that we have today. I would say it's a shortcoming that is coming from the way we design data stacks, and it's, maybe here to stay. Maybe one day it will move away. But, I think the biggest problem that we have right now in the data space is the lack of interoperability of tools. Right? And the fact that when you're building a data stack, you're literally just human middleware stitching together some technologies, and, you know, your documentation is probably going to be a little bit outdated. There's gonna be gotchas. There's going to be all kinds of things that maybe other people stumbled into, and it's your first time. But, essentially, if you look even at the way tools interact, they interact by looking at data in a database.
This doesn't have metadata, which means that the amount of things that you can do with it are, by nature, very limited. And I think once we can get away from this concept that metadata just needs to be added to every tool or created every time and we only move data around, a major change can occur. Before this, I think, you know, we're just all stitching together vendor tools.
[00:52:52] Tobias Macey:
Well, thank you both very much for taking the time today to join me and share the work that you've been doing on DLT. It's definitely a very interesting project. Definitely excited to see the ways that it's evolving. Definitely going to be playing around with that and experimenting with how it fits into some of my new projects that are coming. So appreciate all the time and energy that you've both put into that and the rest of your team, and I hope you enjoy the rest of your day. Thank you. Thank you. Thank you for having us, and have a great day as well. Thank you for listening, and don't forget to check out our other shows.
Podcast.net covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host at data engineering podcast.com with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Imagine catching data issues before they snowball into bigger problems. That's what DataFold's new monitors do. With automatic monitoring for cross database data diffs, schema changes, key metrics, and custom data tests, you can catch discrepancies and anomalies in real time right at the source. Whether it's maintaining data integrity or preventing costly mistakes, DataFold monitors give you the visibility and control you need to keep your entire data stack running smoothly. Want to stop issues before they hit production?
Learn more at data engineering podcast.com/datafold today. Your host is Tobias Macy, and today I'm interviewing Adrian Brudaru and Marcin Rudolf, cofounders at DLT Hub, about the growth of DLT and the numerous ways that you can use it to address the complexities of data integration. So, Adrian, can you start by introducing yourself?
[00:01:04] Adrian Brudaru:
Sure. So I'm a data professional. I got into the data field 12 years ago. I built data platforms for start ups and enterprises. And in my last 5 years, I would say I was doing lots of building higher projects, consulting, and data engineering. And I guess, you know, the reason why I'm here is because I saw a need for people like us, for data engineers. Namely, we didn't have dev tools,
[00:01:28] Marcin Rudolf:
and, I decided to do something about it. And, Marcin, how about yourself? Yeah. So, my background is actually software engineering, and, I'm doing this for a really long time, for 30 years. In that time, I did a lot of different things, like from the Telco software, some, you know, data factories in the early 2000, search engines, blockchain, and now I'm doing this. And I think it's the my best best gig in my life. Yes. So we do a lot of open source, a lot of coding, which I really love. And, we are also helping people to build and automate stuff. So, you know, you apply your engineering. This is extremely fulfilling, and I really like it. And going back to you, Adrian, for people who haven't listened to your previous appearance on the show, if you can just
[00:02:10] Adrian Brudaru:
bring us back to how you first got started working in data. Yeah. So I don't know. 12 years ago, I started as an analyst. I did 5 years of start ups. I quickly started building things end to end. And the thing about start ups is you don't really have a budget, so I was doing a lot of hands on building, really. I got after about 5 years, I basically switched to consulting because it gave me the chance to actually work more, as strange as that sounds. So it's more about the work, less about, let's say, the social contract. And here is where, basically, I found the need, for DLD, and, Martin here is helping, build it. So, yeah,
[00:02:47] Marcin Rudolf:
if I can check-in, like, I I I was doing a lot of machine learning before I go to school. So, actually, I had a startup, which is was doing a search engine for mobile applications. We did a lot of much much learning topic, inference. We also built a vector database without even knowing. So, yes, I I did a lot of data before, but maybe I was not aware that it's actually data engineering at the time. It was 2009, so quite long ago.
[00:03:13] Tobias Macey:
And bringing us to the conversation today, you've both been working very hard on building the DLT framework, the DLT hub business around it. For folks who want to get a bit more into some of the core of what is DLT, how does it work, I'll refer them back to the previous episode we did back in September of 2023. And so for people who haven't listened to that yet, if you could just give a quick overview about what is DLT and then talk a bit about
[00:03:40] Adrian Brudaru:
some of the notable ways that it has evolved since we last spoke. Yeah. So, basically, DLT is the 1st pip install dev tool for data engineers by data engineers to build pipelines fast, easy, and robust. It's just everyday boilerplate code for things like incremental loading, automatic schema inference, schema evolution. This is where we started. But in 1 year, our vision has expanded a lot. So with the continuous support of the community, we actually evolved into a very comprehensive Python library for moving data. We are well integrated with the modern data stack components. It works with high performance Python data libraries like PyRO, Pollers, IBIS, DuckDeeb, Delta.
And, yeah, it actually works very well on industrial scale even in constrained environments. We added things like data, contracts, parallelism. But I guess one of the biggest things that we're seeing is an explosion in lake house adoption. This is something that we are seeing quite a bit of user pull from. Yeah. And as as for adoption, I would say we reached 600 k monthly downloads, which is, I guess, 10 times higher than any other competitor in our space. And our users built over 10 k private sources by this time. Yeah. And they're they're using it for all kinds of things, including building rags, things like that. As you have been going along this journey,
[00:05:00] Marcin Rudolf:
building DLT, continuing to invest in and to evolve it, what are some of the core principles that help to guide your work on that project? So alright. So we actually have, like, a very clear principles how we build the product, how we operate in this, like, open source ecosystem. Yes? So one core principle is that DAT is a library. It's not the platform. Yeah. So what does it mean? That if someone writes code, it's using library, our library to to write this code. So you add library to your code. You don't add your code to someone else's platform. Yes? So the outcome of this is that we are trying to fit in into existing ecosystem. We are trying to work with everyone. We are not replacing anyone. Yes? Which ecosystem. We are trying to work with everyone. We are not replacing anyone. Yes? Which typical platform does is replace some other platforms. We are trying to fit in. Yes? So this is this is the first principle. So when you design when you look at, at the other projects, yes, you always look for for ways to interact to cooperate. The other one is like a we are trying to automate everything. Yes. So you should do things once. And this is like a this, this principle of efficiency that that's applies both to us and how we think about our users. This is also important thing. Then Kafka can know black box principles.
When, everything should be customizable. You should be able to change everything you want. So we also expect autonomy of the users. Yes. So, like, by letting them hack, letting letting them change everything and, you know, also looking into the code. So I think this is, like, a very important and, I mean, we also go to great lengths to for our users to do less work. We do more work for our in order for our users to do less. Yes. So we it's for me, like, a very important thing. So if I'm engineer, I I should have some empathy toward other toward other engineers. And this is one of the principles. So you need to be really empathetic, and then then it's, you can write really good code and really can help other people. And from that point of platforms and the impact that they have on the overall
[00:07:00] Tobias Macey:
architecture and approach that teams take to their data management, As a framework and as a library and in some of your blog posts, you've taken a very opinionated stance against the idea of using these managed extract and load services. I'm wondering if you can talk a bit more about what you see as the shortcomings of those platforms and what are the situations where you would actually argue in favor of their use. Yeah. I would say this is a question about somebody else that you're asking here,
[00:07:33] Adrian Brudaru:
And the way I would answer it is that we're actually very focused on our own principles, and we don't worry too much, what other people do. Simply put, what we want to achieve, we haven't seen anyone be able to manage that by offering manage extract load services. So, basically, the whole concept, I would say, is competing with the openness that we offer. And at the same time, for us to become a standard, we want to be adopted freely by other vendors as well. Right? So, so, you know, this means that we just cannot compete there. It's I would say this is a shortcoming for us. When it comes to the end user, you know, it very much depends on the persona. If you're building something large with custom requirements, you're probably better off having customizable solutions. Yeah. So we are coming from, like, a very different, let's say,
[00:08:22] Marcin Rudolf:
space. So, actually, DLP comes from market as all this machine learning revolution, AI revolution, like, and, ecosystem with Python libraries. So actually what we are, what we are looking at is like to have this deep install experience. So you can actually install data platform and, you are autonomous. You run on your own premises. It's often local workflow or single machine workflow. So actually we are quite orthogonal to this thing that you call managed services. We are trying to follow very different pattern. Yes. Very different usage patterns, very different workflows that are possible for this Python ecosystem and impossible for the, managed solutions. I don't know, of course, by vessel. There are certain workflows that are way easier when you have a managed solution, but we don't think we compete with this. And I think that in that context as well, the places where I would say that the managed platform
[00:09:16] Tobias Macey:
does make more sense is if you don't have a team that has the engineering acumen to build those custom options, and you just need to be able to pull data from one place to another, particularly if it's a widely used pattern where you have good support for the sources and destinations that you're working with. Absolutely. So we actually sometimes see users saying, like,
[00:09:41] Adrian Brudaru:
why in DLT I need code to do this or that? You know, we tell them, go use 5th trend. From that perspective too, as you said, you're not looking to necessarily
[00:09:49] Tobias Macey:
replace anyone. So I'm curious what you have seen as far as the types of teams and engineers that are using DLT. What is the overlap that you've observed as far as people who are using both DLT and another option where they use that other option for those, well paved paths and they use DLT for the more custom requirements?
[00:10:12] Adrian Brudaru:
So I would say there are a couple of patterns that I see, and that is what I call first time data platform and second time data platform. So the first time data platform is something that people just build quick and dirty. They just whatever. They put it together. It works kind of like what you're talking about, the tried and true patterns. But then there comes the point when something doesn't work. Then you start looking for a solution. You find DLT, and some people stop there, and they just, you know, have a DLT pipeline running alongside other things. But many people, at some point, they reach a point where they go like, okay. But why do I have 2 solutions? I could just use one solution. Or they could get to the point where why am I paying this much for event ingestion on Fivetran or something like that or SQL copy on Fivetran. Right? And then they start migrating more. And then we see the pattern of the 2nd time build or the 2nd time data platform where, let's say, these people have already experienced what the 1st time data platform is like. And let's say reining in the entropy that is created in such places is very difficult. So they just start with engineering best practices from the start, and then, you know, DLT is a no brainer choice kind of for these situations.
[00:11:17] Tobias Macey:
Since the last time that we spoke, there has been a lot of evolution in the space of data movement, some of the most notable pieces being what you mentioned earlier, mentioned about the increased growth of AI and the requirements of data movement and customization as far as how that data has moved. And from the competitive landscape, PyAirbyte has seen a lot of investment since the last time that we spoke. And I'm wondering if you can talk to some of the ways that those pressures and those evolutions in the broader ecosystem have informed the way that you think about the development and positioning of DLT? So, yes, this is this is a very good question. And,
[00:11:59] Marcin Rudolf:
so maybe I could get background where we are coming from. So, actually, DLT is coming from this revolution that is happening right now. So where we started the project, our way to convince people that, actually, we are doing something interesting, something new, was to telling everyone, you know, there is some revolution happening. It's a this revolution happens in ML. Yes. So there are people using these libraries, doing these new workflows, very autonomous, local workflows, for example. Yes. And now we are sure this revolution is gonna come to the data space. At some point, people will realize, actually, you can put together a lot of Python stuff and build your own data platform, you can install it, and you have can have the same kind of experience that these data data scientists have. So, our feeling and our we we are just going with this evolution. Yes. And we are, we are we are there for the for the whole time. Yes. So, you were asking about what changed in, in in data. So we the typical transition from, like, you know, this very nice JSON parser that is creating relational structures into data movement library that is, like, integrating with all this ecosystem of of other libraries.
And yeah. So that's my feeling. Yes. We we we are simply there. Yes. We are simply there. Whatever happens in the Python space that is beneficial to some part of ecosystem, we are also benefiting. So I don't know. Like, if there is a new version of the LLM from open, OpenAI content, like, I don't know, this this mini stuff recently, o one. We also benefit. Yes. It knows DLT and it helps our users right away to to write the code. So this is actually very interesting and we love actually, we love this, this kind of revolution happening. This is what we are betting on. Also, if I can build on
[00:13:46] Adrian Brudaru:
the LLMs part, it's there's a little joke I like to make. While other people add, LLMs or AI to their products, we added our product to LLMs. So, basically, if you go now on the newest LLM models, you don't need any plug ins, any rags. You can just ask it for a DLT pipeline. It will go online, search for documentation, and build it for you. Yeah. But other other good example is how we interact with the vector databases. So, actually, you have Lansd b, for example,
[00:14:14] Marcin Rudolf:
or you have, you have, embedded quadrant. And, you know, this is this is another library for us, and we are so tightly integrated. It's like a, you know, it's like a one one thing that you interact together. It's very different like being a destination, you know, on some SaaS platform. It's like your workflow, your one notebook, and you do it in the same way you need to interact with the other libraries. Pushing a little bit more on the differentiation
[00:14:40] Tobias Macey:
between DLT and some of the platform approaches, I think one of the things that those platforms offer is the state management for things like incremental loads where you can say, okay. It's going to maintain the checkpoint information about what was the last thing that I loaded, where do I pick up from there, what is your approach for being able to manage some of that state storage and incremental
[00:15:05] Marcin Rudolf:
and kind of resuming incremental loads for people who are building with DLT? Yeah. This is very good question, and I think we have very smart way of doing that. So simply for us, the state is a part of the data. So if you if you, have any kind of destination, I mean, your destination is able to store some kind of state. Yes? So we ship the data together with the metadata to the destination, and we also load it in, like, a, let's say, atomic way. So I think it's a very robust, way of handling state. So if your data loads and all the checks are passing, also your state loads. So if anything works okay, and you resume your load, you're gonna get the state that's always matching your data. So actually use the destination to store the state. If you want to abstract and store the state somewhere else, you can. Of course, it's a library. So the state's like a kind of context that you can swap. But, for the average user, it's a seamless experience. You don't need any kind of additional setup. You just load the destination to Postgres, to flow vector system, to even vector databases, and you get the state automatically stored. Yeah. And if I can,
[00:16:14] Adrian Brudaru:
you know how, for example, with Airflow, you would manage your state in Airflow, which means Airflow has to be running and functional for your pipeline to run. What, we do with DLD because, we persisted state of the destination
[00:16:28] Marcin Rudolf:
is we support serverless cases very well. Right? So if you wanna run on git actions, if you wanna run on serverless functions, anything like that, you don't need anything local to persist the state. Yeah. So our preferred way of, working with incremental loss actually is to wipe out everything after every turn. And we're gonna restore the core pipeline very quickly from the destination, the clean state with the last state, and and, run it for the new, let's say, set of data. You mentioned also the growth and evolution of the Python ecosystem
[00:17:00] Tobias Macey:
and how because you're just a Python library, you get to benefit from that as well. That ecosystem has also seen a lot of growth and investment, particularly in the data oriented set of libraries and frameworks. So I'm thinking, in particular, a lot of the rustification with things like the data fusion library and PyAra has seen a lot of growth and investment. And I'm curious if you can talk to some of the developments in that overall ecosystem of libraries, frameworks, and the Python runtime itself that you have been able to benefit from and you're most excited about. Yeah. So,
[00:17:37] Marcin Rudolf:
this is really interesting question. So probably you need to stop me at some point because there is so many libraries that we are, like, integrated to work with that, you know, I can I can, yeah, try to go for a long time? But we see a few trends. If you I can group them. So one trend is like a single node or like a single machine computing and a portable data engines, like a dark DB or this data fusion stuff or or even large DB, you could call it data engineers. Then we have open storage formats and associated libraries like, you know, PyIsberg or, DataArrows.
And, we also have a really I have this development with the arrow, and Py arrow. So it's like a you standardize this in memory table format and also compute. So so this those are the trends that I see. And now if you go through this, we of course have that, have, dark DB. Yes. We which we use everywhere. Like, it's, first of all, it's a way to onboard our users to give them this this local experience that doesn't need any kind of credentials. It can try everything. They can produce they can develop programs, improving the, developer experience so much that it's it's hard to explain this how how much change it was for for us when that DB appeared. Yes? Then we have, of course, I mentioned Arrow. I mentioned Polaris, PANDAS, and all this, like, a way to work with the, tabular data. So those are primary citizens for us. So we can load them directly. Everything that works for, you know, like a dictionaries, like a or JSON works also for them, incremental loading, SCD tool, mergers, and so on. So it's like a native citizen. Then you have these libraries for the open, table formats, and we actually use them a lot. We build, Delta Lake using Delta Rest serverless.
We have some big deployments already in production. And then are some other libraries. So maybe less mentioned, but super important. And I think they really show how much benefit by being a part of ecosystem, by trying to fit in nodes to replace. Yes. So even if you look up as as as something super org like SQL Academy, we recently added SQL Academy destination and, and, and a source. And now we have, like, hundreds of the databases. And, you know, we put some work to do the mergers, a seed tool, and incremental loading. And, you know, even obscure databases right now support that. So I could enumerate a lot. We benefit from SQL Cloud and we benefit from FSpec, just like abstraction device system. A lot from Pydantic, for example, to create data contracts that actually people understand from other work. That's that's really like, you know, there are libraries for entity recognition that we use to put the data contracts that are actually for the PII data. I don't know. Like, FICI. Yes, that you can detect all the all the entities that are PII or Microsoft Presidio. Yeah. I agree. It's very easy to wax poetic about all the interesting things that are going on in that space. So it's definitely great to see the amount of investment and integration that you've done. Yeah. And this is I I must say this is supernatural for us. Yes? So, it's not that we are doing this integration for the integrations. Like, we see a value coming from that. We see how this works together and it's like a multiplier.
Yes. We are not adding we are multiplying. So the one of our, let's say, core principles.
[00:21:06] Tobias Macey:
On that point too of being able to customize and build all kinds of use cases around DLT. One of the things that helps most with adoption for a project like DLT is the overall developer experience, the onboarding. I'm wondering if you can talk to some of the notable investments that you've made in that user experience of building with DLT composing pipelines and some of the ways that you think about the interfaces for source and destination development.
[00:21:40] Marcin Rudolf:
Yes. So, I mean, we are by nature very close to the code. Yes. We are a library, so we interact with the code. And I think that it's our DNA to make this, make this development experience easy. Yes. I'm a software engineer, so I also like a I'm trying to keep this thing good for people that want to build, that want to, want to write code. So I mentioned this, Doug, do you think? It's like improved a lot onboarding. Yes. Like, it shortens the time to learn stuff. You can run all the examples. You can get into into things and even really, very quickly build your local, very quick and low latency environment with DuckDV and use it for for testing. We put a lot of attention actually for our destinations to all of them behaving the same way. There are, like, a thousands of tests that we wrote to make sure that every destination starting from the vector databases, even into, like, a bucket files on the bucket. From the point of view of our users, they behave in the same way. They're gonna have the same schemas. You can if if DBT is supported for this destination, it's gonna be the same, you know, set of transformations so you can develop locally, and then run the same code on CI, and then deploy it. Yes. And this is really good for data quality. Yes. Because you can test it. There are many other mechanisms like that. Okay. We are pretty strong software engineering approach, as as I said. So we really pay a lot of attention to, to follow this Python intuition. It's to not really invent new stuff, but to use actually existing building blocks. So for example, all our sources are generators. Yes? Like a Python generators.
And you don't need to create some obscure object model. People know it. Yes. They know what they are, and they can use them right away. The same, like, destinations, you can now build your own destinations that are like a scenes that that consume data. Yes. So you use this something called Python decorator. You just decorate the function, and you have reverse TL, thingy that, you know, took you maybe 1 hour to to produce. So now we are, like, already coming to to the second part of your question, our investments into sources and, destinations. We actually built a very interesting team, like a really good support for, creating REST APIs right? REST API, pipelines. We have, like, a support for the imperative mode when you write code, with pagination, authenticate, and so on. So it's like a low level toolkit. But there is also higher level toolkit, like a declarative mode. And it's combined with the, code generation tool, which is converting open API, definitions into direct into pipelines and datasets. But, actually, you declare what you want. You build your tree of we call it resources.
And very quickly, you can define your pipeline and just run it. Yes? So we know that people love it. We get, are getting a lot of feedback. We see hundreds of those in production. Yes? So actually, that was, like, our big achievement. We did similar things for, like, a database and file file system sync. We standardized this. We let people declare tables, databases, combine them, and very quickly,
[00:24:49] Adrian Brudaru:
create pipelines that now sync databases. I think this is the most popular source, like, its database right now. Yeah. I'll say REST API is the second one, and I want to put a particular emphasis on this one because it was built with lots of, let's say, community pool. So the REST API declarative source is just a Python dictionary, and it was originally you know, somebody donated to us some code where they had to take on this. So we used it as inspiration, then we had more people from the community basically asking that we build something like this. And, eventually, we actually built it with community members. Right? They did, like, half the work, I guess. And, it's, it's literally our 2nd most used source, so it's a big success. Yeah. On that point of developer experience,
[00:25:36] Tobias Macey:
onboarding, speed of experimentation, one of the ways that, in my experience, has always been most effective to encourage that is having sane and useful defaults. And I know that a lot of the investment in this overall space of building libraries of sources and destinations for data movement is to have some standardized protocol for the data interchange, whether that's going back to the Unix shell with the pipe operator and just being able to operate on arbitrary strings or the singer specification, the air byte specification. I'm wondering how you've approached that aspect of this space and how you think about the data interchange protocol between the source and destination, particularly given that you're trying to move large volumes of data so you don't want to have to spend a lot of time on serialization and deserialization?
[00:26:28] Marcin Rudolf:
Alright. It's a very good question. And, actually, our users, interact mostly with the code. Yes? This, like, internals are also available. So we have this hacking principle that you always can get into internal and and use it, like, at the lower level stuff. But maybe I'll start with this, defaults. Yes? Same default that you mentioned. So we actually put a lot of attention, but into let people use some minimalistic really minimalistic code without setting any kind of options. We spent a lot of time to figure out what are the best settings for different things, and we are trying to set up this in a way that's that works. Sometimes it's a little conservative, but we are, like, making sure that you can very quickly start and then improve your your thing. Yes. So you can start without almost without learning. Yes. When you need something special, then you need to learn. Yes. So that's that's our approach. We don't want you to learn some object model before you even use it. So that's that's one thing. Second thing, actually, we do not have any kind of protocol. We are using this open formats everywhere we can. Right now, our primary citizen is Sparkify.
So when we extract, and that's another interesting story we we a year ago, people really had serious doubts. Can you really do any kind of bigger loads with Python? And we realized, yes. But you need to use Arrow. You need to use ConnectorX. You need to use Polaris. And then we did it. And now, you know, our interchange protocol is a is a packet file. And, internally, we form simply like a, you know, repositories or packet files with the manifest, which is a schema that can be a YAM file, can be a Pythonic model, whatever people prefer. And this is the exchange between different stages, among different stages of DLT. Yes. So the last stage is a a load stage. It's a destination, and it looks into this. It looks, what is the schema and loads the packet files into into the, into the destination. Yes. So ideally, we are not doing any if you are working with the arrow tables, you can you just save it once and then pass it to the load stage and it's not deserialized.
[00:28:39] Adrian Brudaru:
It's deserialized by the engine at the end. So I would also try to answer this in a way that data engineers would relate more easily maybe. And that is you typically would have a source to destination protocol because the source is passing data and metadata. The way DLT works is a little different. You basically have a component in the middle that is doing this metadata inference for you. So what this means is the source is only emitting data. You don't need to worry about metadata. If data if metadata is available, we can capture that, but it's not necessary. So what this means is you're just yielding JSON or data frames or, you know, like, much instead, maybe some yeah. Yeah. The fact that you're able to use some of those native constructs,
[00:29:20] Tobias Macey:
Py Aero in particular, I can see as being immensely valuable from a performance perspective because if you're using Aero for the source and the destination,
[00:29:30] Marcin Rudolf:
then they can just operate on the same block of memory. You don't have to deal with that save and load step in the middle. Yeah. That's true. I think this is part of your, question about this high performance libraries. This is one of the thing that we see. I also see good standardized. Maybe it's like a de facto standards, the standard, but, you know, this table format, which is RO, it's a huge benefit for the ecosystem. Another interesting aspect from that performance
[00:29:54] Tobias Macey:
perspective is the ability to parallelize and in particular with Python 3 dot 13 having the no GIL option and being able to do free threading. I'm wondering what you've seen as far as experimenting with that and some of the ways that that impacts the ways that you think about building and deploying these pipelines.
[00:30:15] Marcin Rudolf:
Yes. So, actually, we were experimenting a lot, maybe not with this, no deal option. But, you know, most of the Seras based libraries, they release the deal immediately. So, we were recently, building Delta Lake on Delta OS, and, we were, like, checking if this is really a parallel processing. Even if you have one process in Python and you have many threads, of course, they're gonna get serialized because there is deal. But Delta RS is nice and it's releasing these deals. I think it's giving you a little bit of this new Python experience, like Python 13. And, and it really works in parallel. Yes. We checked that, and, you can actually write to many tables at once, do mergers to many tables at once, which is using data fusion. So it's a lot of processing as well, like I see few processing, and we see that it works.
So, really, we think we're gonna benefit with this more with this high performance libraries than with the changes in the Python, itself. Python is getting more and more blue code. Very nice abstraction for the user as well. It's like an interface to the deeper things, and our task is to hide it and just expose what people can really understand and,
[00:31:23] Tobias Macey:
you know, and and interact. As I was preparing for this conversation and reading through your various blog posts, one of the things that captured my attention the most is this idea of a portable data lake and the impact that it has on going from local development to production, the challenges that exist on that journey. That's something that has been true for years, probably decades at this point. I'm wondering if you can summarize the ideas in that post and maybe talk to what are the missing pieces that would make that fully portable data lake something that can be properly realized.
[00:32:01] Marcin Rudolf:
So I, I could start with, like, a this more technical perspective on this. What what have to come together in order to enable this? I mean, we talked about this already. Yes. So let's start with the Sky Performance Libraries. Yes. So you need actually to somehow create, maintain, vacuum this this, this, data lakes. Yes. For that, this, this slider is the do that. It's hard to to be to be to be ready for this. Then a lot of stuff is standardized. I mean, this is what I mentioned, like, it's it's probably the factor standard, not really what it could standardize, starting from, like, a table format in memory, which is RO. And so we we also have finally, we have, like, a working we have iceberg. Yes. We have, we have delta. So those things are standardized, and you you can interact with this. Another thing that happened, you have this portable query engines. Yes. So, you know, data res is is data fusion.
And now you have dark tv. With dark tv, you can connect to any kind of store via so called scanners and, you know, read data from Delta, data from iceberg, data from process, from bucket files. Yes? So and you can move this engine close to your data. This is a big benefit. Yes? Without this benefit, there could be little reason probably to build these lakes. Yeah. And, you know, you also have this, transformation engines that are trying to replace DBT, maybe like SQL mesh or transformation and engines that are working on data frames, like Hamilton, for example. They are also, like, you know, making this experience with the port table data lakes way better and also way cheaper. Yes. Because you don't need to transform on Snowflake. Yes. You can now transform with data frame, or we can transform with that d v plus dbt if you want the old style. It's, I think, this has to come together in order to create enough value for people to adopt this stuff. And I would say there is also a community aspect that is important, and that is demand.
[00:33:55] Adrian Brudaru:
So something that I think many of us have noticed is there's been a lot of talk about the iceberg, a lot of talk about Delta, but limited adoption. So, of course, it's more in some areas than others. But, for example, if you look at Iceberg, it really exploded this year. So, particularly, I think, in January, in just before the acquisition of Tabular by Databricks, iceberg was about 2 times the search volume compared to Delta. Right? So something is happening. And now with the recent acquisition, that's even more publicity. I would say, by now, iceberg is a bit of a trigger word for many. You say iceberg, everyone's gonna tell you all the new things they're working on kind of. Yeah. I think that the growth of iceberg
[00:34:40] Tobias Macey:
has definitely accelerated a lot, which is great to see. It's definitely excellent that there has been a lot of investment as far as the tooling to be able to integrate with that effectively. So DuckDV, as you mentioned, the fact that there is Py Iceberg to be able to directly interact with the tables without having to have a query engine in the middle and then also just all of the different query engines integrating with that. So Trino, even Snowflake has invested in Iceberg support for being able to either query across Iceberg tables or use that as a native ingest path. And then in terms of your experience of building DLT, the fact that it's open source is great. That obviously helps with community adoption. But at the end of the day, you also have to have some path to sustainability.
I'm wondering as you have continued to build and invest in the tooling and grow the community, how have you worked on formulating the strategy for being able to build a sustainable product on top of that foundation?
[00:35:38] Adrian Brudaru:
So in essence, you're asking where is the money coming from. Right? I would say, you know, for the last 6 months, we've been quite successful at doing support. So we've had several types, let's say, of support that we see Fortune 500 companies ask us for from, let's say, consulting to classic support. 2nd, we have a very successful OSS Motion. And right now, as we were talking about, you know, portable data lake earlier, what this is, it's basically a dev environment for people who just want to go from local to production and easily develop these data platforms. And right now, we're, in design part partnerships with, multiple customers, and we're building this together. And, we can see in more detail that there is this movement in the market towards open compute, and we think this is something that will be ready for very soon. Yes. We learn a lot from what our customers are building as well. Yes. So,
[00:36:31] Marcin Rudolf:
you can actually build a lot of things on top of the IT as a lighter. Yes. So and this this is where we go product wise. Observing and building reproducing
[00:36:42] Tobias Macey:
certain solutions. In your experience of building the project, working with end users, and growing that community, what are some of the most interesting or innovative or unexpected ways that you've seen DLT used? Yeah. So, actually actually, this is a really good question. Yes.
[00:36:58] Marcin Rudolf:
Our users are extremely smart. Yes. They are extremely smart. The people that are using DLT are typically builders. They build their own data platforms, and we somehow get used to the fact that our situation that our users are ahead of us, like, in in ways that they think about DAT, how they can can apply it. Yes. I can give you a few a few examples. Yes. So, we, our earliest big production deployment was at Harnas. This is like a CI company. Yes. And a pretty, pretty big one. And, actually, D2E was more than a year ago. It was year and a half ago. It was already used to create a data platform that got integrated. Like, it happens has has its own object model on top of which can be utilized. So it got integrated into UIs. Yes. And, automatically, they were generated certain, you know, user interfaces.
And people that needed data, they could just interact with this interface, and DLT would produce the dataset on demand. Yes. So it's like a data democracy movement. So then we realized, yes. This is this is really new, and this is, you know, where DLT can go. Yes. Then, we also have, like, a a team of data scientists. Team of data scientists that use DLT plus span plus pandas plus MandeCom, which is like a list. You can maintain list of things in there to be the whole CMS, yes, with the machine learning component that is serving millions of users. They are not engineers. They are very smart data scientists. And, like, stitching the stuff together, they peed build a true data platform that does the work. Yes. So that was really amazing to see that, you know, the people that are not engineers can also, you know, build this kind of stuff. User facing products. Yes. Absolutely impossible without Python and these libraries and so on. Then recently, we built our first big Delta Lake with Posthog.
And, actually, this is really interesting case when delta lake is not done for internal use. It's actually a way to interact with the customers. So what you can learn from it okay. Typically, you would build the REST API and some open API interface, and people would build less clients to take data. Now you can, no, interact with the data. You can interact with the lake. There are schemas. There are tables. You can use whatever engine you want and bring the engine close to your data. It's extremely effect. So I think there is really something new, and we want to learn from it. I could continue really, like, we we have users that are were building asynchronous, destinations and asynchronous sources before us. So, like, taking data from Postgres extremely quickly via asyncio.
So this is really a lot. And the most recent development is people are using cursor AI to automate code generation and they are like feeding the documentation, creating some set of rules. You know, the pipelines have been generated for them. So, I mean, we expected that, but it's like a users did that way before we could even, you know, make a first proof of concept that's already in production. So this is amazing. Yes. It's like having people that are using your stuff. For me as engineers, it's extremely fulfilling. But, you know, using in a way that I didn't expect, this is the best part. Yes. And this is happening every day. You mentioned to being able to build a data platform,
[00:40:11] Tobias Macey:
and that's another topic area that you focused on in your blog posts and your messaging and positioning is the idea of DLT being a core tool in the toolbox of data platform engineers. And I'm wondering if you can talk to some of the ways that you think about what does a data platform engineer do, how does DLT help them, and what are some of the ways that they expose
[00:40:36] Adrian Brudaru:
those platform capabilities to the consumers of the platform to be able to build on top of that and the role that DLT plays in that relationship? So I would say data platform is something that gets thrown around a lot, but, really, it's just a technical representation of a data stack, be it the data warehouse or a data lake or whatever you're doing. Basically, where this, data platform engineer role is new is that what they're doing is they're enabling other data people do their job better. So, basically, they put together systems that essentially create ways for data developers to create pipelines faster, easier, cleaner, let's say, boilerplate code, all kinds of things, right, from governance to engineering.
[00:41:18] Marcin Rudolf:
Yeah. This is what we basically observe all the times. I mentioned this like unexpected things that happens. And the pattern that we see is that people are building, we call it like a data platforms in a box that, that are like a way to to bundle certain entities that you can create with DT. So, you know, sources, destinations, datasets, schemas, contracts, altogether create a package, yes, which can be, you know, developed locally on DuckDb, but deployed on CI in some, you know, test, environment and then deployed by the infra people. They can hand over the same package to the infra people, and they can deploy exactly the same thing to the production environment. Yeah. So this is what we see people are building. So, like, a port type port table or people's table platforms. Yes. So as Adam mentioned, that have the whole stack inside. But this is also the way people expose this to the data users. This is very interesting. Of course, this is not in every company, but in the companies that have strong data science team, they typically bundle this kind of platform, expose just certain things, as a pie Python interface. And those people can interactively, for example, work with the data frames, but the data frames are coming from the data lake, for example. Yes. And then you can train your model directly on the data that is, that is stored somewhere without even knowing using SQL or doing this kind of stuff. And it's the same code. It's the same set. It's a like a support table platform, as we call it. Yeah. And I would say deeper than the technical aspect is also governance. Right? Because when you have a uniform way of ingesting things, you have standardization.
[00:42:51] Adrian Brudaru:
You have schemas up front. And, basically, I don't know. You're probably familiar with the concept of data mesh. What data mesh is, advocating is, basically, that domains are more self sufficient. So domain knowledge is basically captured and fed into metadata for pipelines that allows, basically, the organization to understand what this data is. So the way you can think about DLD with its schema, with its data contracts, it's quite close to that, and we're working on semantic capabilities that will basically create these allow these data contracts to do way more advanced things such as data meshing
[00:43:25] Tobias Macey:
or, let's say, PII data contracts. Yeah. Definitely very excited to see the continued evolution in that regard as well. So I'll be keeping a close eye on your activities there. In your experience of building this tool, building a business, building a community, what are some of the most interesting or unexpected or challenging lessons that you've each learned in that process? I can start with a challenging lesson, and this one is a little painful.
[00:43:50] Adrian Brudaru:
So we were working on pipeline generation because, you know, LLMs, it seemed to all make sense. And this is how we came upon the open API in it. So the challenge there is that you have a number of, let's say, pieces of information that you need in order to generate the pipeline. And if you are using an LLM to guess them, then the error rate will compound. So, essentially, by the time you have a finished pipeline, it's probably not So we realized this approach. Basically, the technology is not there for it, so we went down an algorithmic path. So we figured, hey. We have almost everything in open API. We can infer the rest, and what cannot be inferred can be manually tweaked by the user. So we created this generator that basically scans an open API spec and, creates a REST API source from it, but nobody cared. So we were expecting, you know, that maybe people that are working with fast API that basically use this, standard or maybe people from the community, but literally,
[00:44:51] Marcin Rudolf:
we we couldn't find people who care. Yeah. And that's also a little bit unexpected. Yes. It's like a from the technical point of view, it's like an amazing thing. Yes. You have information. You can create some kind of thing automatically. But actually, people prefer to it's so easy without that, so they probably don't want to learn another another tool. It's easier for them to just write a simple Python dictionary. There is one, very interesting thing already mentioned. It was like a when we were doing this production, Delta Lake was post hoc. We realized we had this for for a long time, we had this, like, idea, this core idea of our product is that, you know, that we can convert any REST API into the dataset. Yes. Because people that interact with REST APIs, if they are consumers of the data, they don't they don't want to have HTTP calls to some endpoint. They want to interact with datasets. And now we we see something really new, like the user interface is not even gonna be the SAP. It's gonna be some kind of, like, yes, with the very different schema, some data catalog that is supposed to be auto generated because it's a lot of work to generate the catalog. And this is how it's gonna be how we're gonna interact, how the companies are gonna interact with the other companies or with the customer. This is a this is super new. Yes. We are thinking how to make it easier, how to automate it. Yes. It's we think it's it's gonna be, a part of this lake revolution that's happening. One thing that I was just realizing that we didn't touch on explicitly
[00:46:18] Tobias Macey:
is for a lot of this conversation, we've been leaning more towards the idea of structured data sources and destinations. We mentioned the ability to integrate with these AI stacks. I'm wondering what are some of the ways that you think about being able to easily address unstructured data sources and be able to either consume that as is or turn that into a structured representation. And some of the experiments that you've done internally and some of the ways you've seen teams addressing that challenge in their own work that are using DLT for that? Yes. This is a good question. Yes. So, obviously, DLT, like, the,
[00:46:54] Marcin Rudolf:
the biggest value you get when you can somehow take something that is unstructured and convert that to something that is structured at the other end. Because this is why DAT exists. Yes? So, of course, we started with the messy JSON files, but, and we did it a really long, long time ago. You can also use any of these libraries that are, let's say, parsing and converting, PDFs into some kind of meaningful structures or things that convert any massive files into a create a schema on top of them, and and then, you can convert them into the datasets. So you can actually plug any kind of Python library or or even the platform that does it into DLP as a source. Yes? We call it a structure source. We we have it for people that want to use it, in our verified sources. They they can try it out. So that's the one thing. Second thing, like, we also integrate with these frameworks that people typically use like a long chain. It's not even integration.
No. We are no. Our source are generators. Yes. So you can just take them and use them with launching. You don't need to do anything. You can, or you can pass launching documents into DFT and gonna also, you know, automatically pass them. So all of the things that you let's say, have, like, a launching apps or, like, launching plugins for such data, you can you can interact with DMT, through it almost automatically.
[00:48:16] Tobias Macey:
The other style of data sources and destinations in particular that has been gaining a lot of attention right now are property graphs because of the renewed interest in knowledge graphs, because of the advent of GraphRag. I'm curious how you have seen people using DLT in that context as well for being able to populate and maintain some of these property graphs. So, yes,
[00:48:40] Marcin Rudolf:
I'm we know that people are using, you know, GraphQL to query this as a data source. I'm not aware personally of for doing this,
[00:48:50] Tobias Macey:
things with DFT right now, mainly. Alright. Well, something to keep your eye on. So for people who are working in data teams, they have a need to be able to move data from point a to point b. What are the cases where DLT is the wrong choice?
[00:49:05] Adrian Brudaru:
Like we touched upon before, if you're not a Python first person and if you don't want if you don't care about software development best practices and this kind of stuff, then don't use DLT. It's not for you. DLT is for Python first data teams. Outside of this, I would say not much to worry about. Right? It's literally by Python data people for Python data people. And as you continue to
[00:49:29] Tobias Macey:
build and invest in DLT and DLT Hub, what are some of the things you have planned for the near to medium term and areas that you're excited to explore further and invest in? So like we were telling you, this pip install, portable, Pythonic data lake,
[00:49:45] Adrian Brudaru:
we think this is a tectonic shift. So there's going to be lots of work that can be done there, I would say. There are a few major areas from open compute to clean, easy dev environment and so on. But, notably, I guess, besides a list of features, one thing that we'll be doing short term is actually some customer roadshows in November, showcasing this, data lake product. So we'll be in San Francisco, New York City, Paris, Berlin. If anyone is interested, just get in touch with us on Slack. We might also add other locations.
[00:50:17] Tobias Macey:
Alright. Are there any other aspects of the work that you're doing on DLT and this overall space of data movement that we didn't discuss yet that you would like to cover before we close out the show? I think that your questions are were very really very interesting and comprehensive, and we touched the things that we we considered the most important. So how
[00:50:37] Marcin Rudolf:
you really being a part of this ecosystem, being a part of this AI evolution. Yes? And the way these new workflows that, are there. Can we benefit from everything that is beneficial to others? We also benefit. Yes. So if people are building a new library, yes, there is a, for example, a new, way to manage Python dependencies when you can ship a script as an executable. It's called UV, some model after the cargo from Rust. And now we are using this for a few days, and it's amazing. And it's adding so much power to the the product that we are building when portability, people start the ability is important.
[00:51:13] Tobias Macey:
So I'm I'm no. I'm amazed by this ecosystem simply and this way of of doing stuff. Not sure this DLT related. It's more like a being an engineer in in the space for a long time. And seeing this kind of thing working, it's this is this is fulfilling. Absolutely. Well, for anybody who wants to get in touch with either of you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspectives on what you see as being the biggest gap in the tooling or technology that's available for data management today?
[00:51:44] Adrian Brudaru:
I wouldn't say it's a gap that we have today. I would say it's a shortcoming that is coming from the way we design data stacks, and it's, maybe here to stay. Maybe one day it will move away. But, I think the biggest problem that we have right now in the data space is the lack of interoperability of tools. Right? And the fact that when you're building a data stack, you're literally just human middleware stitching together some technologies, and, you know, your documentation is probably going to be a little bit outdated. There's gonna be gotchas. There's going to be all kinds of things that maybe other people stumbled into, and it's your first time. But, essentially, if you look even at the way tools interact, they interact by looking at data in a database.
This doesn't have metadata, which means that the amount of things that you can do with it are, by nature, very limited. And I think once we can get away from this concept that metadata just needs to be added to every tool or created every time and we only move data around, a major change can occur. Before this, I think, you know, we're just all stitching together vendor tools.
[00:52:52] Tobias Macey:
Well, thank you both very much for taking the time today to join me and share the work that you've been doing on DLT. It's definitely a very interesting project. Definitely excited to see the ways that it's evolving. Definitely going to be playing around with that and experimenting with how it fits into some of my new projects that are coming. So appreciate all the time and energy that you've both put into that and the rest of your team, and I hope you enjoy the rest of your day. Thank you. Thank you. Thank you for having us, and have a great day as well. Thank you for listening, and don't forget to check out our other shows.
Podcast.net covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host at data engineering podcast.com with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction to DLT Hub and Guests
Overview of DLT and Its Evolution
Core Principles of DLT Development
Managed Services vs. DLT Approach
Impact of AI and Ecosystem Developments
State Management and Incremental Loads
Python Ecosystem and Performance Enhancements
Developer Experience and Onboarding
Data Interchange Protocols
Portable Data Lakes and Industry Trends
Building a Sustainable Open Source Product
Role of Data Platform Engineers
Handling Unstructured Data
When DLT is Not the Right Choice
Gaps in Data Management Tooling