Summary
Real-time data processing has steadily been gaining adoption due to advances in the accessibility of the technologies involved. Despite that, it is still a complex set of capabilities. To bring streaming data in reach of application engineers Matteo Pelati helped to create Dozer. In this episode he explains how investing in high performance and operationally simplified streaming with a familiar API can yield significant benefits for software and data teams together.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack
- Modern data teams are using Hex to 10x their data impact. Hex combines a notebook style UI with an interactive report builder. This allows data teams to both dive deep to find insights and then share their work in an easy-to-read format to the whole org. In Hex you can use SQL, Python, R, and no-code visualization together to explore, transform, and model data. Hex also has AI built directly into the workflow to help you generate, edit, explain and document your code. The best data teams in the world such as the ones at Notion, AngelList, and Anthropic use Hex for ad hoc investigations, creating machine learning models, and building operational dashboards for the rest of their company. Hex makes it easy for data analysts and data scientists to collaborate together and produce work that has an impact. Make your data team unstoppable with Hex. Sign up today at dataengineeringpodcast.com/hex to get a 30-day free trial for your team!
- Your host is Tobias Macey and today I'm interviewing Matteo Pelati about Dozer, an open source engine that includes data ingestion, transformation, and API generation for real-time sources
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Dozer is and the story behind it?
- What was your decision process for building Dozer as open source?
- As you note in the documentation, Dozer has overlap with a number of technologies that are aimed at different use cases. What was missing from each of them and the center of their Venn diagram that prompted you to build Dozer?
- In addition to working in an interesting technological cross-section, you are also targeting a disparate group of personas. Who are you building Dozer for and what were the motivations for that vision?
- What are the different use cases that you are focused on supporting?
- What are the features of Dozer that enable engineers to address those uses, and what makes it preferable to existing alternative approaches?
- Can you describe how Dozer is implemented?
- How have the design and goals of the platform changed since you first started working on it?
- What are the architectural "-ilities" that you are trying to optimize for?
- What is involved in getting Dozer deployed and integrated into an existing application/data infrastructure?
- How can teams who are using Dozer extend/integrate with Dozer?
- What does the development/deployment workflow look like for teams who are building on top of Dozer?
- What is your governance model for Dozer and balancing the open source project against your business goals?
- What are the most interesting, innovative, or unexpected ways that you have seen Dozer used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Dozer?
- When is Dozer the wrong choice?
- What do you have planned for the future of Dozer?
Contact Info
- @pelatimtt on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
- Dozer
- Data Robot
- Netflix Bulldozer
- CubeJS
- JVM == Java Virtual Machine
- Flink
- Airbyte
- Fivetran
- Delta Lake
- LMDB
- Vector Database
- LLM == Large Language Model
- Rockset
- Tinybird
- Rust Language
- Materialize
- RisingWave
- DuckDB
- DataFusion
- Polars
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Hex: ![Hex Tech Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/zBEUGheK.png) Hex is a collaborative workspace for data science and analytics. A single place for teams to explore, transform, and visualize data into beautiful interactive reports. Use SQL, Python, R, no-code and AI to find and share insights across your organization. Empower everyone in an organization to make an impact with data. Sign up today at [dataengineeringpodcast.com/hex](https://www.dataengineeringpodcast.com/hex} and get 30 days free!
- Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png) Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Introducing guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable enriched data to every downstream team. You specify the customer traits, then profiles runs the joints and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack. Modern data teams are using HEX to 10 x their data impact. HEX combines a notebook style UI with an interactive report builder.
This allows data teams to both dive deep to find insights and then share their work in an easy to read format to the whole org. In hex, you can use SQL, Python, R, and no code visualization together to explore, transform, and model data. HEX also has AI built directly into the workflow to help you generate, edit, explain, and document your code. The best data teams in the world, such as the ones at Notion, AngelList, and Anthropic use HEX for ad hoc investigations, creating machine learning models, and building operational dash boards for the rest of their company. HEX makes it easy for data analysts and data scientists to collaborate together and produce work that has an impact.
Make your data team unstoppable with HEX. Sign up today at dataengineeringpodcast.com/hex to get a 30 day free trial for your team. Your host is Tobias Macy. And today, I'm interviewing Matteo Pilati about Dozer, an open source engine that includes data ingestion, transformation, and API generation for real time sources. So, Matteo, can you start by introducing yourself?
[00:01:51] Unknown:
Hello. I'm Matteo. I'm originally from, Italy based in Singapore. I'm, been in in engineering for the last, 20 years and about, 12, 13 years working on data. I've worked for a start up and and mostly financial institution, been part of DataRobot, quite early. I've set up the data team at DBS in Singapore. And, right before starting those, I was leading data engineering for, Goldman Sachs in Singapore in Asia Pacific.
[00:02:24] Unknown:
And do you remember how you first got started working in data?
[00:02:27] Unknown:
Well, I first started working in data when, when I was at Nokia, actually. That was, like I think it was, like, around 2, 000 and 8, actually. And, you know, there was no big data at that time, but, you know, we were processing, log data from the telco equipment. So in reality, it was big data. Even if, if the term was big data was, was not invented yet.
[00:02:51] Unknown:
And now bringing us up to what you're building at Dozer, can you give a bit of an overview about what it is and some of the story behind how it came to be and why you decided that you wanted to invest your time and energy on it? Yeah. Sure. So,
[00:03:03] Unknown:
the problem that, that we're solving at Dozer is fundamentally exposing data, to be, for the integration to customer facing application. And how I encountered the problem was was when I was at DBS that we had to build an entire data infrastructure layer for serving APIs to mobile application. And it was across multiple countries, across multiple products, and it was a massive project with a lot of, moving parts, a lot of custom code to be built, a big team, and it took quite a bit of time. And that's where, you know, I I solved this problem in DBS.
I saw a similar problem when I was at Goldman Sachs. Then, you know, talking with friends in in the data space, I noticed that this is actually it's it's, it's it's apparently a simple problem, but in reality, there is a lot of, complexity behind. And, that's how we started, me and Vivek, my cofounder, started thinking about Dozer. And then we started doing some research, and and, we we found we found other projects that tried to address a similar pro similar problems. 1 of the project is, that is pretty well known is the project called Bulldogzer, actually. That's where we got the inspiration from. That is a project from Netflix that solves a similar problem. And then we start to thinking, okay, this this problem has, this solution this problem has legs, And, and that's how we started. So, basically, we, we we started fundraising. We started the company. We started the open source project, and and here we are.
[00:04:43] Unknown:
And in terms of making it open source, what was your decision process that led you to that conclusion?
[00:04:49] Unknown:
Well, you know, the data, now open source is is, is getting more and more popular. But if you look at the data the data space, pretty much every project is open source. And I don't believe there there is any, I mean, I don't believe today it's possible to make a project successful in the data space without being open source. So for us, it was kind of a given
[00:05:16] Unknown:
that this had to be open source. In terms of the actual problem space that you're addressing, as you noted, there's overlap with a number of different technologies and use cases, and I'm wondering what it is about Dozer and the overlap of the Venn diagram across those different technologies that makes Dozer a better fit for the specific problems that you're focused on versus any of those other solutions. So in particular, things like Flink for streaming or Airbyte or Fivetran for data integration or things like Kube JS for data APIs. I'm just wondering what it is about Dozer and that particular intersection of problem spaces that, made it useful and necessary to build this product.
[00:05:59] Unknown:
Yeah. That's right. That's a good, that's a really good question. So, the way we position the product is, you know, as you mentioned, if you want to break up an entire infrastructure for serving data, you have to, put together multiple technologies. And, then you have the problem of real time data and the batch data. You have different type of caching, different type of lookup that you have to perform. So the technology is start becoming, more and more, and the infrastructure, the architecture become more and more complex. Now imagine where you have a product engineering team that needs to integrate data, with a customer facing application. Now they face a problem of talking with the data engineering team. They they basically talk a different language actually. So what we wanted what we want to enable with Dozer is the ability for a product engineering team to be kind of independent and say, okay. I can source my data from my data lake, from my operational systems.
I can source everything in real time. I can do some transformation because what is available there in the system is not exactly what I want, and I want to easily expose it as API and be able to integrate it with my customer facing application. And I want to do it and I want to do all this without bringing up an entire infrastructure, But, I can have a simple tool to do that. Another thing that another consideration that, that, we have done is the fact that we feel the, the world of data engineering is, is changing a lot. I mean, it has been primarily dominated by JVM based tools, and most of them are, like you say, distributed architecture. Now and in many situation, you don't really need that kind of complexity.
So with new, languages like Rust, new processor coming out, ARM based processor with a large amount of cores, We feel it's, today, it's possible to achieve, on a single machine what, before needed a distributor architecture. So you put everything together and say, okay. Dozer is a tool that allows our ideas. Dozer is a tool that allows data, sorry, product engineers to build this entire end to end end to end experiences.
[00:08:34] Unknown:
And because of the fact that there are multiple steps in building that experience and that has largely been addressed by a number of different technologies that have to get stitched together, What are the major points of friction that you have seen teams run into as far as building those solutions with those different technologies? And what are some of the elements of the user experience that you're designing into Dozer to reduce that overall friction?
[00:09:00] Unknown:
Yeah. So the friction that I've seen is is, you know, it's a typical friction that you have in an enterprise where you have your your private team and your data engineering team. The data engineering team focuses more on the on the big picture, and they want a kind of like the perfect data platform to be implemented in the enterprise. While the product team, they want they want something that is readily available to be consumed by API. That's a friction that that I've seen. And, you know, sometimes, the pulling data from your data platform might might not even be enough, for example, because if you need real time data, if your data platform doesn't have a full, full, streaming real time streaming infrastructure yet, so that's that's, that's a challenging problem for the the the data engineering team.
The other aspect that is, that is very important here is that, since we are our idea, we focus on serving data to customers, to customer facing application. Obviously, you have a different set of problems, that, from the data engineering team. You know, data engineering is prime has primarily been used for internal consumption. So if a report is not ready, for example, 1 day, that's a problem, but it's not not a critical issue. Now when you start putting that data in front of the customer, obviously, you need another level of reliability because it's it's a it's, it's data that need to be available 247 at a in a at a particular SLA.
So and, that's what we address also in Dozer is about the the monitoring of the cyst or the the let's say let's say the data observability from, from this point of view and, and the understanding when problems happen, being able to to resolve that problem because, you know, you're serving customers. So that's the, that's the philosophy behind it. Obviously, we are very early right now. But that's that's our that's our vision.
[00:11:14] Unknown:
Another aspect of this cross section is that there are a number of different personas or roles that are typically involved in some of the different, desired end goals where for building a data API, it might be an application engineer who just needs to integrate data into their application. It might be a data engineer who's trying to provide output to machine learning or an analytics use case. It might be a data scientist who's trying to build, some experimental, system to determine whether or not the direction they're going is the right path. And I'm wondering, because of the fact that you do have so many different target audiences, how that also factors into the way that you think about the design and usage of Dozer to be able to address the needs of each of those personas?
[00:12:03] Unknown:
Yeah. I mean, fundamentally, the the whole idea is that it goes back to the the simplicity of usage. So you just, connect your data sources. You express your transformation using SQL that is that is is translated into a stream pipeline, and data is, is cached in the API. So these are usually 3 distinct components. It's an ingestion component, a streaming ingestion component, it's a it's a transformation engine, and it's a caching layer plus API. We say our philosophy is that because we see, we want to give the possibility to a single developer or a small team to to orchestrate all these. We say, okay, this is, just a configuration file.
We use YAML, and we're gonna have the our, cloud base with a UI, available really soon. But fundamentally, you can express ingestion, transformation, and and caching all in a single configuration, so that you can bring up the entire the entire infrastructure. Some some people call it data as code, actually, as comparing it to philosophy we have.
[00:13:23] Unknown:
And digging now into Dozer itself, can you talk through some of the ways that it's implemented and the architectural elements of it that make it a better fit for this end to end data integration and delivery use case?
[00:13:39] Unknown:
Dowser is, is fundamentally a self contained tool. We didn't want to have dependency to external tools like Kafka or distributed key value stores. So we wanted to keep it very simple. It's all implemented in Rust that gives us the the performance. And it's fundamentally 3, 3 pieces. 1 piece is the ingestion part. So we always treat data coming in as streaming data, whether it's whether it's actual streaming or not. Like, I give you an example. Relational database, easy to think, we use CDC to capture all the changes. But even when we are dealing with Snowflake or, Delta Lake, which is not strictly something, real time, we basically capture, change streams to, detect what has what has been changed, and we keep the and we update the state.
After the ingestion, we, we fundamentally have a a, streaming SQL engine that has been built up built from from the ground up. And this is, let's say, WASI, NC SQL. Actually, we support all the pretty much complex operation, aggregation, joints, etcetera. And this allows us to basically, allows a user to create a a model by joining sources aggregating that then will be stored in the in the caching layer. Once the data is transformed, everything is, is, stored in our our caching layer, which is not a, which is an embed based on embedded database, we leverage, Lmdb, actually, which is a was a very fast memory mapped key value store. And on top of that, we have the, the query the the API. Now this is the the the the, the execution engine. Obviously, on top of it, there is a lot of stuff to handle API. Like, for example, API automated API versioning. So what happens when something goes something changes in the source?
We automatically detect that change. We we, we typical problems that engineers have been dealing with API, we are bringing the solution to those problems to, to the data to the data world. Now all these can, can be run as a as a as a single binary or or or can be run with different binaries. But the fundamental idea is that you can just do a brew install and, and, basically put down a YAML configuration file, and you have 1 process basically connecting to the database, doing the real time processing, and exposing, low latency API. That's that's how it works. Another thing to note that the philosophy, the approach that you have taken is that because everything is pre aggregated, so, the idea is that, complex aggregation and joints, you do, you do that in the in the SQL layer, and, you, key the simple operation, you do them directly on the Like, the cache allows to you to do filtering, sorting, basic operations. If you have anything more complex than that, you do it you do it upstream. That's that's fundamental idea. That allows us to guarantee the millisecond latency, sub millisecond latency on every
[00:17:14] Unknown:
And building up a streaming engine and a SQL engine in the same unit is definitely a very complex set of engineering challenges. I'm curious, what are some of the what are some of the sources of inspiration that you drew from to be able to understand how to address this problem and some of the off the shelf components that you were able to use to build out this overall architecture.
[00:17:38] Unknown:
Okay. Yeah. So, we didn't use many of the I would say off the shelf components, we use LMDB, definitely. That's the the, caching layer that we're leveraging. For the caching layer on top of LNDB, obviously, we've built a couple of some indexes, some some, an additional layer. And, you know, there I mean, there is, there is a lot of literature about that and a lot of project doing that as well. The the on the on the streaming SQL engine, obviously, we got some inspiration from, from the tools that are out there. For the caching layer, we, we leveraged Lmdb. It's a it's a database we like a lot. Actually, we did we did a lot of testing comparing, RocksDB versus Lmdb.
I previously, in the past used a lot Lmdb, and I like it a lot. And, you know, RocksDB is very good, for heavy writes, and while Lmdb is very good for for heavy reads. And that's why we decided to to go that direction. Talking about the, the streaming SQL, obviously, that's a pretty it's a complex piece of code, but it's it's, I mean, other companies as in are embarking into, into this. So we got inspiration from the traditional SQL SQL, real streaming SQL tools like Fling, for example. So that's that's, that's and the value for us is being able to to give, the end to end integration of all these without worrying without letting the user
[00:19:17] Unknown:
handle all these moving parts. As far as the evolution of the project, I'm wondering what are some of the, design and goals of the project that have changed since you first started working on it and some of the assumptions that you had going in that have been updated in the process.
[00:19:35] Unknown:
Yeah. So, I would say that, when we started out, we had, we had this idea of having a single binary deployment, actually. And that was how the project started. And and that's how was the first person the first 1 who was. Later on, you know, now we are starting to, to work on the on our cloud version. And, you know, all I mean, not everything is, is best suited to be a single binary deployment, actually. So, our cloud deployment is entirely based on Kubernetes, so we realize the need of, separating up some some some components. And that's what we are we are we are actually doing right now. So I would say that the, we still want to be able to run it run it as a as a single binary, but sometimes for operationalization, that's, that's probably not, not the best. And that's why we are we're kind of, like, starting to to to separate it in in, in various components. That's that's probably the biggest the biggest change at, that, that architecturally,
[00:20:58] Unknown:
we have done. And so for people who are getting started with Dozer, they want to deploy it and integrate it into their existing infrastructure, start pulling data from their various sources. I'm wondering if you can talk through the overall process of getting that set up and integrated and configured.
[00:21:15] Unknown:
Yeah. So the, the process a documentation on on on our on our website. So, fundamentally, you can download it. So it's, you can download it using or, or use a Docker image. And, we have a couple of samples as well that where we we show how we connect to database, how we we are constantly adding samples. We we can pull data from a tree, how we combine them together, etcetera etcetera. And it's, it's fundamentally writing a YAML file where you define the sources you want to connect, the SQL transformation that you want to, that you want to run, and the endpoints, that, that you want to expose. And that's pretty much about it, actually. So then you start Dozer and, automatically, you have, you have APIs.
We are working on the more details on the on the deployment. As I mentioned, we are working also on our cloud version, which we plan to release about next month. And, yes, and also more and more samples. I mean, we are we're working on more the if you want to get started, that's that's straightforward. It's just 2 lines of code. And we have also have a couple of videos showing tutorial showing how how to how to do it.
[00:22:48] Unknown:
And once it is deployed, you've got it connected up to some of your different sources. Can you talk through the developer workflow and the process involved for teams who are building out some of these different data applications, some of the elements of, access control, multi tenancy, things like
[00:23:08] Unknown:
that? Yeah. Talking about yeah. I forgot to mention that we, we have client integrations as well. So we integrate to it. We have JavaScript client. We have a a Python client as well. So if you are consuming the data from your web applications or your, data science application using Python, we give you, we provide a a client that easily integrates with API and easily integrates with authentication. For authentication, we are, currently, our authentication is is our authorization, we don't really handle authentication. We we let you handle authentication yourself. It's all JWT based, and we handle authorization at the role level.
Field level field level authorization is, is coming soon, and that's how easy is to integrate. As I mentioned before, we handle deployment of new API and versioning of APIs. So that means that, whenever you let's say, for example, you change your model, so you update your SQL because, something changed or even the source changed, we basically detect that, something is wrong. We notify you. We allow you to, you, at that point, you can, update your SQL. You can deploy a new version of the API, and, we'll we let you basically switch the API when you've done integration. So, fundamentally, it's kind of like a blue green deployment for your data. That's, that's how, how the workflow is from a, let's say, from a a product engineer point of view. That's that's that's how the flow works.
[00:24:53] Unknown:
Other aspects that are interesting to talk through are some of the elements around testing, validation, working in preproduction environments, and promoting changes. I'm wondering if you can talk through some of those aspects of the development workflow and the maintenance and evolution of these applications that are built on Dozer?
[00:25:14] Unknown:
Yeah. Right. So this is something that, that is not, much addressed in the car in the open source version, and that's something we are we are implementing in our in our cloud version. So our cloud version with the full life cycle of API management, allowing you to have, to deploy in a in a in a, let's say, preproduction environment, and then testing the API there, and then promoting to a production environment. So that's that's this is, this is something that is not gonna come out next month. It likely come out, around September, but that's, that's, that's what we're working.
[00:25:58] Unknown:
Another aspect of any project that is open source but has a business behind it is the question of governance and being able to balance the needs of the open source project with the goals of the business and sustainability of both. I'm wondering if you can talk through some of the ways that you're thinking about those problems.
[00:26:17] Unknown:
Yeah. Right. Yeah. So we the philosophy that we have is that, open source will always be there, and we will always, always support it. Our idea is that, the open source version is, will give all the features, but, the scalability of API is up to the user deploying it, actually. So all the feature will be available. There will be no features that are, that are cut down in terms of, like, for example, SQL, connectors, or API. What really will, will what is the added value of of the cloud is the peace of mind. So API auto scaling, global data distribution, etcetera, etcetera. So that's that's the way we are we are thinking about it.
[00:27:12] Unknown:
And recognizing that it's still fairly early in the project, I'm wondering if you can talk to some of the most interesting or innovative or unexpected ways that you have seen Dozer used. Yeah. Sure. So as I mentioned, we are very early. But, you know, we started having different
[00:27:26] Unknown:
type of user leveraging those for for different, different applications. I would say that 1, 1 application that came up, in multiple situation is, for example, payments, where, you know, typically to to to run a payment system, you have to you have to run multiple transactional And, you know, 1 of the requirement that you have in in rule. And, you know, 1 of the requirement that you have in in payment system is, is, is being able to aggregate all this data and give a unified view, to the to your user. And that's where we have seen, for example, those are being used by by multiple, multiple companies, because, you know, they payment is a scenario where you need real time information. I mean, once the customer have done a payment, they immediately want to see, what's going on. And, you know, wearing all these transactional system in the back end is not really feasible.
So there there is the need of a unified view that is updated in in real time, actually. That's that's, for example, 1, 1 of the interesting use use case we have, we have seen. There is another use case that came up that is more of a like experimental, is integration with, with, with LLM, for example. So, you know, now a lot of, lots of people are experimenting with with vector databases and using LLMs to to basically using vector databases as a knowledge base for for an LLM. But if you imagine having a, building a chatbot, answering question about the knowledge base. But, if you think, let's take for example a bank, a knowledge base is not really necessary. You need to have a customer 360 in order to personalize the LLM responses for, that user. Now that customer 360, needs to be built and needs to be maintained in, in real time, needs to be up to date. And, and in order to build it, data comes from multiple sources. So you need to have a system that allows you to collect all this data, create a unified profile, and being able to serve that profile at low latency when the LLM to be provided as in the context of the LLM, actually. So that's another that's another interesting use case that came up. It's very experimental, but that's that's that's, we thought it was quite, quite unique and quite creative, actually.
And in your experience of building Dozer and trying to grow the business around it, grow the community around it, what are the most interesting or unexpected or challenging lessons that you've learned in the process? So, from a technology point of view, I don't think there have been many surprises, I would say, because, you know, I've been in technology for more than 20 years, so that's that's, that's, that's easier. I think that, most of the surprises came from the community aspect and, community building because I'm I'm basically a newbie there, and I'm learning my way. And, and it's incredible how how when you put out something, how people, start the curiosity of people and the, how much they are willing to help to contribute to the project, actually, even if the project is is is very early. And, you know, as an engineer, you always want to wait to put out a project that is, that is, that is almost ready, not to burn yourself.
But, you know, maybe that's that's many people say it's not right. I mean, it's debatable, actually. But, you know, even putting the project relatively early, I believe you get a lot of support from the community. I mean, they want to be involved. They want to learn more about it. And it's incredible, you know, how when you put out the project, it maybe is is, is not, fully ready yet. People still want to try and still want to solve solve the problem and contribute back instead of, like, saying, okay, this doesn't work and just drop it 5 minutes later. That is that is quite, quite incredible.
[00:32:14] Unknown:
And for people who are interested in building APIs on top of real time data sources, what are the cases where Dozer is the wrong choice?
[00:32:24] Unknown:
Dozer, you know, it it does integrate a a an engine that preaggregates data. So Dozer is a perfect choice when you know exactly what you want from the API, and, you know the consumption pattern. Now if you have a situation where you're doing more exploratory, so you don't know exactly what you, what you need, and you want to run different queries, and most of them are are can be all up queries. Dozer is definitely not the choice here. There are other tools that allows you to ingest all the data and do all upgrades on top of it. You can think about tools like Rockset, for example. You can think about Tinybird. Those are more, the philosophy of those tool is that you take all your data, you pump the all the data in the system, and then you run queries in, semi real time there.
Those are on the other side is much more efficient when you say, I know exactly what I want. I know what kind of query, I'm gonna be doing, and I want the the lowest possible latency. That's that's that's when those are fits very well.
[00:33:47] Unknown:
And as you continue to build and iterate on Dozer, what are some of what are some of the things you have planned for the near to medium term?
[00:33:55] Unknown:
So near term near to medium is, is, our to be honest now, we are fully focused on, on, on cloud deployment, actually. And, at the same time, enabling more scalability on our, on our open source, on our open source release as well. Other aspect is on the, on the connectors side, so expanding the connectors. Now we have a limited set of connectors, and, and, that's, that's, that's what we're working on, expanding them. And, 3rd is, the, the UI. Right now, use Dozer is fully based on a on a CLI and a configuration file. So we are working on a, on a UI layer to be more more easier usable for people who prefer that. Obviously, there are more there are also longer term things, like there are longer terms item that we're working on, what I mentioned before, the monitoring, observability, more of, more work on the security aspect as I said, field level, field level, field level authorization.
So this is this is what, what is on our plate in the in the coming months.
[00:35:22] Unknown:
And to the point of the source and destination integrations and also just the overall extensibility of Dozer, what are some of the interfaces or, plug in options for people who are using Dozer to be able to extend it and, innovate on it for themselves?
[00:35:41] Unknown:
Yeah. So the, okay. In terms of, like, sources, currently, we support Snowflake, data Delta Lake, Postgres, s 3 files. We have a connector for the Ethereum blockchain as well. So these are the the connector we currently, currently support. The connector and we support also, I forgot to mention, is not strictly a connector, but it's basically a gRPC endpoint. So you can actually pump in data, from from your from your code as well. The connector model is, is very easily implementable. In fact, we have a couple of contributors that are working on MySQL connector implementation and and another 1 who are working on a MongoDB connector implementation.
So that's that's, that's, that's pretty straightforward to implement. On the, on the sync side, we don't really advocate for having multiple syncs. I mean, we are not strict I mean, we have a streaming SQL engine behind, but we don't we don't sell ourselves as a agnostic SQL streaming SQL engine. Our sync is our caching layer, is tightly integrated is tightly integrated with that because that's our value proposition is be able to, to serve as to serve API. There's gonna be 1 1 thing that, we, we are exploring, actually, because it came up a couple of times, is the ability, to plug in, with an external key value store.
That's something that we are we kind of started looking at it, but we we are not sure about it yet. But that's that's, that's that's the the the fundament. I mean, most of our extensibility will be on the on the on the source side.
[00:37:38] Unknown:
Are there any other aspects of the Dozer project in this space of end to end data integration and delivery that we didn't discuss yet that you'd like to cover before we close out the show?
[00:37:50] Unknown:
I think we discussed pretty much everything. I mean, I I I cannot think of other things that, that, that, we didn't cover. Yeah. I think we discussed pretty much everything.
[00:38:01] Unknown:
Alright. Well, for anybody who wants to get in touch with you or follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:38:17] Unknown:
Well, as I mentioned before, I think that, that there is a revolution in progress in the data management space that is driven by RASP. That's what I truly what I truly believe. I mean, we are going, we are actually going into a direction, a shift where, you know, we used to have fully distributed system, and now we started realizing that, maybe that's not needed. Maybe let's go back to the roots. Let's go back to the single machine with multicore. And that is and with much more efficient languages like like Rust. And, you know, that is, that is happening very much that is happening both in the streaming and in the batch, in the batch space. I mean, streaming, there are multiple there is Dozer. There are a lot of other project like Materialise, Rising Wave, as you mentioned, like, cube JS, all written in in in in Rust. But if you look at the the batch, batch landscape as well, there is a lot of stuff happening there. I mean, a lot of people are excited about DuckDV, Data Fusion, Polars, and I believe that this tool will completely
[00:39:33] Unknown:
change the the landscape of of data engineering. Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing on Dozer. It's definitely very interesting project, interesting platform. Definitely appreciate the time and energy that you and your team are putting into bringing that into the world. So thank you again for taking the time today, and I hope you enjoy the rest of your day. Thank you very much.
[00:40:02] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast.init, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast dotcom to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts atdataengineeringpodcast.com with your story.
And to help other people find the show, please leave a review on Apple Podcasts and just tell your friends and coworkers.
Introduction to Matteo Pilati and His Background
Matteo's Journey into Data Engineering
Overview of Dozer and Its Origins
Decision to Make Dozer Open Source
Dozer's Unique Position in the Data Ecosystem
Challenges in Data Integration and Dozer's Solutions
User Experience and Personas for Dozer
Technical Implementation of Dozer
Evolution and Design Changes in Dozer
Getting Started with Dozer
Developer Workflow and Access Control
Testing and Validation in Dozer
Governance and Open Source Strategy
Interesting Use Cases for Dozer
Lessons Learned from Building Dozer
When Dozer is Not the Right Choice
Future Plans for Dozer
Extensibility and Integration Options
Conclusion and Final Thoughts