An Opinionated Look At End-to-end Code Only Analytical Workflows With Bruin

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Imagine catching data issues before they snowball into bigger problems. That's what DataFold's new monitors do. With automatic monitoring for cross database data diffs, schema changes, key metrics, and custom data tests, you can catch discrepancies

and anomalies in real time right at the source.

Whether it's maintaining data integrity or preventing costly mistakes, DataFold monitors give you the visibility and control you need to keep your entire data stack running smoothly.

Want to stop issues before they hit production?

Learn more at data engineering podcast.com/datafold

today. Your host is Tobias Macey, and today I'm interviewing Burak Karakan about the benefits of building code only data systems and the work that he's doing at Bruin. So, Burak, can you start by introducing yourself? Hi, Tobias. Thanks to you. Thanks a lot for having me. My name is Burak, born and raised in Turkey. I'm a software engineer by background.

I studied computer science, then I moved to Berlin in 2018

to work for a company called HelloFresh.

I was thinking I would work there for, like, a year or 2. I ended up working a little bit more than 5. And,

end of 2023,

I quit my job to build Bruin together with my cofounder, Sabri. We are building, like like you said, code only

data platform and tooling around all of that.

And do you remember how you first got started working in data?

Of course. Well, I mean, when I was at HelloFresh, it was, I was working I was leading one of the core core teams on the on the software side. We were building a lot of, we were generating a lot of data, a lot of a lot of events, and and and data that we stored and published was was being used by the data team. So I was part of those conversations a lot. But more interestingly, my cofounder and I, we started doing some consultancy

for for small and mid midsized companies in,

beginning of 2021,

let's say. And, that was, like,

the first

direct experience to see what are some problems that people are are so

hurt that they're willing to pay money for. You know? And, and that's when that's when we started working a lot more with data. That's when we we got exposed to some of the problems,

and, eventually, we ended up, building some solutions,

around those problems.

So this is, like, I think, my relationship

with data. Before that, I did, like, a bit of bit of ML, but I was primarily working on the software side and then gradually transitioned to data.

And so digging into Brewin, I'm wondering if you can describe a bit about what it is that you're building, some of the story behind, how it came to be, and who you're focused on as the target audience. Like, who are you solving these problems for?

Absolutely. So a little little bit of backstory around this. So when we started working with some of these companies,

these were, like majority were gaming companies, mobile gaming companies that that collect a lot of events. They have, like, a lot of clickstream data. They their business relies on being able to analyze that data. Sort of one of the first hires they're gonna be making is gonna be a data scientist or a data analyst. You know? That's like a naive way of, okay. We have a lot of data. Let's make sense out of it. But very quickly, they start running into, data engineering challenges, especially when you think about, like, the lack of infrastructure in, like, a gaming company that has never worked with data before.

It becomes pretty painful for these companies to sort of make any sense of that data and bring couple of different pieces of data together to make any analysis.

This is when we first got exposed to these problems these companies were having when as part of that consultancy that I was talking about. Basically, there was a lot of companies that had data analysts, data scientists, and maybe data engineers

that did not have the infrastructure in place that needed to do a significant investment into their data infrastructure

to basically being able to run, like, a regular query in a data warehouse. You know? So this ended up pushing us towards thinking that, okay. Maybe there's a better way. We could build a solution that takes care of the problems of these small teams

without having to invest a significant

amount of time, effort, and and money to making the data usable.

That's that's how Brew and born. When we looked at the market at the time, and then I think it still is, like, the the heydays of modern data stack, you had a lot of different tools for a lot of these different jobs.

You if you wanted to ingest data, you had to host a tool or you pay for Fivetran or Stitch or whatever, then you have to figure out a way to transform this data. Then you have to figure out a way to orchestrate all of these. Then comes data quality, then comes governance, then comes observability.

All of these were things that, in the end, the same team had to deal with. You know? So when we were thinking about, okay, we're gonna solve these problems for these teams, these analyst scientist heavy teams, at the same time, well, it's gonna be the same people that has to jump through all of these different tools. Could we sort of build a

a consolidated

way of

dealing with data? And that's that's how Bruin born in the end. Bruin is a company. We have to we have a couple of products.

We are we're open core product, so,

we have

a command line application called ingestor. It allows

copying data from one place to another. It's open source. It's on GitHub.

We

I think by the time the the the conversation will be released, it will be live. We have a command line application that allows building and turn data pipelines, which is also open source. And then we have, like, a commercial cloud platform as well. But, basically, we're trying to build all of these to be to be built around open open primitives so that people can get the benefits of all of these, different tooling we build and at least run them in their own infrastructure if they needed to. To your point of having this single tool chain that goes end to end,

I guess it's also worth calling out what is the beginning and what is the end of those points. Like, what are what are the pieces that you're actually focused on solving for? Is it analytical use cases? Is it, you know, reverse ETL? Are you solving for AI applications?

There's there's a wide scope of what a pipeline might do these days.

Exactly. I would say it's primarily the analytical use cases. So what what Bruin does is, effectively,

it is a tool chain and a platform that brings together data ingestion, data transformation,

data quality, and governance.

So, basically, if I were to sort of try to visualize

the data stack, so at the bottom, you have, like, the your data warehouse. And vertically on top of that, you have your on the right side, you have your BI tool. And on the far left, you have, like, bunch of different sources, external source of data. You have Facebook, Google, databases, whatever. So Bruin aims to fill that gap between tries to close that rectangle, if that makes sense. So you have, like, your BI tool. We don't touch we don't do anything with regards to, like, data visualization. We integrate with these BI tools. But, effectively, anything before that, Bruin tries to,

simplify

and and integrate with existing tooling there. And another interesting aspect of the approach that you're taking, in addition to being code only, is that you are trying to be this unified interface for being able to work across these systems, whereas a few years ago, there was the disaggregation of all of the different pieces, which gave rise to the ill fated modern data stack that is that's faded from common parlance.

And I'm curious if you can talk to some of the lessons learned in that phase of disaggregation

and every tool for itself, and these are the different interfaces. And so everything can be composed however you want

to where we are now, where you're building this single tool chain to try and have that end to end workflow.

And maybe some of the feedback that you've gotten from people who are starting to use that more unified approach and maybe in juxtaposition

to their previous experiences of trying to cobble together all these different tools? Yeah. So

I think, like, this this bundling and unbundling cycle was, like, very common in every industry, and I think we're going through a a rebundling phase in in the data industry. But I don't think it's repeating itself in that sense. I think there's a lot of learnings that are being carried over and over. With the rise of modern data stack and individual narrowly focused tools getting deep into what they're doing was that, okay, you get the best of breed and and you figure out, like, whatever fits into your exact use case the best, but it also came with the with a couple of problems. So first of all, with that many systems being involved and that many ways of doing things, it becomes pretty hard for anyone to track what's happening to to their data. You know? But, like, who's where where do I get this data from? Who touches it? Who can access it? Where do I use this in in in the rest of my architecture, etcetera? So, like, that observability was starting to become a major problem. And also, like, maintaining and building all of these tools and getting them to talk to each other is, I mean, it's an engineering heavy process. Like, there's no way to walk around that. Let's see. It requires a lot of knowledge around all of these tools to to get them to to to work together. You know? Another, like, good aspect of that was it was composable. Right? It allowed you to I don't know much in practice if anyone were able like, were taking up one of them and putting another one back once the stack was built. But, effectively, you had the understanding that, okay, you can use one tool for data quality or you could exchange it for another another one, and that gave some flexibility. Now coming into the new way of working, which is, like, basically trying to understand if if we could build a consolidated experience without any sort of lock in around this, a couple of things we learned was, first of all, code driven is very successful. Like, look at dbt. Right? It's like it brings the best practice around software engineering into into data workloads. Well, a lot of it is, like it comes from, like, the fact that you treat it as a software. You version control it. You apply all of your regular checks and validations onto onto that. So that comes to this world compared to the other bundled examples of, like, Oracle tools and whatnot. Together with that, I think a big chunk of what we're trying to do today is while building that cohesive experience,

how do we make the infrastructure and and the tooling you build around them to be composable still. Right? It does not have to

support every possible

way of working, but it should allow flexibility for people to go outside. And the example I always give when I'm talking about this is the drag and drop UIs for for building data platforms.

It basically forces you into working with the way of the tool, and it practically locks you in if you wanted to move out a year from now. And one of the things that we took as, like, a core principle when we were building Bruin is, okay, we are gonna build an opinionated way of working, but it should also be extensible enough. For instance, we do a lot of stuff around SQL pipelines,

but we also allow people to run Python workloads natively. You know? We have prebuilt connectors for a lot of sources and destinations,

but if there's anything we do not support or people are not happy with the way we're ingesting data, they could always write their own connectors in Python Python where they could write any of the logic they want. Or if they wanted to run a very simple regression model, they could write a few lines of Python in the same pipeline without having to figure out or learn a completely different tool. You know? So long story short, as part of that, like, all of those learnings, I think we try to take, the composability,

the extensibility

into the new world while also having, like, a strong focus on they need to talk to each other kind of mentality. You know? That's why, like, with Bruin, you can run, your data ingestion. You can run your transformations. You can run Python. You can run ML models. You can run data quality stuff, all as part of the same pipeline, same lineage, same observability principles.

In terms of, like, our users, what what do they do they get out of that is basically the amount of time they spend on infrastructure. The amount of time they spend on getting these tools to work with each other is close to 0 at this point. And I don't want to I don't mean this just for, like, the cloud and whatnot, also for the command line application. You just throw it in negative action. It just does as it does on the cloud platform anyway, you know, which means that you just push. It works natively.

You don't need to think about any other bit of infrastructure.

It does data ingestion. It does Python. It does SQL, you know, so you get all of those benefits.

And all open source, so you're not locked into anything. All lives in your repo.

All is based on Git. So, basically, try to hit a sweet spot of these, these different concerns.

And to some degree, it brings to mind the

original vision of Meltano before they pivoted to focusing more on the singer ecosystem

sort of as it existed when it was still within the bounds of GitLab of being this end to end tool chain,

code first, very DevOps and CICD friendly.

And as opposed to what you're saying a lot of these UI driven workflows or the the disaggregated

stacks where you have to stitch together everything by yourself where even though there are these standard interfaces of, oh, you just load everything into your data warehouse, and then you go over to your data warehouse, and you do something else over there, and then you go to your BI tool and do something else over there. That led to a lot of edge cases where if you're on the happy path of what everybody else is doing, then, sure, it's fine. But if you're trying to do anything even slightly

divergent from that, then you're on your own and you're trying to bushwhack through the jungle. I think a lot of it there are a couple of things that I think makes,

like, bundling

solutions like like ours

a good time now. There's a couple of advancement, but the the most important one I see is I think personas

and and people are changing over time. You know? One of the things that we're betting on, one of the things that we see with our our users is there is a new role, new persona emerging, kind of like analytics engineers, right, which is like a good good middle point between, like, your data analysts and your data engineers. And these people are going further full stack if they have the right tools. So now these these people are highly technical. They can write code. They're not necessarily software engineers. They they might not wanna build full blown infrastructure, but give them their ability, and they're gonna write to whatever they they're gonna write to code for whatever they need.

And I think so we're we're betting on the growth of this kind of, like, user persona

within the industry where kind of like what happened to software engineers. Give them these right primitives, and they're gonna start going further to to different extents of this. And I think things like AI models getting better at writing code and helping people,

data warehouses, cloud data warehouses, standardizing on different different languages, solutions like SQL got existing, for instance, sort of all contribute to making it, like, the perfect bet for growing a new sort of an integrated way of working with data. You know? And that's what what we're focusing on. That that's why I think it's it's a good time to to build, like, a solution compared to maybe 10 years ago.

And continuing on that discussion of when you're within the bounds of what the tool creator envisions as the happy path, then things generally work. But as soon as you wanna do something custom, then you have to try and cobble together a whole mess of other solutions.

How does that principle factor into the way that you're thinking about the primitives within BRUIN, the technologies that you're building on top of, and some of the ways that you have that escape hatch for people who do need to do something custom but don't wanna have to throw away the whole tool chain to do it? So we had a couple of ways that we could go with building these kinds of, tools, which

the first one was like, okay, focus on transformations, focus on transformation pipelines,

focus on SQL. Right? And we see this with dbt. We see the SQL mesh. We see this with SDF. We see it like with a lot of other tools.

Basically, treat SQL as a first class citizen, and then and then the rest follows. Right? But in our experiences, it it never worked that way. And we saw that, okay, the moment that you needed to introduce a little bit of Python or a little bit of another language into your pipeline, you have to figure out something completely different, which means you had to host an airflow cluster or you need to run DAXTER or you need to run Cron jobs elsewhere. You lose the data lineage. You have a lot of these complexities. Right? So we basically

built Bruin around the primitives that it's all about an asset, and an asset can be a SQL

model. It could be a Python asset. It can be a machine learning model, but it also can be an Excel file. It could be a parquet file that lives in s 3. You know? So BRIN is basically built around the assumption that anything that generates value using data could be a data asset, and it should be possible to represent those assets as part of your pipeline regardless where the execution happens. Right? So this allows you to sort of separate those definitions from the executions,

and it also allows you to expand your pipeline to cover further use cases. And I think a great example of that is is Python. Right? So you can write whatever you can do with dbt, you can do with Bruin as well. But you can also step outside of SQL

with Bruin Bruin's built in, like, Python abilities, for instance. We when we're running Python locally in your computer, we we install the Python distributions.

We set up a virtual environment. We install isolated dependencies there. We take care of all of that complexity, so you just write a few lines of Python script. This allows you to easily escape into whatever you can do with the regular programming language without having to build out a a completely different infrastructure to deal with that. And I think, like, it's it's also like, the industry is going towards there with things like,

Snowpark, like being able to run Python models and

and writing some some some table functions and whatnot in different programming languages. But basically,

I think those

approaches into bringing programming languages into these pipelines is sort of patching the existing ways of working. Whereas if you were to design them as, like, equally important ways of working, equally important types of primitives

in the tool, it opens up different different patterns and different possibilities.

So it was basically a core approach that in the first version of Bruin, we had to support both SQL and Python

as, like, first class citizens. You know? In the future,

I can definitely see us bringing even more programming languages into these pipelines and also more tools. Maybe there's gonna be a like, ingestor is a good example of this. It's like a standalone command line application, but we also integrated it as a type of an asset in Bruin CLI, which means data ingestion jobs are first class citizens in the same pipeline as well. This allows

enforcing data quality checks on the data ingestion jobs. This allows

using them as part of the same data lineage. This allows ensuring that data is always accurate and and they run-in the right order without having to configure different schedules elsewhere. So long story short, treating these different technologies as as first class primitives

allowed us to sort of build an abstraction that brings all of these together within the same same graph, if that makes sense. Absolutely. And digging a bit more into that transformation ecosystem, you mentioned SQL mesh, SDF, DBT. I think SQL mesh in particular stands out as

supporting

Python and SQL on

a level playing field where,

from my understanding,

as long as your Python returns a data frame,

then it treats it the same as if you were doing a SQL transformation, and so you can actually combine those different modeling patterns. I don't believe that they have facilities to be able to extend beyond

Python as a full on programming language, so it's, I believe, fairly

locked into SQL and Python being

those interfaces. But I'm curious if you can talk a bit more to the transformation engine that you're building, some of the ways that it's designed, and some of the juxtaposition

of how it fits within that overall ecosystem of these SQL focused

transformation

tools

that are targeting data warehouse environments?

First of all, I I love SQL mesh as well. You know? I I think the founders are pretty cool people as well, and I think it's a it's a really good piece of technology. And not SQL mesh, but we use SQL GloT in in Bruin CLI as well. So I think they brought in a lot of new stuff that that were not there, especially with DBT before.

I think an important difference that we

sort of position ourselves around is there are

transformation

tools or transformation

frameworks,

whereas what we're building is

a framework for

working with data end to end. Right? So this means that data transformation

is part of what we do with Bruin CLI as well, but same for for data ingestion, same for data quality, same for,

data governance and observability.

Right? So I think an important difference in terms of Bruin CLI is it's more like a a data stack in a box compared to

a data transformation tool as well. Now this obviously comes with a different trade off. For instance, the fact that I'll give an example with SDF. I think they're building, like, a a a really interesting product with very

deep understanding of what happens within SQL,

and I think they would they do this better than we do, for sure. But what we're trying to do sort of is to go to the end to end of that experience

to be able to cover more use cases

with the good primitives

and and a framework around working with all of these in a in a different way. Right? So I think, like, comparing

transformation

tools with, like like, dbt or SQL Mesh with Bruin sort of, in that sense, becomes

apples versus oranges. You know, it's they they sort of aim for different,

different purposes.

One of the things we try we are gonna be doing in the in the next month is gonna be being able to run dbt models as part of proven pipelines as well in the CLI in open source. Right? So this allows

sort of stepping a lay layer above the transformation tools themselves

and still being able to give those primitives of having the execution graph, having the lineage, having data lineage, having data quality

without having to change the transformation tool. Right? So I think, like,

we try to go to the breadth

where they they go to the depth,

And, and we sort of at the moment,

we focus further on making that that breadth experience as seems as possible.

Before we dig too much more into the architecture of Bruin and some of the ecosystem that you're thinking about around that. I'm also interested in touching a little bit more on that code only philosophy.

You mentioned who you're targeting, but I'm also interested in some of the ways that that code only approach maybe acts as a gating factor for certain types of organizations, certain audiences,

and especially as you expand beyond a single team and into the organization level. How you think about brewing existing in a broader organizational ecosystem

where maybe there are heterogeneous

tools used? How do you allow Bruin to be able to

exist within a certain microcosm of the organization

without requiring everyone to be on the same tool stack?

So

one of the things that

we like, in our first

experiences working bringing Bruin to customers was, like, that exact the wall that you mentioned. There are people that that work with data,

but they haven't

used Git much, you know, or they would feel much much more comfortable if they had a UI that allowed them to do certain things in a certain way. But on the other hand, I genuinely believe

that you cannot scale a UI driven

way of working to to an organization, you know, which which just doesn't work. You run into a lot of troubles.

So what we are trying to do to sort of reduce that barrier in terms of what we're doing is, obviously, we target people that that still can can write some amount of code. Like, they can they they should be able to write SQL or Python. We're not I don't think we are the right tool for people that have never worked with these technologies before. But in order to to reduce that barrier of entry, what we are doing is, for instance, we have also an open source

DS code extension

that

visualizes everything that's in your code and gives you a UI in your local device that sort of allows you to edit different parts of things. So maybe you write your SQL and you wanna add define, like, different types of columns or you wanna add some metadata.

You get a little UI that helps you do these, and that UI still translates all of these different things into into code in your repository. Right? So

we try to hit that hit that balance of reducing that barrier to a certain extent and giving people the user interface that can that they can work with

without

losing the benefits of of code first solution. So everything you do in the UI is actually driven by the code. Everything you change is reflected immediately in the code. Code is the source of truth. It does not be the layer that you work necessarily in. Right? But it is the thing that drives the rest of the infrastructure.

So I think in terms of that, like, sort of barrier of entry, we're trying to complement our CLI applications with little open source UIs around them as well. So that definitely helps with the accessibility of seeing kind of what is happening without having to try and parse the code. And so now focusing a bit more on

Bruin itself and the technologies,

I know that Ingestor is a wrapper around the DLT framework. I'm wondering if you can just talk through the overall

architecture and design of the Bruin tool chain and some of the ways that the design and scope have evolved since you first started working on it. Absolutely. So well, I mean, let me start with with Ingestor first because that was like it was a very quick thing that we ended up building and releasing as a stand alone tool. So we started with transformation pipelines and then and then Python and then Bruin was Bruin CLI was doing all of these different bits, but we did not have proper support for native data ingestion. And this was around the time when when DLT was was being released, also a Berlin team, and then we have some some, close contact with the founders there. And, we ended up giving you the spin, and it turned out to be

exactly what we needed. You know? So we were thinking of of bringing,

DLT into Bruin.

Now little complexity there. Bruin CLI is written in Golang, which is not a

very popular choice for for writing data application, I guess. But we we wrote everything in Go there, and and we needed to bring those different Python libraries into that. And while looking into that, we thought, okay. It would have been amazing if there was a

stand alone executable

that we could run just to ingest data so that Bruin CLI could orchestrate it in a platform agnostic way, in a language agnostic way so we didn't have to write Python a lot of Python for

integrating

it. That's how ingestor was born. We took DLT.

We sort of wrapped a command line interface around it. We added, like I mean, we made it a bit more opinionated,

and, and we built it to solve our own problems with proven CLI, but ended up releasing that, and people people liked it. You know? So that's like we sort of wrapped it around that, but I think that helped us a lot when integrating that with into Bruin CLI as well because, essentially, Bruin CLI has, like, a mini orchestrator inside of it. It's written in Golang. It's highly parallelized,

and it can execute these different

binaries or executables

as as, like, Docker images or, like, binaries or

or natively, like like Python or or SQL, etcetera.

Right? So long story short, Ingestor ended up being built as, like, a stand alone tool because we needed a way to run ingestion jobs without bringing Python code into the CLI, and Bruin CLI takes care of takes advantage of ingestor being a standalone tool and runs all of the other parts natively

in Golang. That sometimes introduces a bit of bit of duplicate work or hard hard work. If you use Go for a lot of data workloads, there's, like, there's not gonna be libraries. There's you're gonna have to jump through a lot of hoops. But so far, we've been,

we've been managing that fine. So that's like a that's a brief integration story between them.

Digging more into ingestor and harking back to that escape hatch conversation that we had, I'm wondering

how the work that you did on ingestor allows for

extending the different

connections that you're able to build or maybe being able to write a custom DLT pipeline

and still be able to execute that either via ingestor or via the brew and CLI?

So ingestor itself

has, like, certain sources and destinations it supports. We constantly try to expand that. But if ingestor does not support a source and you still need to get that data,

you don't,

extend ingestor itself. Instead,

Bruin CLI allows you to do all of these different things, but also in in Python as well. I'll give you an example. Let's say you want to ingest data from

Shopify,

and,

Ingestor supports Shopify directly.

Brew and CLI supports Shopify through Ingestor as well, which means you can write a few lines of YAML

that will bring whatever data you need from Shopify into your data warehouse. But then let's say you want to bring data from your

internal

accounting platform.

Then for that, you write a few lines of Python scripts that returns a data frame, and then you put it as a as a Python asset in Bruin,

which means if you run both of these assets with Bruin CLI,

Bruin will run them in parallel. It will use Ingestor for Shopify integration,

and then run your own ingestion logic and then upload the resulting data frame into your data warehouse

for you without you having to to write any extra code for it. Right? So the moment that you want to step outside of just, like, a single data ingestion

job and you want to build that pipeline,

that escape hatch lives inside Bruin CLI.

In practice, it means that your pipeline is gonna be usually a combination of ingestor tasks,

SQL assets, and Python assets that either bring data in or even do reverse ETL at the end of your pipeline. And I think it's a it's a good way to good way to show that that I think we're on the right path in terms of the primitives because

using the same Python assets and Python primitives,

you can use them both for data ingestion, but you can also push data

to to another API that you need to integrate with. Right? Or you can train a machine learning model or you can do predictions or you can

communicate with external APIs. You can send notifications and all that kind of stuff. Basically, having that flexibility

right within your pipeline. I think that that's how I would think about that escape hatch and expanding the ways of working there. Another aspect to talking to being able to move beyond the bounds of what is explicitly in scope of the tool design.

You mentioned, Bruin is focused primarily on that analytical use case. And so for teams who they've used Bruin, they've got their analytics in a good space, and now they are starting to explore more sophisticated applications of data, or maybe they're doing

reverse ETL for data operationalization,

or they're starting to look more into machine learning workflows that require

a different element of orchestration. I'm wondering how you see Bruin sitting in that overall

data ecosystem for that company

and some of the ways that the insights that you get from Bruin are able to be brought into those other contexts where

maybe you have to use the lineage

information from Bruin and populate that into another more overarching

metadata tool to be able to get that full end to end lineage as you go beyond those analytical

use cases and just some of the ways that you're thinking about those design challenges?

I think, like, one of the things that we took I mean, we we try to make sure that Bruin never locks anyone in. And while it has ways of working, it did take rates with with other tools as well. I think you gave a good example with with metadata store. Right?

Let's say you you have certain Bruin assets,

but you also use

Outland for your as your data catalog because you have, like, I don't know, tens of different

sources of, like, types of data, and you catalog everything there. So

we have

built a dedicated

metadata push feature into Bruin CLI,

which allows taking all of those definitions

in your assets

and pushing them to configured external metadata stores. You know?

I'll I mean, the first thing we built for this was was BigQuery integration because customers wanted to replicate the documentation they have in their assets on BigQuery as well so that they had some other analysts that can they don't have to leave their BigQuery UI.

That means that

without having to duplicate any

any documentation

effort or doing any manual work with a single checkbox in the in the Versus Code UI or a flag in the command line application, they could simply push the same metadata and replicate them in an external platform.

Another way that we unlock this is,

Bruin CLI has internal commands that allow you to have a JSON representation

of the whole pipeline and assets.

Right? Which means if you wanted to run Bruin CLI elsewhere,

like, let's say you start with GitHub Actions, now you're moving towards a more a different different architecture. You wanna run it as part of another orchestrator

anywhere else. You could simply

parse all of these definitions, convert them into the primitives of the orchestrator you're using, like Airflow tasks

or DAX or assets,

that you can get all of those benefits without losing the benefits of of quickly working and iterating on these workflows locally. I don't think we're there that we can integrate with every tool out there, but we're simply trying to build things in such a way that if we're not integrated yet, it should still have the

possibility to integrate in the future, just building that missing piece of integration.

In terms of that metadata push, for instance, we support BigQuery at the moment, but we're there's nothing limiting us to support

Outlook or

Confluence

even or any other any other data catalog anyone might be using. Thinking of of, like, ML workflows and things like that, we are currently working on expanding Bruin's Python abilities where

you do not have to run Python

in the same device that you're running Bruin CLI.

Right? So for instance, you could still have

your Python assets,

but you can you can run them as as ECS tasks on your AWS account remotely.

Still, you get to work locally.

If you want, you can run those locally. Or if you have a dedicated infrastructure that that you need more, I don't know, you need GPUs, you need bigger instances, you can still run them remotely. Right? So goes goes back to the goes back to the first principles we built, Brew, and CLI around. We do not lock anyone in. I cannot hold people locked in. It just doesn't make sense for for anyone, and I hate that personally as well. So that's why we try to build those, like, little integration points, I would say.

Now for people who are

looking at Bruin, they're trying to figure out, okay,

this looks like a useful tool. How do I build with it? Can you just talk us through a typical workflow of going from I have some data that I have in a source system. I want to build a dashboard based on some transformed version of it or,

I have multiple sources. I need to aggregate them together, build some sort of a, you know, maybe a snowflake model or a star schema in my data warehouse, and then start generating dashboards on top of that, how that looks in the Bruin native workflow.

Yeah. So one of the things that I think it's it's I I find very helpful when I'm working is to have, like, those start starting points, like templates. That's why we have we have built bunch of templates into Bruin. I'll give an example.

I think Shopify is a good example. Let's say you have a

a Shopify store that you want to pull data into

Snowflake,

and then you have an internal Postgres database that you want to bring that data into Snowflake as well, and then you're gonna mix and combine these bits of data. Right? The first thing you do is you set up you install Brewin in your in your computer,

and then you can simply run Brewin in it, Shopify, BigQuery,

and it's gonna bring a

prebuilt

Shopify pipeline with data ingestion

assets defined,

which will automatically bring that data from Shopify into your data warehouse.

It will have a few steps of instructions in them, so you set the connections here. Here's where you put the here's how you run the asset. Here's how you can run the whole pipeline. And, that simply allows within seconds

to have a pipeline that brings Shopify data

into your BigQuery data warehouse

with a single command execution.

Right? And then let's say you want to bring your Postgres data as well. You can simply add like, copy one of those Shopify assets. Just change the source to be Postgres instead of Shopify.

Add those credentials into your, brew and YAML definitions,

and, you simply have your data being replicated as well. As part of these templates, you also get some SQL assets,

which sort of show you

how exactly you're gonna how you can you can run SQL assets in that pipeline.

In

the examples of prebuilt sources, we try to model the data. We try to clean the data and give you, like, a full end to end pipeline.

But if you want to build your own transformations,

effectively, you write your SQL.

You put a few lines of metadata at the top of the same file. You can use these templates as, like, a starting point. And then the moment that you hit run, it's gonna run them as part of the pipeline again. You have the ability to to validate all of these configurations.

For SQL assets, we do extra validations using the data warehouse, sort of running them as part of a dry run execution to ensure that there's no semantic errors or there's no data errors

in the assets itself.

And then this already becomes a production ready pipeline for you. Right?

Run it in in GitHub actions using using our, GitHub actions

scripts,

or you can run them in airflow.

You can run them in in Daxter.

You can run them in a cron job in an EC two instance.

You could use Burrow and Cloud.

Just pick your

pick the best tool that you're familiar with, basically.

So this would be the this would be the way of working, and then gradually expand from there. As you

continue building and investing in Burwin and building the open source components of it. I'm wondering what your thoughts are as to how you are considering

the overall ecosystem around it. Are you looking to build your own

ecosystem

for Bruin? Are you looking for Bruin to be part of another ecosystem and just some of the ways that

the sustainability

and the multiplicative factors of being able to take advantage of other components

in other tool chains,

how that factors into what you're designing with Brewin?

That's a tough, tough question because

we want to build an ecosystem around Bruin, but we do not want

to having to build everything from scratch. Right? We want to be able to benefit from

existing

advancements in in other industries as well. I'll give you a couple examples.

Data ingestion. Ingestor just wraps DLT,

and we try to contribute back to DLT as well, which means any advancements in DLT itself

in terms of making data ingestion faster, more efficient,

supporting more sources and destinations,

proven users start getting those benefits right away as well without having to

rebuild everything from scratch. Another example we try to make use of those existing stuff is SQL parsing. It's a very nasty job. You have, like, tons of dialects that you have to support.

And, and in Go, there's not really a solid library that does that. You know? We we thought, okay. Well, we could build this from scratch. We can do do all of that thing, but that would be, like, a a a big undertaking. And, well, maybe we're not the best programmers out there. Yeah? So there's there's some more clever people than than us that did that before. So we figure out a way of integrating Python libraries

into our Golang application

in an embedded way. So without anyone having to install anything extra,

we can use SQL Gloat as part of Bruin CLI,

which means any advancements we get out of those, like, different tools,

we replicate them, in in Bruin as well. Now this opens up the question that, okay, are we gonna have to build everything from scratch, or are we gonna have to like like, the community has to figure out all of these different

additional new features that need to be brought into Bruin, or could we integrate the existing

ecosystems like dbt packages, for instance? I don't know. I think I think we'll see over time. I do not see a very straightforward way that we can bring

things like dbt packages into it. But I think the fact that we can run everything as as native Python assets

means that people can use any library they want, and they're not limited with what we can do. You know? And that's one of the things that I like about running Python natively instead of relying on things like Snowpark. You're not limited to what that environment can run. You can take what you're running. You can build your own

environment to run all of these

different assets

that gives you full control

in terms of the ability and and and the escape hatch, the depth of the escape hatch that you want

to you want to build. You know? So long story short, couple of ways that we integrate.

Longer term, I don't know. We'll see. We'll see. But if anyone has any ideas, I'm I'm definitely love to hear.

And on that note, from the open source and community building

side of things,

what are the main opportunities for contributions

to Brew and an ingestor and some of the ways that you're thinking about the developer onboarding experience for people who want to

build onto or extend those utilities?

So for for all of our open source projects, we try to we try to make getting started that simple,

especially with, like, some pre build

commands that they can simply run to install dependencies

and and run tests and build the application, etcetera.

For around ingestor,

the first thing could be and should be that we recommend people to do is if this is like a a common source or a destination that would benefit everyone, let's contribute back to DLT itself instead. So the wider community gets the benefit,

of that new integration as well, and anyone can use them in their infrastructure, then we can

replicate that in in ingestor as well. For Bruin CLI,

I think what we are just opening up the tool to to public use, like releasing it in the open source, so there's probably gonna be a lot of missing pieces around

documentation

and sort of, like, getting started bits. We try to improve on all of those fronts, but it could be painful for first time contributors to understand what they do. But at least I'm I'm more than happy to, you know, give people support when they they wanted to contribute. We can we do this in a couple of ways, but the primary way we communicate with the community is we have, like, a public Slack channel that people can join, that that we try to have personal touch. The other day, I hopped on a call with a contributor that I've never met before.

We just did a quick fix together, so I really like that kind of stuff. And I'm more than happy

to have any any contributors in, you know, what we're doing, what we're building there as well.

And in your work of building Bruin, building Ingestor,

and building the business around them, I'm curious to hear about some of the most interesting or innovative or unexpected ways that you've seen those tool chains and workflows applied.

Well, there was there was one person that built, like, a

COVID

analysis,

like, a full pipeline

using using Brew in in in his free time to to build, like, an like, a case analysis and building summaries around that.

Someone

did a case study for one of our customers. Like, the the case study was, like, a pipe asking them to build, like, a a data transformation

pipeline,

and they built it in, even though we have not launched any of these in publicly, but they found some some public repositories, and they figured out all of these by themselves.

And I think, generally speaking,

anything that sort of goes beyond regular analytical jobs and people that I see like, seeing that people

use Burund stuff in outside of those,

I think, has been pretty interesting. One example that pops to mind is one of our customers started using Ingestor

as, like, a reverse CTO.

So they load data from Postgres to BigQuery. They do a lot of analysis on on on BigQuery. They do, like, build a lot of other data transformations.

But at the end of it, they use Ingestor again. This time, BigQuery being the source, Postgres being the destination.

They replicate that data into their production database so their API can,

serve this this new transformed way of data, which is something

sounds so obvious looking back now, but was not what I had in mind when we were first releasing

Progenster.

So

I

really enjoy discovering these kinds of clever uses of what we're doing. And it also gives me hope that these are these are primitives that allows people to to build those, like, flexible workflows that allows them to, like, run their all of their production workloads on on Bruin while also doing, like, fun analysis projects over the weekend using the same same primitives. You know? And in your work of building these tool chains, working with the community, understanding the problem space, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

One of the things that that I found challenging was I I come from software background. A lot of the practices that we built into

into was

like, had its roots in in in our software experience.

And some of those patterns are not very obvious to people that have been working with data, especially people that have just spent, like, a year or 2 being as like, working as a data analyst.

And I have initially, I had this this reaction that, okay. We're gonna help you

figure out how to work with with these tools, you know, and that will help you get faster. In the end,

we ended up learning a lot from them where

they

figured out they they sort of came up with such interesting problems that we were like, oh, shit. We have not thought about that at all. We had no idea that

this was a problem. And this has been overall an eye opening experience and and has significantly

changed the direction for us in terms of the type of tool we're building. Initially, it was very focused on data analysts, data scientists, but command line applications, people that are comfortable with terminals and stuff. But over time, it significantly shifted our our our focus to, okay,

still those same primitives,

but make it as easy as possible, possible. Give them the right extensions.

Meet them what they need to where they need to where they need to go. Yeah. So sort of give them those those tools that or patterns that are that are used to it. I don't know if this answers your question. But Yes. Well, it it's more just a matter of your own your own learnings, your own kind of takeaways from going along this journey to where you are now. A very important learning in this process for me was

people when they built, like, any sort of workflow where where they have to do the same job repeatedly,

but, like, in slightly different ways, and they start introducing a lot of tech debt into the process.

I I always thought that it was like, okay. They they don't care. They don't care about building sustainable stuff or they don't want to do the right thing. It has been an important learning for me that that has nothing to do with whether or not they care. It's just as simple as they do not have the right tools to do the right thing. So the moment that you give them the right guidance

and you make doing the right thing also the easy thing, it significantly changes the way people work. And I think that that reflects itself in in a lot of these different tools that I think pretty much all of us are trying to do the same thing, give these people

the way to do the right thing and also make it the easy way to do that, the same thing so they do not pick pick another way and make things worse off. This that has been a that has been an important learning for me that shifted how we think about our products

constantly.

And for people who are

looking for

a tool stack, They're trying to manage their analytical workflows. What are the cases where Brewin is the wrong choice?

Very good question.

So one of the decisions we made early on was

dynamic pipelines

where,

let's say, you you ingest your own customers' data. And as you bring in new customers,

you're gonna have to run the same workflow but in parallel for, like, 100 customers, 200 customers,

basically replicating, dynamically generating pipelines.

Those use cases, Bruin is not the right tool, is not the right solution for that. So we do have certain primitives around

dynamic pipelines,

but dynamically generating assets, for instance, like you could do with Airflow or Daxter,

I think Bruin would not be the be the right tool, and instead, you should go for a full blown

orchestrator that has those capabilities

for those.

The second use case, I think, that we are not a a good fit is ML only workflows.

We have a lot of focus on on analytical pipelines

where, you see, like,

roughly

equal, if not more SQL heavy,

distribution among SQL and Python assets.

But in in ML world, that would be a completely different,

different thing where

data quality concerns do not really apply exactly. You have much much different observability

requirements.

So for those, like, ML heavy pipelines

and and ML workflows primarily,

I do not think BRUIN is the BRUIN is the right solution.

And,

yeah, I think these are the these are the ones that I think people should be

considered of when when picking. You mentioned a little bit about some of the future direction that you're thinking about, but I'm wondering if you can talk to some of the near to medium term

plans that you have for the project, for the business, and some of the projects or problem areas you're excited to explore? So a couple of different different avenues we're going down are, first of all, the like, the templates that I mentioned. You know, getting started is usually the hardest part. And the fact that we have those primitives for data ingestion transformation means that we could actually build full blown templates that allows

analyzing data, building data models within seconds for any platform out there. Right? So we're gonna

be investing heavily into expanding

our sets of templates

to sort of make it brew in, like, a a tiny marketplace of of templates that people can choose and and start building upon. The second avenue that we're chasing at the moment is getting deeper into

into understanding the the semantics of all of these assets. So for instance, Bruin CLI parses your SQL

definitions, and it's gonna get columns out of that. It's gonna it's gonna build your dependency tree, etcetera.

But I think there's a lot more static analysis capabilities that that we could introduce into those workflows

that expands beyond SQL, but also

combining them with data ingestion. So those kinds of, like, lineage slash

static analysis capabilities is is one of those things that we're gonna be investing further into. And the third avenue is is execution freedom,

where if you want to execute your code elsewhere when you're running your Pruend pipeline, those transformations that needs to run closer to the data or they need to run-in a beefier machine or or elsewhere, simply we want to be able to support those remote execution capabilities

in Burund,

pipelines. And,

beyond all of those, I think we want to make sure that

we also offer the best experience as, like, an offering like a managed platform to to run all of these as well, where people can decide where do they want to run run their transformation workloads, and they don't have to think about

any infrastructure to to manage all those. But that's a bit more on the on the business side of things. You know? So I would say these are the these are the current things that we are looking into in the next months.

Are there any other aspects of the work that you're doing at Bruin, the overall space of code only,

data pipelines,

the

problem areas of building these analytical systems that we didn't discuss yet that you'd like to cover before we close out the show? Well, there's one little piece where I always wished people built, like, code driven

BI tools as well, right,

where we see

new developments around there with tools like Evidenced

or or some of the tools are going, into that space as well. But

I would really love to see more

developments there so that we can integrate with those kinds of solutions to to give, like, a real end to end experience there with regards to from ingestion till

the Insight being actually used elsewhere. I'm hoping someone

builds

more solutions in that space. Other than that, really appreciate,

you having me. Thanks for having me on the show. And, yeah, it was really lovely to chat. Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And so as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling our technology that's available for data management today.

I think the biggest gap the thing that frustrates me the most especially is data quality being treated as a second class citizen

where people run all of their transformations

in somewhere. They have their orchestrators and data ingestion jobs, and then you have a completely separate place that runs data quality,

which means it becomes an afterthought.

And, and I think data quality,

especially as people organizations

use data more, teams get bigger, teams rely on the data,

more and more, data quality needs to be a first class citizen.

I think software

teams sort of internalized

this by now where, like, companies with with solid engineering cultures in the software side are gonna be always implementing some form of testing and validation and integration tests and whatnot. I am really looking forward to that being a common approach in in data teams as well, where data quality is a first class citizen in every workflow that people are are building, and, and it becomes, like, one of the first things people talk about. Alright. Well, thank you very much for taking the time today to join me and share work that you're doing on Bruin and some of the ways that you're thinking about how to

simplify and consolidate the workflows of people building analytical systems. I appreciate all the time and energy that you're putting into that and making the tooling available for folks to be able to build on top of it, and I hope you enjoy the rest of your day. Tovas, thank you very much. I appreciate you having me here, and, and thank you all for listening as well.

Thank you for listening, and don't forget to check out our other shows. Podcast.net

covers the Python language, its community, and the innovative ways it is being used. And the AI Engineering Podcast is your guide to the fast moving world of building AI systems.

Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host at data engineering podcast.com

with your story. Just to help other people find the show, please leave a review on Apple Podcasts and and tell your friends

and coworkers.