Streaming Data Integration Without The Code at Equalum

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

Are you bogged down by having to manually manage data access controls,

repeatedly move and copy data, and create audit reports to prove compliance?

How much time could you save if those tasks were automated across your cloud platforms?

Immuta is an automated data governance solution that enables safe and easy analytics in the cloud. Their comprehensive data level security, auditing, and de identification

features eliminate the need for time consuming manual processes.

And their focus on data and compliance team collaboration

empowers you to deliver quick and valuable analytics on the most sensitive data to unlock the full potential of your cloud data platforms.

Learn how they streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta.

That's

imuta.

Your host is Tobias Macy. And today, I'm interviewing Ito Friedman about Equilum, a no code platform for streaming data integration. So, Ito, can you start by introducing yourself?

Hi, Twas. I'm Ito Friedman from Ipulu and CEO of the company for the last few years. Been in the data domain for the last 15 years or so, doing roles from operations, database management,

all the way to ETL, architecture, and just about any

related roles in the data domain. I've been involved in Niccolo for the last few years again in doing architecture and designing the product.

And do you remember how you first got involved in the area of data management?

Well, I initially started around SQL Server quite a few years back, doing a lot of DBA stuff and ended up doing architectural

and

complex projects, integrations,

and all sorts of stuff around

SQL Server and relational database.

The last few years, I've moved from relational to other databases and other data systems

as as all of us did. So, yeah, I started there.

Can you give a bit of an overview about what you're building at Equilum and how it got started? And you mentioned that you've been there for a few years, so maybe a bit of how you got involved with the business as well.

K. So we are building a

end to end platform for data ingestion,

basically ETL system, with the goal of providing

open source benefits

to the enterprise domain.

I'm sure that everybody who's tried to use open source in an enterprise

have found all the difficulties and hardships around implementing it. And we are trying to bridge the gap and get the best out of our consoles

into an enterprise ready

application and product.

So, that's for our goals.

As for myself, I've been

working for about 3 years, almost 3 years in Equolum.

I've started,

I would say, halfway.

Iqalum actually started

with that goal, but had quite a few steps around it. We started with a very simple system and ended up with a full stack

of Spark, Kafka, and all of the adjacent components, and a full system. So we started very simple.

The overall space of data integration and ETL has become relatively crowded in the market, and there are a number of different approaches

where some people are advocating for ELT, where you just do extract and load and using something like maybe Fivetran or the singer set of tools, or some people are focused on batch oriented workflows

using more traditional ETL approaches. And I'm wondering if you can give a bit of context as to how Ecolum fits in that overall market and some of the differentiating factors that engineers should consider when they're debating

what tools to use and what approach to take?

The main differentiator for us is

a customer that wants a system that is

mature. And a lot of integrations of open source products and open source capabilities

are still in the undermaking and still are growing

as you grow with

them. And we are aiming at providing the whole system end to end to someone who wants ETL rather than a lot of moving parts. I think that's the main thing. We don't want to provide

yet another software that relies on

5 to 10 different vendors. So, for example, if you are doing streaming, you might implement Kafka, and you might need zookeeper around it, and you might need to monitor it. And I'm sure that everybody who's done that have seen the amount of components that you end up with. And I would say that when you start with a data engineering project, you usually end with data engineering plus a whole division of DevOps. We want to end that mess and provide 1 product with 1 vendor that gives you the whole thing end to end, everything monitored and

relevant to the use case rather than to the technology.

In terms of the overall

ecosystem of data, you mentioned wanting to be able to bring the benefits of open source to the enterprise. And I'm wondering for people who might already have started down the journey of building out a data platform, they might have some capacity for data integration in place. What are the components of the overall ecosystem that

Equilum is designed to replace outright, and which are the ones that it is designed to integrate with and augment?

Let's start with replace. We are looking at ourselves as an ingestion system.

So the word replace depends on what you are doing. I can certainly give you a few examples from implementations.

We have replaced a system using open source pentaho

on top of EMR

and a lot of orchestration around building the flow in Pentaho,

executing the flows in EMR, monitoring, and and getting everything working together.

We have replaced the whole thing with just 1 system. And, again, 1 vendor for the whole thing and not mingling too many

components and getting them to work. So I would say that for that use case,

the replacement would be for the end to end system. So it depends on the use case itself, but we are aiming to replace

the whole integration

end to end from source to target to get the data transformed, enriched,

and managed to the level of I read it from the source, whether it is a streaming source, batch source like s 3 or even CDC source and writing it to whatever data target might be Snowflake s 3 for data warehousing or data

lakes. So it is an end to end solution that is aimed at providing

the full stack that you require for data integration.

As for integrating with other products in the enterprise,

I can give a few examples like data data catalog

that we are aiming to to integrate. We have we have not fully integrated yet. But we are not aiming to

consume the whole domain of data. We are aiming at integrating with quite a few products that are still in the domain. So currently, we are aiming only on the

data integration and data transformation and enrichment area.

Particularly for the cases of Kafka and Spark where people are using them for a data integration component, they might also be using them downstream

from that for being able to power machine learning workloads. And I'm wondering if

Equilum is able

to leverage the existing clusters that people have running for being able to automate some of the flows through those systems, or if it's a matter of they would just use their pipelines specifically for those downstream use cases and use Ecolum entirely for the integration?

Again, it depends.

We can use an existing platform and existing Kafka and Spark.

But in most cases, we found that 1 of the biggest benefits we provide is we wrap the whole thing together

suited to the use case. And a lot of times when we use other platforms,

we lose that benefit.

We have to mitigate and work around with other components to get things working as we are aiming at. And I think that the end approach, it is possible.

But in most cases, it would be good to give the domain of data integration into Equolum and have it manage the whole thing by itself. But it is possible, and it is we do have users doing that.

And as I mentioned at the beginning,

the Ecolum platform is designed to be low code or no code. So I'm wondering if you can just talk through the overall workflow of somebody who's using Ecolum for the simple use case of doing a direct point to point data integration without necessarily having any transformations in flight?

I can actually give 2 use cases. So let's just start with a simple

batch from s 3. So you would define a source,

provide the relevant credentials on the source.

On top of that source, you would define a stream, so a flow. And the flow, whether it's needed or not, you can add the source plus transformations,

define the target, and

that's just about. It's as simple as that. Define source target,

connect the dots in a flow, and you're done. So for very simple flow, I would say,

5 minutes from the point of starting, you'll probably have a running flow. So I think that that is 1 very simple flow. Another interesting flow is useless of replication. We do a lot of data replication,

and by that I mean, you might have an online Oracle system

used for something in your business, and you wanna get all the data into a data lake in the cloud, for example. Doesn't have to be that. Of course, there's a lot of options.

You can create the source again, and we have an object called replication groups

where instead of creating a flow per table, you actually create an object called a replication group that holds all the tables that you have selected,

and you're gonna get all those tables into your target with, I would say, 3 or 4 more clicks.

But it would be for a 100, a 1000, or even more tables without you going into too much details. So that would be 2 examples of simple floats.

For the case where you then

have an existing data flow of doing this point to point integration, you then want to be able to add in transformations

to maybe

overwrite or

occlude

PII, so maybe mask the beginning of a credit card data or remove the street number from an address field,

or if you want to say that for a particular source or a particular subset of records from a source, you want to actually route those to a different destination point. What are the capabilities in Equilum for being able to handle those use cases and how that manifests in the workflow and the interface that's exposed to the end user?

So I think the interface is 1 of our key points, and we've invested a lot in getting

just a few clicks to do a lot of work. The interface itself is translated eventually to Spark code.

So, generally, everything you can do in Spark, we allowed to do in code in our Canvas.

For that example, just masking data would mean instead of just doing source to target, you would add a transform operator.

The transform operator will show you the whole schema, and you can do whatever you want on that schema. So double clicking on the flow, selecting the right function, and you're masked.

Same goes for filtering, so it's pretty simple. You simply select a filter operator or split operator if you wanna divide data

and provide a super simple

expression

at the level of any Excel user

that can write a very simple formula in Excel would be very comfortable

writing. It's not even the word writing is more complex than actually it is. It's just double clicking on fields and selecting the functions you wanna use for filtering.

And, again, probably 2 minutes, you can get filters and route data to the right point.

Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code.

Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours or days.

DataFold helps data teams gain visibility and confidence in the quality of their analytical data through data profiling, column level lineage, and intelligent anomaly detection.

DataFold also helps automate regression testing of ETL code with its data diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values.

DataFold integrates with all major data warehouses

as well as frameworks such as Airflow and DBT and seamlessly plugs into CI workflows.

Go to dataengineeringpodcast.com/datafold

today to start a 30 day trial of DataFold.

Once you sign up and create an alert in DataFold for your company data, they'll send you a cool Water Flask.

And then as far as being able to handle things like versioning or updates to existing workflows, how do you handle being able to stage the rollout of that from a staging to a production environment, for example, or testing out a flow with a subset of data and being able to validate that and then roll back in the event of errors?

So we provide 2 areas around this. First of all, our flow version, that means any edit you do on a flow does not interfere with anything that is currently running. When you click edit on a flow, you basically get a new version. So you would never have to edit a running flow. Once you have the version and getting to the point you are okay with that, you can publish it and replace the running version.

So it's very easy to, in 1 end, publish a new version,

and since we are doing versioning of everything, it is super simple to go roll back. So that's inside 1 system. For me creating between systems, we provide

quite a few options

using CLI to export the whole thing. You can export via the UI, and we have a full list of APIs that you you can even export. For that example, we have quite a few users exporting everything into files and managing the whole thing

in Git type solutions, so on source code management. But you can also avoid this and just use versions that are in the product itself.

And can you dig a bit more into the Equilum platform itself and how it's implemented and some of the ways that the design and architecture has evolved since it first began?

As we spoke in the beginning, we are using quite a few open source products

stitched together.

So Spark is processing engine. We are using Spark Streaming and and Standard Spark for batch. 1 cool thing about that is when you're doing writing a flow, you don't really care how it's gonna be executed.

We decide for you when you've selected the amount of objects that are required. The objects that are required streaming, we'll simply do Spark Streaming. If it's not required, we'll do Spark Batch. So we are using Spark for the processing engine. We have fully integrated Kafka inside the system. So for the example of CDC,

when you're pulling data out of Oracle or SQL Server or whatever the relational database it is, you are actually pulling the data out of it and we are putting it into our internal Kafka

and using Spark Streaming on it.

And you're benefiting from the whole thing without touching anything underneath, of course. So for that matter, we'll create the topics. We configure everything, including partitions and all the configuration around Spark to mitigate the whole thing to work together. I think 1 key very important point is we're doing exactly once. I'm sure that anybody who's done that knows how difficult that is. We're doing end to end exactly once on streaming and batch. So as for architecture itself,

Spark is a processing engine. Kafka is our internal buffer. We have quite a few other components in the system like Prometheus for monitoring. We provide dashboards in Grafana. All the metrics you see in the RUI are based on Prometheus

to the level of you can use the data in Prometheus,

the metrics, to integrate to whatever system. Just today, I had a conversation about integrating into New Relic. So we provide quite a few components that you would expect and you would probably implement yourself

if you do some a a similar project, but the whole thing is already stitched together to 1 solution.

And in terms of being able to orchestrate all those elements together, and in particular, being able to handle the exactly once processing all the way through, what have you found to be some of the most complex or challenging aspects of being able to build out and maintain the platform, particularly

as the

capabilities and versions of those underlying components change and evolve?

First of all, exact ones, I think, is 1 of the hardest

aspects of the domain, especially if you have distributed systems

where you need to design for failure. That means that failure is a part of your life, and you need to mitigate it. So we found ourselves doing years of code around

exactly

once. Even in areas you wouldn't think you get an act from Kafka, doesn't really mean that the act is fully

good enough for you. In some cases, you might wanna actually check that the data is there.

So exactly once is a deep and very

complex problem that to say that you have mitigated it means you've done a lot of work. So I think it is a big area in our product, and we are aiming

to provide exactly once to an enterprise level. That means you can get your financial data through Equil and without worrying about getting lost or being duplicated.

So exactly once it's been quite an effort.

So the other part is the underlying components. And as you understand, we have, as you would expect, quite a few components. And we see these components as

part of Equilium. That means that when we see it fit, we will upgrade a component.

It's something that you should not care about as a customer, and we have done upgrades of Spark, Kafka, Zookeeper, and whatever component it is in the system without

interfering with the customer itself. We are doing it seamlessly as part of an upgrade of a version. And I think this is a very important part because these type of activities

usually take a lot of time and a lot of DevOps effort

to automate and orchestrate,

and we've done a lot of work to get that

seamlessly into our upgrades.

So that's just the orchestration of it. On the other hand, we try to keep as update as we can. We are

not able to keep up with any small change in any specific components because we are doing thorough testing

on internally before we release anything.

And since

we're developing a generic platform, we do need to you check quite a few use cases.

We are not at the latest version at any point of time, but we are very keen to upgrade and provide new capabilities

and new stability that these products provide. And, again, it's provided as part of an upgrade in Equilibrium. You don't need to upgrade the actual component.

The subject of being able to ensure that your customer pipelines continue running and that you're not introducing any regressions is definitely

challenging. And I'm wondering how you address issues such as data quality or doing early identification of potential failures or maybe being able to do things like static code analysis of the compiled pipelines of customers

in seeing what potential errors those might come up with in your test environments before you actually roll them into production?

Yeah. Certainly is quite an area that you need to validate.

It caused us to generate

an actual Python library that automates the whole

Equilum pipeline creation. So it is by the way, we can provide it if needed for automation,

but we are doing just about any action in a flow that is possible. We have automated, and we are testing it on every release. And even before releases, of course,

customers to assist us with testing with very strict scenarios and very

ad scenarios. We are actually using customers'

flows to test when they allow us. So it is something that we invest a lot of time and effort around

checking that when you upgrade something, for example, just upgrading Spark, you have not heard any data types or any transformation that has changed in the between the versions.

So we we are simply doing sour testing with as much now as we can generate, plus automation on it.

1 of the elements of the pipeline and being able to stay up to date with changes in source systems is the concept of change data capture, which you handle throughout your workflow. And

that's definitely an area that is becoming increasingly relevant and becoming table stakes for any sort of data integration capabilities

for a platform.

And I'm wondering what you have seen as some of the challenges and edge cases that you run into for being able to support that change data capture for the variety of systems that your customers might be working with?

So, yes, CDC is certainly change data capture is certainly a key factor, and we are we're doing a lot of it. I mentioned

replication groups before. It's all based on CDC, of course. And it is a key element. We have a dedicated team just for connectivity

to sources

and executing CDC concepts.

We have gotten to the level of hopefully, if, Oracle guys are listening, I'm sure that everybody knows LogMiner

and the benefits and drawbacks of it. So we have actually developed

a an an alternative for LogMiner. We call it OBLP,

Oracle binary log parser parser, which means we are able to read Oracle logs ourselves. We don't need any external components.

So

we are doing a lot of work at that area.

We are investing a lot of time on performance around that, and I think this is 1 of the key factors around CDC. People wanna see the data being updated

on a millisecond

delay without affecting production at all. And I think that is

the hardest point there is in CDC when you go to scale. For that example, LogMiner is not as good as it should be, and we found that it does have limits, and we simply

avoided it by writing our own solution.

Not many companies have done that, by the way, for Orko.

So we've seen that for quite a few of databases, and I think we are excelling in that area, and we are providing a lot of databases with CDC capabilities

with that performance approach at hand. So I think that's 1 area that is very important, performance.

The other area is dealing with schema changes, which in relation to databases, I don't think schema changes is it sounds like the most important problem since schema does not change that much

compared to other data sources.

But when it does,

you want to have a perfect solution that does whatever you need end to end. So we provide on top ons of CDC.

We provide end to end schema changes. That means you've added a column in your Oracle database and you're writing to a Snowflake. You're gonna get the new column in Snowflake

seamlessly. You don't need to do anything. You can actually decide in some cases

where schema change is something you wanna get involved in. You can actually get a notification that tells you we have identified a schema change. Do you want to do something with it, or do you want us to automate it all the way? So I do think these 2 aspects are the most important. Their performance

and schema change management, senior evolution.

I think these are the 2 important parts,

and we've invested a lot of time there.

Yeah. The other element of change data capture that I've seen as being challenging is identifying

issues with transactionality,

where I know that not all of the systems

that support change data capture

are able to operate at the higher level to understand beyond just, you know, these are the specific changes that are being written. And then if a transaction fails, having to then be able to roll that back. And so I'm curious how you handle situations like that and

not having to

replay all of those events or, you know, destroy and rebuild a table because of a transaction rollback, particularly when you're dealing with immutable data systems on the receiving end.

I agree that it is a challenge.

A lot of times, if you wanna wait till the end of the transaction

for it to be fully committed, that means you wanna start reading it only when it's committed, and that means you're gonna have a lot of delay.

And 1 of the the performance improvements we've done is we're gonna read a transaction

before it's committed,

but we're not gonna process it until we get the commit for that transaction.

That means you get almost 0 delay. So you might have a transaction that does a 1, 000, 000 record change,

and we're gonna read it along as it goes into our system.

But the apply or commit will only be executed when we receive the commit from the source. So, yes, it is a big problem, certainly, and we had to specifically develop a component just for that, to deal with that and still be

relevant. Avoid, as you said, rollback, so we will not implement anything that might be rolled back. But you still wanna

be on top of the changes and not wait for the end of the transaction.

And the other interesting aspect of your platform is with the focus on the no code approach. I'm wondering what

your heuristics

and design priorities have been for being able to surface some of these complex data management challenges

and issues

to users who don't necessarily have a deep background in it and making it accessible and understandable to them while also being able to have the flexibility

engineering oriented teams to be able to implement custom logic that they can embed within that workflow?

So we do have as I said, we've implemented

over Spark. So

as for complexity,

I think we've done

flows with thousands of transformations overall,

and I think we can

probably get to any transformation that is possible in the world. We'll we'll be able to do it. And since we are using Spark, it is going to be super fast as well. We actually have quite a few optimizations around how you have built your flow to be readable

versus how we execute it in Spark. So you can actually design your flow to be super readable.

You should not care about how it's gonna be executed. And we are actually compressing

operators.

That means if we can combine 2 operators

during the execution into 1 or a 100 into 10, for example, we will do that. That means that when you design your flow, you can design a flow that is super easy to read. And we've seen that in transforming

flows from other systems to Equolum. We've seen a reduction in the amount of operators on the flow,

sometimes by 10.

So we've seen a flow that has a 100 operators in pentaho, for that example,

going to 10 operators in Equilom because of the abilities

of the transformations and the ability for you to ignore any performance aspects.

So that is 1 key point. I think that it means that for you as a developer or an ETL developer, you should not care about the performance aspects

of how you build the flow. You should just make it readable.

As for the actual

writing of the transformations themselves, we allow for

just about any function possible, whether it is text or date or numbers or whatever transformation you've done or you need to do, including JSONs, XML, and passing whatever type of data you want inside the flow or even before the data lands into the flow. So it is possible to do just about anything with our built in functions.

But if you get to a a limit for some reason or you want to

reuse code that you have already written,

we allow to integrate Java code as a JAR into our system, which will be exposed as a function. It's pretty easy to use as well. We provide a sample product to do that. We also provide JavaScript as part of the flow, so you can also use that for custom coding.

And I think that the third thing, which is very important,

we have actually migrated

Java code that has been written for MapReduce type jobs

into flow operators without any coding. So I think that's the best example of how complex you can get. We have integrated Java into a graphical flow.

And, of course, you're gaining a lot of performance since you are executing this on Spark, and we are implementing the built in Spark

functions rather than custom code, you'll get better performance when you use the built in functions.

For the cases where you

do need to be able to access some of the underpinnings of the system to be able to understand

where there might be an edge case or diagnose errors or be able to add in additional quality checks that are accustomed to your business domain.

What is your approach for being able to hide the underlying concerns from the end users? And what are the cases where you are forced to expose them in some manner and how you make them accessible to your customers?

So I I think we can take an example of the way we implement Spark. The usual way of implementing Spark is submitting jobs,

submitting jobs into Spark and waiting for the job to execute, and that means that you need to know

what are the correct correct characteristics of your job, and you need to configure

the submission of the job

to your flow and to your specific

code that you

wrote. The way we have actually obfuscated that from the user is we have an application running in Spark constantly,

and we are receiving

flows that have been written, and we optimize them. And we actually do not request the user to know anything about Spark. We get

the flow diagram and convert it into an execution in Spark

based on our knowledge and how we analyze

the needs inside the flow. So in 1 hand, we have we you don't need to know anything about Spark, and you don't need to understand anything about it, and you can work just writing the code in your graphical interface.

On the other hand, we do have users that have gone to the level of, yes, I wanna see what's going on there and why is it slow or maybe I wanna change something. We do provide all the abilities including monitoring

and metrics on our flows in Prometheus and in Grafana.

So you are very welcome to go into Spark and see whatever you need, whether it is directly on the Spark UI

or through metrics that we provide in Grafana and Prometheus.

Another interesting thing to dig into is that at the beginning, you mentioned that you're working on simplifying the process of bringing in these open source capabilities to the enterprise. And

beyond the challenge of just being able to integrate these systems and run them at scale, what are some of the other challenges that you face in

bringing those capabilities to the enterprise and some of the special considerations that you've had to built in or

the communication patterns that you've had to use to convince

the enterprise users of either the value or the utility of these systems and kind of sell into those channels?

I think 1 of the key aspects that we heard from our customers that have benefited, especially customers that have migrated from their own DIY solution,

is, I think, that the number of moving parts and the

integration between them

is a lot of time very hard. And by that, I mean, is if you have an ETL system that

you write flows in it, and then you need to execute it on a spark cluster or whatever

type of processing engine.

And then you might wanna scale it up, so you might need Kafka as well. You find yourself

integrating the whole thing.

So if you wanna get enterprise level solutions for each of those, you might find yourself with 3 or 4 vendors just for that use case. And we sort of call it vendor hell because you find yourself

in whatever problem you have, you're gonna ask each of those vendors whose fault is it. So it goes to very different areas.

Identifying problems, you will always have identify when identify whether it is the ETL system, whether it is Kafka, whether it is it park, or whatever engine you're using. So monitoring and identifying issues is always falling

between these.

There are other elements of managing

multiple solutions that are not meant to be integrated

externally, I would say. Usually, you would need to integrate it yourself and have

hands on

all of them together. I think that's a very important area that we've seen a lot of companies

deal with and try to get all of these working together

and find themselves failing or finding themselves in years of project without getting the real benefit from it. And we've invested a lot of time in

thinking about it as 1 system rather than a bunch of components that need to talk to each other. It is 1 system, so we will never get the user to check something on Kafka. They don't care about it, and they just need to use it. So I think that's a very important thing around enterprises. They don't wanna deal with the underlying components.

Another interesting area is security. So we've invested quite a bit, around it. We're still investing in it. It's always a growing area. For example, we've implemented

multi tenancy. You can actually create

multiple tenants inside 1 Iqalum

installation. That means you have full

separation and isolation from 1 tenant to another. We have a customer that is doing OEMing on us and providing us as part of their platform.

Their customers

customers are actually competing,

and they don't wanna see each other. And you doing multi tenancy on these components is quite an extensive job, and it's quite hard to do. And when we've invested

a lot of time in isolating the comp components.

So these these are 2 examples.

For customers who are interested in deploying Equilum and integrating it with their systems, what is involved in actually getting that set up and being able to point that at their various data sources?

So as for the actual deployment, we provide an end to end, 1 click Ansible installation.

You can remove the word Ansible if you don't want to. We have just, as I said, 1 click. The whole work is being done by Ansible,

and we are maintaining that playbook

or set of playbooks

to the level of users are customizing customizing on us and doing their own Terraform or Ansible on top of us. But the installation itself is basically 1 click to get a cluster running. Once the cluster is running, just 1 click to create a source and integrate

just to any system that's needed. We support quite a few sources and targets. And in those,

we support a lot of options and capabilities

to the level of encrypted Oracle connections,

Kafka or SSL,

and that sort of stuff that is very, very common in enterprises.

So installation and deployment is super simple.

In terms of the ways that you've seen your customers using Equilum, what are some of the most interesting or unexpected or innovative implementations that you've seen?

I think 1 of them I'm not sure that interesting. I think it's strange maybe. But we've seen we have a customer that have analyzed

the largest XMLs I've ever seen, and we had to deal with a lot of optimization around how to get an XML to work correctly

and pass it correctly.

And I'm talking about 100 of megabytes of a single XML,

and we've done that on streaming.

In some parts, not 100 of megabytes in streaming as 1 of XML, but we've done it on streaming and batch in different

variations. And we've gotten to the level of

joining 10 XMLs or more and generating data that can actually be used in a data warehouse or a data lake in that case using a structure that is just unreadable.

And it was done with a very simplistic flow relatively because it's still complex work, but I was amazed

at the level of complexity

that user got out of Equilum, and the use case, and the complexity

towards the target was just amazing to see.

And in your own experience of building the Equelon platform and growing the business around it and being able to fulfill the needs of your end users, what are some of the most interesting

or unexpected or challenging lessons that you've learned in that process?

I think 1 lesson was we had a lot of thought about performance as a key factor for everybody.

And we found that a lot of users

don't really care about performance,

and usability

sometimes is 10 times more important than performance. We were quite surprised about that.

So we found ourselves

adding performance and scalability

versus usability and

functionality.

It's always been sort of like a fight during the development in Equolume between these 2. How do you get something super functional and super usable for a very extreme case, but still be performant?

So we've started by saying performance is the most important thing. But along the years, we found that not always that is the case, And I think that surprised it a bit.

For people who are considering using Equilum, and they're interested and excited by its capabilities, what are the cases where it's the wrong choice?

So I think the wrong choice would be

if you are aiming at sort of playing around with the actual components themselves.

If you wanna write spark code, it is possible in Equilum,

and you can use our own spark.

You can certainly use our Kafka. We have customers that are using our own Kafka as the enterprise Kafka. It is possible, but I would say that you are not benefiting from the full system. So if you wanna write specific spark code for specific

use cases, and you've got just a few of them, a lot of times you will not benefit from Equilum. But if you wanna get a end to end system working and benefit from Spark, you are certainly in the right place. So I would say that sort of

developer oriented into I wanna write my own code

are usually not

the real orientation for what we're looking for. We're looking for ITL guys, business DI guys, that sort of area. They do wanna touch code, but that's not their

main approach to solve things. They wanna get things done rather than writing their own code.

As you continue to evolve the platform and bring on new customers, what are the capabilities that you have planned for the future of Equilum or new changes or new features?

We have a few areas.

1 of the most interesting areas we are developing

is smart file pipelines.

And by that, I mean, we wanna benefit from, first of all, pipelines that you have sorry, flows that you've already written on your environment.

And we are actually suggesting

or working on still with still in progress,

suggesting

transformations and enrichments based on other, flows in your environment.

And that means that if you have sort of a situation while you're riding a flow that is similar or might be similar to another flow, will suggest a set of operators

that might help you to do the transformation.

And that means that if you have similarity between your flows, you might end up

writing very little flows

the more you write them. You might start up with more complex ones, but along the way, you're actually getting suggestions from other

stuff you already done. So that's an area we're developing. We call this SmartEQ,

and we're developing it in different areas, not only

on the actual flow itself. We're on also developing

into crawling the source and providing some insights on

possible transformations

on the source level.

So sort of finding fields that might be interesting for transformations,

finding tables that might be interesting for some stuff. So that is 1 area.

Another area which I really like is

we are looking

at ETL

as the base of data science. Of course, everybody is looking at the same approach, I would say.

And 1 of the things we've done is

we want to get the data engineer or ETL

sitting with a data scientist,

either on the same computer

or on the same flow, and developing it together. And by data I mean is we want the data scientists to benefit from the data that is being ingested

the second it is ingested and help the data engineer

to write the

flow. So for that, we are actually providing Jupyter Play Notebooks

inside our flows.

So you can actually

open a Jupyter notebook on a preview on an app operator and see the data. We actually provided

a notebook that analyzes the data and gives you some concepts on what the data looks like,

maybe some statistics on it. And you can write your own notebooks in Jupyter inside the flow. Once you've done that, you can work with the data engineer to integrate that together. So you can have a data engineer in 1 hand and a data scientist in the other hand writing the same flow without iterating too much. They are writing the flow together.

And for the notebook capabilities, are you leveraging anything such as papermill for being able to then actually just embed that notebook as the operator within the flow?

Currently, no. We are using Jupyter. It is something we're thinking about, but it's currently

more sort of, like, developer type

activities. So we are, at the moment, with Jupyter only.

Are there any other aspects of the work that you're doing on Equilum and the use cases that it enables, or the underlying tooling that we didn't discuss yet that you'd like to cover before we close out the show?

I think 1 interesting area is

the I think the width of the solution is interesting. A lot of the competition in this area

are very focused on 1 aspect. For example,

there are many solutions that do data replication based on CDC,

and they would usually not be

focused on ETL.

On the other hand, you might see ETL tools that do CDC, but they are not focused on it, so they don't do it well. For example, they would only do log minor. And I think we provide a very good combination of both.

You can do binary law passing from Oracle

and combine that from with files from s 3, Kafka, and write the whole thing into Snowflake

while transforming and enriching the data

plus aggregating it in streaming

and doing streaming joins on that thing and actually writing

the data

exactly in the way you want it to be in Snowflake without

other systems. So you can get data from Oracle with all the integrations to Snowflake in 1 system. So I think the combination of these 2 is a very strong offer.

Well, for anybody who wants to get in touch with you or follow along with work that you're doing, I'll have you add your preferred contact information to the show notes. And as my final question, I'd like to get your perspective on what you see as being the biggest gap of the tooling or technology that's available for data management today?

I think we discussed it a bit. I think that the gap between designing a solution and getting it to running in production

is sometimes or most of the times harder than it looks like. It's very easy to start. It's very easy to do

something very small. But when you get to production, you have a million aspects and a million things you need to look at, and a lot of the tools do not provide this as part of it. So I I think production readiness is a key thing that is missing as I see it. You need to do a lot of work to get there.

Thank you very much for taking the time today to join me and discuss the work that you've been doing with Equilum and the capabilities that it provides. It's definitely a very interesting system, and the overall space of data integration is challenging. And I think that the approach that you've taken is very interesting and well engineered. So thank you for all the time and effort you've put into that, and I hope you enjoy the rest of your day. Thank you, Tobias, for hosting us. It's been a very interesting conversation.

Listening. Don't forget to check out our other show, podcast.init

atpythonpodcast.com

to learn about the Python language, its community, and the innovative ways that is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links