Summary
The first stage of every good pipeline is to perform data integration. With the increasing pace of change and the need for up to date analytics the need to integrate that data in near real time is growing. With the improvements and increased variety of options for streaming data engines and improved tools for change data capture it is possible for data teams to make that goal a reality. However, despite all of the tools and managed distributions of those streaming engines it is still a challenge to build a robust and reliable pipeline for streaming data integration, especially if you need to expose those capabilities to non-engineers. In this episode Ido Friedman, CTO of Equalum, explains how they have built a no-code platform to make integration of streaming data and change data capture feeds easier to manage. He discusses the challenges that are inherent in the current state of CDC technologies, how they have architected their system to integrate well with existing data platforms, and how to build an appropriate level of abstraction for such a complex problem domain. If you are struggling with streaming data integration and change data capture then this interview is definitely worth a listen.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.
- Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta.
- Your host is Tobias Macey and today I’m interviewing Ido Friedman about Equalum, a no-code platform for streaming data integration
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by giving an overview of what you are building at Equalum and how it got started?
- There are a number of projects and platforms on the market that target data integration. Can you give some context of how Equalum fits in that market and the differentiating factors that engineers should consider?
- What components of the data ecosystem might Equalum replace, and which are you designed to integrate with?
- Can you walk through the workflow for someone who is using Equalum for a simple data integration use case?
- What options are available for doing in-flight transformations of data or creating customized routing rules?
- How do you handle versioning and staged rollouts of changes to pipelines?
- How is the Equalum platform implemented?
- How has the design and architecture of Equalum evolved since it was first created?
- What have you found to be the most complex or challenging aspects of building the platform?
- Change data capture is a growing area of interest, with a significant level of difficulty in implementing well. How do you handle support for the variety of different sources that customers are working with?
- What are the edge cases that you typically run into when working with changes in databases?
- How do you approach the user experience of the platform given its focus as a low code/no code system?
- What options exist for sophisticated users to create custom operations?
- How much of the underlying concerns do you surface to end users, and how much are you able to hide?
- What is the process for a customer to integrate Equalum into their existing infrastructure and data systems?
- What are some of the most interesting, unexpected, or innovative ways that you have seen Equalum used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while building and growing the Equalum platform?
- When is Equalum the wrong choice?
- What do you have planned for the future of Equalum?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
- Equalum
- Change Data Capture
- SQL Server
- DBA == Database Administrator
- Fivetran
- Singer
- Pentaho
- EMR
- Snowflake
- S3
- Kafka
- Spark
- Prometheus
- Grafana
- Logminer
- OBLP == Oracle Binary Log Parser
- Ansible
- Terraform
- Jupyter Notebooks
- Papermill
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy analytics in the cloud. Their comprehensive data level security, auditing, and de identification features eliminate the need for time consuming manual processes. And their focus on data and compliance team collaboration empowers you to deliver quick and valuable analytics on the most sensitive data to unlock the full potential of your cloud data platforms.
Learn how they streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta. That's imuta. Your host is Tobias Macy. And today, I'm interviewing Ito Friedman about Equilum, a no code platform for streaming data integration. So, Ito, can you start by introducing yourself?
[00:01:54] Unknown:
Hi, Twas. I'm Ito Friedman from Ipulu and CEO of the company for the last few years. Been in the data domain for the last 15 years or so, doing roles from operations, database management, all the way to ETL, architecture, and just about any related roles in the data domain. I've been involved in Niccolo for the last few years again in doing architecture and designing the product.
[00:02:16] Unknown:
And do you remember how you first got involved in the area of data management?
[00:02:19] Unknown:
Well, I initially started around SQL Server quite a few years back, doing a lot of DBA stuff and ended up doing architectural and complex projects, integrations, and all sorts of stuff around SQL Server and relational database. The last few years, I've moved from relational to other databases and other data systems as as all of us did. So, yeah, I started there.
[00:02:42] Unknown:
Can you give a bit of an overview about what you're building at Equilum and how it got started? And you mentioned that you've been there for a few years, so maybe a bit of how you got involved with the business as well.
[00:02:53] Unknown:
K. So we are building a end to end platform for data ingestion, basically ETL system, with the goal of providing open source benefits to the enterprise domain. I'm sure that everybody who's tried to use open source in an enterprise have found all the difficulties and hardships around implementing it. And we are trying to bridge the gap and get the best out of our consoles into an enterprise ready application and product. So, that's for our goals. As for myself, I've been working for about 3 years, almost 3 years in Equolum. I've started, I would say, halfway.
Iqalum actually started with that goal, but had quite a few steps around it. We started with a very simple system and ended up with a full stack of Spark, Kafka, and all of the adjacent components, and a full system. So we started very simple.
[00:03:47] Unknown:
The overall space of data integration and ETL has become relatively crowded in the market, and there are a number of different approaches where some people are advocating for ELT, where you just do extract and load and using something like maybe Fivetran or the singer set of tools, or some people are focused on batch oriented workflows using more traditional ETL approaches. And I'm wondering if you can give a bit of context as to how Ecolum fits in that overall market and some of the differentiating factors that engineers should consider when they're debating what tools to use and what approach to take?
[00:04:25] Unknown:
The main differentiator for us is a customer that wants a system that is mature. And a lot of integrations of open source products and open source capabilities are still in the undermaking and still are growing as you grow with them. And we are aiming at providing the whole system end to end to someone who wants ETL rather than a lot of moving parts. I think that's the main thing. We don't want to provide yet another software that relies on 5 to 10 different vendors. So, for example, if you are doing streaming, you might implement Kafka, and you might need zookeeper around it, and you might need to monitor it. And I'm sure that everybody who's done that have seen the amount of components that you end up with. And I would say that when you start with a data engineering project, you usually end with data engineering plus a whole division of DevOps. We want to end that mess and provide 1 product with 1 vendor that gives you the whole thing end to end, everything monitored and relevant to the use case rather than to the technology.
[00:05:28] Unknown:
In terms of the overall ecosystem of data, you mentioned wanting to be able to bring the benefits of open source to the enterprise. And I'm wondering for people who might already have started down the journey of building out a data platform, they might have some capacity for data integration in place. What are the components of the overall ecosystem that Equilum is designed to replace outright, and which are the ones that it is designed to integrate with and augment?
[00:05:54] Unknown:
Let's start with replace. We are looking at ourselves as an ingestion system. So the word replace depends on what you are doing. I can certainly give you a few examples from implementations. We have replaced a system using open source pentaho on top of EMR and a lot of orchestration around building the flow in Pentaho, executing the flows in EMR, monitoring, and and getting everything working together. We have replaced the whole thing with just 1 system. And, again, 1 vendor for the whole thing and not mingling too many components and getting them to work. So I would say that for that use case, the replacement would be for the end to end system. So it depends on the use case itself, but we are aiming to replace the whole integration end to end from source to target to get the data transformed, enriched, and managed to the level of I read it from the source, whether it is a streaming source, batch source like s 3 or even CDC source and writing it to whatever data target might be Snowflake s 3 for data warehousing or data lakes. So it is an end to end solution that is aimed at providing the full stack that you require for data integration.
As for integrating with other products in the enterprise, I can give a few examples like data data catalog that we are aiming to to integrate. We have we have not fully integrated yet. But we are not aiming to consume the whole domain of data. We are aiming at integrating with quite a few products that are still in the domain. So currently, we are aiming only on the data integration and data transformation and enrichment area.
[00:07:32] Unknown:
Particularly for the cases of Kafka and Spark where people are using them for a data integration component, they might also be using them downstream from that for being able to power machine learning workloads. And I'm wondering if Equilum is able to leverage the existing clusters that people have running for being able to automate some of the flows through those systems, or if it's a matter of they would just use their pipelines specifically for those downstream use cases and use Ecolum entirely for the integration?
[00:08:05] Unknown:
Again, it depends. We can use an existing platform and existing Kafka and Spark. But in most cases, we found that 1 of the biggest benefits we provide is we wrap the whole thing together suited to the use case. And a lot of times when we use other platforms, we lose that benefit. We have to mitigate and work around with other components to get things working as we are aiming at. And I think that the end approach, it is possible. But in most cases, it would be good to give the domain of data integration into Equolum and have it manage the whole thing by itself. But it is possible, and it is we do have users doing that.
[00:08:44] Unknown:
And as I mentioned at the beginning, the Ecolum platform is designed to be low code or no code. So I'm wondering if you can just talk through the overall workflow of somebody who's using Ecolum for the simple use case of doing a direct point to point data integration without necessarily having any transformations in flight?
[00:09:03] Unknown:
I can actually give 2 use cases. So let's just start with a simple batch from s 3. So you would define a source, provide the relevant credentials on the source. On top of that source, you would define a stream, so a flow. And the flow, whether it's needed or not, you can add the source plus transformations, define the target, and that's just about. It's as simple as that. Define source target, connect the dots in a flow, and you're done. So for very simple flow, I would say, 5 minutes from the point of starting, you'll probably have a running flow. So I think that that is 1 very simple flow. Another interesting flow is useless of replication. We do a lot of data replication, and by that I mean, you might have an online Oracle system used for something in your business, and you wanna get all the data into a data lake in the cloud, for example. Doesn't have to be that. Of course, there's a lot of options.
You can create the source again, and we have an object called replication groups where instead of creating a flow per table, you actually create an object called a replication group that holds all the tables that you have selected, and you're gonna get all those tables into your target with, I would say, 3 or 4 more clicks. But it would be for a 100, a 1000, or even more tables without you going into too much details. So that would be 2 examples of simple floats.
[00:10:27] Unknown:
For the case where you then have an existing data flow of doing this point to point integration, you then want to be able to add in transformations to maybe overwrite or occlude PII, so maybe mask the beginning of a credit card data or remove the street number from an address field, or if you want to say that for a particular source or a particular subset of records from a source, you want to actually route those to a different destination point. What are the capabilities in Equilum for being able to handle those use cases and how that manifests in the workflow and the interface that's exposed to the end user?
[00:11:08] Unknown:
So I think the interface is 1 of our key points, and we've invested a lot in getting just a few clicks to do a lot of work. The interface itself is translated eventually to Spark code. So, generally, everything you can do in Spark, we allowed to do in code in our Canvas. For that example, just masking data would mean instead of just doing source to target, you would add a transform operator. The transform operator will show you the whole schema, and you can do whatever you want on that schema. So double clicking on the flow, selecting the right function, and you're masked. Same goes for filtering, so it's pretty simple. You simply select a filter operator or split operator if you wanna divide data and provide a super simple expression at the level of any Excel user that can write a very simple formula in Excel would be very comfortable writing. It's not even the word writing is more complex than actually it is. It's just double clicking on fields and selecting the functions you wanna use for filtering.
And, again, probably 2 minutes, you can get filters and route data to the right point.
[00:12:16] Unknown:
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours or days. DataFold helps data teams gain visibility and confidence in the quality of their analytical data through data profiling, column level lineage, and intelligent anomaly detection. DataFold also helps automate regression testing of ETL code with its data diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. DataFold integrates with all major data warehouses as well as frameworks such as Airflow and DBT and seamlessly plugs into CI workflows.
Go to dataengineeringpodcast.com/datafold today to start a 30 day trial of DataFold. Once you sign up and create an alert in DataFold for your company data, they'll send you a cool Water Flask. And then as far as being able to handle things like versioning or updates to existing workflows, how do you handle being able to stage the rollout of that from a staging to a production environment, for example, or testing out a flow with a subset of data and being able to validate that and then roll back in the event of errors?
[00:13:34] Unknown:
So we provide 2 areas around this. First of all, our flow version, that means any edit you do on a flow does not interfere with anything that is currently running. When you click edit on a flow, you basically get a new version. So you would never have to edit a running flow. Once you have the version and getting to the point you are okay with that, you can publish it and replace the running version. So it's very easy to, in 1 end, publish a new version, and since we are doing versioning of everything, it is super simple to go roll back. So that's inside 1 system. For me creating between systems, we provide quite a few options using CLI to export the whole thing. You can export via the UI, and we have a full list of APIs that you you can even export. For that example, we have quite a few users exporting everything into files and managing the whole thing in Git type solutions, so on source code management. But you can also avoid this and just use versions that are in the product itself.
[00:14:35] Unknown:
And can you dig a bit more into the Equilum platform itself and how it's implemented and some of the ways that the design and architecture has evolved since it first began?
[00:14:44] Unknown:
As we spoke in the beginning, we are using quite a few open source products stitched together. So Spark is processing engine. We are using Spark Streaming and and Standard Spark for batch. 1 cool thing about that is when you're doing writing a flow, you don't really care how it's gonna be executed. We decide for you when you've selected the amount of objects that are required. The objects that are required streaming, we'll simply do Spark Streaming. If it's not required, we'll do Spark Batch. So we are using Spark for the processing engine. We have fully integrated Kafka inside the system. So for the example of CDC, when you're pulling data out of Oracle or SQL Server or whatever the relational database it is, you are actually pulling the data out of it and we are putting it into our internal Kafka and using Spark Streaming on it.
And you're benefiting from the whole thing without touching anything underneath, of course. So for that matter, we'll create the topics. We configure everything, including partitions and all the configuration around Spark to mitigate the whole thing to work together. I think 1 key very important point is we're doing exactly once. I'm sure that anybody who's done that knows how difficult that is. We're doing end to end exactly once on streaming and batch. So as for architecture itself, Spark is a processing engine. Kafka is our internal buffer. We have quite a few other components in the system like Prometheus for monitoring. We provide dashboards in Grafana. All the metrics you see in the RUI are based on Prometheus to the level of you can use the data in Prometheus, the metrics, to integrate to whatever system. Just today, I had a conversation about integrating into New Relic. So we provide quite a few components that you would expect and you would probably implement yourself if you do some a a similar project, but the whole thing is already stitched together to 1 solution.
[00:16:31] Unknown:
And in terms of being able to orchestrate all those elements together, and in particular, being able to handle the exactly once processing all the way through, what have you found to be some of the most complex or challenging aspects of being able to build out and maintain the platform, particularly as the capabilities and versions of those underlying components change and evolve?
[00:16:52] Unknown:
First of all, exact ones, I think, is 1 of the hardest aspects of the domain, especially if you have distributed systems where you need to design for failure. That means that failure is a part of your life, and you need to mitigate it. So we found ourselves doing years of code around exactly once. Even in areas you wouldn't think you get an act from Kafka, doesn't really mean that the act is fully good enough for you. In some cases, you might wanna actually check that the data is there. So exactly once is a deep and very complex problem that to say that you have mitigated it means you've done a lot of work. So I think it is a big area in our product, and we are aiming to provide exactly once to an enterprise level. That means you can get your financial data through Equil and without worrying about getting lost or being duplicated.
So exactly once it's been quite an effort. So the other part is the underlying components. And as you understand, we have, as you would expect, quite a few components. And we see these components as part of Equilium. That means that when we see it fit, we will upgrade a component. It's something that you should not care about as a customer, and we have done upgrades of Spark, Kafka, Zookeeper, and whatever component it is in the system without interfering with the customer itself. We are doing it seamlessly as part of an upgrade of a version. And I think this is a very important part because these type of activities usually take a lot of time and a lot of DevOps effort to automate and orchestrate, and we've done a lot of work to get that seamlessly into our upgrades.
So that's just the orchestration of it. On the other hand, we try to keep as update as we can. We are not able to keep up with any small change in any specific components because we are doing thorough testing on internally before we release anything. And since we're developing a generic platform, we do need to you check quite a few use cases. We are not at the latest version at any point of time, but we are very keen to upgrade and provide new capabilities and new stability that these products provide. And, again, it's provided as part of an upgrade in Equilibrium. You don't need to upgrade the actual component.
[00:19:09] Unknown:
The subject of being able to ensure that your customer pipelines continue running and that you're not introducing any regressions is definitely challenging. And I'm wondering how you address issues such as data quality or doing early identification of potential failures or maybe being able to do things like static code analysis of the compiled pipelines of customers in seeing what potential errors those might come up with in your test environments before you actually roll them into production?
[00:19:37] Unknown:
Yeah. Certainly is quite an area that you need to validate. It caused us to generate an actual Python library that automates the whole Equilum pipeline creation. So it is by the way, we can provide it if needed for automation, but we are doing just about any action in a flow that is possible. We have automated, and we are testing it on every release. And even before releases, of course, customers to assist us with testing with very strict scenarios and very ad scenarios. We are actually using customers' flows to test when they allow us. So it is something that we invest a lot of time and effort around checking that when you upgrade something, for example, just upgrading Spark, you have not heard any data types or any transformation that has changed in the between the versions.
So we we are simply doing sour testing with as much now as we can generate, plus automation on it.
[00:20:33] Unknown:
1 of the elements of the pipeline and being able to stay up to date with changes in source systems is the concept of change data capture, which you handle throughout your workflow. And that's definitely an area that is becoming increasingly relevant and becoming table stakes for any sort of data integration capabilities for a platform. And I'm wondering what you have seen as some of the challenges and edge cases that you run into for being able to support that change data capture for the variety of systems that your customers might be working with?
[00:21:08] Unknown:
So, yes, CDC is certainly change data capture is certainly a key factor, and we are we're doing a lot of it. I mentioned replication groups before. It's all based on CDC, of course. And it is a key element. We have a dedicated team just for connectivity to sources and executing CDC concepts. We have gotten to the level of hopefully, if, Oracle guys are listening, I'm sure that everybody knows LogMiner and the benefits and drawbacks of it. So we have actually developed a an an alternative for LogMiner. We call it OBLP, Oracle binary log parser parser, which means we are able to read Oracle logs ourselves. We don't need any external components.
So we are doing a lot of work at that area. We are investing a lot of time on performance around that, and I think this is 1 of the key factors around CDC. People wanna see the data being updated on a millisecond delay without affecting production at all. And I think that is the hardest point there is in CDC when you go to scale. For that example, LogMiner is not as good as it should be, and we found that it does have limits, and we simply avoided it by writing our own solution. Not many companies have done that, by the way, for Orko. So we've seen that for quite a few of databases, and I think we are excelling in that area, and we are providing a lot of databases with CDC capabilities with that performance approach at hand. So I think that's 1 area that is very important, performance.
The other area is dealing with schema changes, which in relation to databases, I don't think schema changes is it sounds like the most important problem since schema does not change that much compared to other data sources. But when it does, you want to have a perfect solution that does whatever you need end to end. So we provide on top ons of CDC. We provide end to end schema changes. That means you've added a column in your Oracle database and you're writing to a Snowflake. You're gonna get the new column in Snowflake seamlessly. You don't need to do anything. You can actually decide in some cases where schema change is something you wanna get involved in. You can actually get a notification that tells you we have identified a schema change. Do you want to do something with it, or do you want us to automate it all the way? So I do think these 2 aspects are the most important. Their performance and schema change management, senior evolution.
I think these are the 2 important parts, and we've invested a lot of time there.
[00:23:40] Unknown:
Yeah. The other element of change data capture that I've seen as being challenging is identifying issues with transactionality, where I know that not all of the systems that support change data capture are able to operate at the higher level to understand beyond just, you know, these are the specific changes that are being written. And then if a transaction fails, having to then be able to roll that back. And so I'm curious how you handle situations like that and not having to replay all of those events or, you know, destroy and rebuild a table because of a transaction rollback, particularly when you're dealing with immutable data systems on the receiving end.
[00:24:20] Unknown:
I agree that it is a challenge. A lot of times, if you wanna wait till the end of the transaction for it to be fully committed, that means you wanna start reading it only when it's committed, and that means you're gonna have a lot of delay. And 1 of the the performance improvements we've done is we're gonna read a transaction before it's committed, but we're not gonna process it until we get the commit for that transaction. That means you get almost 0 delay. So you might have a transaction that does a 1, 000, 000 record change, and we're gonna read it along as it goes into our system.
But the apply or commit will only be executed when we receive the commit from the source. So, yes, it is a big problem, certainly, and we had to specifically develop a component just for that, to deal with that and still be relevant. Avoid, as you said, rollback, so we will not implement anything that might be rolled back. But you still wanna be on top of the changes and not wait for the end of the transaction.
[00:25:20] Unknown:
And the other interesting aspect of your platform is with the focus on the no code approach. I'm wondering what your heuristics and design priorities have been for being able to surface some of these complex data management challenges and issues to users who don't necessarily have a deep background in it and making it accessible and understandable to them while also being able to have the flexibility engineering oriented teams to be able to implement custom logic that they can embed within that workflow?
[00:25:53] Unknown:
So we do have as I said, we've implemented over Spark. So as for complexity, I think we've done flows with thousands of transformations overall, and I think we can probably get to any transformation that is possible in the world. We'll we'll be able to do it. And since we are using Spark, it is going to be super fast as well. We actually have quite a few optimizations around how you have built your flow to be readable versus how we execute it in Spark. So you can actually design your flow to be super readable. You should not care about how it's gonna be executed. And we are actually compressing operators.
That means if we can combine 2 operators during the execution into 1 or a 100 into 10, for example, we will do that. That means that when you design your flow, you can design a flow that is super easy to read. And we've seen that in transforming flows from other systems to Equolum. We've seen a reduction in the amount of operators on the flow, sometimes by 10. So we've seen a flow that has a 100 operators in pentaho, for that example, going to 10 operators in Equilom because of the abilities of the transformations and the ability for you to ignore any performance aspects. So that is 1 key point. I think that it means that for you as a developer or an ETL developer, you should not care about the performance aspects of how you build the flow. You should just make it readable.
As for the actual writing of the transformations themselves, we allow for just about any function possible, whether it is text or date or numbers or whatever transformation you've done or you need to do, including JSONs, XML, and passing whatever type of data you want inside the flow or even before the data lands into the flow. So it is possible to do just about anything with our built in functions. But if you get to a a limit for some reason or you want to reuse code that you have already written, we allow to integrate Java code as a JAR into our system, which will be exposed as a function. It's pretty easy to use as well. We provide a sample product to do that. We also provide JavaScript as part of the flow, so you can also use that for custom coding. And I think that the third thing, which is very important, we have actually migrated Java code that has been written for MapReduce type jobs into flow operators without any coding. So I think that's the best example of how complex you can get. We have integrated Java into a graphical flow.
And, of course, you're gaining a lot of performance since you are executing this on Spark, and we are implementing the built in Spark functions rather than custom code, you'll get better performance when you use the built in functions.
[00:28:37] Unknown:
For the cases where you do need to be able to access some of the underpinnings of the system to be able to understand where there might be an edge case or diagnose errors or be able to add in additional quality checks that are accustomed to your business domain. What is your approach for being able to hide the underlying concerns from the end users? And what are the cases where you are forced to expose them in some manner and how you make them accessible to your customers?
[00:29:06] Unknown:
So I I think we can take an example of the way we implement Spark. The usual way of implementing Spark is submitting jobs, submitting jobs into Spark and waiting for the job to execute, and that means that you need to know what are the correct correct characteristics of your job, and you need to configure the submission of the job to your flow and to your specific code that you wrote. The way we have actually obfuscated that from the user is we have an application running in Spark constantly, and we are receiving flows that have been written, and we optimize them. And we actually do not request the user to know anything about Spark. We get the flow diagram and convert it into an execution in Spark based on our knowledge and how we analyze the needs inside the flow. So in 1 hand, we have we you don't need to know anything about Spark, and you don't need to understand anything about it, and you can work just writing the code in your graphical interface.
On the other hand, we do have users that have gone to the level of, yes, I wanna see what's going on there and why is it slow or maybe I wanna change something. We do provide all the abilities including monitoring and metrics on our flows in Prometheus and in Grafana. So you are very welcome to go into Spark and see whatever you need, whether it is directly on the Spark UI or through metrics that we provide in Grafana and Prometheus.
[00:30:36] Unknown:
Another interesting thing to dig into is that at the beginning, you mentioned that you're working on simplifying the process of bringing in these open source capabilities to the enterprise. And beyond the challenge of just being able to integrate these systems and run them at scale, what are some of the other challenges that you face in bringing those capabilities to the enterprise and some of the special considerations that you've had to built in or the communication patterns that you've had to use to convince the enterprise users of either the value or the utility of these systems and kind of sell into those channels?
[00:31:11] Unknown:
I think 1 of the key aspects that we heard from our customers that have benefited, especially customers that have migrated from their own DIY solution, is, I think, that the number of moving parts and the integration between them is a lot of time very hard. And by that, I mean, is if you have an ETL system that you write flows in it, and then you need to execute it on a spark cluster or whatever type of processing engine. And then you might wanna scale it up, so you might need Kafka as well. You find yourself integrating the whole thing.
So if you wanna get enterprise level solutions for each of those, you might find yourself with 3 or 4 vendors just for that use case. And we sort of call it vendor hell because you find yourself in whatever problem you have, you're gonna ask each of those vendors whose fault is it. So it goes to very different areas. Identifying problems, you will always have identify when identify whether it is the ETL system, whether it is Kafka, whether it is it park, or whatever engine you're using. So monitoring and identifying issues is always falling between these.
There are other elements of managing multiple solutions that are not meant to be integrated externally, I would say. Usually, you would need to integrate it yourself and have hands on all of them together. I think that's a very important area that we've seen a lot of companies deal with and try to get all of these working together and find themselves failing or finding themselves in years of project without getting the real benefit from it. And we've invested a lot of time in thinking about it as 1 system rather than a bunch of components that need to talk to each other. It is 1 system, so we will never get the user to check something on Kafka. They don't care about it, and they just need to use it. So I think that's a very important thing around enterprises. They don't wanna deal with the underlying components. Another interesting area is security. So we've invested quite a bit, around it. We're still investing in it. It's always a growing area. For example, we've implemented multi tenancy. You can actually create multiple tenants inside 1 Iqalum installation. That means you have full separation and isolation from 1 tenant to another. We have a customer that is doing OEMing on us and providing us as part of their platform.
Their customers customers are actually competing, and they don't wanna see each other. And you doing multi tenancy on these components is quite an extensive job, and it's quite hard to do. And when we've invested a lot of time in isolating the comp components. So these these are 2 examples.
[00:33:56] Unknown:
For customers who are interested in deploying Equilum and integrating it with their systems, what is involved in actually getting that set up and being able to point that at their various data sources?
[00:34:08] Unknown:
So as for the actual deployment, we provide an end to end, 1 click Ansible installation. You can remove the word Ansible if you don't want to. We have just, as I said, 1 click. The whole work is being done by Ansible, and we are maintaining that playbook or set of playbooks to the level of users are customizing customizing on us and doing their own Terraform or Ansible on top of us. But the installation itself is basically 1 click to get a cluster running. Once the cluster is running, just 1 click to create a source and integrate just to any system that's needed. We support quite a few sources and targets. And in those, we support a lot of options and capabilities to the level of encrypted Oracle connections, Kafka or SSL, and that sort of stuff that is very, very common in enterprises.
So installation and deployment is super simple.
[00:35:00] Unknown:
In terms of the ways that you've seen your customers using Equilum, what are some of the most interesting or unexpected or innovative implementations that you've seen?
[00:35:09] Unknown:
I think 1 of them I'm not sure that interesting. I think it's strange maybe. But we've seen we have a customer that have analyzed the largest XMLs I've ever seen, and we had to deal with a lot of optimization around how to get an XML to work correctly and pass it correctly. And I'm talking about 100 of megabytes of a single XML, and we've done that on streaming. In some parts, not 100 of megabytes in streaming as 1 of XML, but we've done it on streaming and batch in different variations. And we've gotten to the level of joining 10 XMLs or more and generating data that can actually be used in a data warehouse or a data lake in that case using a structure that is just unreadable.
And it was done with a very simplistic flow relatively because it's still complex work, but I was amazed at the level of complexity that user got out of Equilum, and the use case, and the complexity towards the target was just amazing to see.
[00:36:09] Unknown:
And in your own experience of building the Equelon platform and growing the business around it and being able to fulfill the needs of your end users, what are some of the most interesting or unexpected or challenging lessons that you've learned in that process?
[00:36:23] Unknown:
I think 1 lesson was we had a lot of thought about performance as a key factor for everybody. And we found that a lot of users don't really care about performance, and usability sometimes is 10 times more important than performance. We were quite surprised about that. So we found ourselves adding performance and scalability versus usability and functionality. It's always been sort of like a fight during the development in Equolume between these 2. How do you get something super functional and super usable for a very extreme case, but still be performant? So we've started by saying performance is the most important thing. But along the years, we found that not always that is the case, And I think that surprised it a bit.
[00:37:10] Unknown:
For people who are considering using Equilum, and they're interested and excited by its capabilities, what are the cases where it's the wrong choice?
[00:37:18] Unknown:
So I think the wrong choice would be if you are aiming at sort of playing around with the actual components themselves. If you wanna write spark code, it is possible in Equilum, and you can use our own spark. You can certainly use our Kafka. We have customers that are using our own Kafka as the enterprise Kafka. It is possible, but I would say that you are not benefiting from the full system. So if you wanna write specific spark code for specific use cases, and you've got just a few of them, a lot of times you will not benefit from Equilum. But if you wanna get a end to end system working and benefit from Spark, you are certainly in the right place. So I would say that sort of developer oriented into I wanna write my own code are usually not the real orientation for what we're looking for. We're looking for ITL guys, business DI guys, that sort of area. They do wanna touch code, but that's not their main approach to solve things. They wanna get things done rather than writing their own code.
[00:38:21] Unknown:
As you continue to evolve the platform and bring on new customers, what are the capabilities that you have planned for the future of Equilum or new changes or new features?
[00:38:31] Unknown:
We have a few areas. 1 of the most interesting areas we are developing is smart file pipelines. And by that, I mean, we wanna benefit from, first of all, pipelines that you have sorry, flows that you've already written on your environment. And we are actually suggesting or working on still with still in progress, suggesting transformations and enrichments based on other, flows in your environment. And that means that if you have sort of a situation while you're riding a flow that is similar or might be similar to another flow, will suggest a set of operators that might help you to do the transformation.
And that means that if you have similarity between your flows, you might end up writing very little flows the more you write them. You might start up with more complex ones, but along the way, you're actually getting suggestions from other stuff you already done. So that's an area we're developing. We call this SmartEQ, and we're developing it in different areas, not only on the actual flow itself. We're on also developing into crawling the source and providing some insights on possible transformations on the source level.
So sort of finding fields that might be interesting for transformations, finding tables that might be interesting for some stuff. So that is 1 area. Another area which I really like is we are looking at ETL as the base of data science. Of course, everybody is looking at the same approach, I would say. And 1 of the things we've done is we want to get the data engineer or ETL sitting with a data scientist, either on the same computer or on the same flow, and developing it together. And by data I mean is we want the data scientists to benefit from the data that is being ingested the second it is ingested and help the data engineer to write the flow. So for that, we are actually providing Jupyter Play Notebooks inside our flows.
So you can actually open a Jupyter notebook on a preview on an app operator and see the data. We actually provided a notebook that analyzes the data and gives you some concepts on what the data looks like, maybe some statistics on it. And you can write your own notebooks in Jupyter inside the flow. Once you've done that, you can work with the data engineer to integrate that together. So you can have a data engineer in 1 hand and a data scientist in the other hand writing the same flow without iterating too much. They are writing the flow together.
[00:41:05] Unknown:
And for the notebook capabilities, are you leveraging anything such as papermill for being able to then actually just embed that notebook as the operator within the flow?
[00:41:15] Unknown:
Currently, no. We are using Jupyter. It is something we're thinking about, but it's currently more sort of, like, developer type activities. So we are, at the moment, with Jupyter only.
[00:41:28] Unknown:
Are there any other aspects of the work that you're doing on Equilum and the use cases that it enables, or the underlying tooling that we didn't discuss yet that you'd like to cover before we close out the show?
[00:41:40] Unknown:
I think 1 interesting area is the I think the width of the solution is interesting. A lot of the competition in this area are very focused on 1 aspect. For example, there are many solutions that do data replication based on CDC, and they would usually not be focused on ETL. On the other hand, you might see ETL tools that do CDC, but they are not focused on it, so they don't do it well. For example, they would only do log minor. And I think we provide a very good combination of both. You can do binary law passing from Oracle and combine that from with files from s 3, Kafka, and write the whole thing into Snowflake while transforming and enriching the data plus aggregating it in streaming and doing streaming joins on that thing and actually writing the data exactly in the way you want it to be in Snowflake without other systems. So you can get data from Oracle with all the integrations to Snowflake in 1 system. So I think the combination of these 2 is a very strong offer.
[00:42:47] Unknown:
Well, for anybody who wants to get in touch with you or follow along with work that you're doing, I'll have you add your preferred contact information to the show notes. And as my final question, I'd like to get your perspective on what you see as being the biggest gap of the tooling or technology that's available for data management today?
[00:43:02] Unknown:
I think we discussed it a bit. I think that the gap between designing a solution and getting it to running in production is sometimes or most of the times harder than it looks like. It's very easy to start. It's very easy to do something very small. But when you get to production, you have a million aspects and a million things you need to look at, and a lot of the tools do not provide this as part of it. So I I think production readiness is a key thing that is missing as I see it. You need to do a lot of work to get there.
[00:43:34] Unknown:
Thank you very much for taking the time today to join me and discuss the work that you've been doing with Equilum and the capabilities that it provides. It's definitely a very interesting system, and the overall space of data integration is challenging. And I think that the approach that you've taken is very interesting and well engineered. So thank you for all the time and effort you've put into that, and I hope you enjoy the rest of your day. Thank you, Tobias, for hosting us. It's been a very interesting conversation. Listening. Don't forget to check out our other show, podcast.init atpythonpodcast.com to learn about the Python language, its community, and the innovative ways that is being used.
And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Sponsor Messages
Interview with Ito Friedman: Introduction and Background
Overview of Equilum and Its Goals
Market Position and Differentiation
Integration and Replacement Capabilities
Leveraging Existing Platforms and Low Code Workflow
Handling Data Transformations and Routing
Versioning and Rollback Mechanisms
Equilum Platform Architecture and Evolution
Challenges in Orchestration and Exactly Once Processing
Ensuring Data Quality and Testing Pipelines
Change Data Capture and Schema Evolution
Transactionality and Rollback Handling
No Code Approach and Custom Logic Integration
Enterprise Integration and Security
Deployment and Setup
Interesting Customer Use Cases
Lessons Learned and User Priorities
When Equilum is the Wrong Choice
Future Capabilities and Features
Conclusion and Contact Information