Summary
Data management is hard at any scale, but working in the context of an enterprise organization adds even greater complexity. Infoworks is a platform built to provide a unified set of tooling for managing the full lifecycle of data in large businesses. By reducing the barrier to entry with a graphical interface for defining data transformations and analysis, it makes it easier to bring the domain experts into the process. In this interview co-founder and CTO of Infoworks Amar Arsikere explains the unique challenges faced by enterprise organizations, how the platform is architected to provide the needed flexibility and scale, and how a unified platform for data improves the outcomes of the organizations using it.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- Free yourself from maintaining brittle data pipelines that require excessive coding and don’t operationally scale. With the Ascend Unified Data Engineering Platform, you and your team can easily build autonomous data pipelines that dynamically adapt to changes in data, code, and environment — enabling 10x faster build velocity and automated maintenance. On Ascend, data engineers can ingest, build, integrate, run, and govern advanced data pipelines with 95% less code. Go to dataengineeringpodcast.com/ascend to start building with a free 30-day trial. You’ll partner with a dedicated data engineer at Ascend to help you get started and accelerate your journey from prototype to production.
- Your host is Tobias Macey and today I’m interviewing Amar Arsikere about the Infoworks platform for enterprise data operations and orchestration
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by describing what you have built at Infoworks and the story of how it got started?
- What are the fundamental challenges that often plague organizations dealing with "big data"?
- How do those challenges change or compound in the context of an enterprise organization?
- What are some of the unique needs that enterprise organizations have of their data?
- What are the design or technical limitations of existing big data technologies that contribute to the overall difficulty of using or integrating them effectively?
- What are some of the tools or platforms that InfoWorks replaces in the overall data lifecycle?
- How do you identify and prioritize the integrations that you build?
- How is Infoworks itself architected and how has it evolved since you first built it?
- Discoverability and reuse of data is one of the biggest challenges facing organizations of all sizes. How do you address that in your platform?
- What are the roles that use InfoWorks in their day-to-day?
- What does the workflow look like for each of those roles?
- Can you talk through the overall lifecycle of a unit of data in InfoWorks and the different subsystems that it interacts with at each stage?
- What are some of the design challenges that you face in building a UI oriented workflow while providing the necessary level of control for these systems?
- How do you handle versioning of pipelines and validation of new iterations prior to production release?
- What are the cases where the no code, graphical paradigm for data orchestration breaks down?
- What are some of the most challenging, interesting, or unexpected lessons that you have learned since starting Infoworks?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
- InfoWorks
- Google BigTable
- Apache Spark
- Apache Hadoop
- Zynga
- Data Partitioning
- Informatica
- Pentaho
- Talend
- Apache NiFi
- GoldenGate
- BigQuery
- Change Data Capture
- Slowly Changing Dimensions
- Snowflake DB
- Tableau
- Data Catalog
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need some more to deploy it. So check out our friends over at Linode. With 200 gigabit private networking, scalable shared block storage, a 40 gigabit public network, fast object storage, and a brand new managed Kubernetes platform, you've got everything you need to run a fast, reliable, and bulletproof data platform. And for your machine learning workloads, they've got dedicated CPU and GPU instances.
Go to data engineering podcast.com/lunode today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show.
[00:00:51] Unknown:
Free yourself from maintaining brittle data pipelines that require excessive coding and don't operationally scale. With the Ascend unified data engineering platform, you and your team can easily build autonomous data pipelines lines that dynamically adapt to changes in data, code, and environment, enabling 10 times faster build velocity and automated maintenance. On Ascend, data engineers can ingest, build, integrate, run, and govern advanced data pipelines with 95% less code. Go to dataengineeringpodcast.com/ascend to start building with a free 30 day trial. You'll partner with a dedicated data engineer at Ascend to help you get started and accelerate your journey from prototype to production.
Your host is Tobias Macy. And today, I'm interviewing Amar Arsekir about the InfoWorks platform for enterprise data operations and orchestration.
[00:01:40] Unknown:
So, Amar, can you start by introducing yourself? Yeah. Absolutely. Thank you for having me, Tobias. Yeah. My name is Amar Arsekiray. I'm, the founder and, chief technology officer at Infowox. So my background is building large scale data systems. I started my career as a software engineer. I built large scale data systems at Google. I built, the the first data warehouse on big table at Google and then analytics platform that ran off that ran all of the internal analytics there. And that's really how I got started, in in data management. And, you know, there's a lot of inspiration from that work that I did there in in starting Infoworks.
[00:02:16] Unknown:
So what was your best resource for being able to learn about some of the best practices and the discovery involved in being able to build out that data warehouse on top of Bigtable and some of the challenges that you faced in the process? So building a a big data warehouse,
[00:02:31] Unknown:
and a petabyte scale system on top of Bigtable and and, you know, Big Data Technologies was was pretty brand new. So we were doing the world's first data warehouse on top of Bigtable, and the resources were really the original inventors of Bigtable and and, you know, the big data technologies inside Google. And, in fact, a lot of the, the open source systems, you know, Hadoop and Spark came out of the original paper that was published out of Google. So and and I had access to those resources inside inside Google, so to speak.
And that's how, you know, I was able to build it and ran a pretty successful analytics platform that had thousands of, use cases, thousands of users, and petabyte scale datasets.
[00:03:13] Unknown:
And now you've built up the Infoworks platform to be able to solve some of the challenges that you've that you've experienced in the big data space and some of the problems that are inherent to larger organizations. I'm wondering if you can give a bit more detail into what you've built at Infowarx and some of the story behind how it got started. Yeah. You know, as I said earlier,
[00:03:34] Unknown:
my first exposure to the large scale data system was at Google. Built the world's 1st data warehouse on Bigtable there. And after Google, I joined a company called Zynga. It's a, you know, social gaming company. And, they had a unique challenge of large scale, datasets, but also needed to make, you know, predictions and, you know, analytics on top of it, very quickly. And over there, I built an in memory database, supporting about a 100, 000, 000 players, became the world's largest in memory database at that time. And the unique thing that I did there was to build, like, thousands of analytics pipelines. You know, that used to feed, everything about gaming dashboards to, you know, the gamers behavior and recommendations and so on. So the lesson that I learned building these 2 systems was that there was a need to automate a lot of the data operations in, you know, lots of enterprises.
And the 1 digital companies like the Google and Zynga and Facebook and, Amazon, they had built all these platforms for their internal consumption. You know, it made sense that this is something that's going to be required by any data driven organization. You know, that was the, sort of, the inspiration to start Infoworks. And, the engineering and the product team that has built Infowox essentially have the same background that I have, which is running large scale analytics pipelines. And and, you know, when we talk about large scale, I'm talking about thousands of analytics pipelines being built and managed,
[00:04:58] Unknown:
on this platform. And what are some of the fundamental challenges that often plague organizations that are dealing with, quote, unquote, big data? And how do those challenges change or compound in the context of an enterprise organization and the organizational complexities that manifest there? You know, this is a great question. You know, if you look at, the usage of data and how people are managing their data assets,
[00:05:21] Unknown:
you can, you know, really segment the world into 2 sets of companies. There are the bond digital companies, like the Google, Facebook, Amazon, Netflix, and things like that and so on, where they have a foundational platform on which they are essentially always building out a 3 60 degree view of their business. On the other side, you have all these other companies who are in various stages of data maturity where the approach is to do use case by use case, you know, sort of a build out. So they are gathering data, for every use case. And as a result, there is a fragmentation of their data assets within the company. And to gather a 3 60 degree view of the business, it becomes, you know, pretty challenging. And in this world, I mean, you know, it's not built on a foundational platform so much. It's built on point tools, and there is a lot of glue code that dominates the the world. And and that becomes also very challenging. The data gets fragmented, teams get fragmented, the skill sets gets fragmented. So these are some fundamental challenges that many companies are facing when they have to deal with, you know, data that represents their their business. You know, you can call it big data because the amount of data is also large. The complexity is pretty large. And and how do you sort of manage,
[00:06:30] Unknown:
all of this becomes very critical. And in terms of the enterprise organizations, what are some of the unique needs that they have of their data that aren't necessarily going to manifest at either smaller scales or for newer companies that are maybe large, but don't necessarily have as much of that legacy infrastructure and legacy data that they need to be able to support. Yeah. So in in enterprises, I mean, number 1, now every enterprise is a data driven organization. So the need for data itself is accelerating.
[00:07:01] Unknown:
Every department is becoming data driven, which means they need data to make decisions. And, as a result, every organization, in order to become data driven, they really have to build thousands of use cases. And and, you know, where they're essentially using their data, their their fresh data, that becomes also very important to make their decision. And today's state of the art is to, you know, have 10 or 20 use cases that that will be built at best, because, you know, in today's, architecture, you have to hire a team of engineers, who are coding the those pipelines, and then, you know, they take it takes up an amount of time to sort of operationalize it in production. And and there is a limit to what you can do, versus automating the lot of the data operations that is gonna get you the the agility, that you need to to run, an organization and and become much more of the data driven organization. So that that's a a unique, you know, set of challenges that enterprises face. They also have the the the legacy tooling and the the point tools and the glue code that, you know, that you mentioned earlier, which becomes, you know, a challenge to maintain in when when you're introducing sort of new use cases.
[00:08:12] Unknown:
And the big data technologies that we have now are generally fairly built for purpose, either by the original organization that used it and then open sourced it, or by the academic institution that was using it for a particular area of research, which can lead to some sharp edges or difficulties in integrating it into the larger ecosystem. But from your perspective, what have you found to be some of the design or technical limitations of those existing technologies? And how does that contribute to the overall difficulty of using or integrating them effectively into an enterprise organization?
[00:08:48] Unknown:
Yeah. I I think the the legacy integration technologies, right, I mean, they are they have a limitation. Number 1, you know, it is still pretty coding heavy, and and, you know, you still have a a lot of coding that you have to do. I mean, there are many tools which are, you know, visual programming based, but it's still programming. So which means you are still programming and and and, you know, building out those pipelines and and and things like that. What happens when there are schema changes? What happens when you have to do time series analysis? None of those things are automated. You're you're sort of building out all of this, you know, manually.
So the automation part is is very critical to achieve these thousands of use cases to become data driven organization. So that's 1 limitation in in, sort of, the existing legacy technologies that are that are there. And also many of the legacy technologies are the the integration tools were built for a SQL based database. And and as a result, it becomes, you know, SQL centric. In a distributed, world, whether you are using, big data technologies like Hadoop or Spark, it's it's, important for the for the tooling layer, the data engineering layer, to know about this, the distributed infrastructure. 1 example is, you know, the distributed technologies, have this, you know, parallelism built in. But your data needs to be organized. And what I mean by data, it needs to be organized to make use of that parallelism is that the data needs to be partitioned rightly. And, you know, SQL based technologies not do not necessarily deal with that. So that's 1 of the the technical challenges we solve with Infoworx where, you know, if you have a partition of 1, you get a parallelism of 1. If you have a partition of a 100, you get a parallelism of 100. But that organizing the data into right partitions is something that we do, in in our we we support what is called hierarchical partitioning to to support those kinds of, getting the benefit of the underlying, you know, technologies. And and that is becomes important as you are dealing with larger volumes of data, more use cases, and so on. You need to get the power of your distributed big data architecture into your tooling layer. So for an organization
[00:10:50] Unknown:
that might be looking to use Infoworx to solve some of their data challenges, what are some of the signs or symptoms that would lead them in your direction? So 1 of the challenges that we see is and so the,
[00:11:03] Unknown:
the other thing that a number of the companies have already built is, you know, sort of the do it yourself platform for this, especially the new technologies, like the, you know, the Hadoop and Spark based systems. 1 of the challenges that, they face is, you know, the continuing in investment of engineering in maintaining those, you know, do it yourself platforms. And, you know, some of the do it yourself platforms are essentially built on point tools with a lot of glue code. So when maintenance becomes a challenge, many of the enterprises, you know, we have worked with them. They've come to us where, we have successfully replaced their in house, do it yourself platform with a with an info works platform that has, you know, given them the agility to to to run their organization. So that's that's 1 of the sort of a successful journey for an enterprise.
[00:11:51] Unknown:
And so for the organizations that are integrating Infoworx into their system, what are some of the tools or technologies that they might be replacing? And what is the process of actually integrating the Infoworks platform into their data technologies that are already running? So 1 thing,
[00:12:09] Unknown:
that, you know, from a a replacement standpoint, there are number of legacy integration tools. It could be, the ETL tools that, you know, you you may be familiar with, like, Informatica, Talend, Pentaho. These are the legacy integration tools. You know, we have, you know, replaced in in many instances. It's it could be ingestion tools. It could be like Scoop or NiFi or or Golden Gate in some cases for doing CDC. It could be like cloud native tools that, you know, that, are there for whether it is for ETL or ingestion or orchestration tools. So these are the some of the tools, where we have, you know, replaced. And, and with Infoworx, just to first sort of talk about what is it that we do. Yeah. Infoworx is an enterprise data operations and orchestration platform. So it spans all of the data operations functionality, like, you know, ingestion, CDC, merge, and building a time series of your datasets as data comes in, data transformation, data modeling, and then orchestrating all these, you know, pipelines in production. So the we span the all the gamuts of the data operations.
[00:13:15] Unknown:
So these are the technologies and the and the tools that we replace in an enterprise setting. And then as far as the integrations that you build to make sure that you can work with all the systems that are preexisting at these organizations, how do you identify and prioritize your work to make sure that you have sufficient support for being able to
[00:13:36] Unknown:
provide the functionality that is needed? Yeah. So there are, you know, 3 layers of integration, that that you can sort of, you know, you can categorize in in this world of data operations. On the 1 side, you have the data sources. So we support integrations into most of the enterprise data sources. We have 30 plus data sources all the way from Oracle, Teradata, SQL Server, file systems, streaming data sources, mainframe connectors, API based data sources, and so on. And and, typically, it's it's mostly it's it's it's sufficient in many cases, you know, to to sort of, ingest data, onboard data sources, and so on. We also have a connector framework.
In case 1 of the data sources you have is not supported, you can easily add a support for your new data source and and customize it and and build it out. And and so those, connectors can be, you know, added as an enhancement into the base platform you are building on top of the existing platform. And in in the other category of integration is if you have your own custom code for data transformations and things like that, we support, you know, 2 kinds of integration. You can sort of bring your own code and and make it as a note inside Infowarx. Or we also support a loosely coupled integration where you can sort of call into your system, pass parameters and all that on all those things using our orchestrator mechanism to manage those dependencies and parameter passing and, you know, fault tolerance sort of execution of of those pipelines. And and and then pass the control back into Infowarx to, you know, for further processing. So that's much more of a loosely coupled integration. And the third 1 is in, you know, connecting to, you know, different layers for data consumption, whether it's cloud endpoints.
It could be, like, you know, BigQuery or some other technologies that you want to deliver the data. So we support a number of, you know, those integrations as well. And the the way we prioritize these integration is really, you know, we are very tightly, you know, connected to our customers and and their and the use cases that they are building. We are, essentially supporting
[00:15:36] Unknown:
those initiatives and prioritizing based upon, you know, where their use cases are going. And what are some of the key principles on which the Infoworx platform is built that have guided your development and improvements of the overall capabilities of the system?
[00:15:52] Unknown:
Yeah. Absolutely. So, Infoworks, I mean, we call it the 3 pillars. In Infoworks, platform is built on, you know, these 3 pillars. The first 1 is deep automation. So, you know and that's our background. Anything and everything that we can automate in data operations, we we have built in automation, so which enables people to build out these thousands of use cases. The second 1 is infrastructure abstraction. Because there is gonna be different kinds of data computation technologies, distributed execution engines, and so on, We have built this on an infrastructure abstraction so we can deploy it in any sort of an environment, whether it's on premise, in the cloud, or in a hybrid setting. And the 3rd principle on which we have built this platform is to make all of the data operations available in a single place.
So you can think of it as a integrated solution, whether it's for onboarding a data source, transforming your data source, or operationalizing it. So all these 3 functionality is available in a single place. So it's easy for you to adapt to changes as things are evolving in your use case. And the the 4th 1, which is, important, I'm just gonna add that, is is that it's it's built natively for this big database systems, the distributed architectures. And and that is very important in the sense that, you know, these parallel engines, provide a lot of power. And unless you take, you know, provide a native integration into these parallel engines, you you're not gonna get the benefit of it. And that's the 3rd the that that's what we have done with this Infoworks platform. Can you talk a bit more about how the Infoworks platform itself is architected and some of the ways that that design has evolved since you first began working on it? Yeah. Absolutely. So, you know, our, you know, origins were in terms of, you know, automating the data operations, and and that's where that's where we came from. And the way that, the Infowoc system was orchestrated, was essentially that it provided automation for all of the data operations, and it encompasses, you know, the entire gamut of the data operations. So we start with crawling of a data source. So we fetch everything about a data source, like, you know, the metadata, the, you know, relationships between tables, and and things like that. And once the crawling of the metadata is is completed, you have the option of ingesting the data, both, like, you know, from a historical load standpoint where you need to load all of the data, in in the beginning to doing CDC. Right? The change data capture where, you know, as data is changing in the source systems, you can bring, you And we support, And we support, something called as as change schema capture. So if there is a change in schema, it automatically adapts to those changes. It does things like backfills and all the things that you need to do for supporting the new schema. And the 1 of the unique things about a big data system is the big data systems are typically, immutable, which means if there are records updated in the source system, there is no easy way for you to change that on your, you know, on your big data, you know, system, where you're building your analytics.
So we support a continuous merge process. So the continuous merge process is, you know, automatically syncs the data and and gives you a view into the data that, you know, provides an updated view. Right? And and that's, very important in this new architecture, you know, where you are dealing with immutable systems. And as the data is coming in, we are, you know, auditing with the time series. So the the data is organized on a time series with current and history tables. So this is, you know, something that, lets you do a lot of the time series analysis very quickly. And and that, you know, essentially builds your data catalog that is continuously refreshed and and maintained. And it provides the basis for the rest of the processing. And we have this concept of data domains where the data catalog can be then provisioned into these data domains where data admins and, data scientists can come in and perform transformation operations, data modeling operations on top of that data catalog.
And they're only seeing a slice of the data catalog. So, you know, that way they can you know, you can, have control over the data access and who sees what. And the data transformation pipeline is also heavily automated. When when, you compare, Infoworx against the visual programming approaches, if you have to deal with things like incremental pipeline, you know, it's a single click operation in Infowarx versus in in in a legacy's, ETL tool, you have to deal with, you know, things like, you know, auditing the tables and and and a time axis, capturing, like, watermark so that you can sort of do incremental, processing later and so on. All these things have to be manually done. So those are the kinds of automation that are built in. Slowly changing dimensions or a click of a button, you know, click a button, and then it it organizes the data into in the data model in in in current and history tables and and and so on. Audit capture, lineage, versioning is all performed, on those data transformation pipeline. And the once the data transformation, you know, logic is built out, the final set is your data data model. And the data model, you can also accelerate into an in memory data model. So that way, you not only sort of get your raw data, but you're shaping the data into the right format in a data model, but also creating the data model at at the right speed. You can get you know, a data model in a in memory system is gonna be 8 to 10 times faster than on your, you know, base system. So those kinds of, accelerated data models, you can do it, you know, right in the platform itself. And it also provides, access to integrate into your ML and AI algorithms as well, you know, and supports things like training of your data models. And once these things are, you know, built out, the data models are built out nicely and it represents your your business, you can export those data models into other consumption endpoints. It could be, you know, a big query in the cloud or, Snowflake or some other technologies that you're using to consume those data models, while you are continuously refreshing it and maintaining it and delivering that data models for consumption. Or it could be individual tools like Tableau and and Qlik and and so on. And the last stage of, of this is to orchestrate these pipelines so that you're running them every 15 minutes or every hour or every day, and, you know, manage dependencies, parameter passing, making it fault tolerance with, you know, retry and restartability, and and the and and managing it on a day to day basis. So we have an orchestrator that can be used to to orchestrate these pipelines and and deliver the the end result. And all all of this is built in a single platform. So that way, you can have data engineers,
[00:22:38] Unknown:
data scientists, data analysts, and production support or production admins all work on a single collaborative system. There are a few things that I wanna pull out from there. 1 of those is the concept of the change schema capture canonical problems that data engineers are faced with is how the evolution of data and the source systems can be reflected downstream. And I'm wondering what you have found to be some of the useful strategies on that point and some of the challenges that you've faced in terms of being able to reflect those schema changes in a way that is non destructive at the destination systems.
[00:23:19] Unknown:
Yeah. Absolutely. This is a great question. So what 1 of the things that we did as I think as a strategy that we have followed in our system is we provide a automated change schema capture. But at the same time, you know, we we provide it also in a way that you can be notified and and, you know, there can be a human involvement and an authorization before those changes are automatically applied. So that is a strategy we have applied across the entire system, especially for the schema change capture. So that way, you know, you you can automate certain kinds of changes into the system. And then, you know, for other kinds of, schema changes, you you can also, make it manual in the sense that, you know, you you have to get notified. You get you need to be authorized or approved before those changes are percolated into the system. So that strategy has helped, because there is no 1 size, automation fits all kind of a story, especially when it comes to schema changes.
And there are, in many cases, a a change in schema that should not have been, you know, really performed in the source systems and and, that shows up in your data warehouse systems. In many cases, requires, that, you know, human intervention. So that's 1 of the things that we have learned that,
[00:24:33] Unknown:
has worked out pretty good. And then another element is the data cataloging. And I know that discoverability and reuse of data assets in general is 1 of the biggest challenges that face organizations of all sizes. And I'm sure that that is compounded with the different business units that exist within an enterprise organization. And I'm wondering how you address that discoverability in your platform to make sure that you can cut down on rework and duplicated effort between different silos.
[00:25:03] Unknown:
Yeah. You know, the data catalog is is, is something that is central to the organization. I mean, it truly represents, you know, the, data assets, that an enterprise needs to manage. And and, you know, it has to have a a lot more metadata information for, you know, managers of the data. So 1 of the things that we have done is I mean, you know, the data, catalog is, you know, searchable and taggable. So that's, like, you know, pretty much most many systems have it. We also have a mechanism to enhance the data catalog with, you know, both technical glossary as well as business glossary. So you can upload, you know, Excel files that represents your, you know, the, the business metadata, and and then the system tags the business metadata into those, no columns and tables and and so on. And and, again, it everything becomes searchable as you're uploading and enhancing the system. And that's 1 part of it. The other thing that we have also done is called as data engagement dashboard.
So as as you're using this data day in and day out, whether by running the pipelines and and and and, building data models and orchestrating and running it in production and so on. We are also capturing and showing what are the most critical important data assets within your organization. So we are calling it the data engagement dashboard. So you can see what are the top 5 data models, what are the top 5 raw data sources, and and who are the the most heavy users of your, of your datas, and so on. So you you get a view for what are the critical data assets in an organization. And and, you know, data, assets are, you know, just like any other, know, sort of physical assets that a company may have, whether, like, you know, it could be stores or machinery or equipment and so on. There has to be a a management layer, and and and that's how we are viewing it. And we are putting a lot of, you know, sort of emphasis on that data engagement dashboard, for the managers of data. For the different roles that are interacting with Infoworks, I'm wondering if you can give a high level view of the different responsibilities
[00:27:05] Unknown:
within an organization that might be using Infoworks in their day to day, and the different workflows that they might engage in with your platform?
[00:27:14] Unknown:
Yeah. So we we have a the the platform supports this role based access control. So, typically, what we have seen is, you know, data engineers are mostly interested in building that onboarding of their data source and and building out their their data catalog. So, typically, that's where they are focused on. So data engineers build out the data catalog. They, they then provision in, you know, what what we call as data domains, which is basically a subset of data's, catalog is then provisioned for a certain set of users for for them to, like, build transformation pipelines and so on. The second set of use users are data analysts and data scientists. So they the data scientists and data analysts, you know, they are working with a partial set of data catalog and then building out transformation pipelines.
They're shaping the data and, you know, creating the data models within what we're calling it as data domains. And the final data models are, you know, when they're ready for consumption, they're typically working with production admins. The production admins then take the the pipelines and the artifacts they have created, and then migrate it from dev to production. Right? The production systems are typically managed by production admins, and they are, then orchestrating it and, you know, deploying it in production. And and they're also monitoring that in production. So Infowarx provides a orchestrator where they can, you know, monitor and and and, see how things are going. They can also look at the SLAs, once it is deployed in production and and make certain changes if needed, whether by passing parameters or by increasing capacity. Like, you know, if a process is going very slow and it's taking a long time, they can they can add more compute, in in into that, process and and, get the SLAs, to run within the the thresholds that they have. So these are the 3 different kinds of roles. It's the data engineers who are interested in building the data catalog, the data analysts, and data scientists who are interested in building out the data models, that is shaped, for business consumption, and the production admins for orchestrating the workflows in production. And throughout those different stages, can you talk through the overall life cycle of a unit of data, whether that's a single record or a large batch of data and how that flows through the Infoworks platform and the different subsystems that it touches on at each stage of its life cycle? So, the life cycle of the data or a unit of data in Infoworks is first part of this is that when a data engineer is sets up a, connection to a data source, Infoworx is crawling and and and, understanding the metadata of that of that data, which is all the the structure and the information and the data pattern is is pulled in. And then the second part of it is to actually ingest the raw records itself. So the raw data is is ingested. That is the the first load is is the historical load where you are bringing in all of the data, sets for a certain table on and or a certain schema and so on. And then the second phase of this, ingestion is the CDC, where, you know, you're bringing in bringing in the the changing data, datasets.
Now if this data represented a a new record, it'll be handled in 1 way. If it's if it was an updated record, then it goes and it gets processed in in in our merge process. So we have a continuous merge process which gets, you know, essentially takes the updated record and merges into the the base set of, you know, tables that's on the on the big data environment. And and this data gets tagged. So there is a time axis on which this, data gets tagged. And it may be also pushed into current tables or history tables based upon if if this was an old record, versus if this is the current record that is, you know, represented on the source side. So that's, all the things that happens to a unit of data when when you're building out a a data catalog. So that's the first phase of this whole thing. The second part of, this is the data transformation. So once the data is brought in and nicely organized on a time axis and and and catalog, there is a transformation pipeline, which, is essentially applied on top of your, of your data.
And that transformation pipeline has certain kinds of operations like, you know, join nodes and, you know, unions and, you know, other kinds of transformations that you may be applying before that data shows up in a data model. There is a lot of, the management that happens in a transformation pipeline. Like, if you need to to to know who made what changes to the transformation logic, lineage, versioning, and all those things. So that's that's orthogonal to the flow of the data itself. We are keeping track of everything that happened to the transformation logic. And once the data shows up into the data model, you know, the final step is that it could, then be exported in a in into a a consumption layer where, it could be some cloud endpoints or or some other, sort of an SQL engine or, things like that. And then as we discussed earlier, the finance part of this whole thing is orchestrating it in production. So this is 1 part of, like, you know, the data flow. But once you have this operationalized in production, you're running these things every hour, every 10 minutes, or whatever frequency that you need to, manage.
[00:32:22] Unknown:
The interface that is available for the data engineers and the analysts to be able to build out these different flows and interact with their data is largely UI oriented with a low code approach and being able to click and drag different components. I'm wondering what you have found to be some of the design challenges that you face in being able to provide an appropriate level of control and expressivity while still being able to make it accessible to people who don't either have the technical capacity or the time to be able to dig deep into the code level aspects of the systems.
[00:32:58] Unknown:
So the interesting thing is, you know, our UI subsystem is built upon our API layer. So, you're right. All of the things that 3 different personas or roles who are using Infowox, they are using this collaborative UI platform and and managing their work, whether it is to build out data catalog or transform transformation or operationalizing the data. And, in some cases, there are some of our customers that are also using this platform without a UI, using, using a an API, to to drive the pipelines in in production. So that's another sort of a usage category as well. So 1 of the things that we have done is the the there is a challenge in terms of making it completely UI because some in some cases, you need to have programmatic access to various points within the the platform. And we we have an ability where, you know, since our entire platform has got APIs, you can sort of tap into different parts of our system and and, and and and and do an integration.
1 example is, if you have a a custom transformation code and you wanna be able to sort of use that, you can you can drag and drop and, you know, put in a custom code inside of Infobox and make it into a node That becomes a reusable node, and and you can, you know, pass parameters to it and and and so on. So this this is 1 approach of reusing some of the things, that you that you may already have. We also support integration using our orchestrator, which you can call it the loosely coupled integration, where you can, you know, as part of your transformation, you can pass control to your system that is a preexisting system, pass parameters and and so on, and then, you know, have have once it is completed, have, have it come back into Infowarx and and continue on. So this way, you know, you you have a mechanism for you to integrate into sort of existing systems.
[00:34:50] Unknown:
And then in terms of the workflow of the people who are defining those pipelines, what do you have available for being able to handle versioning of them and, releasing them to different environments to validate them and test them, or for being able to look back at past versions and do historical analysis?
[00:35:08] Unknown:
There is a built in version control for the pipelines. So you you can, have data analysts work on a new version while there is an existing version that's being run-in production. And and once it's, the the development is complete, you can move the new version or migrate the new version into production. So you you have the ability to sort of tag it, version it, and all those things. We also support integrating into your sort of existing config management tools such as GitHub. So those are things that are available. And the 1 of the unique things is this since this is a platform that's API driven, you can also build CICD on top of it and perform things like data validation and data reconciliation. We support things like, record count based data reconciliation out of the box. So if you need to, you know, run a reconciliation job at the end of every week and and it you know, you you need to just click a button and and and those reconciliation processes will run.
And and, you can also perform sort of custom data validation. You can specify slices of data. So things like, you know, if you if you want to take a slice of your data and you can say, show me the the number of, sales that happened in this ZIP code, and and that should be a slice of your data that you always want to validate, at the end of a certain day, you can specify those slices of data. Right? And then it generates the reports and statistics to make sure that everything is is running running fine. And that way, you can you can, sort of version the pipelines and validate it and and, and and manage this on an ongoing basis. And then the graphical paradigm can sometimes
[00:36:40] Unknown:
break down because of the fact that there are elements that are difficult to translate into a UI paradigm or that require some specific custom development to be able to handle. And I'm wondering what you have seen as being some of those edge cases where there's the necessity to be able to drop down to the code level and what you have as far as an escape hatch in your platform. You know, in cases when the customers, you know, they have to perform a custom transformation
[00:37:09] Unknown:
and they, they're either have an existing code base they wanna use or, you know, they they want to write code in in in some of the cases. What we support is an ability to do custom transformation node. So we have a mechanism in which we can support adding custom transformation code, whether it's in Java, Scala, or Python. And and you can code that in and then drop it into a in a certain format droppable node. And you can attach that within, the InfoWorks in a platform. You can drag and drop it, build it out, and and reuse it in in various ways that you may want to that you may wanna use. And that way, you are leveraging your existing assets. So it's not something that you have to recode everything, in this new environment.
And the other thing that we also do is that we support integrating, you know, Java based transformation nodes with Scala or Python based transformation node without having to go back into the disk. In other words, we are doing in memory transformations. You can say passing data frames between these nodes. So there is no penalty for, you know, for you to write it in different languages.
[00:38:19] Unknown:
And then for and what are the cases where Infoworx is the wrong choice for a platform for handling data orchestration and integration?
[00:38:29] Unknown:
Yeah. No. This is a great question. So 1 of the things is if this is Infoworx is a platform for, you know, building out data operations and data orchestration for batch style, you know, use cases and, you know, and micro batch style use cases. So if you're looking to do real time analytics with milliseconds sort of response time, then this is not, the right choice of the platform. In other if if you if you are looking to perform a use case which requires data models to be updated and made available in few minutes, I would say 2 minutes and above, then this would be a right choice of technology. And in your experience of building and growing Infowarx as a business and as a technical platform, what are some of the most challenging or interesting or unexpected lessons that you've learned in the process? Yeah. You know, 1 of the things is, as, we have discussed, I mean, Infoworks is a highly automated system. And and 1 of the things that we have learned is automation by itself is is is also not sufficient. And as a result, we have invested in, you know, customer success, within our enterprise customers. And we have customers who have built thousands of pipelines and and run-in production, you know, 400 plus use cases in 12 months. So they're able to gain the agility that they did not have before Infoworx.
And that, we were able to achieve. I mean, automation was definitely the root cause and the and the the the reason why that agility, you know, came in to play. But we have also invested in in training the customer and and advising them in in in this new, paradigm of, you know, doing data operations and orchestration. And I think that was, I would say, a lesson learned in terms of, you know, how automation plays out. You you still need that education, that advisory sort of a role, and we have heavily invested in in in that. We're also investing in a lot of the self-service capabilities where, like, you know, there is, like, sort of tutorials, the video tutorials, and other things that we are, you know, investing and also recommendation. So when data analysts and data scientists are performing certain operations, we are, under we are essentially analyzing what they are doing and recommending in the app itself or in the platform itself to to, guide them. So these are things which we have learned as a result of, you know, working with large number of customers. And all those heuristics, the rules of, like, you know, how people are using our system is, you know, we are applying machine intelligence to, you know, sort of recommend what they should be doing. So and I think those things have are having a bigger impact or or, you know, I'd say having a big impact, in in usage as well. What are some the most notable ways that the overall big data landscape has evolved since you first began working on Infoworx? And what are some of the industry trends that you're most excited for? Yeah. Since we started, over and start started Infoworx, I think 1 thing we have seen is that, you know, there is a secular movement towards more efficient technologies when it comes to, you know, distributed execution engines for data processing.
So what I mean by that is, you know, we we had, this open source Hadoop, which which which started with Apache, Hadoop. And then after that, the the Spark Technologies which came in, which, you know, essentially made, MapReduce sort of work within, within memory. And then now we are seeing ephemeral clusters, which means you don't need to have a, you know, 1 big large static cluster. You can you can have a a cluster start up and and and, taken down, for each, sort of workflow. So all these things are essentially solving 2 things. 1 is the complexity of managing a cluster, and second thing is the the cost or the efficiency of of, the you know, data processing. You know, being in this world of data, for a while now, I think 1 thing, what which has become clear is that there is more data today, than all of the data computation, ability, as as human beings. Right? So as a result, there is a need for much more efficient data processing technologies, and and that's a movement that's gonna be happening. And and, what we have done at Infoworx is to really build this data operations and orchestration platform with an infrastructure abstraction. So we can run on any kind of execution engines and and storage technologies.
And that's 1 of the things that we see that's that's very exciting. We're we're gonna see more many more technologies
[00:42:46] Unknown:
for data processing. And are there any new features that you have planned for the near to medium term or overall improvements or enhancements to the platform that you'd like to share before we close out the show? Yeah. I think 1 of the things that we are working on in our roadmap is is the self-service
[00:43:02] Unknown:
initiative where there is going to be, you know, sort of a guided tutorial as you are using the product, recommendations for various things you can do. And we have learned a lot from many large customers using and I think 1 thing we have, you know, seen is that our customers are running thousands of pipelines in production, and we have a lot of heuristics and a lot of things that we have learned from this. And and we are, using machine intelligence to sort of surface it as people are, you know, building out pipelines and using the product.
[00:43:32] Unknown:
Well, for anybody who wants to follow along with the work that you're doing or get in touch or get some more information about the platform that you've built out, I'll have you add your preferred contact information to the show notes. And so with that, I'd like to ask you a final question of what you see as being the biggest gap in the tooling or technology that's available for data management today? Yeah. 1 of the biggest gap that I see in the tooling, for data management today is that data operations are fragmented
[00:43:59] Unknown:
because, there is a lot of point tools, and you have to write a lot of glue code to stitch them together. And automation doing automation of data operations in such a setting, it becomes very challenging. That's 1 of the biggest gap that I see. And, you know, there was 1 research paper from Google which said that in a typical sort of ML use case, 5 percents 5 about 5% tends to be ML code and 95% is glue code. Right? And I think having a platform that integrates all of these technologies
[00:44:29] Unknown:
are all of the the functionality that is required from onboarding a data source, transforming it, and operationalizing it, it becomes very important. Alright. Well, thank you very much for taking the time today to join me and share your experience of building out the Infoworks platform. It's definitely a very interesting system and 1 that solves a very complicated challenge for larger organizations and well, for any organization. So definitely appreciate the work that you're doing there, and I, thank you again for taking the time, and I hope you have a good rest of your day. Thank you for having me,
[00:45:07] Unknown:
listening. Don't forget to check out our other show, podcast dotinit@pythonpodcast.com to learn about the Python language, its community, and the innovative ways that is being used. And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction to Amar Arsekir and Infoworks
Building Large Scale Data Systems
Challenges in Big Data for Enterprises
Technical Limitations of Existing Technologies
Key Principles of Infoworks Platform
Orchestrating Data Pipelines
Lifecycle of Data in Infoworks
Lessons Learned and Industry Trends
Future Enhancements and Closing Remarks