Summary
There is a wealth of tools and systems available for processing data, but the user experience of integrating them and building workflows is still lacking. This is particularly important in large and complex organizations where domain knowledge and context is paramount and there may not be access to engineers for codifying that expertise. Raj Bains founded Prophecy to address this need by creating a UI first platform for building and executing data engineering workflows that orchestrates Airflow and Spark. Rather than locking your business logic into a proprietary storage layer and only exposing it through a drag-and-drop editor Prophecy synchronizes all of your jobs with source control, allowing an easy bi-directional interaction between code first and no-code experiences. In this episode he shares his motivations for creating Prophecy, how he is leveraging the magic of compilers to translate between UI and code oriented representations of logic, and the organizational benefits of having a cohesive experience designed to bring business users and domain experts into the same platform as data engineers and analysts.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy!
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
- Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
- Your host is Tobias Macey and today I’m interviewing Raj Bains about Prophecy, a low-code data engineering platform built on Spark and Airflow
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what you are building at Prophecy and the story behind it?
- There are a huge number of tools and recommended architectures for every variety of data need. Why is data engineering still such a complicated and challenging undertaking?
- What features and capabilities does Prophecy provide to help address those issues?
- What are the roles and use cases that you are focusing on serving with Prophecy?
- What are the elements of the data platform that Prophecy can replace?
- Can you describe how Prophecy is implemented?
- What was your selection criteria for the foundational elements of the platform?
- What would be involved in adopting other execution and orchestration engines?
- Can you describe the workflow of building a pipeline with Prophecy?
- What are the design and structural features that you have built to manage workflows as they scale in terms of technical and organizational complexity?
- What are the options for data engineers/data professionals to build and share reusable components across the organization?
- What are the most interesting, innovative, or unexpected ways that you have seen Prophecy used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Prophecy?
- When is Prophecy the wrong choice?
- What do you have planned for the future of Prophecy?
Contact Info
- @_raj_bains on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
- Prophecy
- CUDA
- Apache Hive
- Hortonworks
- NoSQL
- NewSQL
- Paxos
- Apache Impala
- AbInitio
- Teradata
- Snowflake
- Presto
- Spark
- Databricks
- Cron
- Airflow
- Astronomer
- Alteryx
- Streamsets
- Azure Data Factory
- Apache Flink
- Prefect
- Dagster
- Kubernetes Operator
- Scala
- Kafka
- Abstract Syntax Tree
- Language Server Protocol
- Amazon Deequ
- dbt
- Tecton
- Informatica
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means? Our friends at Outland started out as a data team themselves and faced all this collaboration chaos. They started building Atlan as an internal tool for themselves. Atlan is a collaborative workspace for data driven teams like GitHub for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlan enables teams create a single source of truth for all of their data assets and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.
Go to data engineering podcast.com/outland today. That's a t l a n, and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $3, 000 on an annual subscription. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.
Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Massey. And today, I'm interviewing Raj Bains about PROPHECY, a low code data engineering platform built on Spark and Airflow. So, Raj, can you start by introducing yourself? Hi, Tobias. I'm
[00:02:06] Unknown:
super excited to be here and looking forward to the conversation. I started off in engineering, mostly in compilers. So I've worked at Microsoft on Visual Studio. And then, you know, GPU programming was starting to become a thing. So I moved to NVIDIA, spent 4 years working on CUDA, 1 of the early engineers who built that. And then actually for a while, CUDA was not going anywhere, not being used as much. So and Bitcoin mining and deep learning had not picked up yet. So I was like, let's move to the data space. Maybe there's more action there. Went to a startup building an operational database, worked there as an engineer, but then figured out the problem was in product market fit, moved to marketing, moved to product management.
And then from there, I moved to be the product manager of Apache Hive at Hortonworks through the IPO. And a lot of it was about, you know, growing and scaling the Hortonworks, you know, revenue and making sure that ETL could be done at scale. And then from there, you know, I found it was, you know, a little hard to use. So I was like, okay. Let's build prophecy, make
[00:03:07] Unknown:
ETL, and data engineering. You know, bring it into the modern age. Make it easy. Out of all that, you have kind of danced around data. Now you're kind of wholly in it. I'm wondering if you can just share a bit more about how you first got introduced to the overall space of data management and what it is about it that is attracting you to it and makes you want to spend your time and energy building a company to help people who are working in the space.
[00:03:30] Unknown:
I first got interested, I studied systems and found, you know, while I was at Nvidia, I was taking these advanced advanced systems courses in Berkeley, where I went through the history of databases and a lot about the history of systems and then working through, you know, up to the latest technology. So I got interested in databases and data management. NoSQL was big. I was really excited about NewSQL that was distributed transactional systems more for the technical problems. Right? We're using Paxos as distributed query processing. So I got super excited there. But from there, I was in the startup working on engineering. I found that, you know, the company kind of got stuck in what you would call the classical chasm, right? There was some initial traction, but not that much traction. And then I've started going out to the market and trying to see how customers are solving, you know, the problem with NewSQL and which problems we were best suited for and kind of got a taste for that. So I was like, okay, this, you know, going to customers, talking to them, seeing their problems, you know, that that became really fascinating. And then, you know, being able to bring the technological skill and merging both of them to build the right product. But then when I was at Hortonworks, right, seeing Apache Hive, you know, at that point we were competing with Impala. We did make it quite made Hive quite dominant. Right. But talking to customers, they were really struggling to use this product. And then Palantir came up. And to me, it seemed that, you know, our tooling still looks like it's from the nineties. Right? It's like we we have seen SpaceX going up. We have electric cars, and then you look at your data engineering tooling and using Hadoop. You know, you're using from a command line shell. You know, you have SQL scripts with configuration in them. It just looks so old school.
Like, the difference between what was there and what could be built is just so huge. I looked at this and said, you know, we can make data management so much easier, and it's a massive opportunity to build very, very interesting things.
[00:05:25] Unknown:
As you said, you're now building the prophecy platform to help make it a little easier to do data engineering and data management on top of some of the powerful foundational tools that we've built out over the past several years. But wondering if you can just give a bit more of an overview about what it is that you're building and some of the story behind what made you focus on this particular problem space as something that you wanted to build a company to address.
[00:05:50] Unknown:
Understand a little bit of the context of where our customers are coming from. So I had been working with the very large enterprises. Think, you know, top fortune 50 banks, financial services, companies you would associate with your credit cards. And what their on prem infrastructure look like is they have actually a product called EbonyShow, which is the performance leader in the last generation of ETL. And then in the last generation also there is Informatica. That is the leader in volume. That's a $10, 000, 000, 000 tools industry. Now then, you know, they do a lot of ETL there or transformations there, and then they're moving their data into a data warehouse. Right. But then in this ETL tool space, right, they got a lot of things. They got visual drag and drop development.
They could drop down to the code when they wanted. They got tests, source code management, you know, metadata search, column level lineage, scheduling, deployment, all of that. Right? So as these customers were looking to move to the cloud, and then if you look at the cloud, Apache Spark seems like a successor to Ab initio, similar distributed processing engine used heavily for data engineering and Snowflake, perhaps the replacement for Teradata. When you move to this new cloud infrastructure, the processing engines are there. But on top of the processing engines, you need a lot of data engineering products, right, to schedule, to deploy for metadata management.
So what prophecy is focused on is saying, okay, as we move to the cloud, you know, we have to reimagine what these data engineering products and tools will look like for the new world. And then another big change in this new world is a move from towards software engineering practices and data engineering. So this is saying you want TES, you want Git, CICD, and all of these best practices apply to data engineering. So we are so what prophecy has tried to do is to merge the best of the old and the new world. And then, you know, we can talk about the specific features that we provide.
And that's, you know, quite a few of them as well. As you
[00:07:59] Unknown:
mentioned, there are still a lot of challenges in being able to build a data engineering platform and workflow for an organization. There are a lot of concerns that have to be tied together. There are a lot of different tools that you have to select and integrate. And I'm wondering why it is that after, you know, 2 decades at this point of, quote, unquote, big data engineering, why it is that it's still such a challenge to be able to go from I have a new organization, I have a, you know, a database, to I now have a, you know, data platform, and my analytics engineers and business users can self serve and answer their own questions.
[00:08:39] Unknown:
There are a couple of reasons for this. 1 reason is that when you have a platform shift, right, so the ETL or data engineering products on premise were built by companies over a couple of decades. So now you have a platform shift from on premise to cloud. There is usually a disaggregation followed by an aggregation. Right? Because the full product is really hard to build. There's a lot of moving parts to it, or it's a really big product. So, you know, 1 start up will go and say, hey. I'll build a scheduler. The second person will say, I'll build a development environment. Somebody will say, I'm building some lineage. Right? So basically you get all these startups, right, which are not really companies, right, and not even full products. They are features. There is tremendous pressure to go to market with the new bottoms up thing and quickly get some traction. So they build these small little features. And then when a larger business looks at this, it's a mess. And the message from the businesses is very clear. Right? They want 1 platform.
Databricks is quite dominant. Snowflake is dominant, but people are not unlike Hadoop, where there were, like, 20 different engines, people are not picking that anymore. They're saying we want the data to go in 1 place. You know, right now, it is still going to 2, but that's what they like. Right? So on top of that, to manage, they want a single product. Right? And that's just a lot of product to build. Now that's the reason 1. The reason 2 might be a little bit of why how the venture industry works. Right? So if you look at it, most of the processing engines are funded today. If you look at Confluent and Kafka, if you look at Spark, they were funded in a university. They were funded inside LinkedIn. So most of the processing engines are today being funded by these large companies.
They are in open source. You can see Presto and also, of course, Hadoop. And then the VCs come in and then they say, okay. Now I'm willing to make a bet on this because I already know that product works and somebody has else has paid for r and d, which is not me, the VC. Then on top of that, for the management of of this, you need products that make it easy for everybody to manage. Unfortunately, the engineers in Google and LinkedIn prefer code. Right? So they are not going to invest in building these tooling. So now somebody has to invest. And who's going to invest? You know, the VCs have to invest. And will the VCs invest? No. They will not invest because they want before a series a traction that is up and to the right. So while the companies need a product that does 10 things, a VC will not fund anything that big because they want, you know, the person to build this 1 quick thing, run to the market, come back and show traction. So because of that, over the last 5 years, 7 years, there has been hardly any funding in the data engineering tooling space. And that's the second structural reason. You know, 1, it's the time and, you know, the disaggregation aggregation, and it's a lot of product to build.
[00:11:32] Unknown:
2nd, you know, it's very hard to fund an effort like that. Yeah. It's definitely a very interesting insight because from the 30, 000 foot view, you look at the data space and the, you know, venture capital funding that's flowing into it, it seems like the VCs are just throwing money at data. But to your point, it's point solutions, not a horizontal, you know, fully integrated system. And as you said, it's very difficult to find something that is a well integrated entire platform that you can just take and be able to use end to end, and you still end up finding all of these you know, this solves all of my problem around being able to build and execute data quality tests, but now I have to buy this other service that's going to handle my data management and access control.
And I also have to go buy this thing over here that's going to manage all of my scheduling and workflow management. So it's definitely an interesting point that I hadn't really sort of thought to put together in that way. To add to that, right, it would at least be okay if these point solutions work.
[00:12:32] Unknown:
Now I'll give you an example. We had a company come in, an insurance provider, Fortune 10, that we are working with. They are like, okay. You know, the first piece of workload that I want to move to the cloud from Ab Initio to Databricks is this set of workflows. These are 20, 000. And they're like, okay. You know, how do we move them? And Prophecy, you know, apart from our data engineering product, which is on the cloud and, you know, we also can convert all the legacy formats to Spark. So we are like, okay. We can convert that to Spark. But then they're like, okay. How am I going to schedule this? And they're like, oh, you know, most people use Kron. Advanced people use Airflow.
But the customer comes back and says, you want me to write 20, 000 Airflow jobs with each 1 having its own recovery and, you know, handling the downstream, upstream handling cost management? Should I put 3 these 3 workflows on the same cluster? Should I put these on different clusters? And you know what? Like, astronomers putting money into airflow and it's the most used product, we support it. Right? But just saying, like, airflow till 6 months back, I think, couldn't even support a 1, 000 workflows. So the needs that are there in the startup space are just off a very different scale than for the large enterprises. And the amount of open space that remains in data engineering is huge. Like, there is not 1 scheduling solution in the market that can solve, you know, the customers. We are talking to their problems.
[00:13:50] Unknown:
As you mentioned, you're working primarily with these large enterprises, and you're trying to solve the needs that they have around their complex data work flows and being able to integrate across multiple different business units, a number of them who maybe didn't even start in the same company, and so you're trying to deal with all of these complex organizational challenges. So what are some of the features and capabilities that you're building into prophecy to help address some of those issues and complexities that arise in these larger organizations?
[00:14:19] Unknown:
I love the question. There's actually a couple of very interesting things behind it. So the first thing is in a large enterprise, there are many users. And you cannot have a system which is specialized for each user because otherwise your users can't talk to 1 another, can't work on the same platform. So you have some people who will write code. You also have, you know, hundreds of visual ETL developers who were developing ETL work the last generation. They're used to visual drag and drop. You have the data analysts who can write some SQL or have been using a product like Alteryx, right, to do development.
So you have all these users that all need to be supported. All of them need to put workflows into production. 1 customer we talked to says, hey. The data analysts keep writing workflows in Alteryx and then give us an XML and say, can you put it in production in Spark? And it's a big win point for them. Right? So the first thing you realize is that for an enterprise, there's many users that need to be supported in the same interface. Right? So what Prophecy has done very uniquely is that we have an interface where you have this canvas. We call them gems where you can drag and drop gems. So you have a source gem, a transformation, a target. So you just stitch them together in a visual data flow. You know, you hit play and it runs on Spark, so you can see the data after every stage. You can see the data as it flows through. So very simple to develop. But then as you're developing that, you press a tab there and you switch to the code view. And as you've been doing the visual development, prophecy has been writing code very well structured, very readable on Git. Right? And then you can go back and edit the code. And when you go back to the visual interface, the code changes will be passed back and you can see it. So in the same interface, visual drag and drop developer who might have been using an Informatica or an Abenish or an Alteryx can work with the same person who's been writing code. So we support both. Right. So that's 1 thing we had to do is to bring everybody along.
The second thing is companies no longer want to be locked into proprietary formats. And some of the people who have tried to replicate the old ETL tools in the cloud, they are like, you know, the stream sets and some others those stream sets are mostly in moving data. I don't want to single them out. But there's a few other players. That's what comes to mind. They they will say, okay, you know, you write your 10, 000 workflows in my XML format. Right? And Azure Data Factory will do the same. And the thing is the enterprise are no longer interested in doing that. They want their code in open source format on Git. And they want to run their tests, and they want to run their CICD pipelines. So a lot of the data platform teams are moving toward code.
So the second thing we've had to do, right, 1 is to support, like I said, to have interface that supports various users. The second is we've had to move to code. Now that has meant also having a metadata system that can support code. Right? The third thing, and then this might be a little longer answer. The third thing is, you know, we have to support the entire lifetime. Otherwise, you have to integrate with 10 other tools. So we have development. We have deployment. So you can use low code airflow for your scheduling and just say, hey. Run this this workflow, then this workflow, then this workflow. Right? And then you don't have to figure out about JARs and everything. We'll figure that stuff out. And then the 3rd, we have metadata search and column level lineage because you need governance in a large so there is a wide feature set like that. The final thing I would say is that there is a need for extensibility and that has become very, very central to prophecy where we are asking our users to standardize, but we are not telling them what to standardize on.
So we might give you some operators, you know, a scanner, reader, join, and aggregate. But then, you know, we go to a user and they say we want our own encryption and decryption, and we want to roll it out across the company. So somebody else says we want a data quality library. This is our data quality library. We want everybody to use it. So we've had to build it at a higher level of abstraction where we can essentially you can think of us as a tool generator. Right? If you come from compilers, it's like compiler generators. So what we can do is people can come and extend and create their own gems and roll it out to their teams. So they standardize on whatever they want to standard. You mentioned that Airflow and Spark are sort of the core of this. You have this extensibility
[00:18:33] Unknown:
layer for being able to bring in other workflows that aren't directly supported by what Prophecy is doing. For somebody who already has some existing data infrastructure, they're maybe already using some spark or some airflow, or maybe they don't have any of that at all, and they're using a different scheduler and execution engine. What are some of the pieces of the sort of canonical data stack or the sort of some of the common systems that people might be using that can be replaced by Prophecy, and what are the pieces that Prophecy is going to integrate with and play nicely with? Prophecy,
[00:19:08] Unknown:
as far as Spark is concerned, 1st, Prophecy primarily focuses on Spark today. And then, you know, we're slowly extending into Snowflake where, you know, even as you are supporting Spark, you load it into Snowflake at the end, but then you might run a few queries after that on that. So we're going to extend on top of Snowflake because most of our customers are Spark and Snowflake or another data warehouse. So that's the primary computation engine that we support for processing of data. Now if the customer has and when we go to a customer, they already have a footprint. A Ebonicia workflows. They might have Informatica, Hive, and Spark.
Right? So what we do is we provide cross compilers. What we have done is gone in and reverse engineered all these legacy formats. And so we can read that their proprietary formats in, compile the code, run a whole bunch of compiler analysis and optimizations, and write out high quality spark code that is very readable and very performant. And then we've done this for some of the large enterprise. So if you have footprint in other things, you want to move it to Spark, we will consolidate all on top of Spark. Right? So this is Spark with a data warehouse. Right? So it can be Spark and Snowflake. It can be Spark and Redshift and and so on. So that is 1 thing we do. On top of for scheduling, right now we have added support for Airflow, but, you know, we are not as closely tied to Airflow as we are to Spark. Like, Spark, we really care about and the data warehouse we really care about, and we're gonna put 2 of those together. And so that's the stack we care about. Not too many data that are like, we are not going to support Flink. Right? On the other hand, you know, airflow, we're a little less fond of. So airflow is fine, but if somebody has Prefect or somebody has DAGSTAR, we'll support that. Right? So we can add support for new schedulers very quickly.
And then for metadata system, we will provide our own. Right? So we'll provide the metadata system because traditionally that has something that has lived within the ETR tools or data engineering tools. It's something for which we have not found any good solution in the market. So, for example, we provide, you know, metadata search, column level linear. So let's say, somewhere in the middle of the code, you call the user defined function. You can just search by it. Right? You can search by a column, find where all that column was used, then click on that, go across workflows, you know, and within workflows and find exactly where that column was last modified. Right? So that kind of search, lineage, and metadata management we provide ourselves. So we'll work with the scheduler you have. We'll work with the spark you have. And then metadata will provide our own.
And yeah. So that's kind of the stack we have.
[00:21:49] Unknown:
Digging more into the actual platform of prophecy that you've built, can you talk through some of the architectural elements of how it's implemented and how you're able to manage this translation from visual to code based interaction and some of the complexities that's come about trying to manage the sort of dual use case and the range of end user capabilities that you're working with?
[00:22:14] Unknown:
So let me first give a quick summary of how the prophecy is built. So at the bottom layer, you know, we are built as a Kubernetes operator. So we have our own operator written in Go. And, you know so that provides, you know, security and disaster recovery and backup and restore and all of those things. On top of that, we have our microservices layer, which is written in Scala. And then this has, you know, metadata service, code generation service, lineage computation service, scheduling, and so on. So all of them are tied to this Kafka, which is, you know, the backbone of the metadata service. And everything is based on events, and we can go a little bit into that. And on top of that, we have our user interface that is where the core structure of it, the canvas, the connection to Spark, those are hard coded as you would develop them normally. But the gems in the middle, they are generated from a spec, and their UI is generated from a spec. And we do it where the users can add their own spec and say, you know, I want to generate, you know, this new gem that does data quality. Right? And they can just write a 100 150 lines of code that will generate a UI, etcetera. So now with this, 1 very interesting thing is that we've had to merge metadata and code.
And that's been tricky because as you're doing metadata, you have a workflow that's an entity, and then you have many things you know about the workflow. Now you might have code for the workflow that'll go on Git. You might have the user or something in basic information about the workflow that will go on Postgres. Right? So we've had to design this hybrid metadata model, where for the same entity different what we call, aspects of those entities could be could go to Git or could go to Postgres. So that's the metadata management. Now the other thing is as the user is writing the code visually, as they have gem, and in the gem, they're writing or developing their code. So they say, okay. I'm going to do some reformatting. They start to type. And they say c o and, you know, column expression builder will come up and say, looks like you're trying to write a concat to do an auto complete for you. Right? So there is an expression builder that's helping you do that. As you're doing that, you know, and every time you hit save, we take that and from that draft generate code.
And that code has multiple files on git. Now there are some bounds of what the basic structure of each each gem looks like. So now I can go to the code and I can edit it. I will change some expressions. I can introduce a few variables. I can introduce a few comments. But then as long as in the compiler terms, the abstract syntax tree of that code broadly matches the spec that has been given, we can parse it back. So we have parsers that will parse back that code and get all the core elements out of it. And so I can show that to you back in the UI. So we have the code generation. We have the parser. Similarly, the lineage will compute lineage by parsing the code. Now where it becomes really tricky is when you go to Gem Builder. Because now it's not like we can handwrite a parser because we don't know what the specs going to be like. Right? So now the user is saying, this is what I want my spark function to be like, and in the middle, I want you to fill in a, b, and c. Sophisticated technology behind that. 1 Sophisticated technology behind that. 1 other thing on the code side, we use Monaco editor. So that's Visual Studio Code, essentially the same editor.
And we are running language server protocol LSP after that. So as you're writing your code, you are getting autocomplete that understands the language. It understands spark. It understands the column that are flowing at this point in time. And the same in the visual 1. As you're writing the SQLiter, it'll know what your UDFs are. Plus, it'll understand Spark and the columns and the concurrent scheme are flowing through. So we have our own language server protocol. So it's designed to look extremely simple.
[00:26:23] Unknown:
But behind that, you know, we're we're working like crazy to make sure the experience of the user is good. Yeah. Definitely always fascinating, the amount of different things that you can do with compiler technology because at face value, it's, oh, it's a compiler. I just run my code through it, and at the other end, I get something useful. But when you actually start to dig into compilers and think about the different ways that it can be used and the different ways that they manifest, it's pretty astounding.
[00:26:46] Unknown:
Yeah. It's really fascinating. It's been quite interesting for us and, yeah, at least 3 different ways we are using them. Right? So going from legacy to the current ones, going between code and visual, and then definitely, you know, auto completing the AST as the user is writing incomplete code.
[00:27:03] Unknown:
And for the language server protocol, is that something where you have a service that somebody can run on their local machine that will connect up to prophecy to be able to pull in some of the code intelligence and metadata intelligence so that they can maybe integrate it with e max or whatever their IDE happens to be without necessarily having to work within the UI of Prophecy?
[00:27:22] Unknown:
So that is something we currently do not have. The reason why we've gone for that so 1 is that we think that Visual Studio Code has become decently good. So in the web, you can edit your code. Now 1 thing is all our code is stored in git. So you can always go to your git repository pull to make your changes and recommit the code and prophecy will pull up the new code. Right? The the thing that you're missing is that when you're developing with prophecy, we are actually instrumenting the Spark run. So think of it this way. Right? You have a pipeline that has 10 steps. And now when you say you run it on Spark, Spark will run the entire thing as a single query.
But what we give you is that after every step, you can see your data as it is being transformed. You know, you know, listening in and doing instrumentation and pulling the data out. So that makes it super easy. So for data engineering, right, the programming, it's super easy if at every point in the code, you can see exactly what the data flowing in is, what does the data look like, what does the schema look like? And then, you know, if Visual Studio Code or sorry. In, IntelliJ, you're not going to get that because you're not connected to a live cluster and running it. So the answer is, 1, it's a much better experience, we think, when you are connected to a cluster, and we provide that. 2nd, we think there is Visual Studio Code.
It's a pretty good idea these days, so we you kind of have that. And then the third thing is we also aspire to move toward collaboration. You see, like, this is something which we haven't gotten to, but we're thinking very deeply about. You know, we use Figma a lot for our design. It's just great. Like, 2 people can join into the same design and do it. And they're like, can 2 people go into can somebody just say, hey. I'm building my data engineering pipeline. I'm having some trouble here. Send a link to another user in the team and say, hey. Can you help me? And the other person joins in right in the middle of that same workflow and says, oh, let me add this transform for you. Can that collaboration be done? So that's something we have our eye on. Right? Because right now, you know, everybody gets their individual get branches, but then, you know, now we are working toward that. In a few months, if you work on a shared branch, 2 people can edit the code at the same time and especially, visually as well. So for that purpose, we think that staying to the web and shared nature is probably going to be pretty important for us. Yeah. Absolutely. Definitely a lot of
[00:29:56] Unknown:
utility in being able to have a paved path where everybody who's using it is going benefit from the user experience development and the capabilities that are built into it rather than having to try and build to the lowest common denominator and sacrifice a lot of potential functionality.
[00:30:14] Unknown:
Yes. I have nightmares of supporting all the different, like, Hive Metastore supported so many databases. Then every customer had their own branch. And then so they said, oh, I want just my branch with these 3 patches. And then there were people who were, like, spending their entire life figuring out, you know, for every patch, which of the 48 branches they would go into. So, yeah, it's definitely some nightmare there of, you know, this explosion that happens when you try to support everything. It just takes away from no. We very much are at least, you know, in the founding team, we really idolize Apple.
Like, okay. We are going to build 1 way and try to make sure it's really good instead of giving a million option, which is more Android. And then that the second time, you know, it's like we care about what the user needs. That we're not building technology where we'll, you know, say, okay. You have a problem, like, for let's say what Google would do. Google would say, you have a machine learning problem. We will build TensorFlow. Now, you know, nobody can use it. Okay. Somebody built something good, then they pick Keras and say, okay. Let me put that on. Or you have Kubernetes where it's like, here is a very hard problem. Here is a very hard solution.
Hey. That's 1 way to do engineering, and that's like not a product mindset. Right? Then you go to Apple and Apple's like, hey. We are going to give you this easy interface. Here is your laptop. Oh, it's having heating problems. Let us go fix it. Now they're gonna go build a new microprocessor from scratch, right, which is going to outperform Intel. Like, who knew you could do that? Right? So lot of deep engineering, lot of stuff, but all
[00:31:47] Unknown:
giving a simple user interface. Right? So we kind of like that. We more want to be a product company than an engineering company because our job is to make the lives of our users easier. Yes. Definitely a very useful perspective to have, particularly, as you said, if you're trying to help the end user and not just build a fancy tool that you want to sort of sell to whomever is technically minded enough to wanna use it. And so in terms of the platform, you mentioned that you're very invested in Spark, and you settled on Airflow because it's sort of the good enough solution that's widely used and widely supported. And I'm wondering if you can just talk through what your selection criteria was that led you to this ultimate decision of Spark being the core engine and Airflow being the initial first class supported orchestrator.
[00:32:34] Unknown:
In that choice, right, I have a very strong opinion about Spark. I really like it. And the reason is is the following. Right? If I were an enterprise customer and I wanted a data engineering system, I would want it to be Spark. And here is the reason. Right? Because first thing, it's open source. Right? We have customers who come to us using some of the products I've mentioned who are like, we are paying 15 to $20, 000, 000 a year just in licensing cost to our, you know, ETL product. And so the enterprises don't have the appetite, especially the larger ones, to get stuck into a proprietary format again. So 1 good thing is with Spark, I know that I won't be paying beyond a certain limit because somebody will already always have, like, an Amazon basics of Spark. Right? You have an yeah. You don't wanna go over Databricks. There's always an EMR. It's going to be good enough. Right? And if the prices of Databricks go up, another competitor will come in and say, I'll give you a better spark. So that's 1 thing. The second thing is if you look at the programming model that needs to be supported. Right? So SQL is great as a productivity layer for the things that can be handled in SQL, and that is great. But then the essential complexity of data engineering is more than SQL.
Right? So our customers do encryption, decryption. You know, they want to run statistics. They want to figure out on data quality. They want to see, hey. Did is the pattern on my data today different from the pattern on my data yesterday? And, you know, is there a big enough thing for me to start worrying about it? Right? So in the middle of their workflow, they make a rest call and look up a key. So for a 1000000 rows, I'm not going to make a 1000000 calls. Right? For 1 partition, I might say, okay. Let me make it look up 1, 000 keys at a time. Right?
So SQL doesn't capture it all. So a data warehouse can do some of it, but then there's things it can't do. So what I like about SQL what I like about Spark is I have a SQL productivity layer when I want it, and it gives me the productivity. But when I need to drop down further and do something that SQL does not support, I have the ability to do that. So I have a general processing engine, so I will never get stuck trying to do something that I'm not able to do. So I like that. Right? Because as an enterprise, you are trying to make a decision for 10 years, 15 years. Your workflow is gonna run there for that. So you don't want to get stuck into a technology which will be limiting. Right? And, of course, with new machine learning use cases coming up and others, we just like the fact that it is that powerful and and open.
So that's the story of Spark. Right? And then it's going to become quite obvious quite quickly as Snowflake starts to work with larger customers. Today, they don't. They work on the data warehousing side. Right? If I was reading through the Snowflake s 1, and they have somewhere between 50 60 customers who pay them more than a $1, 000, 000 a year. Now no enterprise can do any serious ETL or data engineering in $1, 000, 000 a year, especially given that Snowflake also adds in the cost of hardware into their revenue. So not happening. So no large enterprise is doing ETL on Snowflake.
Right? So as they try to go there, they're building Snowpark and other things, they will also have to add other things, things other than SQL, and they are moving in that direction as well. So just saying that Snowflake will end up reinventing Spark and Spark will end up adding Snowflake. There's no other way out for these guys. So that is the right model and both platforms will build that. Now coming to scheduling, the usability, like it happens with most of the Hadoop product, the usability of the products is quite low. And the market's response to that has been that, yes, we will, you know, buy Hadoop, but Hadoop is like, what, like, 5 companies merged. I don't know. Like, Hortonworks merged with Cloudera, and then they bought this Arcadia data, the BI company. So they bought the entire stack on top of them. Right? And they're still, what, 3, $4, 000, 000, 000.
They might have just taken 2 and a half $1, 000, 000, 000 of funding, plus the r and d done by Google and Yahoo, plus 15 years of effort to end up with a $3, 000, 000, 000. That's not great. Right? And the my read of that is the market clearly says usability matters. Right? And Snowflake, so quickly, so narrowly, just saying I'm going to build a data warehouse, much lesser product than Hadoop, you know, become so big. Its usability matters. So now if you go to Airflow, you know, again, it's 1 of those things where the usability is quite low. Before Airflow 2.0, we couldn't even support it because there was no API. Even now the API is not complete. So it's like I have to actually go inside the UI and click some things to get some things done because there's not a proper rest.
API, it didn't scale beyond a 1, 000 workflows at a time. Now astronomer is putting an effort behind that. They've got, you know, goals. Like, now we crossed a 1, 000. Now we have to go to 30, 000. Right? So so on. So that's 1 thing. Still early days for airflow, I would say. It's used a lot. So So if you go to a customer and they already have airflow, we can support it. Right? So that's the good for us. I wonder if it's the long term good scheduling solution. Now the other thing. Right? And then maybe a slight tangent is if again, coming back to that customer who had 20, 000 workflows.
Right? They're like, you know my 20, 000 workflows. You know which 1 depends on which 1. So you already have, like, this large dependence graph. Right? Let's call it a plan. Right? You have workflow that feeds into the other 1, to the other 1, to the other 1. They're like, why do you need me to manually tell you now how to run this and what order? You already know it. Right? So they're like, can you just run it in this order? And is it can you also figure out which ones to run together on 1 cluster and minimize my cost? Right? And then can you also add in recovery code and other stuff so I don't have to do it per workflow?
So then on top of the airflow, the prefect, diode, diodextile, there is this other layer of smartly scheduling thing, which is where I think a lot of the value finally is going to be in smartly scheduling stuff. And then going going back to the metadata and feeding information from scheduling saying, hey. This dataset is 2 hours old. Right? And this 1 is 20 hours old or this 1 is 2 months old. That's they don't use this. Right? So keeping that scheduling information going back, feeding it to the users of the dataset so they know there is value in, you know, in scheduling them. And then the last mile of actually scheduling it and getting the answer back, That's the lowest value thing, and that's kind of the layer at which we are stuck in the cloud. So a lot more to be built on top.
[00:39:13] Unknown:
Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo or Facebook ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you're looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hitouch for reverse ETL today. Get started for free at dataengineeringpodcast.com/hitouch.
In terms of the actual process of designing and building a workflow with prophecy, can you talk through just the overall process and some of the steps involved and the ways that you have designed the overall user experience to be able to span the multiple different roles and stakeholder positions and levels of experience that people are going to have, particularly at some of these larger organizations that you're targeting?
[00:40:13] Unknown:
Sure. Definitely. The first thing with prophecy is, like, you can go try it out at, like, app.prophecy. Io. Sorry for the short plug. Now once you go there, right right now the version in the cloud connects only to Databricks, though we have our enterprise version connects to every spark. We have Cloudera customers and EMR customers. So now all you have to do is you add your credentials. You say, here is my Databricks space. Here is my token. Connect me. And now you come inside prophecy. You see a project with some example workflows. And now you open double click that and you see on your visual canvas a workflow, which has, you know, a couple of gems which are like, I'm reading some data, you know, a few gems which are like, I'm filtering the data, I'm transforming the data, reformatting it, aggregating it, and then writing it out to this other place. Right? So now once you go there, you say, okay. I need to run this. So you go to the workflow and click a button to say, I want a spark cluster attached to this work.
And then you'll say what size. So you say, okay. I want a spark cluster with 10 cores. It's like just go or, usually, we call them small, medium, large. So you say, okay. Attach this medium cluster to me. The medium cluster will come attached. It takes a couple of minutes. Then you hit play, and it'll run through. And after every step, you'll be able to see the data. You click on the data. It'll open up as a table. You can see what sample data looks like at each step of the transform, what the stats look like. So very easy to visually follow the data flow from beginning to end. Now if you wanted to do a new workflow, all you do is you say, okay, I'm going to have a source. So you click a source. You say, okay, what kind of file do you wanna read? You say file from s 3, and it will just say, okay. And the s 3 browser will come up, and you just double click the file and say, I want to read this file. It's a CSV file.
Hit play. It'll read it, say, life is good. Then you move on, and then you can see the data coming through. And then you can say, okay. I think I need to filter this data. Right? Just drop a filter. Double click that. It will say, okay. Write your expression. You'll start writing an expression. An expression builder will come up and say, looks like you're trying to write this. Let me help you. And you write it in SQL. Actually, you can also write it in the UI in Scala or Python, but most people use SQL. So you write a SQL expression, and then as soon as you've written that, you hit play, and it'll run again. Interactive run usually takes less than 3 seconds.
So in 3 seconds, you'll have the new data show up. Then you look at it and say, did I get it right or not? And then you can continue step by step by step by step building it. By the time you've added 10 gems, which would be, you know, you should be able to build a workflow in a couple of hours, Your workflow is there. You know it works on Spark. It's already tested. At least in terms like you know it works. So if you are a user who's a visual ETL developer, if you are a data analyst, right, that's all you need to do. Right? You needed some visual drag and drop, and you needed to write some SQL expressions. And that's all you need to care about.
Once you're happy with this workflow, you go to the project that the workflow is in and say commit this, and it will automatically get committed to Git. Now if you want to go to schedule this by the way, if you are a visual ETL developer as opposed to an analyst, you probably want to write a test. You can also say, hey. Generate a few tests, and what we'll do is it'll generate, you know, for every transform it'll generate, here's the data in, here's the data out kind of test, or you can say here is the data in, and here is the predicates that should be there, and you can generate these tests. And these tests will also go on Git. And then if you do, like, a Maven test, it'll run. Now you've done this. Right? So now you've visually done this.
If you don't care to go under the hood and see the code, you don't have to. But for the users who are so inclined, they can go click and see the code edited some. Now yeah. So now you come, commit your code to git, then you say I want to run it at 9 AM tomorrow. Right? So 1 is you ran it. You got the target data where you want. As an analyst, if it was ad hoc, you are happy. Right? Because you read the data from here, you wrote it into Snowflake into your table or into a file, and now you have the data as you want it. You're all good. If you want to schedule it, then you can just go to the scheduler and say, okay. You know, run this workflow 9 AM every day. Point it to this location.
And if it fails, send me an email and just add 3 of those gems and say, okay. Deploy it. You can say test it. If you do a test, it'll run it right then. And at each step, you'll see the log and you'll be able to see if it's succeeded or not and how much time it's taking. And then you're like, okay. This looks good. Deploy it. 1 click, it gets deployed, and then it'll run. And then every 9 AM, it'll run. And if it fails, it'll send you an email. So you don't need to worry about that. So, basically, with visual drag and drop and some SQL skills, you can build a full ETL workflow on Spark and build a whole schedule on Airflow, and all of that will run without you ever having to even look at code.
So you can go end to end, starting development to, you know, as just from a CSV file or a SQL table all the way to deployment, you know, in 1 pass.
[00:45:08] Unknown:
In terms of being able to scale, in terms of the cognitive and organizational complexity as you go from, I have this 1 simple work flow. It's maybe, you know, 6 gems long. It's so, you know, the 6 different steps to I now need to be able to, you know, hook into the midway point of this work flow to tee off and go a different direction. And then the output of that workflow is actually dependency of a third 1 and just being able to manage all of these different interconnects and reasonable components and, you know, the the complex web of requirements that an organization is going to need. How do you handle some of those aspects of being able to scale organizationally?
[00:45:49] Unknown:
So scaling for a workflow, you know, handling multiple cases can happen at different levels. So first is that within a workflow, we have subgraphs. Right? So you can say here is my subgraph that does this, here is my subgraph that does this. So you don't want to have a canvas with, like, you know, a 100 gems. Right? So you can break it down and break it down into particular pieces of the problem. And then after every piece, you can just see the data go and data come out and say, is this piece doing the right transformation as I expected? That is great. If not, I can go into it and just handle that. Otherwise, I come out and don't care about it again. So that handles breaking down the complexity that way. The second thing is, you know, organizations need to develop their own transformations, and they can do it at 2 levels. So 1 is gem, which is saying, here is my data quality library, and you have a data quality thing that will say, okay. Do I have these many nulls? Is this a unique key? Is this this? So you can create your own data quality library, roll it out to the team. And then when other people go in their visual editor and look at gems, they some of those gems are in the menu, are their own ones that are rolled out in their company, and then they can use that. So that means that you can create any you can standardize on any Spark code or, coming soon, SQL code that you want these reusable pieces.
The second thing that people ask for is they want reusable subgraphs. So they have patterns. They're like, okay. When you're getting data from 1 of these sources, let's say, you know, they read 20 different external systems, and a lot of them are somewhat similar. The names might be different. The column ordering is different. The table structure is different. And then they're consolidating all into 1 final target. Then they say, okay. Here is my template of how to read 1 of these sources. And, you know, another 10 sources are gonna come up so they can create that template. And in that template, think of it as a subgraph with many different gems, each 1 having a different transforms. They might have something like, here is an auditing thing that makes sure that the number of rows that came in were the number of rows that went out. We didn't lose any. You need to have auditing in financial services.
So the idea is that you have these highly reusable templates, right, that you can use. And this is very important for users, and especially, you know, a platform team can build this for the data analysts and say, hey, if you're reading from 1 of those sources and targeting there, here is your template of, you know, 15 steps, and you just need to fill out these 3 values and you're good. Right? So stuff like that. So be building these reusable templates is another thing. So in terms of capturing the complexity, right, there is complexity in terms of scale, in terms of, like, I just have a, you know, a workflow that's too big. Then there is complexity in terms of, you know, I want to do a lot more kind of operations. There's complexity in terms of I want to standardize. And then there is complexity where people want to take, you know, either within the same workflow or you can break a workflow and say this workflow, right, through this dataset.
And then from there, these 3 different workflows read from that data set in the project. Right? So there, what you can do is you write multiple workflows. Right? You can say, hey. My project 2 reads from project 1's that data set and starts from there and does this next set of processing. So you so people break it up into multiple projects sometimes. So what you would see is you would see all those workflows in your schedule. In your airflow schedule, you'll say run workflow 1, then run workflow 2 and 3 together, then run workflow 4, so you can do that. Also, when you are trying to understand it, when you're trying to debug it, you can go back into lineage and see your data flow through. So you can see, you know, here is a dataset. It was read by, you know, this workflow that wrote this and this dataset.
These 3 workflows read those after that. So you can monitor and see your values flow through all of these, you know, complex, you know, merges and diverges. Right? So some things are handled at workflow level, some things are handled at schedule level, and then, you know, the lineage will show you everything that's happening across both levels.
[00:49:54] Unknown:
1 of the things that you mentioned too that you ended up deciding to build into prophecy because you didn't find an open source solution that fit your particular needs is the metadata layer. And 1 of the pieces that is tightly coupled to metadata in a lot of ways is data governance and sort of policy enforcement. And I'm curious how you're able to factor that into prophecy, particularly for organizations that are going to have complex and sometimes conflicting requirements as far as regulatory and compliance restrictions and business rules around how data is used by whom and where?
[00:50:32] Unknown:
So we have not gotten as deep into it. And there is a reason for that. We've kind of been able to skirt the issue because when prophecy is set up in an environment, typically, the customer will have their active directory or LDAP. So their identity is coming from there. Maybe, you know, there's an SSO provider, and that's the identity that is used by the customer. When they log into prophecy, they use the same identity. Now when they go to Amazon and spin up an Amazon cluster, that cluster is spun up with their identity. And therefore, that cluster has all the same restrictions or capabilities. They can read the data that they are, you know, only they are allowed to. So in that sense, what we are doing is we are passing the identity and the restrictions to the processing engine. And at the processing engine level, they are being enforced.
So far, we've been able to skirt that issue. But at some point, we might have to do stuff around it, but it's been working so far. Yeah. Definitely reduces the overall scope creep if you're able to say that's somebody else's problem right now.
[00:51:41] Unknown:
You you can handle it this way. That's not our job.
[00:51:45] Unknown:
We are actually very happy about that. We're, like, you know, we just pass your permissions through to the engine. The engine has it. And the engine must have it. Right? That enforcement at the engine level. So, yeah, so we are pretty happy with that. Another
[00:51:57] Unknown:
question that I had as you were discussing some of the workflow elements and being able to build these different pipelines is the availability of integrating things like DBT, for instance, for an analytics engineering workflow or bringing in other sort of frameworks and libraries to be able to execute as part of the overall structure.
[00:52:17] Unknown:
Bringing in the other libraries where we have, like, 1 way to look at this, we have these gems where you can say here is, like I said, here is a templated code. These are the fill in the blanks and that templated code can use other libraries. And so you can say I have a data quality library based on Amazon Deque, or you can say I have a data quality library based on great expectations, and that's what I want to roll out in my organization. So that is very doable. Now in terms of DBT, we today support Python and Scala as back end. Right? So for Scala, we support Maven as the build system and SBT, right, which is the Scala build tool. And then the question is, is there, you know, the space for SQL back end, right, where is SQL plus plus? Because you do need the functions. You need the configurations.
And so dbt has added some of those things. So 1 of the thoughts we've had is that can we add dbt as 1 of the back ends where we generate the dbt code? We are looking at it. Some of the things are challenging and actually we've been very, very recently as in last week thinking, how can we make a SQL backend as well? In DBTV, maybe with a few structural changes, we can get there because so here's what happens, right? When you look at the code that prophecy generates, there is a main file that has the graph, which is showing you the control flow. That here is my first source, data frame 1 reads from it, data frame source 2 reads from source 2, the data frame join is the join of source 1 and source 2. Right? So at the top level, you have the control flow, which is kind of copying the visual flow. And then within each of them, there is the actual transformation code in multiple files.
Now what DBT does is it hard codes references. They call them models, and so there is a SQL file. It the SQL file is this model. And in the model, I'm using some other SQL file, and that's a reference in that. So now the problem we are running into is that if I have 40 files and 40 models, you know, to for me to build the visual structure of where did the come from and where did it go, I have to open each 1 of those 40 files and then see for each model which other model is it referencing. And now I can build the visual picture in my graph, and we're just looking at it and saying, god. Right? So we are trying to use DBT.
Yeah. We will see if we are able to. We're we're trying to find a workaround around there. For us, it is not very different from Scala and SBT. It's a build tool. And, you know, they have added a few macros to SQL, so I think that is interesting stuff, but it's great for startups, I think.
[00:54:54] Unknown:
And in terms of the use cases and workflows and systems that you've seen people using prophecy to build, what are some of the most interesting or innovative or unexpected ways that you've seen it used?
[00:55:07] Unknown:
We have seen interesting or unexpected ways. We have seen people who've come in. Actually, they've booked with us, but we have 1 company which says or an actually an insurance provider who's like, we want to write business rules that the business rules people want to write, and then these business rules must apply to multiple workflows. So it's an interesting use case. We're like, oh, that's very interesting. So what they have is that they have these people who are like, okay. If this guy has a prescription or this lady has a prescription, and they have this and this prescription, but then they have not paid for this or this prescription is more than 1 20 days old, then we should do this. So there is all these analysts who have these, you know, rules that are very specific to the insurance industry.
And then they want to go plug it in the middle of their retail workflow and run the retail workflow with the new set of rules. So for them, we had to prototype a full business rules engine, which injects business tools into ETL workflows and runs it for them so they can run it with all kind of different combinations of business rules and find these right set of customers, you know, that they have to send messages to and all of that stuff. So that was a very interesting use case, which which was very new to us. Again, we had to work with the customer to achieve it. And then some customers are just working on their own CICD pipeline. They want the development from us, but then they go and say, hey. I just attached to my existing CICD pipeline.
Some customers are like, oh, I have an older version of airflow. So they just take the code from our airflow on the get, copy it to their old airflow, which does not have an API, and then stick it there and run it. I wouldn't say I'm I find that many cases where I'm surprised. We had this 1 customer who's like, I am buying these feature stores like Tecton and all these. They're like, but if you add this and this on top of prophecy, I've used this halfway for a feature store. If you just store this thing like this in the metadata, I will get a complete feature store. Can I use you as a feature store? And they're like, okay. That's very interesting. We are finding people, you know, extending what they can do. But just the normal use cases are so many. They're keeping us busy.
[00:57:14] Unknown:
In terms of your experience of building this company and working with your customers and doing a lot of the engineering and architecture and design work, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:57:28] Unknown:
Yes. The interesting challenge is that it's very, very hard to replace a product that has been built over 20 years. And then you go to the customer. And then when you go to the large enterprises, they are using every single feature of their product. Right? Not every single, but so so many features. And everybody has a different feature list. So coming back and being able to rationalize and saying what can be built, you know, that has been very challenging. The other thing is the tastes are evolving. Actually, we are learning from customers because the first go we said, here is the visual development environment and customers were like, no. No. No. No. No. Unless you have code on Git, it doesn't work for us. They're like, okay. And now we're like, oh, here are the gems. They generate code on git. But then they're like, we want to add this. We want to add this. We want to add this. We have these another 50 transforms. Right? So they're like the number of operations that they did on premise versus the number of new things they want to do on the cloud as data. The scope of data engineering is just so enormous. We are like, okay. This approach of building us building gems is not doable. We have to throw this code away and rebuild at a different layer of abstraction where we generate these. Right? So a lot of this has come, you know, iterating with the customers and finding their use cases and saying, okay. There's just the use cases are just so varied that what they need is a framework, right, where they can build their own things and roll out their own things. So some very interesting challenges. And then the other thing is getting a team together because we have people who are in the UI, who are generating user interfaces from specs and then, you know, WebSockets and language server back ends. And then we have compiler people who are cross compiling stuff and writing parsers and test. We have people who are making fast execution run on Spark, you know, modifying, like I said, Spark's execution query plan, instrumenting it, getting the data out. We have people who are running on Kubernetes and writing their own operator.
So as a startup, the number of technologies that we've had to get good at to support a use case this large, it's it's just been tremendous. We're very lucky to have a big and strong team that can do this, but, you know, it's like we have got 2 to 3 people in each 1 of the areas, and there is no way you can even supervise a team like that. Right? You're like, guys, you are the experts. This is what the product needs. Go build what you think is right. So building it has been quite a challenge as well. So lots of interesting challenges, very demanding customers, and a lot of product to build. We're building it as fast as we can and, you know, it's just, like, never enough.
[01:00:13] Unknown:
And so for people who are interested in what you're building and they want to be able to handle the combination of visual and code based execution and have a single platform for being able to build their data workflows? What are the cases where prophecy is the wrong choice and they might be better served by, you know, 1 of the other visual ETL management systems or doing everything in code and just sort of picking and choosing best of breed solutions and integrating it themselves?
[01:00:40] Unknown:
1 is that if you already have a coding culture and you have a really small team, then it's probably not. You don't care about the issues of things like standardizations, data quality, you know, where where we are saying, hey. Enterprise customers care about getting their unit test coverages up to a certain point, their data quality things. Because these businesses are sending data to other businesses. They're relying on that quality of data. They want to have standardization because they have 200 engineers, and every 3 years somebody's gonna go and a new person's coming in. And suddenly, you know, if your code is written as, you know, spark code or this long SQL scripts, right, the next person has to understand them. So, you know, if you don't have that kind of problems, you have a tight knit team of 5 people who's writing code and they, you know, and they're gonna stick around for many years.
You don't need prophecy. That's like an overkill for you. So I think that is primarily the other thing is if you don't have some workflows that require scale. Right? Because if your workflows are just, like, going to run on a single node like you use Informatica, It'll do the job or, you know, if you'll write some SQL. Or what the startups are doing. Right? If you look at the startup workflow. Right? They have 5tran and they have dbt. Right? You go straight to Snowflake. Why? Because you don't have any of your data in open source format. So you can't do machine learning. You're not reading in the middle something from rest files and doing this. Just like it's a simple use case. You know, use use Snowflake for it. If if you are an enterprise, don't. Right? You're going to run into a lot of trouble. So it just depends. So small startups, small teams, we are not the right fit for sure.
It's only when you get into larger, you know, midsize businesses, larger businesses or teams which have, you know, if you have a lot of data engineers, like, if you have, you know, 10 or more, then it makes sense. It also makes sense if you have engineers who are already using visual drag and drop ETL tool or have been using the last generation of it's just painful to go to spark code or to go to SQL scripts and, you know, their first scripts are going to be bad, and they're gonna be left with a mess. So don't do that. Don't try to become let me put it in an interesting way.
So if you are a business or an enterprise, you are not going to become a Google by reinventing data engineering. So please don't do that because that seems like the most approachable thing. I'm going to be like Google. I'm going to write all these scripts through the same thing I could do easily with a tool. So, you know, there's a bunch of companies doing that. Please stop, and please don't build in house frameworks. They are the worst thing ever.
[01:03:21] Unknown:
It's the data version of don't build your own crypto.
[01:03:25] Unknown:
Yes. So think of it this way, right? Let's say you had an Informatica, right? It'll give you what will it give you? Right? It's a great tool. Right? So it'll give you visual drag and drop development. It'll give that's a plus. It'll give you metadata search and lineage. That's a plus. But what's the negative thing? You are stuck in some arbitrary XML or JSON format in which all your workflows are stuck. Right? Now what is an in house framework? An in house framework is an arbitrary JSON or XML format that your team came up with, So it has all the downsides. Right? You're not writing in standard SQL or standard Spark. You're writing in this arbitrary format, and you don't have visual development, and you don't have lineage. And So it's like, what would I do if I wanted all the bad things of Informatica and none of the good things? It's like I will build an in house framework where you readwrite JSON or XML. So that's, like, the worst thing ever.
[01:04:18] Unknown:
And as you continue to build out product and work with customers and scale your company, what are some of the things that you have planned for that near to medium term?
[01:04:29] Unknown:
I think in the near term, we are very focused on making the development experience along with, you know, like I said, the gem builder, the template builder, that excellent that's in the very near term. It's like we have to get it in there. Then the next thing is we already have a metadata system that works, that does the basic job, but we want to go to 1 higher layer of abstraction there as well and be able to generate the metadata. So the user should be able to say, I have these entities and these things, and then the metadata system should be able to store this. So that means that you can, in the metadata system, see what we have stored, but then you can store anything you wanna store. So extending my data system, definitely, we are going to also in the near term, the second thing we are doing is SQL and making sure we also cover the data warehouse because the lifetime of ETL, right, is going to it includes when the data lands in s 3 from the for enterprises, it usually lands in s 3 from on prem systems.
Then you do all the transforms, then you load it into the data warehouse, but then you build some materialized views. And you're going to build some cubes and reports or whatever. So we want to cover that entire lifetime. Right? So you don't have to go to a second tool. So so yeah. So it's covering the whole lifetime, making sure our development is in. That's number 1. And number 2 is probably going to be, you know, metadata and being able to do the, you know, the best in class job in metadata. And metadata is another completely open market. There is no dominant product. Yeah. A lot of players, but none of them are dominant yet. A lot of players just got funded based on all these LinkedIn systems. Right. I was talking to Mars and then Mars got funded. And then I had a chat with Shishank and then Shishank got funded. Both of these guys on, you know, 2 different startups, both based on LinkedIn Data Hub. Because we were talking to them because we took some of the inspiration from them, very good quality system. Then whoever missed out on that got and we've funded the Amundsen Marquez company. Right? So it's like yeah. So but here is the problem with an independent metadata company. Right? In the very abstract term, you can say I have a metadata company. What does it do? Right? A simplified version of that is you come and store your metadata into my system, then I will show it back.
The basic problem then becomes is why would anybody put their metadata into somebody else's system? If I have an ATL tool, I already have all the workflows. I already have all the datasets. I will put it in my metadata system and get everything else there. So being an independent metadata provider is a very hard thing because how do you seed the system? It's a tricky problem for these guys. I think they they are being funded. They don't really have any much in terms of customers. I attend their, town hall. These guys are making some good progress for sure. We like them. But, yeah, the market still has to play out early days. Absolutely.
[01:07:20] Unknown:
Are there any other aspects of the work that you're doing at Prophecy in the overall space of building data engineering tooling that we didn't discuss yet that you'd like to cover before we close out the show? Yeah. So I wanted to talk about the larger space per se. I feel like
[01:07:34] Unknown:
1 last thing is just like people underestimate how big this space is, and investors don't understand how big this space is. So, basically, if you go to on prem. Right? So there, like, Ebonyshu was the performance leader, Informatica, the volume leader. That is a $10, 000, 000, 000 a year market. 1 of the biggest ones. Then you say the even bigger 1 is $20, 000, 000, 000 enterprise data warehousing market. But and then let's say Teradata was the biggest player. Right? They became dominant in, like, 85. Now when I was a product manager for Hive, we used to go and say, hey. 70% of the capacity of Teradata is actually being used for ETL. Why don't you move that to Hive, an engine that is focused on higher throughput? Because Teradata is expensive because it's focused on, you know, low latency with caching and all that stuff. You don't need that for ATL. Why don't you move that to hype? And that was the repeatable product market fit on which, you know, Hortonworks and Cloudera both had their IPOs.
Now, if I were to apply my correction to the market of enterprise data warehouse, now 10, 000, 000, 000 is ETL tools and at least 10, 000, 000, 000 from enterprise data warehouses data engineering. So actually the $20, 000, 000, 000 large market as markets used to be, you know, before cloud, I'm saying $20, 000, 000, 000 large market is actually data engineering. Majority of the revenue of Snowflake and majority of the revenue of Databricks and I will wager above 70% is data engineering. And it's a beautiful workload. Right? Because as a company, once somebody puts a data engineering workload, it's gonna run at 9 AM every morning till the end of time, eating compute resources. Right? So first, data engineering market is much much bigger than what people have counted on premise.
A lot of machine learning being more data engineering and data increasing, it is growing really, really fast. And it's amazing that on top of Snowflake and Databricks, in terms of management of data engineering, there is just nothing. It's like such an open massive space, and nobody's building anything worthwhile in it. Right? And everybody's struggling, like you said, with data engineering, trying put things together. So we're very excited. A lot of action going to happen in this space. It's a massive space. So yeah. So we are super excited to see. I think the next 4, 5 years will be, you know, lot of innovation happening
[01:10:04] Unknown:
in this area, and we are, like, super excited to be at the beginning of it. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing and follow along with the progress of your company, I'll have you add your preferred contact information to the show notes. And so with that, I'd like to ask you a closing question. From your perspective, what do you see as being the biggest gap in the tooling or technology that's available for data management today?
[01:10:27] Unknown:
I think there is no suite of data management technologies today. There is no I mean, like, we've been talking all along. Right? Whether you want to do development, whether you want to do deployment, whether you want to do monitoring and management, like, there is no tool that you can go to to do all of that. So I feel like the whole tooling layer, the whole product layer just doesn't exist. And if you look at what historically has been there, every enterprise or every business has gone with a single touring layer, not with 20 different things. What has also happened is in the processing engines, people have converged to 2 platforms, Databricks and Snowflake. So this notion that there's going to be 20 things that everybody will stitch together, it's just a phase. It will go away. So I think the largest thing that's missing is the suite
[01:11:16] Unknown:
that solves your end to end data engineering problem. And because that's the biggest gap we see, that's why we are building a company to fill in that gap, and that's super excited about it. Well, thank you very much for taking the time today to join me and share the work that you're doing at prophecy. It's definitely a very interesting product and an interesting problem space, so I'm definitely excited to see where that takes you. So thank you again for all of the time and energy you're putting into it, and I hope you enjoy the rest of your day. Thank you, Tobias. Love being here. Thanks for the conversation. For listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used.
And visit the site at data engineering podcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Sponsor Messages
Interview with Raj Bains
Raj Bains' Background and Journey
Overview of Prophecy Platform
Challenges in Data Engineering
Features and Capabilities of Prophecy
Integration with Existing Data Infrastructure
Architectural Elements of Prophecy
Selection Criteria for Core Technologies
Designing and Building Workflows with Prophecy
Scaling Organizational Complexity
Metadata and Data Governance
Integration with Other Frameworks
Interesting Use Cases of Prophecy
Lessons Learned in Building Prophecy
When Prophecy is the Wrong Choice
Future Plans for Prophecy
The Data Engineering Market
Closing Remarks