Summary
The core mission of data engineers is to provide the business with a way to ask and answer questions of their data. This often takes the form of business intelligence dashboards, machine learning models, or APIs on top of a cleaned and curated data set. Despite the rapid progression of impressive tools and products built to fulfill this mission, it is still an uphill battle to tie everything together into a cohesive and reliable platform. At Isima they decided to reimagine the entire ecosystem from the ground up and built a single unified platform to allow end-to-end self service workflows from data ingestion through to analysis. In this episode CEO and co-founder of Isima Darshan Rawal explains how the biOS platform is architected to enable ease of use, the challenges that were involved in building an entirely new system from scratch, and how it can integrate with the rest of your data platform to allow for incremental adoption. This was an interesting and contrarian take on the current state of the data management industry and is worth a listen to gain some additional perspective.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Follow go.datafold.com/dataengineeringpodcast to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.
- Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta.
- Your host is Tobias Macey and today I’m interviewing Darshan Rawal about ÃŽsÃma, a unified platform for building data applications
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by giving an overview of what you are building at ÃŽsÃma?
- What was your motivation for creating a new platform for data applications?
- What is the story behind the name?
- What are the tradeoffs of a fully integrated platform vs a modular approach?
- What components of the data ecosystem does Isima replace, and which does it integrate with?
- What are the use cases that Isima enables which were previously impractical?
- Can you describe how Isima is architected?
- How has the design of the platform changed or evolved since you first began working on it?
- What were your initial ideas or assumptions that have been changed or invalidated as you worked through the problem you’re addressing?
- With a focus on the enterprise, how did you approach the user experience design to allow for organizational complexity?
- One of the biggest areas of difficulty that many data systems face is security and scaleable access control. How do you tackle that problem in your platform?
- How did you address the issue of geographical distribution of data and users?
- Can you talk through the overall lifecycle of data as it traverses the bi(OS) platform from ingestion through to presentation?
- What is the workflow for someone using bi(OS)?
- What are some of the most interesting, innovative, or unexpected ways that you have seen bi(OS) used?
- What have you found to be the most interesting, unexpected, or challenging aspects of building the bi(OS) platform?
- When is it the wrong choice?
- What do you have planned for the future of Isima and bi(OS)?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
- ÃŽsÃma
- Datastax
- Verizon
- AT&T
- Click Fraud
- ESB == Enterprise Service Bus
- ETL == Extract, Transform, Load
- EDW == Enterprise Data Warehouse
- BI == Business Intelligence
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy analytics in the cloud. Their comprehensive data level security, auditing, and de identification features eliminate the need for time consuming manual processes. And their focus on data and compliance team collaboration empowers you to deliver quick and valuable analytics on the most sensitive data to unlock the full potential of your cloud data platforms.
Learn how they streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta. That's immu t a. Your host is Tobias Macy. And today, I'm interviewing Darshan Rawal about ICIMA, a unified platform for building data applications. So, Darshan, can you start by introducing yourself?
[00:01:56] Unknown:
Thanks, Tobias, for having me on the show. Yeah. I'm the CEO and founder of Visma. We are a startup out of Silicon Valley. We came out of stealth about 3 months ago. And I've been in the data management space for about couple of decades, I would say. I was head of products for DataStax before this and was in engineering roles at Yahoo and Telcos and B. E. Shaw before that. And I saw firsthand how the best technology companies achieve impact to data and which is nothing like what, we are trying to do for the last decade, and that's how we started this. So that's us.
[00:02:33] Unknown:
And do you remember how you first got involved in data management?
[00:02:36] Unknown:
That's a great question. I tell folks that I've been in data management just under the tariffs of working in telcos or at Yahoo or at D. Shaw. So it all really started when I first arrived in Silicon Valley in 2000. We were at a telco company building appliances for folks like Verizon and AT and T and the likes of Oracle, MySQL. There was an in memory database called Timestan, which is just not cutting it. And some research was coming out of Berkeley at that time or No Sequels and stuff, and the the dotcom crash happened. And our team had to huddle to build some core components of a database that frankly resembled Apache Cassandra, which was open sourced by Facebook in 2008. So we kind of pioneered a bunch of NoSQL technologies in 2001.
And then I went to Yahoo where that data management foundation was solidified as a consumer of data trying to put machine learning in production. And this is for a use case as difficult as click fraud. You know, I'm not sure if you're familiar, but clicks is what makes Google $100, 000, 000, 000 and there's a lot of fraud in there. So how do you put machine learning in production for click fraud? This is before Hadoop. This is before big data as a term was coined. So I got that first hand experience. And then it was cemented when I went to D. E. Shaw to build a high frequency trading platform, ironically, in the middle of 2, 008.
And then fast forward to 2013, I was head of products for DataStax, which allowed me to build empathy for the enterprise market. And I essentially concluded that I built it at Facebook, Netflix, LinkedIn, so you should buy from me. It just doesn't work outside the 50 mile radius of Silicon Valley. Technology is obviously a strength, but expecting an enterprise to hire the kind of skills that Silicon Valley has just doesn't work. So being in data management, seeing it as a consumer and as a builder of these technologies from the last, I would say, almost 2 decades.
[00:04:22] Unknown:
And so from all of that experience, you mentioned that you've just launched out of stealth with this new company in the form of Isseema, and you have a product that you're calling BIOS. And I'm wondering if you can just start by giving a bit of an overview about what it is that you're building there and your motivation for creating this brand new data platform for managing applications that are oriented around derived data and gaining value from raw data assets.
[00:04:50] Unknown:
That's a brilliant question. So let's start with the latter. What was the motivation? The motivation was that, first, from my experience at DataStax where I was a huge proponent of trying to sell this to outside of Silicon Valley, you know, to the Home Depots of the world, to the JP Morgans of the world. And what I came to realize is that the last decade of big data, we might as well call it the last decade of big data because we have lots of open source, lots of scale technologies, AI, ML, Cloud Native, you name it. The buzzwords are there. But data management for 4 decades, you really have to rewind the clock for 4 decades. It's been fragmented in the camps of what's calling ESP, Enterprise Service Bus, ETL, extract, transform, load, Enterprise Data Warehouses, EDW, and business intelligence BI. Right? There have been these 4 camps in some ways. And there have been amazing innovations within each of these categories, but not across all of them. We still glue these 4 technologies exactly like we used to do it, I would say, even in the Oracle days.
That model breaks down as you try to get this new velocity of data, new data types, new data structures, and AIML trying to monetize it. Right? So today, for every data scientist, there are 10 data engineers, which is unsustainable. And that's what's really holding the impact in enterprises from machine learning to go live. And that's what we started to aspiring to change. The way we are gonna change that is through our product called BIOS. It's a hyperconverged data platform. It's converging these 4 boundaries into a single platform, looking at from ingest to insight. And what it does is that the consumers of data, specifically, there are people who build microservices and APIs, there are people who are AI explorers, they are exploring data to figure out patterns, or there are business intelligence seekers. Right? People are trying to business users trying to get insights.
We want to make those use cases live in beex and deliver it in a self serve manner to solve real time needs. And to do that, you had to look at the all aspects of data management across ingestion, storage, processing, querying, and we provide a single platform to collapse the entire data supply chain into 1.
[00:06:52] Unknown:
For organizations that are already struggling with being able to deliver value from data, there's a high probability that they've already got a certain number of systems that are actually functioning as intended and that they have these capabilities built out. What do you see as being the trade offs between having a fully integrated solution along the lines of what you've built with BIOS versus a modular approach where you're using best of breed for the specific point solutions that you want and then integrating across those to allow for evolution of use case and evolution of data capabilities?
[00:07:28] Unknown:
That's a great question, Tobias. And I always give this example. The difference is really between taking photo with an iPhone versus having an SLR camera to take your photo, having a Sony Walkman to listen to your music, and a real phone to make phone calls. Right? And, you know, I enjoy taking photos with an SLR camera, and I have a significantly expensive SLR camera. I am in the minority. Right? I mean, most people are using iPhones today, and iPhones are getting better and better and better at that. So it really boils down to that difference and it goes to where the effort is spend the enterprise. Right? A modular approach will require you to integrate these best of breed solutions. You have to manage the OpEx, the CapEx, the governance individually.
And then there is a promise benefit that, you know, each of these individual modular components will give you lots of features, for a off case where you would need those features. Right? So there are cases where that might be worthwhile in certain scenarios, but in 9 times out of 10 as the world has shown us, when you start converging it, you're really looking at the 20 to 30% of the features across the value chain rather than specific features within each specific, modular component. So that's really where the trade offs boil down to. And because of the convergence that we bring to the table, we can provide very clever capabilities that were unheard of before.
For example, how would you do real time things that traditionally wouldn't do it because there were silos of these, 4 components. So that's what it boils down to.
[00:09:01] Unknown:
For people in organizations who do have those existing data solutions deployed and in production, what are the capabilities for the BIOS platform for being able to integrate with that existing infrastructure, and what are the pieces of that platform that are going to be directly replaced versus just augmented?
[00:09:24] Unknown:
So first, let's start with what we augment. Right? So we do not claim to build any smart machine learning model to make your business run better. We don't believe that is right. We believe that the customers know how to run their business models better. So we are very complementary to how you build your ML model, deploy it in production, impact your business case. Right? So the data app building portion of it is up to you, the customer. But what we claim to be is the best central nervous system for data from ingest to insight. Right? What that means is, now, whether it's an IoT sensor, whether it's an app, whether it's files from where you're sourcing your data, all the way to getting this data into the hands of a data scientist or an API developer or a business user, that's what we take pride on. So that's what we kind of replace versus complement. And even in the data management space, it's not like we do it sort of right off the bat. So we take a lot of care in terms of going with greenfield use cases. None of these technologies that enterprises have in will get replaced overnight.
We start with a greenfield use case, we show the value upfront, then we go to the next use case and next use case. And then if IT wants to look at it as a general purpose architecture for their IT architecture, they're more than welcome to. But we start with a use case specific approach rather than a tool specific approach.
[00:10:45] Unknown:
And so in terms of the use cases that Asima is going to enable, what are some examples of workflows that would be previously impractical or require too many resources or too much time to implement with alternative architectures?
[00:11:01] Unknown:
The key use case that it enables is and even we are surprised with that, is that 1 data analyst can make real time learning for use cases like churn prediction, supply chain optimizations lie. We are use case agnostic. We are industry agnostic. We are a horizontal data platform. But what we really enable is that enterprises do not have the skill sets that the Facebook or Google has. And while there are lots of aspirations to train people to become data scientist and whatnot, what is really necessary is today, how do we deliver impact. Right? And what we make it possible is 1 data analyst, somebody who has 3 to 4 years of experience up to college, somebody who knows SQL, somebody who is able to sort of look at patterns of the data. How can that person do real time data engineering, ingest the data, clean the data, curate the data, make it ready to be used by himself or herself in a self serve manner, build a model, tweak the model, change schemas, do all the data operations of part of it, and then deploy that model in production. Right? So that's what we really make it possible. And by the way, we make it possible in a real time manner. Right? So you don't need to worry about too many of long delays and whatnot. And this has been impractical before. First of all, real time itself is considered very expensive or hard.
And then saying that you would be able to do it with 1 data analyst, not even a data engineer, is something that we truly unblock.
[00:12:27] Unknown:
So I'm wondering if you can dig a bit more into the actual architecture of Isima and some of the ways that it's able to achieve that self-service capacity.
[00:12:37] Unknown:
So it all boils down to looking at the architecture end to end. Right? As I said, you really have to rewind the clock back to sort of the 3 eras of data. Right? We are living in the cloud native world. Before that, we had the open source big data world. And before that, we had the pre big data world. And we looked at all 3 of them, and we said, okay. So how does data get transformed and moved around for the consumers of data? How have the consumers and the applications of data changed? And finally, how is the deployment paradigm changing? Right? And when you look at those data platforms don't live in its own vacuum, they are impacted by and they impact these other 3 ecosystems. And so we looked at all 3 of them, and we said, let's start at the source of the data. From the source of the data, what are the set of steps that I have to do to make that data available across a variety of use cases, and which is a let down into a bunch of common primitives that allows people to onboard data sets within an enterprise or outside an enterprise fairly quickly.
And it requires you to think about this fundamental 4 paradigms, which is there is an enterprise service bus, which is used for certain use cases. There is an ETL, which is used for certain use cases. And some folks like to talk about ETL versus ELD. They believe it's a lie because you are doing some other things within the transform stage. Then you have VW, which is the data warehouse where you will structure the data for analytical queries, and then you have BI. And when you collapse all of them together, what you see is this amazing convergence emerge for most customers.
And for them, this is the best perfect solution. This also means that we are not trying to compete on every single capability of every single component. What we're trying to do is provide a single layer across all 4 of them.
[00:14:27] Unknown:
Because of the fact that you went back to square 1 and reimagined the ways that the overall end to end experience should flow. What are some of the other systems that you looked at for ideas as to positive or negative inspiration for how to approach the overall implementation?
[00:14:50] Unknown:
That's a great question. Right? So we did something pretty unique. While we have a very strong team in terms of technology and everything else, we have built our own NoSQL databases and all of it. We listen to the market. We spend a lot of time speaking to customers across the spectrum, And the inspiration that we really got was Nutanix, which is an infrastructure company. And what they did was, ironically, they started in 2008. They went against the cloud world and they said, they're just gonna converge, you know, compute storage and network into a simple platform because an an enterprise doesn't need scale. Enterprise needs a simple button. So that was our positive inspiration.
The inspiration that we did calibrate was, obviously, because we came from a world of scale. We said initially, we started talking to lots of big customers, and this is where you have to be careful about who you speak with. And the big customers obviously were interested in scale problems and, you know, all of it. But what we realize is that scale is an illusion. Right? At least when you think from an hardware and systems perspective. When we spoke with lots of customers, we realized that a single customer doesn't have as much scale as the cloud guys like for Hyper2 b. Right? In fact, if you think for a moment, if you take away the top 20 Internet providers, the 21st enterprise company, maybe it's a Walmart or Target or whoever, the scale problem becomes a long tail 1. It's not like everybody has a scale problem. They have a scale problem compared to what they have, but in terms of what the cloud can deliver, they don't really have a scale problem. Right?
So that was the thing that we had to calibrate, but we are very fortunate because we believe in the market drives product and the product drives technology, not the other way around. So we calibrated our product to meet the needs of the 90% of the market rather than trying to solve it for the esoteric 10%.
[00:16:39] Unknown:
And as you have been working through building this platform and recently with the launch and bringing people on board, what are some of the ways that your overall design and implementation have evolved and some of the initial ideas or assumptions that you had, which have had to be updated or eliminated in that process?
[00:16:59] Unknown:
As I said, I think 1 of it goes to the scale thing that I just described, which is the scale we started and spoke to lots of big enterprise customers, and what we realized that for them, scale is a big deal. But scale is sort of this illusion that we need to calibrate in terms of where the market is. Right? So 1 thing we realized was that most folks don't have a scale problem. On the other side, most enterprises actually have a governance problem. So we should prioritize the governance capabilities and we are very proud that we have spent a lot of time thinking about it and then building those capabilities and putting it together.
The second 1 which was interesting for us was there's this big hype around real time, and I keep on telling folks that, you know, real time is in the eye of the beholder. Real time for us was we started talking, you know, when you go to Disha, microseconds is real time. When you go to Google, milliseconds is real time. When you go to a supply chain system, 30 minutes is real time. Right? So real time is really in the eye of the beholder of the use case. And calibrating our product to meet that market needs, rather than just telling them, hey. We have a real time platform, so you should start thinking real time. Was the 1 lesson that we learned was probably not the right thing to do. So that's essentially what we focus on. Now, to be honest, we can solve a lot of these problems in real time and, you know, they're very competent in solving many of these real time problems.
But what we learned was we should not enforce the product capability on the customer and force them to think about their business problem in that terms.
[00:18:31] Unknown:
Because of the fact that you have this strong focus on solving the problems of the enterprise, I'm wondering how you approach the overall user experience in terms of addressing issues around organizational complexity and cognitive complexity and hide some of the technical complexities that arise particularly in these modular architectures?
[00:18:52] Unknown:
Yeah. This was a hard 1. I'll be honest with you, Tobias, because what happens is, hey, you're right, is this modular approach has given this rise to lots and lots of tooling. I mean, the number of tools you need to manage a cloud native data architecture is actually astronomically complex, and you probably need a tool to manage those tools. But what we did was we looked at and it makes it even harder for us because our platform serves multiple personas. Right? So we serve the API developer, we serve the AI explorer, and we serve the business intelligence seeker, which is why that unified experience makes it harder.
So we spend countless hours thinking of these multiple personas. Are these different experiences? Are these a unified experience? And what we came down to, there is no way the impact of data can be delivered unless this person has come together. Right? This organization complexity has to sort of be simplified. And anytime when we were at a crossroads of features versus simplicity, we chose simplicity. Right? And we are very proud today that our UX actually allows for about 5 different personas to collaborate on a single platform, because we believe that those personas have to come together in a single organizational umbrella, in the single vocabulary to deliver that symphony from data driven impact. Right?
And that's what we focused on. But this was a hard 1. We spent countless hours on making it possible. And I would say now our platform is extremely simple and easy to use.
[00:20:24] Unknown:
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours or days. DataFold helps data teams gain visibility confidence in the quality of their analytical data through data profiling, column level lineage, and intelligent anomaly detection. DataFold also helps automate regression testing of ETL code with its data diff feature that instantly shows how a change in ETL or BI code affects the produced data both on a statistical level and down to individual rows and values. DataFold integrates with all major data warehouses as well as frameworks such as Airflow and DBT and seamlessly plugs into CI workflows.
Go to dataengineeringpodcast.com/datafold today to start a 30 day trial of DataFold. Once you sign up and create an alert in DataFold for your company data, they'll send you a cool water Flask. Another challenging element of managing data systems is the idea of authentication and authorization and access control, particularly as you scale across a larger number of users and different use cases and roles within the organization. So I'm wondering how you addressed that problem and what your thinking was in terms of how to implement it.
[00:21:45] Unknown:
Yeah. So governance with data. Right? And this is again, you have to focus on the enterprise and you have to look at not how Netflix does it or how Google does it, but you should really ask how Home Depot or a health care provider does it. Right? So we thought about governance end to end, right, in a multi tenant environment. And what we said was that we need hierarchical controls about who can ingest, extract, change data. We have to think about data addressed and data in motion encryption. We have audit logs built in. So for example, who did what when. And we have monitoring built in to say who is abusing your system more than others.
So we thought about all of these problems. We didn't just sort of throw the product out on the other fence and have people think about it after the fact. So governance is not a bolt on. It's a built in capability, and that's what we take really pride on. Frankly, all you need is commodity infrastructure on the cloud, you know, machines and local storage, and BIOS takes care of everything else from a governance perspective.
[00:22:46] Unknown:
Can you talk through the overall workflow of somebody who is using BIOS and what's involved in getting it set up and integrated with their platforms?
[00:22:56] Unknown:
So BIOS encourages more work on design than on implementation, which is the way it should be. Right? So the users of BIOS are bootstrapped in a few hours. We have a self serve system. You'll be up and running in a few hours. Then they pick some use case, let's say churn prediction or supply chain management, model its ontology, which is frankly outside of BIOS. They can do it on an excel sheet if they want. The implementation in BIOS is ours to implement the schema, to implement ontology, what enrichment you want, what features you want. Then they focus on the ingestion of the data, which probably takes a day or 2. And as soon as they start ingesting real data, they start getting insights in BIOS, which leads them to make further changes, which is also encouraged. Right? We encourage people to do tinkering with the data model and ingestion and features and everything else.
The whole process takes about a few days. And then once everything is flowing in, they get their real time BI, they get the data coming in, models, features, everything is there, Then they start doing some extraction of value out of data. So, for example, somebody starts building a churn prediction model on top of it. Right? And now this might take a week or 2 and they will do some a b testing and they'll say, okay, does will this model deliver me the value or not? Within a week or 2, they come to that conclusion, then it's time to productionalize. Net net, it takes about 4 weeks from this journey, from ingestion to where the insight is delivered across BI, AI, and API. Right? And that is where it's very powerful within BIOS.
[00:24:21] Unknown:
For the overall life cycle of data, how does BIOS handle the overall management of that? Is it something where once it enters into BIOS, it then becomes owned by that platform? Or is it something where you leverage other source systems and then BIOS is there for being able to handle all of the integration and analytics?
[00:24:42] Unknown:
Yeah. So we are an end to end system. Right? The ingestion part of it, the storage, the processing and query, everything is built in into BIOS. You send data to us. We can pull data. We can push data to the system from files, from other data sources, like data warehouses, from real time data sources, like IoT sensors. You send data to this platform, you do all of these things within a platform, and then you can extract all the insights from this platform into your data warehouse or everything else. So we really sit between sort of your structured and semi structured data sources on 1 side on the left, BIOS in the middle, and enterprise data warehouse on the right.
And 1 of the unique things that BIOS does is because it does a lot of heavy lifting of curating these datasets, preparing these datasets, the downstream queries in the data warehouse become a lot faster because now the data warehouse is not dealing with unstructured data. It's actually dealing with structured and semi structured data. So we take care of all of it, for you, but we are not an enterprise data warehouse. People can extract all the data out of BIOS and store it in the s 3 bucket or whatever be it.
[00:25:50] Unknown:
Another aspect of data storage and data access that gets very challenging is when you have to deal with geographies and you have multiple different units within an an organization who have their own compliance issues where maybe data has to be homed in the European Union or other data has to be homed in China, and then being able to access all of that information by somebody who's in the United States and handling access controls, but then also just being able to handle scalability of data across geographies. I'm curious how you manage that challenge, particularly because you are on the enterprise, and this is something that happens generally when you have organizations that span geographies.
[00:26:32] Unknown:
Great question again. So we don't just deal with kind of geographical distribution of data. We deal with geographical distribution of data across multiple public clouds. Right? So in fact, we're doing that right now where we are running half of the data being ingested on Google Cloud and other half being analyzed on Azure Cloud. So there are multiple levels built in. It's a multi tenant system. So different tenants can have different accesses, different each tenant versus each that has a Chinese wall, which is very important for multiple lines of businesses within an enterprise. You can route your users to your closest data point. For example, we dealt with a telco, but a lot of data was coming from an on prem Ericsson switch. And it's actually splitting all this data sometimes in files, sometimes in real time. But the data scientists want to work and use the tool chain of Google Cloud TensorFlow.
So we actually had BIOS deployed in this interesting deployment where half of it can be on prem and the other half can on the other side where data analysts and data scientists are working on the public cloud because they like their Google's tool chain. But the ingestion is happening on AWS as your or your own data center. So we let you choose where to place your data, which is the closest to the workload of the user that you are focused on right now.
[00:27:50] Unknown:
For people who are using BIOS, what have you found to be some of the most interesting or innovative or unexpected ways that they've employed it?
[00:27:58] Unknown:
Yeah. The biggest 1 for us was when we started the journey, we wanted to make the life of a data engineer simple. Right? So 2 years ago, when we started this, we had we're working with an ecommerce player and couple of data engineers, you know, very young folks, 2, 3 years out of college, they onboarded 80 data sources in a matter of 8 weeks, and we were completely blown away, you know. And they had everything on the planet. Right? Like, they had the Kafka's of the world and enterprise data warehouses. They had columnar data stores, everything. We wanted to make this simple. So we were very surprised at wow. So we could actually achieve that impact with Bios, and this was 2 years ago. But what totally blew us away was what happened recently with us where instead of a data engineer, it became a data analyst. Right? And, initially, this tool was for engineers who are hardcore software engineers.
But now this tool became something for a data analyst. It was so easy that a product and a data analyst can onboard data sources, curated, clean it, prepared it, consume it, build a model, put it in production, tweak it constantly, a fully self serve experience. And for us, that was very surprising to see someone who barely knew basic Python to perform all of these real time data engineering, ML exploration, deployment of the model, all of it in a matter of weeks, and actually show some value to the business and say, here, I changed your supply chain and it became better in a matter of 4 weeks. That is what we achieved with. And that totally surprised us, personally.
[00:29:29] Unknown:
Another interesting aspect of the work that you're doing is that you have a fairly small team, and you've been working on it for in some cases, could be seen as a relatively short period of time given that most databases, the sort of general guidance is that it takes about 10 years before it reaches maturity. I'm wondering what your approach has been in terms of how to identify, what to focus on, what the immediate value is, and how you're able to compete in the enterprise market against entrenched players like IBM or Microsoft?
[00:30:04] Unknown:
It's not for the faint of the heart. Let's start with that. And you're right. I think solving database problems is a decade long fight. It's not a small fight. And to be fair, we have begun. We we haven't we have a geo quality product. We have validated it at big enterprises. But it all boils down to showing outcomes. I think 1 of the biggest problems in the data management space has been this kind of affinity to tools and each tool doing its own job in in a small space, purely because the market is pretty massive. But we believe that outcomes matter. Right? At the end of the day, outcomes matter and that problem has become worse in the age of AI and machine learning. But we are still keeping on building these ETLs, ESVs, ZWs, and BI to put ML in production. And that just doesn't cut it because ML is a different beast.
And that's why I go back to asking folks that ask your vendor how many folks do they have who have put ML in production. Not ML in theory, not experimented with it on Coursera or or wrote a blog on it, but actually put ML in production which saved or made money. And there are very few who have done that extremely well at scale, and we are fortunate to have, that kind of a team who has done this. And after seeing that, we realized that what is required for this market is simplicity. It's not yet another stream processing real time ETL ing tool.
What is required is absolute simplicity for applications to go live. And so we always focus on listen to the market and the customer, figure out a way to make their life better, to make data driven impact in front of business. Right? And that's what has been our strength, and that's the only way that you can compete against the entrenched base, as you rightfully said, of IBM and Microsoft. Now we have to be cautious here that obviously we can't compete with them on the same turf because, you know, they have infinite marketing dollars and we don't even though we have a likely as better product.
That's why we are focused on use cases and saying, let's make use cases live. Let's not fight on technology outcomes. Let's fight on business outcomes. And if the business outcomes win, then that's the only way that we can succeed.
[00:32:16] Unknown:
It's also worth digging more into the fact that because of the way that you've architected this system and some of the user experience, it's allowed you to target folks like data analysts or data scientists who have a lot of context on what problems to solve and what information they might need to solve that, but not necessarily as much expertise in terms of optimizations for data formatting or handling things like sharding or partitioning and being able to maintain an overall health of the source data and the aggregated data over a longer period of time. So I'm wondering how you have approached education of those end users to be able to surface those types of problems at the appropriate time, and what are the capabilities for being able to hide some of those elements because of the ways that the system is structured?
[00:33:13] Unknown:
Yeah. That's a brilliant question. So number 1, you need to hide a lot of those things. Right? Because the consumers of data don't care about how data is sharded or is my 3rd copy of data not available and, you know, all of that. The system has to become as autonomous as it gets from an operational perspective, the security, the governance, and in all of it. So that's the first part. The second part, which we are discovering as we go along is that because we have made it so simple for the consumers of data to onboard data sources and play with it and, you know, all of that, sometimes with great power comes great responsibility.
Right? So you need to build the checks and balances in the product to say that I'm not gonna let you do this. Right? And then here is why. So we recently started introducing some of those capabilities where some of our customers were using this in an extremely abusive way. And, you know, our system was able to handle it because, you know, the upstream system, they had a Kafka cluster which was sending data to us and, you know, the Kafka cluster would keep on going down because of zookeeper going down and all of it. And then we had to take 10 x x the load on top of us. And now our system was architectured to take that 10 x load, but it was so good that it almost became a sort of they got used to the fact that it's okay, you know, if the upstream system goes down because BIOS can handle it. So what we did was we introduced some alerting capabilities where we would actually tell the customers, hey, you know, your upstream system is down.
And it's okay if you if you want to hit us hard, and I know we can take on 3x4xload. But if you're gonna take this and make it a norm, then you probably just want to think about it that 10 x perspective. That was 1 thing. And the second thing was they started using our real time system more like a warehouse to do exploratory queries, which is also fine in our system. We just kind of quarantine that SLA across a different copy of data. But we realized that they were doing it and creating a lot of cost issues for themselves. Right? Less for us, frankly, we don't care. They're on a complicated query. It's good for us because we use more compute storage and we get more money paid for it. But we realized that it was it was not necessary. Right? It was not the right thing to do. So we worked with them and said, do you really want to extract raw data for the last 10 days in a real time manner?
And what if you extracted only 6 days of data or even 7 days of data? Right? And the price difference would be about 10 x difference. So those are a couple of things that come to mind to protect the consumers of data from, a, not worrying about the underlying semantics of the operations, but also at the same time, making them aware when something is not right.
[00:35:59] Unknown:
And in terms of your own experience of building the BIOS platform and onboarding users and seeing how they're taking advantage of it and the value that they're able to get and just the technological and business challenges that you're facing. I'm wondering what you have found to be some of the most interesting or unexpected or challenging lessons that you've learned in that process.
[00:36:19] Unknown:
Yeah. The biggest lesson we have learned is that the way Silicon Valley has sold all of these things is really like a hammer looking for a nail rather than figuring out what's the right tool to solve that problem. It starts with empathy. You have to build empathy for the enterprise market. You have to understand that no matter what anybody says, these are cost centers within the CIO organization. There's a huge hyped up demand on data being able to deliver impact, but at the same time, there is also huge amount of caution, which is what happened with the Hadoop ecosystem. And we need to learn to navigate it. So the biggest lesson was try to solve a problem for them. Don't try to sell your technology, which has to be your strength, but try to solve a real problem for them. Show them that it actually works.
And then, you know, you you have the chance of helping them grow up the value chain. Right? But that was a big lesson for us. It took us some time. I'm very happy that we have reached there where we are very outcome focused. We are not technology
[00:37:19] Unknown:
focused. With all of this discussion about the value that you've been able to create and that end users have been able to realize, it's often useful to understand what are the cases where BIOS and ISIMA are the wrong choice and you would be better served with a different technology stack?
[00:37:36] Unknown:
I would say, if you have the same constructs as the top 20 to 30 Internet companies in the world, then BIAS isn't for you. Right? So if you can hire a 1, 000 plus engineers to do your data engineering, if you can build your own technology stack, if you can integrate that and somehow glue it together to make it running, then it's not for you. The honest reality is that most of us are not there, not even Walmart, not even Target. Although, they think like that they are, but they're not. So BIOS is really for folks who wants to achieve data driven impact in a very quick period of time rather than getting married to the tool chain. And it all boils down to we used to say this to Netflix 2013 when they were consumer of Cassandra.
They would say, are you the consumer? Are you the watcher of TV, or are you the builder of TV? Right? Which business are you in? And chance of enterprises are not in the business of data engineering, although they would like to say that. They're in the business of using the data to deliver some business outcome for their business, whether it's supply chain churn or whatever. And so for them, BIOS is the right thing. If you are in the data engineering business by yourself, then if that technology is your product, BIOS is not right for you.
[00:38:46] Unknown:
As you continue to bring on more customers and evolve the system, what are some of the plans that you have for the future of the business and the technology?
[00:38:56] Unknown:
I always start with listen to the market, listen to the market, listen to the market, and really make customers successful. Those are the 2 key themes that we are squarely focused on right now. From a product perspective, our goal is to become 10 x better than where we are. I believe we have just begun. And this 10 x has to be in multiple dimensions. The 3 that we look at are in ease of use, further make it very easy to use. In terms of time to value, instead of 4 weeks, can we make it into few weeks? And resources required. Right? Can we make ourselves 10 x better for our customers and deliver that value to market?
Because the market is desperate for that outcome.
[00:39:33] Unknown:
And are there any particular technical aspects or specific use cases that you have actively decided not to pursue, at least for the time being?
[00:39:44] Unknown:
Any use cases which require you to look at, how should I say, 5 years of multi petabyte data or something of that sort is something that we are little lukewarm on because we believe that the half life of data is shrinking. Right? So we believe that the value of data atrophies fairly quickly. So the any use cases which require you to look at long form historical analysis is probably the use case that we would not want to focus on at least in the short term. And that's essentially what we have decided right now not to focus on.
[00:40:17] Unknown:
1 other element of the aspects of data management and analysis that we didn't dig into quite yet and that we should probably touch on briefly is what the capabilities are as far as being able to address the range of fully structured data that's largely textual through to unstructured data and on into binary
[00:40:44] Unknown:
structured and semi structured data. We are not structured on these blob store style binary data. Although, we are working with the health care provider where they're looking at the oncology images and everything else, and even they are realizing that the value of that blob of data is in the metadata, which is kind of semi structured data anyway. So effectively, we work Bios is a platform for structured and semi structured data and for the metadata of binary data, but it's not a platform to store blobs and blobs of data.
[00:41:14] Unknown:
Are there any other aspects of the work that you're doing at aseema with Bios and the challenges that you're looking to solve that we didn't discuss that you'd like to cover before we close out the show?
[00:41:24] Unknown:
Yeah. The only 1 thing I would say is that whenever we put this proposition in front of our customers, so 1 of the questions we always get asked we get faced with this disbelief and, like, you're a 16 people company. As you rightfully pointed out, it takes a decade for data management technology to mature. How are you gonna succeed in that world? And number 2, almost like a disbelief. And I answered to them as always, pick a use case. Let's not get into technology. Pick a use case. Assign 1 engineer who has 2 years of experience out of college, and in 4 weeks, we'll show you the impact. Right? And that has what has worked for us, and our self serve platform is live. You can go to esema.io, register yourself and play by yourself, and you'll see the difference between what was possible before and what's possible now.
[00:42:09] Unknown:
Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I would just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:42:24] Unknown:
Yeah. I personally think that data management has been split in these 2 worlds of if you want ease of use, you have to give your data to companies which will make them smarter before they make you smarter. Right? So the example is SaaS or cloud native subscription technologies. And I believe that equation needs to reverse. The equation needs to reverse where tooling has to become so easy to use that it makes you smarter before it makes the tool company smarter. And that's something that I'm looking forward to achieving in the next decade.
[00:42:56] Unknown:
Well, thank you very much for taking the time today to join me and share the work that you've been doing. It's definitely an interesting product and an interesting challenge. So I appreciate you taking the time today, and I hope you enjoy the rest of your day.
[00:43:09] Unknown:
Thanks a lot, Tobias, for having me on the show.
[00:43:16] Unknown:
For listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used. And visit the site of data engineering podcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave review on Itunes and tell your friends and coworkers.
Introduction and Sponsor Messages
Guest Introduction: Darshan Rawal
Darshan's Background in Data Management
Introduction to Isima and BIOS
Modular vs. Integrated Data Solutions
BIOS Integration with Existing Systems
Real-Time Data Processing and Use Cases
BIOS Architecture and Design
Market Calibration and Customer Feedback
User Experience and Simplification
Governance and Access Control
BIOS Workflow and Setup
Geographical Data Management
Unexpected Use Cases and Customer Impact
Competing in the Enterprise Market
User Education and System Autonomy
Lessons Learned and Market Empathy
When BIOS is Not the Right Choice
Future Plans for Isima and BIOS
Use Cases Not Pursued
Data Types and Storage Capabilities
Closing Remarks and Contact Information