Summary
Data lakehouse architectures have been gaining significant adoption. To accelerate adoption in the enterprise Microsoft has created the Fabric platform, based on their OneLake architecture. In this episode Dipti Borkar shares her experiences working on the product team at Fabric and explains the various use cases for the Fabric service.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
- Your host is Tobias Macey and today I'm interviewing Dipti Borkar about her work on Microsoft Fabric and performing analytics on data withou
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Microsoft Fabric is and the story behind it?
- Data lakes in various forms have been gaining significant popularity as a unified interface to an organization's analytics. What are the motivating factors that you see for that trend?
- Microsoft has been investing heavily in open source in recent years, and the Fabric platform relies on several open components. What are the benefits of layering on top of existing technologies rather than building a fully custom solution?
- What are the elements of Fabric that were engineered specifically for the service?
- What are the most interesting/complicated integration challenges?
- How has your prior experience with Ahana and Presto informed your current work at Microsoft?
- AI plays a substantial role in the product. What are the benefits of embedding Copilot into the data engine?
- What are the challenges in terms of safety and reliability?
- What are the most interesting, innovative, or unexpected ways that you have seen the Fabric platform used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on data lakes generally, and Fabric specifically?
- When is Fabric the wrong choice?
- What do you have planned for the future of data lake analytics?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
Links
- Microsoft Fabric
- Ahana episode
- DB2 Distributed
- Spark
- Presto
- Azure Data
- MAD Landscape
- Tableau
- dbt
- Medallion Architecture
- Microsoft Onelake
- ORC
- Parquet
- Avro
- Delta Lake
- Iceberg
- Hudi
- Hadoop
- PowerBI
- Velox
- Gluten
- Apache XTable
- GraphQL
- Formula 1
- McLaren
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Starburst: ![Starburst Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/UpvN7wDT.png) This episode is brought to you by Starburst - an end-to-end data lakehouse platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, the query engine Apache Iceberg was designed for, Starburst is an open platform with support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by the teams at Comcast and Doordash, Starburst delivers the adaptability and flexibility a lakehouse ecosystem promises, while providing a single point of access for your data and all your data governance allowing you to discover, transform, govern, and secure all in one place. Want to see Starburst in action? Try Starburst Galaxy today, the easiest and fastest way to get started using Trino, and get $500 of credits free. Go to [dataengineeringpodcast.com/starburst](https://www.dataengineeringpodcast.com/starburst)
Hello, and welcome to the Data Engineering podcast, the show about modern data management. Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end to end data lake has platform built on Trino, the query engine Apache Iceberg was designed for. Starburst has complete support for all table formats, including Apache Iceberg, Hive, and Delta Lake. And Starburst is trusted by teams of all sizes, including Comcast and DoorDash. Want to see Starburst in action? Go to data engineering podcast.com/starburst today and get $500 in credits to try Starburst Galaxy, the easiest and fastest way to get started using Trino.
Your host is Tobias Macy, and today, I'd like to welcome back Dipti Borkar to talk about her work on Microsoft Fabric and performing analytics on data without having to move it to various locations. So, Deepi, can you start by introducing yourself for anybody who hasn't heard you in previous episodes?
[00:01:08] Dipti Borkar:
Yeah. Absolutely. Well, great to be back. I was remembering, the previous, 2 times, and so 3rd time's the charm. So excited to be here. I'm Dipdi Borkar. I'm part of the Azure data leadership team, VP and GM at Microsoft and responsible for our strategic ISVs, including Azure Databricks as well as our fabric, developer experiences. So data beyond Toby.
[00:01:35] Tobias Macey:
And for folks who haven't heard you previously, if you can just remind us how you first get started working in data.
[00:01:41] Dipti Borkar:
Yeah. It's been a while now. 20 20 plus years started off in the database lab at UC San Diego doing research on, at that point, semi structured data. So the world was moving from SQL to, semi structured and, yeah, you know, the the interactive applications were starting to go big. So that's kind of where I started off, started my journey in in d b 2 distributed on the Linux side of things in the core engine. So storage and indexing kernel. And I've been pretty lucky to be involved in pretty major step functions and data. As we were talking earlier, it hasn't been a dull day since since I started.
And so relational to NoSQL with with JSON was a pretty big switch. So was, the forefront of that, disaggregated stack with a lot of open source tech, Spark, Presto, founded a company in that space, to drive simplicity. And now it's, Gen AI needs, needs data. And, and it's the next next, 5 to 10 years of innovation. So, it's been a fun fun journey in data management, and, honestly, it's still evolving. I feel like I'm constantly learning, and, transforming myself over every time there's a new technology and trying to stay relevant.
[00:03:02] Tobias Macey:
So recently, you moved to the team that supports the Microsoft Fabric product. I'm wondering if you can just give some sense about what it is, some of the story behind, how you came to be involved in that, and some of the ways that people should be thinking about Microsoft Fabric as it relates to their own work in their data systems?
[00:03:22] Dipti Borkar:
Yeah. Absolutely. Would, love to share. There's a lot of people working on fabric. Right? It's a pretty major investment, at Microsoft within, within Azure data itself. And, you know, if you think about if you look away from fabric for 1 minute and just look at data platform teams and what they've had to deal with, there's a lot of complexity. Right? Over the years, there have been many, many products and tools that have services that have emerged with the goal of solving a problem, but that has has added even more complexity and fragmentation. Right? When you look at, data platform teams and all the services and products that they use, you know, you you think of Mad Turks, mad landscape, right, machine learning AI and data landscape.
And, in fact, this year was the 10th year of, publishing that landscape, and it's getting more and more and more complex. So if you the logos keep shrinking every year in size. Right? And you can barely recognize them at this point. And so if if data platform teams are dealing with that level of it essentially just takes a lot more time to innovate, time to get insights, time to market. And as we've talked with it and as the the fabric teams talk with customers, whether it's a CIO or a data platform deep leader, you know, organizations want to simplify. They wanna simplify and reduce costs, at at at at the, you know, at the basis of it. And then, you know, obviously, be faster on innovating, differentiating on top of that. Right? And so the complex architectures have that have been driven out of this so many products in the space, it's gotten much harder. And in some ways, that's sort of what fabric envisions and promises to simplify is, that, you know, you can, there there are many platforms that call themselves unified or, you know, that with this promise of having a simplicity across all of these components.
But as a cloud, we do have all the pieces. We have many, many pieces that are involved, in a data in a data platform, from business intelligence with Power BI to pipelines data flows with Azure Data Factory to our warehouse, right, lake house, real time, intelligence. You know, we have event streams, event hubs, etcetera. All of this really can, you know, can come together, and that's what fabric is. It's it's, combines the best of all of these core products and services into a single unified, platform, which is a SaaS platform. It's a software as a service platform so that it it's a really seamless experience, low code, no code. And what we've envisioned to do is have as minimal knobs as possible so that everything is sort of auto managed for you, AI powered underneath the covers, and and that's the that's the unified platform, that customers can use. So, essentially, you know, taking away that complexity, giving try try to get to market faster for our customers is the eventual promise that that Fabric's making.
And, you know, with the name like Fabric, it it kind of, encompasses the the vision is to encompass everything, and, it's been phenomenal, since the GA last November. At Ignite, we announced GA, and we have, we just announced a couple weeks ago where we have about 11, 000 customers on fabric already. And so very, very, great to see the momentum. And, I think it's just the beginning. You know, we'll evolve over time. There's lots to build. There's lots of adoption that will happen as well.
[00:07:02] Tobias Macey:
With the fact that fabric is focused on being this unified location for data across an organization, particularly for larger businesses that have multiple different business units that each have their own little silos of data. What is your heuristic for when somebody is asking, is Fabric the tool for me? And what are the cases where you can say, Fabric is a bazooka to your fly and you're actually better suited using some other tool?
[00:07:35] Dipti Borkar:
Yeah. It's a good question. Right? And, 1 that we get in a slightly different topic is where do I start? Right? Where what's the best place to start off? And what we've seen with fabric, number 1, is the the business model is very different from a range of other services and and so on. If you take away the, you know, the Microsoft stack for a minute and you think about a stack that, user might, data engineer might leverage in their data platform. You might have Snowflake. You might have Fivetran to do, you know, some of the ETL, ELT. You might have, Tableau to do the BI. Right? You might have DBT and other products, Spark and and other, other technologies.
Each of these ends up being provisioned and oftentimes the users overprovision. Right? And each of these need to be bought separately, managed separately to the separate line item in the budget. And so when you look at the complexity across these, some of it is actually the business model and the cost. Right? With fabric, you buy 1 product. So you you buy 1 thing. Right? And you can use all of these experiences within that product. So the the business model itself is pretty transformational where you once you start off with fabric, you can start off with data factory where you might pull in data using a data pipeline or data flows, which have, you know, lots of 200, 300 connectors. And then what we've seen is once you get started with that, you then end up using more and more workloads because it's a pretty seamless experience.
You don't have to go and ask someone for, you know, permission to buy either from a marketplace or, go back to your admin and say, hey. Can you, you know, enable this or, you know, we are running out. Can you overprovision? It's all of it is serverless. Right? And so first thing, you only pay for what you use within each of these, and it's all at your fingertips. So, typically, we see users start off with 1 and then use 3 3 workloads, 4 workloads, right, across fabric. And over time, more and more get used. And so the simplicity of usage is is what we're, you know, what we're seeing. So going back to the question of where do people get started, what is it good for, what what it may not be good for, What we're seeing is, pipelines, with data factory get it's used quite a bit, lake houses, and and data engineering workload gets used quite a bit, and then the warehouse. Right? So you sort of, for your medallion architecture, you typically have, your, you know, your raw data and then your, silver gold broad silver gold, etcetera. And so they're starting off with 1, and they're starting to leverage others. So depending on your need, first thing, you could you should only you should only buy the SKU that makes sense for you, and it starts we it's 2 months free trial, each month at $17, 000, I believe. And so that's pretty pretty great to get started off. And and then you can start off with really small SKUs. Right? The, the SKUs start at, f, f 8 and and so on, and so you can you can pay very little to begin with and then grow over time, to more to larger skews if that's if that's where you, you know, the the product takes you and you start using more workloads.
Now in terms of, so I would say starting points tend to be pipelines, right, usage, ingestion. So you start with ingesting into 1 lake, which is the foundation of fabric, and then using other engines, on top of it.
[00:11:10] Tobias Macey:
On the point of data lakes and lake houses, there has definitely been a significant trend towards that as the default location for data for an organization. Data lakes in general have come to be more well understood from their first introduction where there was all the talk about data swamps, and the lakehouse architecture in particular has helped to give people a common architectural perspective on how to approach that. And I've seen in recent years less emphasis on the data warehouse, more on the data lake or data lake house. I'm wondering what are some of the motivating factors that you see for that overall trend towards lakes being a more widely accepted and more widely adopted approach to how to manage data at the organizational scale?
[00:12:00] Dipti Borkar:
Yeah. Absolutely. I mean, you know, I'm I feel like I'm still learning. Like, I've been in this space for many, many years. Like, 2018, 2019 is when I got involved, working with a lot of the Internet companies at that point. And since then, a lot has changed. Right? So, in terms of motivation factors, the early ones, it was, this the disaggregation of the stack. Right? It was, you know, separation of storage and compute was a big deal because they could scale at different different levels. They, and cost is different for each 1 too. Right? Compute in general was a little bit more expensive. Storage, with object stores, etcetera, it it, it got significantly, you know, cheaper to store large and large amounts of data with ADLS and s 3 and others. And so I would say the, that that, the starting point, which was the disaggregated stack at this this, separation of storage and compute is is proven. Right? It's absolutely proven, and that continues to be 1 of the motivating factors.
The other 1, that emerged over time is, just no lock in. Right? Open formats. And it started off with the file formats, which you are, you know, at that point, ORC, Avro, Parquet. Parquet kind of emerged as, the more popular 1, and then table formats layer on on top of it to make it more structured, to simulate in some ways a a warehouse. All the metadata, the table schema got layered on top. And, of course, you know, Delta Lake, Iceberg, and Hudi and others emerge. But foundationally, the motivation is, it's open formats, which means that customers are not in. And this has allowed for a lot more interop across platforms. So you can store data in 1, but you could, you know, you could may have your own engine, internal engine. A lot of enterprises, you know, innovate internally. They have they might have their own, either Python libraries or, or their own engines, whether it's a query engine and a batch engine, they could run on top of it. And, of course, just the interop across even the the commercialized software, right, on top of these open formats is pretty significant now. And so that continues to be a huge driver is no more walled gardens. Right? Truly open data.
And, and this is 1 that Fabric has adopted where storage costs are significantly lower now. Warehouses are, where where storage costs are significantly lower now. Warehouses are, you know your more traditional warehouses are still monoliths. Right? So you you still have that grouping of, storage and and compute, which becomes expensive. Here, you pay for it in different tiers, and so costing becomes it it will become, you know, significantly cheaper over some of the other even on prem data warehouses that, customers still have. Right? A large amount of. And with the innovative models like, fabrics, capacity unit model where, you know, you have virtual currency that you could use and spend across all of these different, workloads, that will help reduce cost. So you're not over provisioning for everything. You know, together, you you have 1 product that you could use, with the same bucket of, of virtual currency.
Now the other things that are emerging when we look at verticals. Right? So you have retail financial services, travel, telco, media, like, lot lots and lots of, you know, verticals. Each of them in some ways has become a technology company now. Right? They have some part of tech that is driving their innovation and their new products. But even then, it is it's very hard to, you know, keep up with new skills with, you know, new products that come up with maybe it's a a warehouse or maybe it's a new tool here and there. Data lake houses essentially create that foundation of 1 lake, which gives you visibility across, you know, all the different aspects of data you might have. And so from a team skill perspective, if you at least have some base level of skill that you can build on top of. So you're not, you know, you're not, every 3 years, you're not bringing in new skill new tech, which means new skills. Hopefully, this stays for a long period of time. It's lasted over you know, I would say it started off with Hadoop, which was extremely complex. Right? And now people didn't see much value of it to a point where it's completely SaaSified now and seamless where you don't need a lot of you don't need a PhD in data engineering to start using this. So, skills have, you know, the the level of skill you can start off with base level of skill now that, data lakes have become pretty, they're they're everywhere. Right? Every everyone sort of has a common understanding of a of a data lake a lakehouse and what that means. And then and finally, I would say just Gen AI and, innovation that's ongoing, you do need your models are only as good as your data. Right? And so increasingly, whether you are using, the core foundational models, you know, the, with with the OpenAI, Azure OpenAI, or other open source ones, or you're training your own, a lot of this is gonna happen on open lakes.
And, and so that's the other reason to, you know, increasingly, we're seeing a lot more users, enterprise customers coming into us saying, hey. We, you know, we we this combination of GenAI and and OpenDataLakes and, and and 1 lake for us is, you know, is tell us more. Like, we we believe in this foundational architecture because that's going to be the next next set of innovation that continues, and we want to make sure we are starting off at the right point, with with open open data leak. So, you know, looking forward, it's going to be about Jenny and, and custom models, enterprises training their creating their own on top of their own proprietary data.
All of these will, you know, will be sitting in the data will be sitting in open formats in data lakes.
[00:18:04] Tobias Macey:
On that note of open and open formats, open lakes, Microsoft has been investing substantially into the open source ecosystem in recent years. And Fabric, I noted, relies at least in part on various open source components. I'm wondering if you can talk to some of the benefits that the product team has seen on layering on top of those existing open source solutions, and what are the elements that were built specifically for fabric that are proprietary and built in house?
[00:18:36] Dipti Borkar:
Yeah. Absolutely. You know, personally, it's it's been fantastic for me to see and be involved continue to be involved, in open source. So I've been involved in many projects in the past, over the past 12 to 15 years. And, and it's it's great to see the Microsoft data stack also, not only adopting open source, which is the beginning, but also giving back, being more involved in the community. Right? Which, which which not necessarily a lot of, companies do. It started off with Delta Lake. Right? So that was the the foundation. That was the primary format that we adopted. And, and and 1 of the big reasons was it's it's the, we have a lot of customers that are already on, Delta given our, Azure Databricks service. Right? It's a, a mission critical service for us and, and our customers.
You know, we have a lot of data in ADLS, Gen 2, which is Delta. So that's kind of where we started off. In fact, to embrace Delta, all our engines were, you know, redid it in large parts because they needed to natively support Delta. Right? It's a pretty big change moving for from a proprietary format, especially for Microsoft, right, to a totally open format. And then rewriting SQL, rewriting the pipelines, rewriting Power BI even. Right? So that it can directly access Delta, even without an engine in the middle, without a SQL engine. So Power BI now supports direct lake where you could just go and directly read delta lake files that are up on on parquet, and it saves a lot of cost, and it's it's faster because we've we've optimized on top of that. We we use our there's a sorting order that we add on top for compression as well, which is called VertiPaq. And so we've we've taken the base open source, but we've optimized on top of it. Right? We continue to, add in our own innovation and IT, of of course, on, you know, on top of the the core open source. And so it started off with, with Delta.
Few weeks ago, we announced iceberg support. Right? So it is iceberg is coming to 1 lake as well, and so customers now have an option of picking which format they want, which means that not only we do we have full interrupt with Azure Databricks, which is our our service, but also with Snowflake, which we announced. I will was at, the Snowflake Summit last week. And, and so that will that interrupt between snowflake iceberg, where it might be might be sitting in, s 3, for example, can now be shortcutted into 1 lake, and used as well. And so, you know, we truly believe that storage should be open, format should be open.
This allows for massive interrupt across platforms. And with fabric, and 1 lake, there are capabilities like, you know, shortcutting and mirroring that really gets you the single view of your data estate. Wherever your data might be, it might be in a different cloud. It might be in a different format. All of that accessible, through fabric. So formats is is sort of the foundation, right, as as we all know and you're very familiar with. But on top of that, of course, the engines are now also embracing open source. And so, last week, we announced our native Spark engine, which is now built in with Gluten and Velox as, the it's kind of a composable stack down, right, for for Spark. And we've seen significant improvements in performance as well as reducing the amount of CPU used. Right? So you're moving from Java to to c plus plus. It's a a pretty big, optimization, and we've seen this with, you know, essentially the standard benchmarks, TPCBAs, TPC, HR, TPCBS in terms of, how much it benefits. Now there are lots, for example, is, essentially, it's a set of operators. Right? And they can it can be used within any engine. Right? In fact, it it was it's fantastic to see my my old team's innovation sort of being used, within within fabric, and, and we're building on on top of it and and collaborating with Intel and Meta and, and others. And so that has been, you know, fantastic to see. The other project that, we are, extremely involved within our founding members of is, the Apache x table. So, you know, all of this all of this interop, you know, needs to be seamless. Right? Otherwise, you're adding yet another layer of complexity for our users, and it you know, we don't wanna leave it to them to to deal with this. And so that's where X table comes in. We are, partnering with, 1 House and GCP and others.
And our GSL Labs, base systems lab that's part of the the Azure data product team is extremely involved with this effort, and that's going to be the core to interrupt between the formats. Right? So in, iceberg, delta, and others, essentially, the metadata layer is separate. Like, it's different how they handle metadata. And so x table is used at runtime to essentially create the metadata layer for both formats, 1 to the other if it does not exist so that over time, they it'll be bidirectional access. You could read, write to both delta and 1 lake. Yeah. So exciting to see all of this open source. Honestly, it's very very, good to see both projects that I've been involved with to be leveraged, but also that I continue to get to be involved. And at the end of the day, customers benefit. Right? Customers win with when we use more open source. Of course, we have to be thoughtful and careful and make sure the enterprise grade security is added on top, etcetera.
And going back to your point on layering, you know, with the the core open source is great. We need to, innovate on top of that, making sure that our customers are well protected as well. And in that stack
[00:24:25] Tobias Macey:
of open source proprietary, particularly when you get into the security layering and making something enterprise accessible and enterprise ready, what are some of the most interesting or complicated engineering and integration challenges that you're running into?
[00:24:40] Dipti Borkar:
Yeah. You know, I'm I'm not involved with, every every aspect of fabric, of course. Right? But in terms of integration I mean, honestly, the the the first big jump to rewriting the engines was pretty transformational. Right? It, took some time to to get there and not just to support it, but actually get it performant, right, and optimize a stack with various things. We now have caching built in on top of 1 date, for example, to to optimize that as well. Security layering is, you know, is, and and making sure that it's every workload that we have in fabric because the the the scope and the the breadth is pretty wide. Right? It's it's it's it's pretty wide surface area. And so making sure that everything is is protected, there's consistency across each of these workloads from a security perspective. So VNet support, for example, and and we are building in 1 security on top, which continues to be you know, it's a journey. Right? You don't get it takes time. So it's not that you start off with 1 thing and and and, you know, you can get it done. We're adding 1 security as a layer on top where, on top of 1 link where it you you can you have extremely fine grained access control, for every workload.
The way that it's you know, the way that it emerges, every workload currently, like SQL obviously is extremely mature and, you know, we have a fine grained access control within it. But to have this consistency and, security across all the workloads consistently, it's, it it it's taken time. So it it's been interesting to see, you know, that unification on principle. It is it it is really great for customers, but it is actually very hard to develop and get there. Right? The SaaS experiences as well. You know, it's it's simple to say, okay. No knobs, minimal knobs. Right? But what that means underneath the covers is you really need to ensure that you're optimizing for a range of use cases. Typically, you might say, oh, you know, with, straight out of the box open source, hey. Set this config to x and set that config to y and that to z. And, you know, for this use case, that's the best config. Well, if there are no knobs, you have to do everything underneath the covers. And so, that's the other area that is challenging is how do you, you know, auto optimize for a range of workloads as well as, use AI and, you know, ML techniques underneath the covers to actually make it happen.
[00:27:14] Tobias Macey:
On that note of AI and embedding generative AI and other forms of AI into the data layer, I'm wondering if you can talk to some of the benefits that you see of having the Microsoft Copilot functionality as an integrated piece of the fabric product versus something that people can use out of band and then apply those changes. Just some of the ways that that changes the thinking around how the data platform is supposed to work, how people are supposed to interact with these data engineering, data analytics use cases, and also some of the potential risks and problems around reliability given the fact that generative AI is nondeterminative.
[00:28:01] Dipti Borkar:
Yes. Yeah. I mean, we're still all, you know, Microsoft is obviously leading that that, the the revolution or, you know, this, this era in some ways, and it's so great to be part of that. If you just if you if you leave data out for a minute and just talk about, you know, the promise of Jenny and what it, you know, what it allows us to do, productivity gain is a big 1. Right? Whether it's data engineering, whether it's, you know, a doctor typing in, their notes, whether it's a, a financial analyst trying to figure out, the market, productivity and, you know, reducing the amount of effort that anyone does is really 1 of the biggest promises of Gen AI. Right? That's why you call it a copilot. You're not it's not the pilot yet. It is a copilot, and, it assists, you know, humans in some ways and data engineers in our world, to do, more with less, to do things faster, to do things with less errors if you know, and in some ways and so that because you have more data at hand, because it's trained on a lot more, information.
And so productivity, I think, is the number 1 thing that most enterprises are trying to begin with. Right? This means that it needs to be a part of the experience. Right? Because, it's a behavioral change otherwise where if it's not a part of that product and experience, you have to kind of switch, go somewhere else, try something else out, and then come back. Right? Versus if if if Copilot sort of, you know, in any AI experiences natively within the product experience, it becomes as it becomes seamless, becomes easier to use, and it becomes a habit. It needs to become a behave it's a it's a behavioral change. Right? So that's where Copilot is part of various other Microsoft products.
GitHub for 1 has, you know, its own copilot. Versus Code has its copilot. Fabric has its copilot, and the this is now, generally available as well, but with our Power BI, Copilot experience. And, essentially, what it allows people to do is have a lot more a broader view of their data. You could ask about, you know, any dataset even if you don't have that information or maybe the tribal knowledge doesn't exist within your team. The copilot will know it. Right? And allow you to, you know, create complex dashboards, create, you know, ML to SQL essentially, Anything with, you know, natural language to the output format, right, is taken care of. And it we've we've seen that especially in you know, for nontechnical users, this becomes a a a big driver because, you know, they, they may not know SQL. They may not know how to create complex dashboards, but they now have the power to do a lot more with the data that they have access to. And so Copilot, you know, comes in very useful in that in that way. So every experience within fabric, whether it's, whether it's a warehouse experience, a data science experience, or BI, as I mentioned, etcetera, has a copilot and will have more, as we as we continue to innovate. Now your question around safety, reliability, and, you know, also, you know, doesn't have all the information. You might still have a lot of tribal information knowledge within the within the company that needs to be taught to the copilot in some ways. And so we have the notion of AI skills within fabric, which means that you can actually update the responses for certain prompts based on what you know as as, accurate. Right? A good 1 is, you know, we had our the project name for fabric in before it was announced was Trident.
And there's a lot of information on Trident everywhere, and that correlation may or may not exist between what Trident is and what fabric is. And so prompt you know, updating prompt responses and making sure you get the you give correct 1 to the users is something that, you know, is is part of is is part of the product, so AI skills is a is a good 1. But in terms of, you know, you you might still have, you know, hallucinations are in some ways real. We, with with these LLMs, and so there are there's still a lot more work in general to, to do with, safety and reliability, but we take it you know, it's we we take it extremely seriously. A lot of these are you know, there's limits and throttling, etcetera, built in and, and hope, you know, over time, it'll keep getting better and better. Now we have multimodal responses, right, that are possible.
So it's not just text. There there will be, you know, visuals, audio, etcetera. So with with every step, we have to build in that, the responsibility that, you know, we have built a responsible AI, to be part of it.
[00:32:48] Tobias Macey:
On that note of modality and also on the subject of AI, that brings us back around to the idea of the data lake being a de facto and default storage interface where up until recent memory, the primary type of data that everybody was interacting with was textual or tabular. Now with AI and the idea of vector embeddings, multimodality, we're bringing in more of this unstructured and semi structured data, whether it's in the context of LLMs and reattrieval augmented generation or in the context of more detailed search functionality around being able to search for audio and images. And I'm wondering how that element of multimodality, AI, retrieval augmented generation is being factored into the ways that the fabric team is thinking about the role that they play in this ever expanding data ecosystem?
[00:33:44] Dipti Borkar:
Yeah. Yeah. It's a great question. And, you know, we're seeing, both on the analytics side, but also on the operational side, a lot of innovation happening. Right? So if you start off with the Gen AI applications, more the simple ones, right, the easiest way to get going is in in embed a a bot into the website, which does a lot of various variety of things. Right? Gives buying recommendations, customer support, handling a range of things. And for those, we have, you know, the vector index obviously, comes in very handy. Cosmos DB, which is extremely widely used and and is actually used for the the, used by chat gbt as 1 of the back end, databases offers, vector indexing. Our post offers vector indexing. And so there's, on all our databases essentially offer support for, a vector index so that that context is, is is stored, and you can build these very, very rich gen AI applications on top. Now on the other hand, you have, your lakes, right, as we were talking about. We've been talking for some time. That is, you obviously have structured data on top, but now you have there's a lot of semi structured data as well. You might have images, logos for companies that you wanna analyze, etcetera.
And currently, there's, you know, there's a deep deep integration between fabric as well as, like, AI studio and our Azure AI, side services. Right? So that's 1. But over time and and vector search is part of our Azure AI experience. Right? So you can you can have vector search, that you can, support on top of the lake. But, yeah. I mean, stay tuned. I mean, these experiences are getting more and more integrated. And so, you know, today, you might have to use 2 or 3 services to to pull it together. Tomorrow, you it's all built in, and it's it's a seamless experience. So 1 aspect that we're seeing is, it's it's also new that customers just want to have it available as easy as possible. Right? So that they're not stuck with doing all this integration work across these services together. And so while the services exist today with, whether it's vector indexes as part of our databases, vector search, which is part of Azure AI, on top of, 1 Lake and so on. It will be even more seamless over time as, as more, you know, as more satisfaction, in some ways happens on the Jenny I stack.
[00:36:19] Tobias Macey:
And then bringing it around to your experience in this space, As we've noted, you previously worked at Ahana, which was very focused on this lake house paradigm. You have a lot of context in data lakes in general. I'm wondering what are the ways that your prior experience at Ahana working in the Presto ecosystem, leveraging, cloud storage, data lakehouse architectures has informed the work that you're doing at Fabric and some of the ways that it's a continuity and some of the ways that it is a new experience.
[00:36:52] Dipti Borkar:
Yeah. It's it's been, yeah, it's been quite some time in the lakehouse space. Founding a company on lakehouses itself shows how how how bought in into the tech I was. And so, you know, there's a few different aspects. Right? I I get to, influence and help the teams on an open source front for sure. Still actively participating, with, through a lot of the projects that we talked about. But the other aspect is actually just the entrepreneurial, you know, nature and, attributes that are core to what I have been doing, right, in the past. And then in some ways, bringing in new ideas, taking taking starting new things, like, you know, 0 to ones are always hard, whether it's as a startup or whether you're doing it within a large corporate company with with with deep pockets. Right? Everyone resources are scarce everywhere, in some ways. You always wanna do a 100 things and you can do 5. And so, in some ways, that entrepreneurial, characteristic has helped me drive new areas.
For example, with fabric, we are now opening up the platform. So we obviously have our 6 workloads that are core and, you know, that are Microsoft workloads in some ways, Power BI and others within fabric. Right? But there are a lot of areas there which are, you know, might be narrower gaps that we may not invest in. And it it it makes sense for our partners to come in and build on top of fabric and integrate into it natively, allowing them to have these experiences as well. And so we have opened up the the platform. So not only is it open data, you know, it's open APIs, but the platform itself platform development kit that I announced, at Microsoft Build, which was our big developer conference a few weeks ago where any ISV, any startup can come in and build a native experience on top of fabric. Right? And, and so in some ways, fabric is, as an OS, you know, in some ways, a data operating system.
Microsoft has its set of apps. Right? And now we are, we are integrating in ISPs to come in and pull in their own native apps in terms in into fabric, and really make it a strong ecosystem overall. At the end of the day, customers win. They have a lot more experiences, capabilities to choose from. And so a couple of examples that we announced were, you know, SAS, Informatica, ASG for geospatial, Neo 4J, for graph, Teradata. Yeah. We have a warehouse, but, you know, Teradata engine on 1 link, and, also, prophecy support, you know, master data management. So this is just the beginning. Right? We will have more and more, ISVs, get pulled in. And then the other areas where, I've been able to sort of get more 0 to 1 new experiences, built new experiences on fabric is bring technologies like GraphQL, which is essentially an API on top of data data, services into fabric. So we added GraphQL support.
We added user data functions natively within fabric. So these are all kind of 0 to 1 experiences that have it's been great to work with the teams to drive and kind of deliver out. So, they're now publicly announced, so I can finally talk about it. But, yeah, just, you know, being involved in new completely new areas, driving the ecosystem. Alright? Because, you know, fairly, connected with the the data space and then the products and the leaders within it. And then I would I would still say, you know, having that customer first mindset, you know, as a founder, you really still you know, you you're you're constantly thinking about, the customer base. You talk to hundreds of customers and, what prospects, I would say, you know, they're, with the, with the hope of being customers. But here, I can, you know, I can leverage a lot of that background, and, the the rich solution architectures, that we've, that I've been, you know, exposed to in the past. Now with fabric and really kind of innovate at a massive scale. Right? So it's been it's been really fun, whether it's health care, whether it's financial services, telco, lot of adoption that's happening, with, very large ISVs as well as enterprises.
And so it's been great to influence, all of these areas.
[00:41:24] Tobias Macey:
And in your experience of working in the space, working on fabric, what are some of the most interesting or innovative or unexpected ways that you've seen that platform applied?
[00:41:33] Dipti Borkar:
Yeah. It's it's interesting. I mean, there's always great stories. The most, recent 1, you know, I I like, I like speed. I like driving fast, and I had follow Formula 1 and all this. And there was a McLaren is 1 of, you know, that we highlighted, 1 of the users and, that is using our real time intelligence stack we recently announced, to store a lot of that their data. Right? And to be able to because, obviously, very high speeds of in you know, ingestion needed, the ability to, analyze this at great speed. So that's a that's a really good 1 and, you know, personal favorite for me as well. But I would say it's been it's been interesting to see, how creative our customers are getting, with with using fabric in, you know, where it's it's across verticals. It's really a horizontal platform, so, helping, with medical researchers, given electronic medical records, for example, getting in financial data, into so benchmark data, composite data, things like that, through different various different sources into into 1 lake so that they have a single view of that data. You know, that's another. And so I think the multi cloud aspect of the shortcutting aspect of, 1 lake, has truly been why it it it's something that a lot of customers are very interested in, because it you know, we've had virtualization for a long period of time. It's not new. That virtualization concept itself isn't new, but the ease of which we enable it and the fact that we, you know, write everything into, by default into into, Delta Lake, and you immediately can then access this data through various other workloads, that part of it is quite revolutionary.
So I think that is going to drive a little a lot more data gravity into 1 week. And, yeah, you know, it's it'll be interesting to see more and more, use cases, on top.
[00:43:35] Tobias Macey:
In your experience of working in this space, transitioning into the fabric team, understanding the use cases, the customer base? What are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:43:48] Dipti Borkar:
Yeah. I I think, a lot of lot of good lessons learned. In some ways, I see you know, as we've evolved over the last, 5 years in the data lakes, it's good to see people taking the the the mistakes made in the past and using you know, fixing them in some ways and and leveraging them. In terms of lessons learned, I would say, you know, that last mile of making it enterprise grade, where, you know, we we have many, many fundamentals that we need to get to for GA, whether it's it's a a security is about big 1, scale testing, etcetera. That continues to be hard and, you know, takes time.
So there's no, there's no free way to get there faster. You still have to go through that process and, in building a good enterprise software. Right? There's, there's no way out for it. And, and, I mean, honestly, it's and it it's very important. Right? The mission critical applications that we're involved with, it's absolutely critical we go through that process, and make sure that we've not only the capabilities are there, but it's, reliable. It's stable. The high availability is built in. Right? So many aspects of the foundations, that you just expect out of a Microsoft service. It's it takes time, and it's, it's still it's still hard. Right? Because everything is so tightly integrated, you have to make sure that the solution architecture is right.
Internally, the design is correct. There's no holes. So that process continues to, be hard and, you know, I I always feel like when you build the next set of software, some people get easier. At least with that, it it it still takes time. Let me think. In terms of, you know, adoption, whenever there's a new new product, and I've gone through this process many times in the past, the adoption itself, you know, it it takes some time to educate. Education is a big part of the learning experience for customers. And, of course, with Microsoft, it you know, it's it's a extremely well known brand. There's a lot of, expectations out of that brand as well. And so, you know, the the adoption phase, it's been phenomenal to see how fast fabric has, has grown and been adopted.
But even then, right, it's it is a process. Adoption takes time. Education, community building, all of these things remain critical in terms of getting customers, end users, comfortable with the tech. Training is a big part of it. So we're doing putting a lot of these things in place, to make sure that at the end of the day, it's a new product. And so we need to make sure that our customers have all the tools available to make them successful. Otherwise, it it, you know, it it, in in some cases, it may not may may not, leave a good taste in their, in their mouths. Yeah.
[00:46:36] Tobias Macey:
As you continue to invest in fabric, continue to explore the ecosystem of data lakes, data lake houses, and see the ways that people are applying them, what are some of the projections that you have for the future of data lake analytics, data lake house architectures, the role of AI in the data engineering and data analytics ecosystem.
[00:46:58] Dipti Borkar:
Yeah. It's it's gonna be a fun next 5 years, Tobias. It'll be interesting to see, you know, how it all kind of continues to come together. I'm seeing a lot of consolidation. Right? Essentially, we've seen consolidation now first in the analytics stack, right, where, with fabric, if a lot of these pieces are sort of consolidated in, Gen AI is is part of it. Yeah. Right? It's being consolidated in, where we'll see increasingly even applications have obviously their their operational apps or RTP apps. There's a real time interface to it. There's an analytics interface and a Gen AI interface to all of these apps, which means that the platforms behind it will also the the more consolidated they are, the easier it will be for customers to to build not only the back end, analytics up, services, but also the front end customer facing services. Right? So I do see, in general, more consolidation happening in in the space. Now Gen AI itself will evolve into, becoming a lot more robust. I mean, it's been, you know, mind blowing already, but you can only imagine over the next 5 years as, more GPUs are available, as more training happens, as more datasets, as as more, you know, custom models start evolving on proprietary datasets. There's still a lot of proprietary data that's, that cost enterprise companies have. Right? And so as as these evolve, there will be, interesting co pilots that emerge over time that will also get, you know, built in.
So Jenny I itself, will emerge. And and so, hope the hope is at the end of the day that, you know, our job as, you know, the way I've seen it is as as people building these platforms, we need to keep this promise of simplification and, unification and, and and help customers sort of reduce confusion in some ways, and help customers, you know, be successful because as we've seen, many times, you know, people adopt technologies, and, they they may not see the benefits of it. So we really want with this with this phase of innovation as every everyone invests in, Gen AI and open data.
We really want them to get the most out of it. Right? So anything that will speed that process, I think will over time emerge into our products, and, and it's great to be part of it, part of the journey.
[00:49:30] Tobias Macey:
Are there any other aspects of Microsoft Fabric, data lakes and lake houses, your experience working in this ecosystem that we didn't discuss yet that you'd like to cover before we close out the show?
[00:49:41] Dipti Borkar:
I think we've talked about a lot of different aspects. So, you you Tobay. So thank you so much for all the the interesting questions and and, you know, the in some ways, the the ordering that that you had as well. You know, honestly, I think that we're still, it's a great platform, but we're, you know, we're still starting and, and, looking forward to, seeing where, you know, where it goes and, hopefully, in this process, make a lot of customers, successful and get them to innovate on top of fabric.
[00:50:14] Tobias Macey:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:50:31] Dipti Borkar:
Yeah. Good question. Yeah. Absolutely. Everyone's free to, you know, reach out to me, LinkedIn, various other channels. Gaps in, I think there are there is a lot more technology now. I think the the gap is probably in adopting that technology and really understanding that tech tech, embracing it. We still see so many enterprises or, you know, that are maybe behind the curve. Right? And, I think it's really time now where data lakes are mainstream. So if if you're not using this stack, the you know, any in any form, like, some form, then you are falling behind. Right? And so it it really is, that there's a lot of great tech that's available today, whether it's clouds, whether it's, you know, other other companies. And and so, you know, as as CIOs, as CDOs, it's really important to, you know, embrace and try in some in some cases, you know, maybe stop investing in other old technologies and and look forward and innovate because, in some cases, it's, existential now. Right?
There will be companies that, out innovate others with with JNII, whether it's LLMs, SLMs, you know, whatever that might be. It is really important to have that stack. Right? So, you know, open data with gen AI on top is that next gen stack. It's proven. It's extremely proven now, and so it's it's time to start, getting there, soon if you haven't done so already.
[00:52:06] Tobias Macey:
Well, thank you very much for taking the time today to join me, share your experience working in this ecosystem, your insight into the Microsoft Fabric product. It's definitely a very interesting platform. It's great to see more investment in open data lake infrastructure, the entree into the enterprise. So appreciate the time and energy you're putting into this space, and I hope you enjoy the rest of your day.
[00:52:30] Dipti Borkar:
Absolutely. Thank you so much for having me again. Take care. Thanks, everyone. Bye bye.
[00:52:42] Tobias Macey:
Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning Podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com. Subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story.
And to help other people find the show, please leave a review on Apple Podcasts, and tell your friends and coworkers.
Introduction and Welcome
Guest Introduction: Dipti Borkar
Overview of Microsoft Fabric
Simplifying Data Management with Fabric
Trends in Data Lakes and Lakehouses
Open Source and Proprietary Elements in Fabric
Engineering and Integration Challenges
AI and Copilot Integration in Fabric
Multimodal Data and AI in Fabric
Dipti Borkar's Experience and Insights
Innovative Applications of Fabric
Future of Data Lakes and AI
Closing Thoughts and Contact Information