A View From The Round Table Of Gartner's Cool Vendors

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means?

Our friends at Outland started out as a data team themselves and faced all this collaboration chaos.

They started building Outland as an internal tool for themselves.

Outland is a collaborative workspace for data driven teams, like GitHub for engineering or Figma for design teams.

By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlan enables teams to create single source of truth for all of their data assets

and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.

Go to dataengineeringpodcast.com/outland

today. That's a t l a n, and sign up for a free trial.

If you're a data engineering podcast listener, you get credits worth

$3, 000 on an annual subscription.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Pacaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Saket Sarab,

Martin Masceline,

Akshay Deshpande, and Dan Weitzner about the challenges facing data practitioners today and the solutions that are being brought to market for addressing them, as well as the work that they are doing that got them recognized as cool vendors by Gartner.

So going in order of introduction,

Saket, if you wanna go first and just give you a brief introduction.

Hi. My name is Saket Saurabh.

I'm cofounder and CEO at Nexla, and we are a unified data operations platform.

And Martin, how about you? Yes. I'm Martin, cofounder and CEO of SOTA, and we're all about data observability.

And Akshay, how about yourself? My name is Akshay Deshpande.

I'm the chief technology officer at TADA.

We are a data fabric platform which specialize

in supply chain and retail CPG industry,

providing ontology semantic based solutions for supply chain systems.

And Dan, how about you? Hey. My name is Dan Weitzner. I'm the VP R and D and cofounder in Kimber AI.

We bring knowledge graphs and SQL

to the masses

for enterprises and organizations.

Going in the same order again, Saket, if you wanna share how you first got introduced to the area of data management.

So before starting Nextelah, I had built up a company in the mobile advertising space, built creating 1 of the earliest mobile ad servers,

and advertising tends to be a very

high data volume space, which is why you've seen a lot of good data tech come out of companies like Google, Facebook, and others.

So in process of building that, we created a data platform which was processing over

300, 000, 000, 000 records of data a day and, you know, doing some very interesting machine learning on that. This is around 2, 012, 2014 time frame.

And, yeah, I could see that the data challenges we're solving there were coming to more industries as more people wanted to work with data.

I'd come to this world of data through the world of compute having worked at NVIDIA for several years, so I could see a combination of compute,

efficiency, and performance

and how that applies to data and bringing that together here in Nextiva. And Martin, how did you get introduced to data? 11 years ago now, I think. I was a early employee at a company called Colibra. They were all in deep. So they're actually building the 1st

data catalog,

data discovery platform.

So that was about that was about at the time that there were no chief data officers or heads of data. So that was a very interesting experience.

And I was there for about 7 years, before venturing diving into to data observability and data quality management. And, Akshay, how did you get involved in the area of data management? Our team specializes

in consulting strategic consulting in supply chain for our customers.

And as part of this engagement, we always found a common set of problems across the ecosystem,

disparate data sources, different processes,

different types of data,

and putting it all together to get the right business insight was always a challenge. So we actually inverted this problem,

put all the common problems over there, and created a solution

which will help accelerate

giving the right insights

to the end users. Dan, how did you get involved in data management?

Well, it all started as a developer. Right? You can't really decouple data from computers and development, so that's where my journey started. I noticed that there are many logic compositions and large SQL files that someone once wrote in an organization, and now I have to maintain

and understand the logic and follow it. And I saw the the problems and complexity of doing that. And at the time, my brother, which we've cofounded, timber AI together,

had a lot of experience in

the, database management

world.

And we talked about

good solutions, and 1 of our principles,

pictures,

for driving

a good solution

and provide an organizational structure that can scale and deliver the value of data,

minimize

the logic complexity from the data consumers and moving it more to the data modelers'

side. And that's just 1 of the small benefits that

timber AI platform gives as part of creating the knowledge graph. You're the second person I've spoken to recently who was brave enough to start a business with their family.

So I'm curious how that's going.

Well, we have no problems. We already know how to argue and disagree from home. So when stuff arises,

we we already know how to follow through.

And

we know how to collaborate. We we collaborate very good together and proves itself very useful so far. And so each of you have been recognized respectively as 1 of the cool vendors by Gartner in their recent report talking about the problems that are facing data professionals. And so I'm wondering if we can quickly go over what you view as some of the biggest challenges that are facing people who are working in the data ecosystem. And so, Akshay, why don't you start us off? If you look at the current situation, the biggest problems which data engineers face is data acquisition,

preparation of the data, cleansing the data and getting it in a format which can be easily understood.

No matter how you look at it, this is a big tedious task. You need to have an understanding of the business systems, the processes.

So getting it and having that right nomenclature

is the biggest thing. Once you have that in place, accessing that data becomes easy. And, Dan, what are your views on the biggest challenges that are facing data professionals?

There are many challenges. It's more as we we all go

into data driven

development

methodologies.

There are many different challenges.

1 of them is, for example, the collaboration,

between different people that are owners of data,

whether it's a debt data engineer, data

scientist,

data analyst, or just a developer, maybe just a platform that consumes the data. There are many, many different data providers and data consumers.

We identify that a lot of the bottlenecks are, like, actually, I said, in the data preparation, data cleansing space.

So we're trying to bring order back

from the chaos, all the different data sources that the business has.

So they can create their own data fabric or data mesh depending on their use case or their needs. The best approach

to have, like, a holistic

business

view of the data. Martin, I'm sure you have some opinions on the view of the data and some of the challenges related to that. I don't know if you wanna follow on with your views on

the challenges that data professionals are dealing with in these recent years. I think as companies gonna more heavily rely on data, I think the the biggest thing now is that we're very often flying blinds. There's

not really a structural process or system in place to get ahead of problems with data. And I think as we all know, there's a lot of things that can can go wrong when it comes to data. The data source, the ingestion,

schema changes, transformation errors, concept drift, you name it. There's tons of things that can go wrong. And very often, we feel a bit out of control

because of that.

So there's a clear lack of observability,

and that's both kind of software and tooling to do that, but it's definitely also a process. Now that we found an issue, how are we going to deal with that? That I think is today probably the biggest problem in productionizing and automating the data. And Saket, how about yourself? Yeah. When we call ourselves unified data operations, you know, the first challenge we see is operations, which is about bringing scale,

not just in data volume, but in terms of how things are run. So to scale, I think of how can more people work with data and how can data come to more applications.

So I think the first part is a people challenge of there are people of different skill levels, people who are experts in data, people who are experts in data systems, how do they work together so that more people can be working with data. So I agree with the aspect of collaboration that was mentioned.

The other part

is that of data being

available or high value data being there in both internal systems in the company as well as outside the organization.

And that creates the challenge of, you know, working with data in terms of integration, preparation,

and so on. So we see those 2 as the people level challenge

and the technology challenge.

The other interesting element of the data ecosystem,

particularly in recent years, is that, you know, the overall base principles, the foundational layers have remained largely the same conceptually where we have storage, we have compute, we have processes and mathematical statistical analyses that we're doing on the underlying data.

But as the overall ecosystem matures, there are a whole slew of new terms and new practices and new ways of thinking about these that, you know, if you're new to the space, it can be incredibly overwhelming

if you're trying to figure it all out from scratch. And each of you have been recognized, as I mentioned at the outset, by Gartner as 1 of the cool vendors in the space. And I'm wondering if you can just give a bit of an overview of

what it is that you're helping

to tackle and helping to simplify

for data practitioners so that they don't have to be so overwhelmed by everything that's happening and stay up to date with all of the latest trends

versus being able to lean into the platforms and the

the space of complexity that data professionals are dealing with. Yeah. Sure.

The space of complexity that data professionals are dealing with. Yeah. Sure. I think it's a good idea that we kind of lay out that terminology as we dive into this. So I think we're all very familiar with data infrastructure. Right? So the ingestion, storage, compute, CICD, transformation. That's, I think, a big block, if I would have to draw it out.

And then there's the observability piece. It's all about, like, metrics, testing, monitoring,

lineage, to be able to do things like root cause analysis. So that's the space that we play in. And then definitely have heard about data discovery, well, which is metadata

search ultimately and business concepts and definitions to really have people be able to find the data that's available in the organization and then get access to it. And then I think the last big 1 also in data management is governance.

Typically, only for larger companies, that's more about the ownership,

policy, and access management. I think, for example, GDPR that we have here in Europe

That's, giving access to certain people or certain datasets for a certain purpose, a certain time. It's a broad set of complexities before we get to finally, you know, the insights and the automations

where we, of course, have dashboarding tools,

automation tools. So I think that's a bit the those are a bit the components, I think, overall, and we focus somewhere so in between in front of discovery on on observability.

And, Akshay, how are you helping to reduce the space of complexity and the overall

understanding of the broad ecosystem that is necessary for data professionals to be able to do their jobs? So, Duys, we start this with a bit of a different approach.

We start this entire process

with creating a semantic model. So we really don't go and start hacking at the physical structure, whether it's coming from SAP, whether it's coming from JD Edwards. We start with the language of the business. Every business has their language And we start with the semantic model definition

of the language which is used by the business. On top of it, we have a domain registry

which actually helps in

identifying

the different business processes

which are part of this entity.

And we then help by creating a unified data fabric which helps

create or map that data from the different disparate sources to the semantic model.

We have more than 200 plus different connectors, which are part of our data fabric platform, which help ingest data from traditional

ERP systems, SaaS applications,

some of the newer applications

through streaming.

There is information which can be ingested from IoT devices and sensors.

And we also have the ability of getting data from end users via collaboration and forms. This goes through a set of data cleansing,

which is actually enhanced by AI algorithms leveraging fuzzy logic to cleanse some of the data. It also helps in duplicate detection. So as when the data moves from 1 stage to the other, the data by itself is clean when it really populates

or gets created into that semantic model. On top of this, we have the rights in terms of the data, who gets access to what,

which addresses some of the common challenges between the different disparate systems.

And there's a toolkit of more than 400 plus different widgets and components, which you can quickly drag drop through a low code, no code solution

to give you that end visibility across it. So we solve this by creating the semantic model, making a data fabric, making it tied to a domain, and then giving it the visibility layer, which you can drag drop through self serve mechanics.

And, Saket, if you wanna give your overview of how you're helping to reduce and manage the complexity that data professionals are dealing with in their day to day. Yeah. I think, you know, to your question earlier about the terminology, there's a lot of it. Nexla is recognized by Gartner as a cool vendor for data fabric.

To break down the concept of data fabric is really, you know, looking at the metadata

to understand what the data is and to automate or simplify the data management. So, you know, until now, data had been serving, you know, AI. Now you can apply AI back to data for the management purposes.

That's from a Nextiva approach. What that means for the end user is we observe the data that we can connect with, and based on that, automatically organize that data into what we call as next sets, so logical units of data that you can work with. And this brings me to the other concept that's been out there, which is that of a data mesh. The concept of having data products that domain users can get access to and use on their own. So by applying this data fabric approach to observing and learning data, we end up creating automatically these next sets, these data products that help the domain user, the end user get to data,

easily collaborate on that, and so on. So what it means for the user of data is, first of all, they can connect to a wide variety of systems. We can automatically

observe, learn, understand that, then present to them the data products that they can then prepare, integrate with, use, collaborate,

and then use this metadata

intelligence. You know, what does the data look like, the schema, when does it come? You know, other characteristics like, you know, attribute level, how does it behave to then provide a certain degree of monitoring on that data so that you know that, okay, you know, the data is working as expected.

And, Dan, how are you helping to reduce the complexity and challenges for data professionals and your work at Timber? There are many ways that I believe we help data professionals

tackle the problems.

1 of the best things about us is that we stay in the SQL world. We do not touch the physical

data layer at all. We're virtual.

We just map from all the data sources to our abstract model, to our knowledge graph. And so you stay in the the SQL space that your infrastructure is running on. You do all the you can do the cleaning and preparation of the data in the mapping stage

while you're mapping the data, but that's just 1 part of it. Another big part of what we do to help professional is give a semantic holistic view

of the business entities.

Anyone can create

an abstract view that's detached from the data itself.

Business user domain experts can enter, and they can model the world's business as they see it right.

And that, engineer can come in and map the data. He knows what each column means. He knows what needs to be standardized, which types should be used, and so on. And at the end, you get a data fabric or a data mesh depending on on the use case that you're looking for, what to do. And you can create all of them using the guidelines of the semantic web, where you have hierarchies between different concepts and different entities.

We also facilitate,

use of the joins and unions. Since we are abstract

virtual layer, that means that we do have, like, a graph representation of the data, and you can do

many graph traversals

or segmentations

or granularities

for different type of entities that you have in the data.

And when you query the data, you have the different columns exposed to you as virtual columns, so you don't really have to write any joins or unions.

You can traverse your data through those virtual columns

and many other things. We bring also graph algorithms to the space. Of course, the data virtualization, you can like everyone said, you can bring the data in from many different data sources.

And at the end, we expect our data consumers to query the knowledge graph instead of querying directly those data sources.

And I'm gonna open this up for everyone to speak as they choose. And so

1 of the

core themes that I'm hearing from each of you is that in order to tackle some of the complexity and a big source of complexity is that there are all of these different systems that we need to work with, all of these different sources of data, types of data,

and trying to figure out how do we map this to a logical or semantic layer that professionals can work across without necessarily having to worry about

doing all of the integration work themselves.

And I'd say that that's a big part of the sort of current theme of data management is trying to

abstract away the integration elements of data management and data infrastructure,

that that's where a lot of the previous, you know, decade or 2 decades worth of work has been going.

And that now we're looking to figure out how can we work at a higher level to solve the problems that the business cares about and not have to spend so much time worrying about the problems that the infrastructure cares about. I'm wondering if any of you care to comment or provide input on that thought. I think this is a good approach that is happening. You know, if you think of it, on 1 hand, you have so much complex of data.

On the other hand, because more people need to work with data, of course, you know, an average person will not have to write code.

So in between, a lot of things have to be figured out. Right? The variety of data formats, connectors, and so on. So it's good that we are looking at something

that understands the data and makes it easier. Because without that, everything you have to deal with code.

So the concept of the semantic layer, I think, you know, makes a lot of sense overall to simplify things.

I do think that the complexities of data systems is not to be underestimated.

Anybody who has worked in the enterprise in complex systems from, you know, mainframes down to back systems to real time transactional systems knows that.

And therefore, the other thing that has come up in this conversation is collaboration. Right? Making sure that a good chunk of problems can benefit from the semantic layer, become easier, you know, ideally no code. A set of things can be done in no code because you do need a little bit of that flexibility sometimes.

And then for the hardest of the hard problems, have some sort of

layer that engineers can get into directly and do what they need to do. But putting those 3 pieces together is, I think, what is helping

companies scale up their teams and the amount of data they work with and applications they want to enable. My perspective on this is

having each of these

entities within a business, you can call it as divisions, you can call it as departments,

own their aspect of data which belongs to their business process or the their part of a value chain helps gain

terms of reducing some of the complexity.

Trying to create 1 big unified model across from an enterprise

is is is

as it seems, is very complex. But the owners of that business function have a very detailed understanding of it. So them creating this network model and then federating it together as part of the enterprise representation

is 1 way of simplifying that problem. In addition to this, leveraging some of the low code, no code tools in conjunction

with some of the cognitive metamod,

data management systems

helps reduce some of the complexities. Right? Normally, you would have people who are doing the data analyst or the data engineers who map the

fields in the end systems to your metamorder

to say this is how the transformation and mapping is to be done. Leveraging some of the cognitive metadata management systems

actually accelerates or helps,

reduce some of the complexity. So that's the second aspect which we see. And the third is having these set of connectors

with the collaboration

framework where you cannot go and access the systems directly

to then leverage some of the form based mechanism,

fills the gap of the data which is there. And then you combine that with the lightweight

registry

to do the validation of the data. We'll solve some of the problems which are traditionally plagued with the traditional data warehouse kind of assistant. I totally agree as well. I think the most valuable piece, really, if we think again about, like, infrastructure observability

and then discovery,

if you wanna start sharing more data, if you wanna have more people use it, it's really important that it's properly described and defined and searchable

and that there's operational metrics around that data that you understand if that's usable.

Those are really, I think, the technologies today that really help simplify

sharing that tacit knowledge because very often, we have a lot of tacit knowledge in the organization

that would be super helpful if that's available on demand basis.

So I do think that motion of of getting that tacit knowledge out and bringing it to the people that want to use data or access data is really key in increasing productivity.

Absolutely. I totally agree with everything that was said.

And, also, I want to add to it that I expect to see more positions

in the data world filled in.

20 or 30 years ago, we only had a DBA and maybe a system admin that took care of the the data aspects. And today, we have data engineers

and data analysts, data scientists.

We even have data products.

I foresee much more positions in the data world opening in the future.

And with that also comes new platforms and new specialized systems

for specific uses in the data world.

That's definitely another trend that's been happening a lot, as you mentioned, that there has been further

specialization

and segmentation of responsibilities in the data ecosystem

as it becomes more of a first class concern within the organization where 20 years ago, it was just the database administrator who made sure that the data was available and that somebody could write some SQL queries. And so there was the long process that everybody dreaded of, I have a question. It goes through the DBA for the answer to come back out the other side in the business intelligence dashboard. And, hopefully, you're able to iterate on that, you know, once a week or, you know, a couple times a month.

And now everybody needs to be able to access data because everybody wants

to be able to

understand more about how the organization is operating, how their customers are interacting with them, the needs that exist in their specific

ecosystem or industry vertical. And so

a lot more of the business needs to work with the data

on a regular basis. And so some of these roles that use the data are now starting to morph into data oriented roles where instead of having a product manager that was responsible for the application that generated the data, There's now the data product manager who works closely with the entire team to make sure that the information that's being collected is being organized and structured in such a way that internal customers can actually use it effectively without having to pass it through the proverbial DBA.

And so I'd be interested in some other feedback on some of the ways that

sort of vendor platforms

and

the sort of higher level abstractions

that you're each working on can help to support these evolving roles and some of the specific roles that you

see coming about as a result of some of the technologies

and principles that you're all putting forth? Yeah. I mean, I see an ops role coming in almost every function. Right? I mean, you have sales ops, marketing ops, now HR ops, and everything. And what are these ops roles? These ops roles are sort of saying that, hey. I have to use data to run this particular function better.

And clearly, these have become the consumers of data, and you can expect, you know, the people in these roles to be pretty data savvy. They're not gonna be writing

your code or creating, you know, pipelines by code and all that stuff, but they know that data really well. They're pretty good at figuring out how to stitch things together.

And these vendor platforms, you know, like, what we are building and several other companies

bringing that, you know, hey. Here's a no code way of getting to this. There's an easier way to get that. In fact, you know, you don't have to be a d b necessarily to get access to a database. You know, the cloud warehouse world is like, hey. You can have your zone and, you know, have your tables and stuff like that.

So, clearly, I think it's bringing more power to each of these functions because they all have to work with data. Just wanted to highlight or stress the aspect what Saket said. Right? Democratization

of each of these functions.

As much as we have these different roles which are coming in,

we also are looking at the end user being the most powerful person

in terms of how they get access to their data.

A traditional way, you would have to create your schema, identify the different dimensions, and

you can only then use those many dimensions in terms of doing an analysis.

With the way some of the newer platforms and our platform does this is when you create that semantic model, it is essentially an immersive network. You can traverse the entire network from 1 end of the network to the other end

without having to constantly create new set of schema objects to do this. It will start with this 1 end. You drill down, which we call it as an impact line drill down mechanism, and you can go and analyze this data in a different way. The different connectors which are there, they are all hooked within that data fabric through a low code, no code mechanism, drag drop. So all of that effort is actually done under the hood, which makes it easier. Though there are these different systems,

the end user can access it and could rely on 3 or 4 different type of people

to really get access to that data. And the last thing is the governance which is there. When you're moving data from different systems, you want the permissions and the rules

to travel upstream

to the actual end user who's using it so that you don't lose

the rights and access permissions

because you created a unified data lake. The ability to traverse that security profile to the end user also ensures that though you're giving the access to the users, your data is still secure and is given to the right roles and responsible.

If I take a look at our domain, so it's all about sound observability and issue management.

Then the first question that always needs to be answered is like, who will now look at

all these alerts?

Who will prioritize what we work on, what we fix? Because

not all data is created equal. Not everything is as important.

So that's a very interesting question and topic because this goes into really the operations, right? How do you deal with this on a day to day basis? And what we've seen is that when there's technical issues very often either related to ingest or to transformation,

we have more and more people that become on call or, dressing and handling these problems. So reliability engineering is, of course, the first thing that comes up.

However, when these problems are more linked to what's in the data, think of that as, like, a record level, then we see the role of the data product manager pop up quite heavily. Data product manager or sometimes also in a kind of operational

data world called the data owner, the analytical data world. It's more the data product manager.

These are then the people who become more concrete on what is a lay really that we expect for building a great product. I think a second version involved

that could go to, for example, a data vendor and say, hey, You know, you've delivered data that was not great. These are expectations. Can you live up to those? Or to internal data producers, so maybe other data domain teams

that, you know, designed or have created a dataset, but you have a different purpose. You wanna do something new with it, so you have new data requirements. So starting that conversations.

I think those 2 major roles, I think, is something that we definitely see a large growth in, and I think is a very interesting, you know, area. From our perspective, we built our platform

agnostic

as possible

to the domain

to any domain so you can create different type of abstractions

as abstract as you want. Right? You can model whole worlds, whether it's it's like a process

type of worldview that you want to see for your data, maybe a decision tree or maybe just a categorization

and classification

of of your data, the different types of data, but we need to

be able to provide the ability

to build this abstract world.

And that's why we,

decided to speak to the guidelines and principles of the semantic web. This work has already been done for us, And there are many different ontologies

that are already prepared by domain experts, whether it's in the marketing world or the finance world or the health care world,

you name it.

And every business that operates in such vertical can make use of any ontology that's already been built for that area

and chop it and slice it in its own particular way to fit its own business world,

and then just map

his data sources to those concepts, those entities

that he created.

This also gives a lot of power to the data consumer

that he can leverage the hierarchy and the relationships

that you create

in your data.

Everything makes sense all of a sudden.

For example, if you have, like, different sort of clients or different sorts of marketing campaigns, you can

classify them maybe by their source, maybe marketing campaigns from Google or from Facebook or YouTube or Twitter. It doesn't matter.

They're all marketing campaigns. When a human being

looks at the marketing campaign, he expects to see a certain

base type properties that are related to the idea of a campaign.

And maybe Google campaign has some properties and columns that the Facebook campaign does not have, like average view time or whatever or keywords or demographics. It can be many different things. But then you can size in a in a hierarchy

worldview

and as your business see the world correctly.

I was gonna share a quick anecdote, actually. You know, when we're, you know, we're building this no code platform. Right? And we said, hey. You can point and exit to a database, and it creates an integration.

And I remember that, you know, we made the product choice that in that UI, you can choose an existing existing table, but you can't create 1. And our thinking was that a DBA would never allow somebody who is not a DBA to go and do a create table.

And as soon as you brought out that UI, the first feedback we got, like, hey. I need a create table button.

And what we also realized is that for many you know, many tools have not become advanced enough for the nonengineered

to work with data in that way. So a modern data warehouse, you don't have to think about indexing

and rebalancing and vacuuming. This is what the DBS would have to think of, but you can therefore because you don't have to do that, you can allow an analyst user to sometimes go create their own table. And on the flip side, I find that it's almost impossible for a data engineer, as, you know, we're hearing, like, to become an expert in the data itself, like, understand sales data or HR data or advertising or performance data at the level that those ops teams or those individual users do. So, you know, once the integration and the preparation and some of those sort of technical aspects are taken care of, then it's that user who can determine, you know, like, hey. What do I need to observe? You know? How do I find if my data is actually looking good? You know? How do I wanna transform this field or enrich that? So I think a lot of tools and technologies have come together very nicely to make it possible

for a lot more people in the team to work collaboratively.

Have you ever had to develop ad hoc solutions for security, privacy, and compliance requirements?

Are you spending too much of your engineering resources on creating database views, configuring database permissions, and manually granting and revoking access to sensitive data?

Satori has built the 1st DataSecOps

platform that streamlines data access and security.

Satori's DataSecOps

automates data access controls, permissions, and masking for all major data platforms such as Snowflake, Redshift, and SQL Server,

and even delegates data access management to business users helping you move your organization from default data access to need to know access.

Go to data engineering podcast.com/satori,

that's s a t o r I, today and get a $5, 000 credit for your next Satori subscription.

Another element

of the challenge that exists for data professionals and data engineers in particular

is

understanding

what are all of the options available to them for solving their particular problem. How do those different

systems and platforms and tools fit together to be able to form a sort of cohesive

and well structured platform for the organization?

And how do they handle the long tail of edge cases that the platform wasn't necessarily engineered to be able to support out of the box. And so I'm wondering if each of you can speak to some of the

challenge that exists and some of the ways that data professionals should think about the selection of the tools and the design of their platform to be able to

facilitate

these operational

oriented roles and

how you're able to build a platform that is

internally well integrated

but is able to either be extended or gracefully degrade for these long tail of edge cases that nobody can possibly plan for all of? My recommendation would always be to, you know, keep it simple initially as you're building it out. For example, in open source, what is available?

Make sure you build data platform by definitely looking what is there available in open source.

Gets through through the use of open source. For example, you can get your first experience with it. The team can kinda start interacting with it. So you learn, and as you do that, your thinking also evolves. And then at some point, kind of making further investments and then also looking at commercial is, I think, a good strategy.

So that's what I typically recommend. So I was definitely looking into the open source world and what's available there because in the data technology, there is a lot of it. Absolutely. I agree. It's always helpful to see the tools that you want. And in the open source world, it's

very nice to have all these nice repositories where you can lay, fetch, and pull, and play with different type of solutions. I would recommend

to start with the architecture, checking your own architecture,

see to plan correctly what what are you trying to achieve,

How are you trying to achieve it? What are the bottlenecks that you foresee when you you're going to adopt this new technology?

Also, make notice of your own infrastructure that you may have or not have in place yet, but what connects to it, what what are the limitations, what are the boundaries.

These are all the things that you have to take into consideration before adopting a new product.

So plan well. It's it's never bad to emphasize on the planning step. Sudan, I definitely acknowledge that aspect. Right? So our approach is to start with capability based reference architecture,

agnostic

or specific product, a platform, or a solution.

Have that capability architecture to really identify

what are my endpoints,

what are the capabilities

exposed,

how do I need to pull that data in, what are the different supported models, and then try a very loosely coupled or a decoupled architecture where you can swap in, swap out capabilities,

which can be there. So existing investment

is not cool, Harish.

You can build on top of it. You can remove those components when the new components come in. As long as you adhere to these open standards,

you will be able to address the aspect

of some of the platforms

no longer being in existence. So that's the aspect which is start with the capability based reference architecture,

use open standards,

use decoupling mechanism,

use an environment which will scale up, scale out. Because the biggest challenge with data is you have

when something starts working,

it starts to explode. You have a lot of data coming in. So the ability of scale

has to come in. And the last thing is, how do you actually

keep everything in control? Because the data in wrong hand is really a problem. So the right security

mechanisms to control the access, who accesses it, and how does it actually

get used. If you put all of these things in mechanism or in play, this would be a producible way in which the problems can be resolved. I think, you know, I'd say that in the software world, we generally know that, you know, when things are easy to use,

they almost can never solve the hardest or the hard problems, and then the tools that can solve the hard problems are not easy to use. Right? So data problems of span the whole gamut, you know, very straightforward to extremely complicated.

The approach that we took was that we have to somehow bring these paradigms together.

So, you know, we give a no code solution so that's easy to use and can solve, you know, 75% of the problems that people work with. A low code extensible thing that, you know, lets you handle the next sort of mid complex things.

But then for the engineers, we give an SDK

full set of APIs,

the command line interface to our product. And then, you know, our goal always is that a combination of these 3 across different

skill levels of people will, you know, handle pretty much any problem that an enterprise can face. The other thing I would say is that, you know, contrary to, I would say, popular opinion, we think that a SaaS only solution is not 1 that works in data. We work with some of the largest financial institutions, and it's impossible for them to put certain types of data in a SaaS environment in somebody else's server. So the approach we have taken there is is a SaaS offering. Most of our customers use that with the option for people to deploy, you know, either just our backplane or maybe even our entire system

on prem in the cloud and kind of run that as a federated system in a hybrid environment. So those are the 2 things that we have done to say, you know, how can we handle the variety of complex problems that people will be solving. In terms of the

challenges that people are faced with and the ways that you've each seen your products used by your customers

and internally to help support your own businesses? What are some of the most interesting or innovative or unexpected ways that you've seen your products used and these problems being solved? And, Saket, why don't you start us off? I think that when you give people a lot of flexibility

and, you know, things that you can put together in a composable way, they come up with many creative solutions. I'll give a couple of quick examples. We had a customer that had extremely large tables, so their dashboards were performing slow. Right? So they wanted to create these aggregate intermediate tables and improve the performance.

What they realized was that they could use Nexla to, you know, take query that big table,

transform it, aggregate it, and create a smaller sort of aggregate table.

But then once they had created that 1 data flow, they were able to use our command line interface

to iterate through that and create, you know, several 100 such sort of configurations. And, you know, something that they were thinking was gonna be a 6 month project for them turned out to be something that they could do in a couple of weeks and make many, many tables sort of much more performant. Right? We have also seen people

look at the next set sort of technology and say, hey, I have a data product, but I can modify it to create a new next set and then modify that to create another new next set. And they have done that to say, hey. I'm gonna take this next set. It is pointing to an image file, and then that image, I'm gonna push through an API call to an OCR system. It will get more data. The resulting next set is more rich, and I can use that data. And this is where 1 of the freight brokers used us to gather data from images of, you know, driver licenses and insurance policies and automate the onboarding of drivers, which was otherwise somebody had to look at, you know, stuff and do it manually. So many, many useful and interesting use cases that we find and are surprised by and excited about. And Martin, how about yourself? Because we're an open source and we're on data observability. Right? So 1 of the interesting things that happened recently was that 1 of the European countries, their health ministry, all of a sudden, we got an email with a request for a quote. Can we buy your software?

And that was, very interesting. We started learning about their use case. It was all about checking all of the COVID data before it went to all the officials ultimately,

for all decision making in in that country.

What they liked about it was that they didn't have to really set up a lot themselves,

looked at the data and started suggesting

some of the potential issues that will be in there or some checks and validations that they should be doing. That's 1 of those very feel good moments. You're like, yes. This country is using it, and they're having benefits, and it's it's having impact. And it's, it's something we can all relate to. So that was very nice thing that happened, a couple of weeks ago. Akshay, what are some of the interesting ways that you've seen your platform used? Sure. I'll take a very recent example

because of the COVID challenges. Right? So we were helping 1 of the largest automotive

manufacturers in the US.

When COVID hit, everyone went in a lockdown.

The visibility from an inventory perspective,

from your suppliers, from an end to end supply chain was immediately disrupted. They did not have the suppliers onboarded giving visibility across what was their ability to

fulfill a specific demand.

So they had a challenge where they got back to saying, okay, I'm gonna start a war room, call every

supplier,

find out what the inventory, what they can commit to, what was their problems.

So this was turning out to be a nightmare for them. They had more than 40 plus different planners trying to really identify

what was going to be happening with their production line. So with our platform, we basically quickly

put together a full form based solution to collaborate between the different suppliers.

It was built on top of this semantic model, which was trying to mimic the supply chain control tower. If there were any

anomalies which they were identifying in terms of saying, okay, I cannot fulfill this demand, We were able to also take in information

from risk related to some of the COVID challenges and find out alternative tier 2, tier 3 suppliers.

How would you actually address all of that aspect of a supply chain? Normally,

if we would have done this in a traditional way, this would have been like a 12 to 16 month exercise.

We work with this automotive manufacturer.

And within the 1st 6 weeks, we were able to put the solution up and running. Instead of them having to start this war room with the supplier,

we were able to have 1 single visibility dashboard

and get them up and running.

They were able to plan for next 3 month, which was not possible. They were planning literally a day to 2 days in advance, which meant they could not understand what was gonna happen as a result of it. This basically

generated

1, 000, 000 of dollars of revenue leakage as well as improvement in visibility. That's how, we really then started to enhance the collaboration aspect and the security way, which came up with the platform side of things. And, Dan, what are some of the interesting ways that you've seen your tools used? Well, when you give the users the ability to build the world from abstract space in any way that they see fit, you suddenly see

unexpected things that the user

might model. But 1 of the things that always strikes me is how the feedback from the user of how quickly they can get up and running

since there are no ETL processes?

Did you just connect your data and start modeling your view and the results that you want very easily? So, for example, you know, a report that

usually took 3 or more months to produce suddenly gets produced in in less than a week,

and that's a huge win in validation for us when we get that feedback from the poor analysts

that had to create a very, very big lengthy SQL query and understand where is the data and where to fetch what and how to prepare and clean it correctly. He suddenly can do it in 1 line of SQL. And if he wants, he can see an explanation and see what is the translated query that gets sent to the underlying databases.

That's something that always surprises me, although it shouldn't, the the feedback that we get from our users.

In terms of each of your experiences

of building these products and building these companies and helping your customers

try to

address some of the complexity and challenges that exist in the overall data ecosystem, what are some of the most interesting or unexpected or challenging lessons that you've each learned? Dan, we'll start with you. It's something that I knew, but I had to relearn from the beginning

is that the user won't follow the best practices.

No matter how much you emphasize, you recommend,

the user will always try to find the easiest and laziest approach that he has to model the data, and we try to facilitate

and spoon feed as much as we can. But when you're dealing with that high level of abstraction,

you really should emphasize and sometimes

help users understand what is the correct way and why it's the correct way to

model your data

differently than what they see for the specific use case.

Akshay, what are some of the lessons that you've learned while building out your project?

1 of the things which we recently faced was

we build the semantic model. We have got all the data from different systems inside,

and the customer is actually utilizing or using his platform

as the single source of truth.

The customer certainly asks and says, hey, I have these 12 different ERPs across my different locations.

Can I actually

get your data and write back to my source systems

to cleanse the data? Because I'm using your system as a single source of truth. Would you be able to write this back to my system? So we basically looked at it and say, hey. The write back is going to be a challenge because you're gonna have a whole bunch of complex logic which is already existing in these ERP systems.

And how do I actually then go and write to each of these systems? There are different permissions. How do I take care of all of these things? It's not just permissions. There are compliance

and regulatory aspects which come into picture. So from there, we actually came up with this whole writeback module,

which gives the user

an ability of a webhook or an endpoint

which they can consume in their applications.

So with the cleansed and the right data,

they can make their own calls, pull from each of these systems,

give them the control of how this is being done, and then get these systems back up and running with the right data. That is real good problems which we've seen quite unexpected,

but definitely something which helps solve. And we're seeing a lot of traction as a result of that as well. And, Saket, what are some of the interesting lessons that you that you've learned while building out your product? We started Nexla thinking that streaming would be the best way to work with data, could handle, you know, batch and, you know, other use cases as well.

We eventually got pulled in both the directions towards batch processing

as well as real time processing. Fortunately for us, our data processing engines are pluggable.

And we saw some very interesting use cases of that. You know, people were like, hey. The next sets have an API to them. If I call that API and if the data is actually coming from a database, we like it the data in real time, and, yes, that's how it works. You know, you end end up automatically generating APIs

to your database with, you know, dynamically rewritten SQL queries that we do. So those became interesting use cases where streaming wasn't fast enough and real time is what was needed. And then on the other end, we saw that, hey. I'm migrating huge amounts of data from my on prem data lake to my cloud data lake, and it didn't really make sense to do that as a stream process,

And we are enabling batch processing engines as well. So quite a few lessons. You know, I think the only thing that is a constant in the data world is that people keep coming up with new ideas, and they will throw new challenges.

Very important for the platforms that we are building today to not get obsoleted because, you know, use cases will keep coming in. Give the power back to the user. Let them figure out how they want to use it. But be adaptive, flexible, and dynamic in your design. And Martin, what are some of the interesting lessons that you've learned while building out your product?

Slightly different. So I came from more of a commercial

software kind of world, commercial product, closed source.

And my cofounder, Tom, he came from open source.

Practically, his entire career has built open source communities and products that are used by hundreds of thousands of users globally. At least for me, 1 of the biggest learnings was, like, how fast you can learn from your user in open source.

It's a whole different world because you have

direct contact, direct access with a growing community of users that are asking you questions practically all the time. They're testing it out to read something new. They can immediately validate.

And the speed that you get with that, I think, is an incredible learning for me. So I'm extremely fortunate to have been able to experience that. So that will be my biggest lesson learned, I think, over the last year and a half now. As you each continue to

iterate on your products

and work with your customers and explore the overall space of the data ecosystem and data processing and data management, what are some of the

upcoming trends and challenges that you're keeping a close eye on? And, Akshay, why don't we start with you? 1 of the common trends we are seeing is the leveraging of AI, ML based algorithms

for coming and giving recommendations on business insights, the need for having this

engine plugged with your data fabric. So that's 1 thing we are definitely

building on top of. We've had a base engine. Now we're trying to scale it up for the volumes of data we are getting in.

That's a trend which we have seen. This goes across the ecosystem. The other aspect we are seeing is the leveraging of conversational

UI. People definitely use dashboards, drill down mechanisms, but they also want access to the data

through conversational UI interfaces. So that's another trend which we are seeing, which we are adding as part of our capabilities,

giving the business users the ability to converse with your system

and give them access to the data.

The other aspect of it is collaboration across enterprises

where you have set up 2 different networks and now you're trying to exchange information.

Traditionally, this would have been done through mechanisms like EDI and other data exchanges.

But the ability to exchange the information

and be part of this large federated ecosystem,

That's another aspect which we are seeing, which we are exposing through a set of

federated API access mechanism,

and that can be driven through self serve

access with some sort of a workflow for approval and provisioning. So those are, a few things which we've seen

as part of the

challenges, new things which people are trying to explore or ask as part of the capabilities.

And, Dan, what are some of the things that you're keeping an eye on as you continue to build out and iterate on your product and work with customers and explore the data space? Like, it's been said before, there's a big bump in the machine learning and deep learning

space.

Everyone that's dealing with data knows that the hot topic right now, and everyone wants it, and everyone wants to use it, but nobody knows how to. We

at Team Maria already have the graph algorithms

for a community detection,

similarity algorithms,

link prediction to create relationships,

and everything automatically

to enrich the current

data that

an organization currently has.

We're planning to enhance it

and work more on it to make maybe more algorithms available and scheduling maybe.

And, also,

we're looking at implementing a natural language interface as we have more and more data consumers

coming in and wants to use the knowledge graph

and to get data out of the organization.

We know that SQL, although it has a very impressive track record for maybe 50 years now and still going strong, I don't see it changing anytime soon. We still see that we have some data consumers that would be easier for them to use some sort of natural language interface to query their data.

And so since we

we we model in an abstract view of the business world in an ontological manner,

it's relatively easy for us to implement this capability.

And so

we want to focus a little bit on that as well. Martin, what are some of the trends that you're keeping an eye on as you continue to iterate on the technologies that you're working with? Oh, I think there's a couple of transformations

going on, I think, on a broader space.

I think maybe not so much in in pure, like, infra today, but I see it more, I think, around recovery and cataloging. And there, I see a big change going on today.

It used to be prominently focused around, like, harvesting metadata and focused on kind of governance and sharing knowledge.

But the growing importance really around requirements for the data and analytics engineer. Kind of the existing players have, to a large extent, failed to address that. So you might hear things

in the market about we need an operational

catalog or we need an active catalog. Well, this is all about the search to make metadata, I think, more important

and more of an active

part of the everyday workflow.

For example, if somebody searches for data, they want to know if we had any recent problems to that data. So, I think that's 1 major change that or evolution that would change. So, you know, metadata is almost becoming more real time. And the second 1, I think, is the advent of observability category and specifically data testing.

Data testing is something that I expect will become an integral part of the best practice of any data and analytics engineer, their toolkit, their way of working.

It's kind of like the DevOps principles.

We're close to having a no point of return moment in which everyone, just every data engineer, every analyst engineer will test data as they're transforming it or ingesting it.

So I think those are the 2 most important ones that I'm tracking very closely. Saket, what are the trends that you're keeping an eye on? You know, I think it's more and more apparent that data is indeed a team sport. So I, first of all, keep an eye on the community capabilities. You know, how are people

going to find a connector, but then share that with the community or, you know, find a publicly available data set and then make it available to everybody else, maybe in the organization, maybe outside the organization

publicly. The second thing I look at is that there's a paradox of convergence and fragmentation.

Both are increasing at the same time. So on the convergence side, I think, you know, finding the data then integrating it and preparing it and monitoring it and running it. You know, these have been, you know, maybe 8 to 10 tools spread out, but I think there is a convergence that is happening and will be coming. We're trying to play a role in that. At the same time, there's a fragmentation going on. Right? I mean, you have your Snowflake data warehouse, but then you have your Firebird and EuroBrick for performance. You have your time series database, and but then you have your graph databases, and then you have your traditional, you know, transactional databases and your data lake. So all of these things will coexist because

specialization

will hit in each of these areas for the specific application.

So both of those things will happen together,

and I think standards will start to play a role. I mean, of course, there are no standards right now pretty much when it comes to

interoperability

of tools or companies.

You know, the coexistence of this convergence and, you you know, fragmentation

will happen or should happen

through standards. We have done our part to open up, you know, specs of things that we do, But I think we'll have to work together as a community, as an industry,

to make it possible and make it easier for the end user. And I think the final point I'll say is, you know, applying machine learning back to data management itself,

that's the whole concept of data fabric, which is why we're a cool vendor, is take that passive metadata about the data itself, take that active metadata,

how how does it run and operate in live mode, bring that together,

make the data management easier, and figure out for your end user, like, what's the right way to run this data and process. So these are some of the things I'm keeping an eye out, but it's an exciting space. It's a space that throws surprises all the time. So who knows what's coming.

Right? We'll find that out in our next conversation.

Alright. Well, thank you all for taking the time today to join me. For anybody who wants to get in touch with each of you, I'll have you add your preferred contact information to the show notes. And as a final question, I'm wondering if you can just give your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. And why don't we start with you, Akshay?

So the biggest gap in the tooling right now

is the ability to plug in the different components which are needed.

Data quality still is 1 of the biggest challenges

with all the disparate data. Then is the whole aspect of metadata management. These are the 2 most common challenges

which still exist even with a lot of tools being available in the market. And, Saket, how about yourself?

I think the biggest gap at a high level, I would say, is happening, you know, at a people level. Right? I think we're asking data engineers to give control of things that they have done before, let people do it themselves,

which means that from the tooling perspective,

we need to bring greater trust, greater

flexibility, but also give more control back, you know, to the users.

I see what that means is that they almost become a must have to be SOC 2 type 2 compliant, you know, to have the ability to run data in silos in different regions and countries. So I think the gap in the tooling, I would say, is largely in making that possible from a people perspective, giving them the trust, you know, giving them the confidence.

I think from a feature perspective,

there's a lot happening,

and I think we'll take care of the feature level things as for sure at an industry.

And, Dan, what do you see as being the biggest gap? The biggest gap, I think, is just

it's all very new, and there aren't a set of well defined principles of

what exactly

encompasses

the role of a data engineer or a data modeler

or data analyst for that matter. Many people get

many tasks that are not directly related to what they do, and they're expected to know things that they

never saw that in their contract, employment contract, and so it creates some problems. So in the collaboration of the data space itself, since it involves so many

people that work with data, there should be better collaboration between the different type of experts

dealing with the

data. And I think that's a gap that will be closed following few years. All the platforms will have to

somehow

integrate those features into their tools. And, Martin, what's your view of the biggest gap? I think people and process. I I would definitely agree there. It's about we have all these new specializations,

these new roles. So how do we work together in an optimized good way? I think that's the core challenge. If I'd have to pick technology, I would say, well, quality and observability,

that's being tackled. The next 1 then I would say is policy 7. We're moving all of this data into the cloud from traditional systems, and there's a lot of policy that already applies there. So how do we scale that and not have to do that on an individual table level, set policies for I think that is still a big 1 out there, a big piece, for the years to come. But, yeah, that's what keeps this space, of course, so exciting. Alright. Well, thank you all for taking the time today to join me and for all the work that you've put into your respective platforms. Definitely shows the amount of dedication,

both in terms of the quality of what you've built and the fact that you've all been recognized by Gartner. So thank you again for taking the time today, and I hope you enjoy the rest of your day. Thanks for having us. Thank you, Tobias. It was a pleasure. Thank you, Tobias. Really fun talking to you every time. And, yeah, we'll hopefully sync up again in 6 months or so and see where things have changed.

Absolutely. I would love that as well.

Thank you for listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave your view on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links