Summary
The problems that are easiest to fix are the ones that you prevent from happening in the first place. Sifflet is a platform that brings your entire data stack into focus to improve the reliability of your data assets and empower collaboration across your teams. In this episode CEO and founder Salma Bakouk shares her views on the causes and impacts of "data entropy" and how you can tame it before it leads to failures.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
- Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder
- Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.
- Your host is Tobias Macey and today I’m interviewing Salma Bakouk about achieving data reliability and reducing entropy within your data stack with sifflet
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Sifflet is and the story behind it?
- What is the motivating goal for the company and product?
- What are the categories of errors that you consider to be preventable?
- How does the visibility provided by Sifflet contribute to those prevention efforts?
- What are the UI/UX patterns that you rely on to allow for meaningful exploration and analysis of dependency chains/impact assessments in the lineage graph?
- Can you describe how you’ve implemented Sifflet?
- How have the scope and focus of the product evolved from when you first launched?
- What is the workflow for someone getting Sifflet integrated into their data stack?
- What are some of the data modeling considerations that need to be considered when pushing metadata to Sifflet?
- What are the most interesting, innovative, or unexpected ways that you have seen Sifflet used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Sifflet?
- When is Sifflet the wrong choice?
- What do you have planned for the future of Sifflet?
Contact Info
- @SalmaBakouk on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
- Sifflet
- Data Observability
- DataDog
- NewRelic
- Splunk
- Modern Data Stack
- GoCardless
- Airbyte
- Fivetran
- ORM == Object Relational Mapping
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Ascend: ![Ascend](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/E4SVhJLU.png) Ascend.io, the Data Automation Cloud, provides the most advanced automation for data and analytics engineering workloads. Ascend.io unifies the core capabilities of data engineering—data ingestion, transformation, delivery, orchestration, and observability—into a single platform so that data teams deliver 10x faster. With 95% of data teams already at or over capacity, engineering productivity is a top priority for enterprises. Ascend’s Flex-code user interface empowers any member of the data team—from data engineers to data scientists to data analysts—to quickly and easily build and deliver on the data and analytics workloads they need. And with Ascend’s DataAware™ intelligence, data teams no longer spend hours carefully orchestrating brittle data workloads and instead rely on advanced automation to optimize the entire data lifecycle. Ascend.io runs natively on data lakes and warehouses and in AWS, Google Cloud and Microsoft Azure. Go to [dataengineeringpodcast.com/ascend](https://www.dataengineeringpodcast.com/ascend) to find out more.
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode. With their new managed database service, you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs. Go to data engineering podcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show.
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it's often too late and the damage is done. DataFold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. DataFold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying. You can now know exactly what will change in your database.
DataFold integrates with all major data warehouses as well as frameworks such as airflow and DBT and seamlessly plugs into CI workflows. Visitdataengineeringpodcast.com/datafold today to book a demo with Datafold. Your host is Tobias Macy. And today, I'm interviewing Salma Bakook about achieving data reliability and reducing entropy in your data stack with Siflle. So, Salma, can you start by introducing yourself?
[00:01:53] Unknown:
Sure. Hi, Tobias. I'm very happy to be here. So my name is Salma. I'm CEO and 1 of the cofounders here at Souffle.
[00:01:59] Unknown:
And do you remember how you first got started working in data?
[00:02:02] Unknown:
I do. It wasn't the most traditional way. That's for sure. So I have an applied math and statistics background. I went to an engineering school here in Paris. And then after graduation, I I moved to Asia. I joined 1 of the, largest US banks as an analyst in equity sales and trading. So as you can imagine, the trading floor, very dynamic, very data intensive environments, heavily relying on large amounts of data to make decisions in real time and where a lot of things can go wrong and a lot of things can break. So naturally, having, you know, the background that I had and the kind of curiosity, I had to get my hands dirty very quickly and and try to understand where we got our data from, what were the ingestion patterns, was it reliable, up to date, etcetera, because it was just vital to my day to day. And at the time, we had a central data team of data engineers that was responsible for creating and maintaining the data infrastructure.
But a lot of the data consumers, myself included, were constantly struggling to get access to reliable data, and and this led me eventually to lead some key initiatives at the firm that were aimed at putting the data in the hands of the data consumers and make both the data and the infrastructure behind it more accessible to more people and, in general, creating more autonomy around working with data so that collectively as a firm, we could become more data driven. So that's how I became a data practitioner.
[00:03:26] Unknown:
That brings us to what you're building at Siflae. So I'm wondering if you can just share a bit more about the product focus and the goals of what you're building there and why you decided that this is where you wanted to spend your time and energy. So back when I was working in trading,
[00:03:40] Unknown:
we had a lot of data quality issues. We used to rely on data to make decisions in real time, and we had a very complex and and big data infrastructure. And and despite all the investments that were made by the banks into going the direction of becoming data driven, It was just inevitable for us to get data incidents, which, you know, more often than not turned into big business catastrophes. And so initially, I was just curious looking for solutions for this we could adopt for my team in trying to get better at how we, you know, look at data quality, how we manage it, how we can improve overall trust in data assets. And especially as, you know, we moved into a fully decentralized kind of environment where we made data consumers more responsible for the data assets that they were using, Easing the access to the data is great, but without proper governance and proper data quality management, things can get quickly out of hand. And and that's the situation where I was in being responsible for a team that was generating revenue for the company. And so, initially, I was looking for a solution.
I did a lot of research. Nothing really was, you know, available or mature at the time. This was 2018, 2019. And then I I reconnected with 2 of my very good friends that I've known for more than a decade. We went to school together who were working in in the data and analytics space at some of, you know, the big tech companies. And I was hoping initially that they would have answers for the problems that I was having. Turned out that it was a problem that any company that was striving to become data driven was facing, and I got even more passionate about the topic. And then we ended up, you know, leaving our our full time jobs and and starting. So Ciflae is data quality management or data liability solution.
We, I guess, today fall into the category of what we collectively call data observability. It's it's a new and emerging and and fast growing category. I'll spare you the definition because I'm sure you had quite a few guests that's, you know, tackled the topic in the past. But, essentially, our approach with Ciflae is to come in and sit on top of an existing data stack. We have connectors that go from ingestion all the way to consumption. We are domestic as far as the consumption use case goes, and we ensure that the data is reliable and trustworthy at any stage of the pipeline. That's the story behind Cifl. As you mentioned, you're working in the data observability
[00:05:58] Unknown:
ecosystem, which is, as you said, emerging, and so everybody has a slightly different definition of what it is that they're actually addressing. And I'm wondering if you can share your working definition of what data observability means in the context of what you're trying to achieve with SFLE?
[00:06:14] Unknown:
So at SFLE, we have this concept that we call data entropy, which kind of symbolizes all the chaos and the disorder that a lot of data practitioners have to deal with, especially with the growing complexity of the data platforms and the growing expectations from the business as far as data and data infrastructure goes. And so Cifli is on a mission to reduce data entropy and to help companies embrace the inevitable complexity and that they have to deal with as far as data and data platforms go so that they can make smarter, faster, better decisions.
Now Cifney's approach specifically, and it goes back to me and my founders on our backgrounds. So when we initially sat down and we started thinking about the product, obviously, our inspiration comes a lot from software and from, you know, companies like Datadog, New Relic, Splunk, etcetera, what they were able to create over the past decade or so. But data is a much more complex environment, and it comes with its own, you know, complexity and use cases and different personas. And so we know that's you know, we needed something that can speak to data and speak to data practitioners.
And so the funny thing about us as a founding team, which actually ended up being our superpower, is that I come more from, like, business leader, data consumer standpoint, whereas my cofounders come from pure software and data engineering backgrounds. So when we were brainstorming about the product, this was beginning of 2021, and at the time, you know, data observability was just taken off. We knew from the get go that we wanted the solution that could solve problems both data engineers and data and data consumers. That's what led us to build Cifli, which is, you know, a what we call a full data stack of 0 video platform. So we didn't wanna dissociate between data engineers or data producers and data consumers.
We didn't want to exclude 1, you know, type of user from the equation because everybody cares about their liability and everybody wants to have good quality data. And the only way to do that, as we say, embrace the complexity in the data infrastructure, is to have that, you know, observability or monitoring layer being as present and as deep as the data is in the ecosystem. And so practically speaking, our approach is to come and sit on top of the data stack. We have connectors from ingestion through the transformation ETL, ELT, the modeling layer, data warehouse, all the way to consumption whether that's reverse ETL, BI analytics, machine learning use cases, etcetera.
And we collect signals across the whole ASR stack, and we use those signals to provide a set of metrics that give an idea of the health status of data across stages. So, you know, in simple words, we we know we have an anomaly detection framework that we use that looks at anomalies at the data, the metadata, as well as the infrastructure. And this is also 1 of the key differentiators specifically because we wanna be in a position where we prevent data quality issues and not just solve them after the fact. That's the idea. And we complete this, you know, overseeing layer or overseeing anomaly detection framework by giving context to the nominees because saying that something is breaking is great or noticing that based the data incident is great. But the most difficult and the most complicated and time consuming parts is the troubleshooting and is trying to figure out how serious the data incidents can be and how it's likely to propagate downstream.
And so we have a old stack data that we compute that helps tackle those challenges and give context about root cause analysis, incidence management, correlation between data assets, and and stuff like that. So that's the approach.
[00:09:47] Unknown:
1 of the concepts that you mentioned in your definition of what observability means in the context of what you're achieving at Souffle is this concept of data entropy. And I'm wondering if you can talk to some of the ways that that manifests and some of the contributing factors to how it grows and some of the ways that data teams can help to mitigate the proliferation of that entropy?
[00:10:12] Unknown:
So entropy in data can manifest itself in a few different ways. You know, entropy can mean for an organization that they can't find their data assets or they have, you know, a discoverability issue, traceability issue that data consumers cannot easily find the, you know, the data assets they need to use, etcetera. It can mean that trust in data is completely lost, that there's a lot of, you know, processes that rely on data but that are completely broken because the data quality is so bad. It can mean that, you know, data engineers spend half of their time troubleshooting and stuff like that. And more importantly, it means that a company is at a state where they wanna use the data. They keep making investments in data platforms, in infrastructure tooling, and the people, but yet struggle to see the results and struggle to become data driven. And so a lot of elements in data entropy are a mix of, you know, technology but also people and culture. And so the reality is in the current environment where every company has very low tolerance to data incidents, every company is downsizing 1 way or another, where companies are faced with a lot of difficult decisions they need to make, it becomes extremely crucial to remove everything that's not backed by data and by reliable data from the equation when making decisions.
And so the expectations, in my opinion, about around data and data platforms and data teams are now higher than ever. And so entropy or data entropy and that uncertainty and that disorder that is, you know, constantly growing within the data team needs to be reduced in order to, you know, optimize the full potential of what data can do and and nurture and foster a data driven culture. Now how is entropy, you know, something that is inevitable in data platform? I think so a 100% because in a way, you know, that also means that the company has made commitments and investments towards being data driven, has invested in a variety of tools, You know, in the current ecosystem that is what we call the modern data stack, I'm using air quotes. You know, everyone wants the best of breed tool for transformation. They want the best data warehouse. They wanna have data lakes. They want to have, you know, a BI tool or 2. They wanna do reverse ETL. So the whole evolution that took place in the data engineering space and the data engineering infrastructure and tooling space allowed for the complexity and allowed for the entropy to increase. It's positive in a way. It's giving more flexibility to companies and more flexibility to data practitioners to do more with data. But in order to do more with data, you need to control or try to reduce the entropy that surrounds the whole, you know, workflows around data. And the way to do that is to first of all, it starts, you know, with people and making sure that everybody's aligned around objectives and stuff like that, but also adapting best practices and tooling that's becoming more and more available to help navigate through the mess. And data observability is a very good example of, you know, of technologies that help reduce data entropy.
[00:13:11] Unknown:
As far as the prevention of errors and being able to proactively identify issues that haven't happened yet, but that are likely to happen either because of changes in the data or because of stresses on the infrastructure and the processing systems or because of code changes that are being introduced. I'm wondering if you can just talk to the types of errors and issues that you see as being preventable and how you're working at Souffle to be able to bring visibility to those potential impacts.
[00:13:43] Unknown:
In my opinion, I don't think it's so much about errors that are preventable, but rather about the stage where you want to catch the errors and prevent them from propagating further downstream and causing, you know, bad outcome for the business and the data team. Because depending on the stage where you are able to catch these anomalies, the outcome can be materially different. Let me explain what I mean here. So in an ideal world, obviously, you want to make everything preventable. You want to have, you know, a complete 360 view over all of your data assets. You wanna know who's using what, you know, who changed what, who's touching what, etcetera.
But the reality is in most current organizations or even the most modern data platforms, that actually becomes extremely complex, and it becomes harder and harder to get at data incidents. I guess the most basic way to go about it is to implement manual checks to get ahead of data incidents. So you want you know, you you could start by implementing testing at the orchestration layer. You wanna check the ingestion patterns. You wanna look at schemas. You wanna look at the data itself, etcetera. Obviously, the earlier in the data value chain, the better because that gives you the ability to catch the problem at the source and avoid further propagation.
But that is not always straightforward, and more importantly, it's actually not enough because as we know, especially in in a world where data consumers want to have more and more control over their data assets, Catching problems at the source is only half of the job. You still have to follow the whole workflow and the whole flow of the data and how it's likely, you know, propagating downstream. So the answer to the question is really depending on how you wanna go about what's available in terms of resources and the complexity of the data platform. In an ideal world, you wanna have checks everywhere. You're gonna have, you know, full stack observability framework. You want to be able to see your lineage everywhere and stuff like that, but it doesn't always make sense for all organizations. And more importantly, not every organization should have the resources to implement something like this. Another interesting element to this question is
[00:15:47] Unknown:
understanding who's responsible for identifying when a problem has occurred or when a problem might occur.
[00:15:53] Unknown:
Yeah. This is the $1, 000, 000 question. Everybody's asking, you know, who should be responsible for data quality? And I think there's few approaches that are emerging. And, again, although there might not be a right or wrong approach, I think there are certain things that work for a certain type of organization. I very recently had the chance to sit down with Andrew at GoCardless who coined the data contract concept and successfully implemented it at GoCardless. And it was such an interesting discussion because, for example, at GoCardless, they, you know, took the concept or adopted the concept of data contracts which essentially is like API for data if you wanna oversimplify it, but it's codifying, you know, some certain expectations around the data to ensure the quality of the data asset at kind of the ETL parts or the ingestion layer. So that's an example of, you know, data quality or data quality checks implemented early on in the process. Now we also see other type of examples where people are going fully decentralized and fully, you know, data mesh and whatnot and where keeping track of data quality becomes not only harder, but it becomes more almost, you know, the the responsibility of the consumer in a way because, yes, you are getting a certain level of, you know, SLAs and SLOs around the data from the data producer, but you're still, in a way, somewhat responsible for what you do with those data assets and how you are likely to, you know, impact the quality of the data assets and stuff like that. So there's really no right or wrong approach here. I think, you know, depending on the resources and depending on how the team is set up, whether it's, you know, fully decentralized or whether there's a higher issue of data engineers versus data consumer, stuff like that. I think different approaches work for different teams. But but, unfortunately, I don't have a perfect answer for you here. I've seen all sorts of scenarios.
I think the best way or what works the best is when both data engineers and data consumers and data producers are involved around data quality and data observability because that's that helps, you know, alleviate some of the pressures around expectations, but there's really no right or wrong answer here.
[00:18:05] Unknown:
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and to build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enables you to automatically send data send data to hundreds of downstream tools. Sign up for free today at dataengineeringpodcast.com/rudder. As far as the end user experience of being able to view and understand the end to end state of the data platform, that can be a very overwhelming prospect where, particularly if you aren't intimately familiar with every detail of what's being done by whom and why.
And I'm curious. What are some of the user interface and user experience paradigms and patterns that you have tested and that you're leaning on to be able to make this a tractable and an approachable system for being able to view and understand what the important elements are to be able to pick out and being able to dig through and debug or, you know, raise alerts and things like that?
[00:19:22] Unknown:
Yeah. That's a really good question. And it actually brings me to a very important pillar in our product which is data lineage. So the short answer to your question is you need to have perfect, amazing, fully automated data lineage. The reality is that's much, much easier said than done, and it's the concept that is often misunderstood. But it is essential in a data quality framework and a data observability framework. Why? Because not only do you want to keep full visibility over how the data assets interact with each other, but also how is the data propagating within the ecosystem, especially if you wanna embrace, you know, either like a fully decentralized model or you wanna give more power to the business users and put data more into the hands of data consumers, which is I think is, you know, a lot of companies are striving to do today, it becomes extremely important to know or to follow the flow of that data within within the organization and and have a whole, like, traceability of the enterprise data pipeline.
Minimum and in each framework should be able to map out the dependencies between the metadata assets to kind of, you know, get ahead of how things are likely to propagate and keep track of, you know, the dependencies between data assets. But more importantly, and this is the approach that we took with CFA, lineage should be more than just dependencies. It's it should also ideally include, you know, all the events, the jobs, how data is transitioning from 1 point to another. And back to the point I made earlier about, you know, creating more of, like, a, you know, a preventative approach around data observability and data quality monitoring rather than the reactive 1. This is a part where lineage becomes very crucial. Because if an anomaly is detected, you know, users wanna be able to rely on lineage to scope the issue, see how, you know, it's likely to impact assets downstream, like tables, dashboards, you know, etcetera. But more importantly, you want to also add that infrastructure element to know if the issue could have been prevented or could be prevented or, you know, could be really solved at the source. Back to what you mentioned earlier about, you know, something as simple as a data engineer changing a code or a software engineer that's not even in the data team, changing something that's likely to cause a lot of headaches for the data team. That's something that should be captured by Lineage and should give that visibility for both the data producers but also the data consumers so that they can take appropriate action.
[00:21:52] Unknown:
And to your point of a change being introduced by a software engineer and that impacting the downstream lineage, that also brings in the question of where does that lineage need to start, where a lot of lineage solutions will focus on lineage maybe from the Airbyte or the Fivetran layer into the data warehouse and maybe even into the reverse ETL destinations. But a lot of times, they will elide that initial scope of in the application where maybe the schema is being mutated by the object relational mapper and their migration framework. And I'm wondering what are some of the ways that you are potentially looking into bringing the lineage view earlier in the chain and into that software engineer, application engineer tool chain so that you can get an early warning system of when those schema changes are going to be coming down the pipeline?
[00:22:42] Unknown:
I love this question. And, actually, we have a concept at Ciflae that we call static lineage and dynamic lineage. So what you described initially, that lineage kind of from, you know, air bytes all the way or Airbyte's pipeline all the way to the consumption layer, and that is computed by you looking at metadata, looking at the logs, reverse engineering SQL, and stuff like that. That's what you call static lineage. But then when you start to go into the application layer and you look at you know, you collect signals from the infrastructure itself. Because when you if you zoom out, a data asset is the result of a job that runs on an application. Right? And so if you just look at the dependencies between the data assets, you're still in that kind of static reactive lineage approach.
However, if you're able to go and collect signals from the infrastructure layer of the of the data asset, then you can take more of a preventative approach and then your lineage becomes not just something that is, you know, horizontal, but also goes vertically into the application layer and and kind of feeds into that in making sure that the whole workflow is more preventative than reactive. So I can go off about this topic for hours. Minus is really the backbone of our product.
[00:23:50] Unknown:
Yeah. And I guess digging a little bit more into that lineage question, there's also the level of detail in that lineage graph where you can have table level lineage, you can have column level lineage. And I'm wondering what are some of the attributes and pieces of metadata that you find to be crucial for being able to provide useful context in that lineage view so that it's not just a very kind of bare bones factual, this column came from this column in this table, and it was executed by this transformation being able to also provide maybe some of the elements of why it was transformed in this way or some of the semantic attributes and how they might change and mutate as they propagate through that lineage graph.
[00:24:33] Unknown:
Today, the way we do lineage, as I said, we collect information about data, metadata, the logs. We reverse engineer the code, etcetera, but we also go into the application layer to make sure that we also collect useful information from there. And so it's not only just about dependencies, as you said, like table to table or column to column, but also, you know, views, models, DAGs, dashboards, etcetera. And and, actually, more recently, we launched what we call metrics observability and and by pushing our lineage all the way to the semantic layer. So we went from being, like from having, like, a static lineage to having a lineage that goes to semantic layer and to the application layer. It all falls under the, what we call the full data stack observability framework. And in an ideal scenario, as I said, like, you want your lineage graph to go as deep as possible into each stage of the data pipeline.
So we're gonna go very deep in the orchestration layer. You wanna go very deep into the modeling layer and kind of have a dynamic updates of that lineage view so that if something changes, it's captured, and the changes are captured and logged in the lineage graph as well.
[00:25:36] Unknown:
Another interesting element of what you were pointing out is that infrastructure layer question where a lot of the data observability systems will focus purely on that layer of data movement and data manipulation, but there is definitely a whole host of errors that can result because of resource constraints at the hardware level or because of lack of capacity, either storage or compute. And I'm curious how you are integrating some of those infrastructure details into the rest of the context that you're providing in Ciflait and how you're able to correlate the resource constraints at the infrastructure layer with some of the failure modes and ways that you're thinking about kind of surfacing that information in a way that is understandable to people who aren't necessarily cloud or infrastructure engineers?
[00:26:28] Unknown:
Yeah. That's a really good question. So, obviously, we're not gonna go and reinvent the wheel. There's already a lot of tools, very mature tools in the space that do software observability in a really great way. But the idea is to take the input from them and make it make sense for data practitioners. Right? Because if you present, you know, a log report or if you look at the output that you would get from, you know, something like Datadog, that might not necessarily make sense for an analytics engineer at an ecommerce company. Right? So the way we do that with CFA is we have connectors with those tools, and we pull information from them that's, you know, is, you know, providing observability over the infrastructure. But we take that information, and we use our own machine learning algorithms to predict what could lead to an anomaly at the data level. So that's how we've been going about it, and it comes back to that preventative approach again, but more importantly, it bridges the gap between software and data, and I think that's also a big gap in the data engineering space broadly speaking.
[00:27:31] Unknown:
And so digging into the Cifile product a bit more, I'm curious if you can talk to some of the elements of product design that you focused on and who the target end users are and some of the ways that that's influenced your overall approach to prioritizing integrations and the underlying kind of architecture and how you approach the product delivery aspect of whether it's SaaS or on premise or, you know, kind of hybrid cloud, that aspect of it.
[00:28:00] Unknown:
We built from the very, you know, early days of the product. As I told you in the origin story, we knew we wanted to be full stack. We knew we wanted to have a tool that could help both data engineers and data consumers, reduce data entropy and better navigate data within their system and and improve the overall data experience. I think that's something that's not often talked about. Now back to product design specifically, with that in mind and with also the intention of wanting to avoid our users falling to the trap of alert fatigue or having alerts that are not very actionable or stuff like that.
We knew from the get go that we needed to have, you know, a fully extensive lineage that can help users navigate through the alerts they get and actually, you know, take action on behalf of those alerts. Because the most difficult part is not to know when something breaks. The most difficult part is to know what to do after something breaks. And so with that idea in mind, we continue to invest a ton in our lineage capabilities. We continue to invest heavily in our suite of integrations. And back to your questions about what type of, you know, people we speak to, what organizations, and what kind of customers we work with. So Ciflae can or fits for any data team of a certain size. It addresses both data engineers, data producers, and data consumers.
With each 1 of them, we're providing them with insights that make sense to them. What makes sense to a data producer may not necessarily make sense to somebody who sits on the business side or to a machine learning engineer or something like that. And that's the whole beauty of CFA because not only do we, you know, alert when something is wrong, but we also present it to you in a way that can help you take action on behalf of that. And that's how a lot of our product design and product roadmap has been shaped.
[00:29:45] Unknown:
In that journey of going from the initial idea to where you are now, I'm curious what are some of the assumptions or predictions about the specific pain points or the ways that the product would be used or the capabilities that it would offer, how those have been either validated or challenged along the journey?
[00:30:06] Unknown:
This is a really good question, and this is particularly relevant in the case of such a new and emerging category. Because, you know, if you're doing a good job, you wanna be solving problems that data practitioners are suffering from today. Sure. But you want to solve problems that people might not have even thought about yet. And the only way to do that is to be and, again, I'm gonna say something that makes me cringe, but to be customer obsessed. You know? And customer obsession is not just something you put on a deck or you put on your but it's actually going and spending time with customers, you know, seeing how they interact with the tool, seeing how they interact with each other, asking them questions about their pain points, how they use data, who uses it, what are the expectations from the management around it, how much they spend on data infrastructure, etcetera. Like, really trying to come up with a story in your head about the day to day of that data practitioner.
And I think that's how you become so, you know, quote unquote obsessed with them that you start to think like them and you start to feel empathy for them and you start to incorporate that into your product design and your product roadmap. And I think that's how you get to a point where you can actually innovate and not just, you know, build something that everybody else is building. That's also 1 of the biggest challenges when starting a company. Back to your question specifically about, you know, the early feedbacks and how we were able to incorporate that, we always think about our product road map as being a forever evolving thing and a living and breathing document.
And we constantly have sessions with our partners and with our customers and prospects and try to ask them questions about, you know we're not just looking for compliments about our product and our product roadmap, but more importantly, we're asking about what is it that you think we're missing? What are the challenges that you think you will be facing in a year from now? Where do you think we're not doing such a great job? And I think when you use that and you're very critical about, you know, how you incorporate that, I think that's where you get to a point where not only do you validate what you initially thought was or not, but more importantly, you start to, you know, bring something new.
[00:32:21] Unknown:
Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend. Io, 95% reported being at or overcapacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it's no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. That's where our friends at Ascend dot io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability.
Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark and can be deployed in AWS, Azure, or GCP. Go to data engineering podcast.com/ascend and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $5, 000 when you become a customer. In terms of that conversation around how the end users incorporate Siflae into their day to day work and particularly around that communication and collaboration element.
I'm wondering if you can talk to the overall workflow of and how it does fit into that day to day effort of data engineers, data practitioners, and the consumers of those data products.
[00:33:50] Unknown:
So we've actually seen the tool being used in a lot of different ways. We have customers that only go and log into the platform when something was wrong and they wanted to do causal analysis and incidents management and the troubleshooting and stuff like that. But we also provide with, like, you know, 3 60 view dashboards on the health status of the data assets, which could be or are often used as, like, general updates on how the platform is behaving. We have a customer in the UK, and you know how in the UK they're quite obsessed with their tea in the morning. Right? We have a customer in the UK who told us once that he wakes up in the morning, and he realizes, okay. Let me check Cifl before I make myself a cup of tea.
And that's an interesting take on the product because you use it as your comforter, as something that, you know, lets you actually go on about your day without having so much anxiety about the entropy that's going on in your data platform.
[00:34:48] Unknown:
Another interesting aspect of building a product, particularly in the space that you're focused, is that there are a lot of potential tangents that you could go off and also potential areas of overlap with products. And I'm curious how you think about which are the core strengths that you really want to build on, and what are the elements of data observability, metadata management, data governance that you consciously are choosing to leave to some of those existing products in the space without necessarily trying to compete with them?
[00:35:23] Unknown:
It's a very interesting question. I'll give you the answer from, like, a pure Cifile perspective, but also from, like, you know, my view on how the space and the category of data observability is likely to evolve over time. Our Cifl our approach with Cifl is relatively straightforward. We wanna reduce data entropy, and we do that by giving full end to end visibility over data assets, their dependencies, their health status at the data, metadata, and infrastructure level. That's, you know, the approach we've taken. Now back to the category of data observability more specifically, I think as the category continues to gain in in maturity and legitimacy, and that's thanks to the collective efforts of all the emerging all the amazing technologies and companies that's emerging in the space, I wouldn't be surprised if it starts to kind of tout from adjacent categories simply because, you know, the idea is to provide more reliability and more trust around the data in general with the organization.
And I think even not only are, you know, the companies emerging in this space to do data observability are still figuring out, but also the end user. I think even the end users are still, you know, realizing, okay. We need the data catalog. We need data lineage. We need to go data mesh. We need to do data contracts. There's a lot of buzzwords that have been thrown around, and there's still no consensus towards, you know, what is the go to approach. And I don't think there will be. I think it will always be a question of what is more adaptable for my use case and what is more beneficial for my organization and, you know, make the choices based on that and then ultimately go and invest in technology.
Buying software is the easiest thing you can do as a data leader. The hardest part is making sure that people actually will use that software and that you will get your return on investment. I think that's how a lot of you know? I know people like to talk about consolidation and bundling and unbundling of the data stack. I think that's you and I can both agree that that's a topic for another day. But but I think, ultimately, it should be left to the data practitioners and the data leaders to decide what's more adequate for their use cases.
[00:37:37] Unknown:
Yeah. Particularly in the space of products that operate on the metadata of the underlying systems. I'm definitely keeping a watchful eye on how that evolves because there are so many potential areas of overlap and kind of collaboration or competition. And I think that what will be interesting is to see how that space evolves into a set of products that are composable and interoperable more so than just having to be an all or nothing solution.
[00:38:06] Unknown:
Yeah. I agree. I agree with what you say.
[00:38:09] Unknown:
In terms of your experience of building Souffle and working with your customers and seeing how it's being applied, what are some of the most interesting or innovative or unexpected ways that you've seen it used?
[00:38:19] Unknown:
We have a customer in Europe that's a platform for creative workers. So, like, you know, Instagram influencers and stuff like that. And they use CPlay to monitor their data quality, lineage, and stuff like that. In a recent catch up I had with them, they told me that they were using C Play to monitor creator churn. So they put what we call business or custom rules, and they started to monitor, like, if the number of creators drops, then they know that somebody seeded their creators or the people that are working with their agency or something like this. I thought that was quite interesting. Another interesting way, and this actually goes back to the anecdotes about the products in general or some of the anecdotes about the product.
When we initially built Ciflay, we thought it was gonna be a tool for data engineers and data analysts. Right? With time, we started realizing that, you know, our customers were starting to open Ciflave for business users and for, you know, some of the less technical data consumers. And especially recently with the whole, you know, hype around the semantic layer and how it's taken off and everything, and we actually released a feature recently that we call metrics observability. We saw that there was, you know like, immediately after we announced that, a lot of our customers were calling us and saying, oh, can we have that for, you know, our marketing team or our CFO or our CEO? So it's something that we're not expecting. And it goes to show how much organizations now and data teams are willing to open up the access to data and to the data infrastructure more broadly internally, and it's very refreshing to see.
[00:40:11] Unknown:
In your own experience of helping to found and build and grow the company and the product and maintain its direction, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:40:23] Unknown:
We have challenges and problems every day. Otherwise, it wouldn't be fun. I underestimated in the beginning the importance of storytelling in hiring. We're still trying to learn that because like many startups, initially, we went and hired a bunch of software engineers, and I have an amazing cofounder and CTO. He knows how to speak to them. All good. And then we started hiring salespeople, and he was, you know, a whole different challenge and a whole different, you know, kind of vision you need to sell them. I guess that's something I wasn't necessarily prepared for.
[00:40:55] Unknown:
And for people who are running their own data platforms, data infrastructure, they're trying to understand how to make it more reliable, what are the cases where Cifile is the wrong choice?
[00:41:09] Unknown:
Oh, I love this question. Let me preface this by saying that in my opinion, from what I see from my vantage points, the number 1 reason why data quality initiatives fail is because there is no buy in from the business, and there is no alignment between engineering and the business when it comes to data quality. Right? To answer your question, I think is definitely the wrong choice for any company that is not willing to go the extra mile and align both data engineers, data practitioners, data analysts, data consumers broadly speaking, and the business as well in the whole process.
Because if you don't do that, then you're just gonna like, at the end of the day, CFNA is a technology solution. Right? Like, it's a software you buy and you implement on or you add to your existing data stack and something like that. Cifli cannot solve your cultural problems, and it cannot solve your miscommunication or communication problems internally. Right? And if you don't solve those first, then there is no data quality solution or framework that's going to improve the quality of your data. And I think that's where, you know, Ciflae would be definitely the wrong choice for an organization like that.
[00:42:20] Unknown:
And as you continue to build and grow the product and scale the capabilities and team and add new features, what are some of the things you have planned for the near to medium term or any project or problem areas that you're excited to dig into?
[00:42:34] Unknown:
We have a lot of exciting things coming up with a couple big announcements coming up as well, so stay on the watch out for that. More on the product specifically, so our DNA is and will never change, which is, you know, observability and monitoring, full stack solution. We, you know, sit on top of the existing data stack, and we monitor data, you know, its full life cycle. In the coming months and coming product releases, we will go a bit deeper in each of the compartments that we cover. I mentioned the semantic layer. That's something that came out recently with our metrics observability framework. The infrastructure components is also something that's that we continue to invest in. Obviously, lineage is a big topic for us. We, today, have lineage from ingestion all the way to consumption, some of the aspects around infrastructure, so we wanna continue to build on that. And generally speaking, just continue to, you know, grow with the category and, you know, this thriving ecosystem that is data infrastructure. It's extremely exciting, and I and I feel extremely fortunate to be building a company in this space.
[00:43:37] Unknown:
Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:43:52] Unknown:
I think right now, over the past few years, there has been a lot of new technologies and a lot of developments that took place to address problems for data engineers. So, you know, things like, you know, integration, automating pipeline creation, making, you know, real time something that is actually feasible and doable, etcetera. And at the same time, there is, you know, a lot of tools and technologies that were aimed at making data more accessible to the consumer. So, like, you know, both ends of the data warehouse. What in my opinion I see is still lagging is is kind of that bridge between the 2. And in a way, you know, the modeling layer should play that in a sense that it should enable certain data consumers to, you know, get a bit more autonomous around creating data products, and at the same time, it should, you know, bring some more data producers around on the business side, etcetera.
And granted there's a lot of, you know, amazing technologies and DBT is pioneering the whole thing, But I think it's still in a way, like, especially if you go on the DBT example, like, you still need to know SQL to use DBT. Right? And if you're, you know, a vanilla business user that's just getting started, I feel like there's still not enough developments or technology that has emerged in the space of data management that can make that transition from data producing to data consumption more easier and more smooth for business users. We keep hearing about self serve analytics and giving autonomy to data consumers, but I still don't think that's a thing. I think it's an area that's still really to be underserved.
[00:45:29] Unknown:
Absolutely. Yeah. It's definitely exciting to see more investment in that area of making data consumable and understandable to people who don't live and breathe it every day. Alright. Well, thank you very much for taking the time today to join me and share the work that you and your team are doing at Souffle. It's definitely a very interesting product, and it's definitely great to see more investment and activity in this space of data observability and helping to push the frontier forward. So I appreciate all the time and energy that you're putting into that, and I hope you enjoy the rest of your day. Thank you, Tobias. I appreciate that. Thank you.
[00:46:06] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com. Subscribe to the show. Sign up for the mailing list and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts atdataengineeringpodcast.com with your story.
And to help other people find the show, please leave a review on Apple Podcasts and just tell your friends and coworkers.
Introduction and Sponsor Messages
Interview with Salma Bakook: Introduction and Background
Building Siflle: Product Focus and Goals
Defining Data Observability and Data Entropy
Manifestations of Data Entropy and Mitigation Strategies
Preventing Data Errors and Anomalies
Responsibility for Data Quality
User Experience and Interface Design in Siflle
Static and Dynamic Data Lineage
Integrating Infrastructure Details into Data Observability
Product Design and Target Users
Validating Product Assumptions
Incorporating Siflle into Daily Workflows
Core Strengths and Adjacent Categories
Interesting Use Cases and Customer Stories
Lessons Learned in Building Siflle
When Siflle is the Wrong Choice
Future Plans and Upcoming Features
Biggest Gap in Data Management Tooling
Closing Remarks