Summary
Data observability is a term that has been co-opted by numerous vendors with varying ideas of what it should mean. At Acceldata, they view it as a holistic approach to understanding the computational and logical elements that power your analytical capabilities. In this episode Tristan Spaulding, head of product at Acceldata, explains the multi-dimensional nature of gaining visibility into your running data platform and how they have architected their platform to assist in that endeavor.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
- RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
- TimescaleDB, from your friends at Timescale, is the leading open-source relational database with support for time-series data. Time-series data is time stamped so you can measure how a system is changing. Time-series data is relentless and requires a database like TimescaleDB with speed and petabyte-scale. Understand the past, monitor the present, and predict the future. That’s Timescale. Visit them today at dataengineeringpodcast.com/timescale
- Your host is Tobias Macey and today I’m interviewing Tristan Spaulding about Acceldata, a platform offering multidimensional data observability for modern data infrastructure
Interview
- Introduction
- How did you get involved in the area of data?
- Can you describe what Acceldata is and the story behind it?
- What does it mean for a data observability platform to be "multidimensional"?
- How do the architectural characteristics of the "modern data stack" influence the requirements and implementation of data observability strategies?
- The data observability ecosystem has seen a lot of activity over the past ~2-3 years. What are the unique capabilities/use cases that Acceldata supports?
- Who are your target users and how does that focus influence the way that you have approached feature and design priorities?
- What are some of the ways that you are using the Acceldata platform to run Acceldata?
- Can you describe how the Acceldata platform is implemented?
- How have the design and goals of the system changed or evolved since you started working on it?
- How are you managing the definition, collection, and correlation of events across stages of the data lifecycle?
- What are some of the ways that performance data can feed back into the debugging and maintenance of an organization’s data ecosystem?
- What are the challenges that data platform owners face when trying to interpret the metrics and events that are available in a system like Acceldata?
- What are the most interesting, innovative, or unexpected ways that you have seen Acceldata used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Acceldata?
- When is Acceldata the wrong choice?
- What do you have planned for the future of Acceldata?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means? Our friends at Outland started out as a data team themselves and faced all this collaboration chaos. They started building Outland as an internal tool for themselves. Outland is a collaborative workspace for data driven teams like GitHub for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlant enables teams to create a single source of truth for all of their data assets and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.
Go to dataengineeringpodcast.com/outland today. That's a t l a n, and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $3, 000 on an annual subscription. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.
Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Tristan Spalding about AccelData, a platform offering multidimensional data observability for modern data infrastructure. So, Tristan, can you start by introducing yourself?
[00:02:07] Unknown:
Sure. Thanks for having me on. My name is Tristan Spalding. I'm the head product at Accel Data and excited to talk through data management, data observability, and everything we've got going on today. And do you remember how you first got involved in the area of data? Well, for me, it was very lucky and strange and coincidental where I was actually a philosophy major in college. And I've happened to find probably 1 of the 2 companies that was looking for philosophy majors in the data world. And this was in the day of the semantic web where people were building ontologies and things like that. So, you know, let's go do some ontology with these big, you know, department defense clients and pharma companies and things like this. And that's how we intend. So that, for me, turned out to be a lot less and sparkle and r 2 r m l was the standard. And basically just, like, getting pretty deep into databases, things like that. And so that was my entry point. I ended up working at Oracle with sort of the Adecco Group after that acquisition. So that sort of brought me into analytics, search, you know, big data, Hadoop, and ultimately machine learning. And then the last step for me before Excel Data was about 4 and a half years at DataRobot, really, you know, building that up and helping to grow that learning a lot about machine learning and ultimately building out kind of our MLOps offering. So for me, I mean, I've seen many sides of it, you know, across the enterprise, you know, data life cycle, things like that.
And, you know, excited to jump in to 1 of the messiest parts of it, I would say, which is trying to make sense of all the, you know, innovation and all the technologies across it, which I think is 1 of the big big, you know, purposes of data observability as a category.
[00:03:37] Unknown:
And so that brings us to where you are now with Excel Data. I'm wondering if you can describe a bit about what it is that you're building there and some of the reasons
[00:03:46] Unknown:
that you think it's worth spending your time and energy on. Yeah, for sure. So Excel data is this layer or data observability, really, I think, is this layer that's starting to emerge, you know, where people are looking at data assets as assets and basically saying, we have observability for software, some very sophisticated and powerful companies and tools around that that are built up. When you apply that to data, I think most people in the data world, you know, would not feel incredibly confident about any aspect of their data in many cases. And especially when you scale it up to, you know, beyond the 1 or 2 highlight, you know, sort of golden assets to the wide range of assets that people are using, I think that confidence gets a bit lower. And so I think, you know, with data observability, there's this, I think, race really, you know, to kind of figure out how do we provide that, you know, foundation where we can actually trust our data assets as much as, you know, for example, we trust the home page of our website. Just to pick an example where, you know, you would not deploy that without extensive monitoring and change cycle and things like that, you know, change review cycle, code review cycle, you know, in many cases, like, so why would you do that for some of the data that's powering not just your reports, but the products and services that you have? So, I think ultimately it's about bringing reliability to something that's increasingly important as an asset, which is basically your data pipelines.
For Accel Data, I think Accel Data has a quite unique angle on this. So, you know, the mini, you know, background story on Accel Data is really the core founding, you know, team came out of the Hortonworks engineering and data engineering group. So this was a group that, you know, was deep in the middle of some of the most innovative technologies and basically, you know, big data projects of the era there. So I think it's the group that's deeply familiar with both the operational challenges of actually running these projects with the organizational, you know, sort of goals and problems and structures that are around them and with the actual data problems on top of it. So I think it's something where, you know, everyone, I think, sees this opportunity. It's like, wow, this data is important. We don't trust data. How do we solve it? I think for me, you know, when I was sort of still fresh enough to sort of understand, you know, thinking about the options and the market and things like that, you know, what stood out to me with Excel data is really the what we would call the multidimensional aspect. So going very deep, not only into the data itself, but actually into the systems, you know, powering that data. So, for example, if you have a data pipeline, you've written in Python here, and it's using Databricks extensively, and you get an alert, hey. This this pipeline's slow. Well, I think 1 thing to do is kind of look at the data and what's happening with that. The other thing is to actually dig in and try to resolve the problem and figure out, you know, how do we tune this? How could this run more effectively?
And I think that's something doing those together. Like, I think there's very few people that are able to really deeply understand, you know, data distribution and drift and anomalies and things that went and set up the controls for that understand actually setting aside the actual domain expertise of the data, which is a whole other aspect that's quite critical, and actually go into the various technologies they're using to actually process that data. And so with Excel Data, I've seen, you know, the ambition to kind of tackle both sides of that, which is very, very difficult to do. I think precisely because it's difficult to do it. I think it's something that there's gonna be some really interesting opportunities and great product opportunities.
[00:06:54] Unknown:
So when I introduced the topic of conversation for today, there was quite a mouthful that I actually pulled from your website about multidimensional data observability for modern data infrastructure. So there are a few different things to unpack there. And data observability, you touched on that a bit. I've done some other shows about that. The modern data platform, modern data infrastructure, that's another area that I've spent a decent amount of time on where it's, you know, this reorientation of the data warehouse as the focal point of all data operations and data analytics. But multidimensional is something that I'm definitely interested in digging into and and getting your sense of what that means for a data observability platform to be multidimensional and some of the ways that that manifests.
[00:07:36] Unknown:
Yeah. Well, I think part of it actually comes back, you know, to the distinction people sometimes draw between monitoring and observability, like setting aside data and things like this. I think people sometimes, you know, there have been monitoring tools and now people contrast themselves as saying, well, we provide we don't just provide monitoring, which is kinda telling you what's going on and giving you that signal, we provide observability, which is basically understanding how you can map you know, external behavior and signals that you're seeing to internal, you know, operations of the system and applications that you maintain so that you can actually fix them. And so I think for whatever reason at this point, I don't know that data monitoring, you know, sort of like there wasn't an era of data monitoring. I think that I'm aware of where that came out where people were really intense on that. I think people went very quickly for the term data observability. You know, for whatever reason, I think it sounds cool. Like, people do ML observability. Like, everything's observability now in name. I think, you know, when we say multidimensional, we're trying to kind of, I think, double down on basically, hey. We're gonna let you understand up and down, left and right, basically, how these things relate to each other. Like, we're not going to be content to just tell you, you know, something very useful, by the way, which is like, hey, like, your data is bad quality, like, it's late, You know, things like that. What we wanna do is basically say, hey. What happened here was that, you know, your data drifted and that caused your Spark configuration to be suboptimal, which caused us to be take longer, which meant the data wasn't there, which meant the join you had downstream, like, was joining on on old data, which means your results are bad. Like, there's a big difference in taking an individual user through that entire kind of journey, which mixes a ton of technologies potentially, you know, in a ton of angles and kind of just doing something, you know, again, useful, which is telling you, hey, something's gone wrong here. So I think multidimensional, you know, it'd be better if we could say, hey, like these things are data monitoring and where data observability, you know, because we go deeper and we tell you why and we help you fix it. I think multidimensional is just the way to remind people like, hey, there's more that the life of a data engineer these days is a quite complicated 1 and requires dealing, you know, with multiple aspects here. So I think that's what that gets at, I would say.
[00:09:42] Unknown:
Digging into the modern data infrastructure aspect of it, you touched on this a little bit with the variety of systems that you might need to jump across to be able to actually diagnose the true cause and effect of an error. But what are some of the requirements and implementation details that are brought in as a necessity because of the fact that we do have this modern data stack and modern data infrastructure where we will likely have a variety of different components of tooling and different vendors and different concerns that we need to be able to traverse well
[00:10:20] Unknown:
understanding the entire life cycle of a data asset? Yeah. I mean, I think the diversity is 1 of the big aspects of this. I mean, as things have, you know, moved from I think we got sort of 1 drag and drop tool from a big, you know, vendor, and it kinda we use their databases, we use their BI tools and things like that into this world now where we're writing code, like, we're using the best of breed, you know, where there's a new library every week, you know, that does something super powerful that takes off like, the first requirement, I guess, is basically, you know, it sounds simple enough to say, but, like, is to have eyes on all of that and so be able to actually instrument, you know, data pipelines, code based data pipelines in a meaningful way. I think this is where, you know, some of the existing tools, I think, get a little lost because, like, you could have great analysis.
But if you're only analyzing 5% of the actual assets that are out there and importantly, not the ones that are actually maybe the newer applications, the ones that are are tackling new revenue generating initiatives, like, there's a limit to what you can do there. So I think the first aspect is basically breadth, you know, of integrations enabled to actually meet the workflow of a data engineer. I think the other interesting aspect with this is kind of the difference between, you know, data warehouses and data pipelines, I guess. And what I mean by that is basically it's like, obviously, BI is a very well established use case, and both data warehouse has a lot of uses. Like, that's not going away. But what's changing, I think, is people are dealing more with raw data earlier. So data isn't necessarily born in a data warehouse. Like, there's something emitting it. There might be a stream taking it in, but, like, it might be a file sitting somewhere that then you're gonna process, and ultimately, you're gonna load that somewhere else into a different database. And the great thing today is you can use the best of breed for that. Like, from an operational standpoint, it's a good thing to be able or I should say performance standpoint. It's a good thing to be able to, you know, use the best tool for this.
But from a complexity point of view, it's not a great thing. And so I think 1 of the big aspects of data observability is to look at, you know, are you looking at this, like, basically at the point where it initiates and where it starts in a file? We're looking at the point where it ultimately ends up in the data warehouse and, you know, we we run reports off of it. And now we wanna know if those reports are good or bad and, like, can I trace back this thing in the UI to that? Like, that's super valuable, like, use case, but it is, like, a limited segment of what I would say the modern data engineer looks at. And so I think, you know, sometimes, I think we use the phrase, like, shift left. We borrow the phrase shift left from DevOps and things like this to kind of say, like, hey. If you're drawing the diagram of the life cycle of data, like, shift your observability and your monitoring left to the point where the data actually originates. And that's actually, like, quite difficult to do. That's a quite difficult technical problem because, 1, that data is raw.
You know? 2, it's not necessarily, you know, indexed for a query. Like, you can't write SQL against that necessarily. That's gonna capture it. So even expressing the types of thing you're looking at are harder. And I think the other aspect is it's gonna jump across 3 different things along the way. It might jump from streaming system to Databricks and ultimately end up in a data warehouse somewhere. So I think those are some of the key technical technical aspects. Like, it's quite challenging to do, like, across the whole lifespan. And so I think, personally, for me, like, that was that was quite interesting about Excel Data. It was like, wow. Like, this is a harder problem that I thought. There wasn't an out of the box like solution to this, and, like, there still isn't an out of box solution. Like, you know, Excel Data can do it with all these things, but it's not a simple task. And certainly 1 of the directions, you know, reason I joined the company is to help us, you know, make this into something that's incredibly powerful and easy to use that covers all of it. But it's a big challenge to do it all, you know, with the real complexity of the modern data stack.
[00:13:53] Unknown:
1 of the continued pieces of feedback that I get when I talk to people about the complexities that are inherent to the space of working in data is the fact that there are so many different stakeholders with different perspectives and different priorities and different backgrounds in terms of their technological depth or focus. And I'm wondering what you view as the kind of core personas for Excel data and how that influences the areas of focus in terms of feature development and user experience design and just your overall approach to priorities and how you think about collaborating across all these different stakeholders throughout the life cycle of data?
[00:14:33] Unknown:
It's interesting. And I think we've kind of taken a bifurcated approach, I would say. So it really starts with a data engineer and saying, like, you know, you want to be a high performing data engineer. You want to be building these pipelines. You wanna be tackling new use cases, and you wanna be doing it better, faster, more reliably than anyone else. And so how can we basically, you know, on the 1 hand, help you achieve that goal and on the other hand, not give you too much extra work? So if we're gonna say, like, hey. Guess what? We'll give you that stuff, but first, you need to write, you know, a 100 lines of code to set up, you know, basically tests on every component of this. And by the way, you need to learn what population stability index is and, like, how to compute anomaly. Like, that would be a bad user experience that would lead potentially to a good result. So I think, you know, just saying, like, alright, we want you to get to reliability, but we don't want you to do you know, have to change your workflow that much, I think is 1 guiding goal for us. That's, you know, again, like, kind of easy to state, but very hard to pull off. And this is very much something that, you know, I thought about quite a lot of DataRobot, where it's very much around, hey, for data scientists, like, how are we going to take some aspects of your job and basically automate that in a great great way or even, like, superhuman way? And so it's a lofty goal to hit, but I think there's an aspect of that level of automation and meeting the data engineer where they are today that we have. I would say the other approach, you know, that the secret master plan, of course, is, like, you know, there's another persona that's really missing a view on what's happening today. And that's basically the the CDO and the data executive who is charged, you know, ultimately with making some really smart choices on where they invest across different tools. Like these budgets are spend is going way up on a lot of these things, how you navigate across cloud migrations, things like that. And so I think 1 of the things we try to do is say, as we make it easier for data engineers to kind of understand to make their pipelines more robust and monitored and respond to incidents faster, We're also building up a map of how data and, basically, how data technologies are being used across an organization. And that map can be very helpful for really, every sort of manager, you know, up the chain in the data executive sort of, like, flow because it lets you know not just, you know, on paper what's the relationship between the systems or on paper, you know, which things are good. But in reality, who is using what technology?
Is that data actually any good? What did we spend on that? What do we spend on that relative to how much our data is growing? Things like that. You know? And, ultimately, map that into revenue. So I think, you know, it's a long term thing here basically to say, you know, build up, make it easy for, you know, individual data engineers to kind of achieve their goals. And in the meantime, kind of inform the broader organization, which technology should we invest in, you know, how should we adjust our spend, especially with, you know, I feel like sometimes I don't wanna even mention the word data mesh, you know, because it can be used lots of different ways. But, like, I feel like sometimes that's used in a way where it means, you know, hey. Different teams are kind of building up and managing their own not infrastructure, but managing their own cloud services that they're buying. And so you need a map of that and you need sort of guidance on how to control and manage that. And I think, you know, it starts with understanding what data engineers are doing and then helping them with the aspects they need help with and then bubbling that up all the way. You know, like, 1 of the big distinctions we draw that does help clarify things for us is, like, we view, you know, like, along the lines of this sort of shift left, you know, comment or philosophy I mentioned earlier. Like, we view, like, the ability to write code and use APIs and add customization and scale things up as basically, like, a good thing for us. Like, we view that as, like, our users view it as a good thing. So, like, that's a nice division for us in certain ways where it's like, look, if if we wanna do, you know, point and click kind of data quality policy creation and review on data warehouse data, like or do sort of data steward type actions where we're trying to understand, you know, and govern data. Like, those are super valuable traits, but they're also things where we say, like, that's a great thing for us to partner, you know, to work with our partner on. Like, that's not really our focus. Our focus is, like, you are a data engineer. You know, many cases, you're writing code. You're writing Python, Java, Scala.
You're using these technologies, like and you're solving these complex problems on pretty raw data. Like, that's the type of persona and data engineer that we really, you know, have success with, I would say.
[00:18:40] Unknown:
And given the very deep and detailed and technical nature of the product that you are working with and your focus on helping customers be able to understand the value and helping to understand what direction to take the work that you're doing. I'm curious how you have approached that as head of product where you need to have this understanding of the underlying fundamentals of what it is that you're doing and how it's operating, but you also need to be able to work at the level of the customer and just being able thread that needle given the kind of complexities of the space.
[00:19:15] Unknown:
So I think, generally, it's a good signal for, basically, a product managers and product leaders and things like that to be, you know, confronting an area where there's a ton of technical depth and and challenges going on. Because I think that means that as you're able to, you know, work with your team and, like, all across the board, including UX designers, product design, all of us, you know, as well as the hardcore engineers, like, once you're able to come in and understand that and put that into something that's easy to use, like, you've added value for the customer. Now someone who didn't understand how to tune a Databricks job, you know, can do it very quickly in 1 or 2 clicks. But, you know, it's not where we are today.
Like, what we have now still requires, you know, thinking through, but we've captured the signals and we have ideas basically and data on how to improve, you know, and to get these suggestions. So I think, you know, at 1 level, it's, you know, we're trying to build something that's has superhuman knowledge embedded in in the sense that, you know, no 1 is gonna be an expert in all 20 of the technologies that you might look at. Even within streaming systems, you could rattle off 25 streaming systems that are out there that people may be using that all are tuned and optimized in their own way and have their own, like, concepts. So as we try to embed all of those in in the product, like, that's gonna be something that no individual, certainly not me, is gonna understand the details of that. But I think it's a good signal. And so the question is basically always to keep coming back to, you know, what's the outcome that we're driving them to us. Like, you know, what are we noticing, you know, when something goes wrong? Like, what are the ways it could have gone wrong? Now let's drill into specific metrics and kind of line those up together. So it's an interesting, you know, design challenge, basically, and product challenge basically, and product challenge to think through how you prioritize this and how you set up the hierarchy to drill through. I don't know if that's what you meant by, you know, dealing with a technical topic, but I think that is, you know, 1 of the things that we we do a lot. Yeah. It's just always interesting to get some perspective on the different roles and responsibilities
[00:21:09] Unknown:
that are required to be able to build these different data products. And so just curious, given your view as somebody who's leading product direction and still needs to be able to understand all of these technical details, how you approach your particular job.
[00:21:25] Unknown:
Yeah. I mean, it's a lot of asking questions and a lot of learning. I think, like, that's 1 of the fun parts really is, like, you get by being a, you know, 1 of these companies where you're taking on this type of challenge. Like, you basically hurt by osmosis picking up some quite useful things. Like so I would say, just as an example, like, just since I've joined, you know, Exadata, like, I've learned so much about Snowflake and, you know, like and even Snowflake in its time has evolved in terms of the capabilities that it's offered for and, you know, how to track usage and how to optimize things and things like that. And, like, I think that'll continue. So, you know, it's a lot of learning from this and then, you know, winding it back to basically, like, what are we trying to accomplish on this? It's like, well, people don't wanna be spending too much on their Snowflake instances, but, you know, get some surprise bill. You know, I think many people have been burned by whichever cloud provider, like, you know, Snowflake specific thing. So, like, how does this tool do that? And, by the way, you know, it also helps where we have engineers that are incredibly deep on Snowflake and understand it and same for other data platforms. Like, it helps to have someone like me that's coming in and asking questions. You don't understand what that is. Like, you know, how would we do that? And then helps bring it to a place that we're making sure the value comes clear or the key, you know, metric you need to get value comes clear. So always a work in progress and always something we refine. That's why you wanna have great engineers in your company.
[00:22:41] Unknown:
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state of the art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up for free or just get the free t shirt for being a listener of the data engineering podcast at data engineering podcast.com/rudder. In terms of the specifics of what you're building at Excel, can you give a bit of an overview about how the platform is implemented and some of the architectural elements and the specific sort of technical challenges that you faced in being able to service such a complex space?
[00:23:33] Unknown:
1 of the interesting things about data observability specifically is that it has a huge amount of, like, operational concerns embedded within it. And so what I mean by that is that datasets that people are dealing with like, you know, you could have 20 gigabytes that you're, you know, building a dashboard on and it's updated once a month, or you could have, like, 4 terabytes coming in because you're getting, like, let's say, gaming data, you know, sent in. And that's 4 terabytes per day for your data engineering team. You need to deal with that. So those are huge differences in the volume and velocity of data that you're dealing with, and so it's a little bit, you know, it's unlikely that the same technical solution is going to be optimal for both. There might be 1 that's more convenient. There might be 1 that's more powerful, and then you have to think about what some of these things like when you start, how is it going to scale up? I think 1 of the big architectural choices that has given AccelData, you know, the opportunity to service some of these, you know, huge use cases is basically a decision to use Spark basically as the underlying, you know, processing engine. So in some cases, you'll find tools are doing things in database, you know, which has its own advantages. Like, you can write SQL. Like, that's good. Like, that's a little bit cleaner and things like that, and that can be good. Excel has taken a different choice, basically, where, you know, typically, we're not running in database. We're actually, like, running our profiling and our analysis and our aggregation in Spark alongside that. And so that's given us a number of advantages in terms of few things. Like, 1 is obviously, like, cost price performance. So if you're running in DB and your database charges by usage, then the total cost of ownership is gonna be is just gonna be a factor that's still gonna burn in a little bit there. And so as you go from, like, basically, you know, early scale or 1 use case, BI use case to, like, oh, damn. Like, now you know, or, oh, now I'm, you know, processing tons of data here, and I need to scale up all that stuff. Like, that curve might get a little bit steep for some people's taste, especially with some of the, you know, cloud data warehouses today that people may or may not be as experienced at tuning.
I would say the other, like, important aspect with this is taking load off of the database. So the last thing you wanna do when you're trying to say, like, we're optimizing our data pipelines is actually put a lot of load on those data pipelines by running queries at precisely the same time that you are doing your workload. Right? You've got your normal workload that you're trying to make sure is good, and then you've got your monitoring workload. And if those are running on the same system at the same time, you know, a quite intense monitoring workload is gonna interfere with your actual workload you're trying to protect, which is really the last thing you wanna do. So that's something we've taken pains to avoid because we wanna do a very intense, you know, analysis of the workflow while it's happening and and have that happen quickly. Then I would say the last bit that's important, you know, ties back to what I mentioned earlier around, basically, the expressiveness of what you're able to do. So when you're you know, for us, we are trying to handle very complex use cases. Like, we're skewing towards that side, and we expect and encourage people to basically write code and write custom rules, custom style expressions, complex multi table sort of analysis, multi data source analysis, like, when you're doing this, like, just to ensure data. So there's a number of use cases for that that that we have customers using. Those things are, like, really not very straightforward to do or possible to do in many cases, like, you're just, like, living in a relational database, but they are very possible to do in Spark. So this is another aspect where, basically, you know, price performance, you know, as you scale up to really high volumes, avoiding stressing your operational workloads with your modern workload and then the expressiveness, like, those are all key elements of Spark. I would say, you know, the other big technical choice, you know, that's their technical challenge, which I think is common across many companies, you know, in the data world now is basically, you know, your cloud architecture. Like, how are you gonna do this in a way that's basically going to be efficient and not move lots of data and also be sensitive of privacy considerations? So are you gonna move everyone's data in and then you'll analyze it? Like, are you gonna, you know, push that down to the databases? Like, how do you manage that? And so this is something that we actually have a lot of, like, you know, in VPC deployments and and sort of, like, you know, fully hosted by the customer environments, things like that that avoid that. That's 1 not easy to support option, but 1 that, you know, we have a lot of. And then, you know, there's also we'll be launching sort of a cloud service that has a hybrid deployment model in the coming months here as well. So I don't know if that kind of answers it. Like, that is, like, you know, I would say the number 1, like, dimension on this that we've chosen is basically use open source. It's not just Spark. Like, you know, we use Kafka.
You know? So, like, we scale, like, a lot on this, but we've chosen not to use, like, the more proprietary technologies as the the underlying part of our stack. In the
[00:28:01] Unknown:
evolution of the product and the project, I'm curious how the design and goals of the system have changed since it first began and since you became involved with it?
[00:28:13] Unknown:
Yeah. So I think a lot of the origin of the company, you know, had to do really with understanding deeply, you know, the data processing technologies, whether that's, you know, sort of Kafka or, you know, Hive or Snowflake or Databricks, things like that. Like, it was quite deep on that area and able to serve those things up. I think there was very much also a way to kind of connect, you know, some of the data profiling, I would say, aspects to across there. And so I would say the big thing that I've been focused on, you know, in the time that I've been here is basically bringing these sides together, I think is 1 big aspect and making sure, you know, the crossover between these is incredibly clean so that when you do have an incident on 1 side, you're able to understand not just like, hey. This data is late. Like, sorry. Go ping IT, but, like, here's what what was happening with the database this was using at that time. That's been an interesting development here of, like, bringing these aspects together, which has both technical implications of how you sort of set up the various agents for this as well as significant sort of user experience, things there. I would say the other aspect, like, I bring a little different like, I've been working much more on the last stage of a cycle, the data life cycle, I guess, where, you know, you've got this clean data, these clean data pipelines that work perfectly, and now you're building and running machine learning models on top of it. So there's a lens that comes to that. Like, some of the things you look at when you're powering that type of use case are a little different from things that traditionally we think of as, like, data quality or data reliability, things like that. There's a little more focus on things like data drift, basically, or basically, like, how data is changing over time. And is that expected? Is it unexpected? Is it a real change in the world, or is it a problem, and how do you attribute that? And how do you actually get to the root cause of which thing is performing, you know, more or less well or reliably?
So I think, you know, 1 of the things I started to bring a little bit and I think we'll start to see show up in the offering quite a bit more is basically, you know, enhancing kind of the embedded smarts of the product. Like, how much of this sort of machine learning aspect is it doing for you, both in terms of actual embedded machine learning and embedded best practices. So I think, you know, that becomes an exciting area to look at for a couple of reasons. Like, I think 1 is, you know, obviously, we're in the business of trying to make this easy for users here and definitely using data to inform that as 1 way. But in the bigger picture, I think there's still and there will continue to be an evolution and kind of how the roles of data engineers, ML engineers, data scientists, ML ops engineers, data ops engineers, data reliability, like, all of these roles, like, those job titles exist and the responsibility is not clear. It's solidified who's gonna do what.
My view, you know, having seen a lot of, you know, sophisticated ML organizations, things like this, my view is that, you know, in many cases, data engineers have a primary role to play basically in serving any data that's going to the outside world and controlling that. And so today, I think a lot of that tends to be a little gated and it's like, well, you need to be data scientist, like you need to understand, you know, these metrics and the life cycle and things like that. You know, I think it's possible, like, with the right tool and the right sort of things built in, like, it's possible for data engineers to catch that that stuff and manage it, you know, much earlier in the life cycle. And I think, you know, data engineers have a lot on their plate already, but, you know, they are the right people with the right controls and the right knowledge and basically the right mindset to be able to play a bigger role in some of these. And so I think that's something that, you know, I don't know if that's a consensus view now, but I think that's starting to be something that we're exploring is how much can we make that happen and, you know, what effects will that have in terms of unlocking some of these use cases beyond data warehousing.
[00:31:51] Unknown:
And as far as the actual adoption and usage and workflow of Excel data, what are the steps from saying, okay. I need to be able to understand at a deeper level what's happening in my data platform and my organization's data ecosystem, and then actually installing and integrating Excel data and being able to use it to start being able to figure out what are the actual pieces of information that I need to be able to do my job better, improve my operations, improve my reliability.
[00:32:22] Unknown:
So I think there's, you know, organizational aspects as well for any any sort of data project, you know, where you're you're trying to prioritize, like, which are the things that are actually going to be important to look at and that are gonna benefit from this. I think in terms of the product itself, really, to the extent possible, we try to automate this stuff so that, basically, by plugging in, like, installing agents that we have, we're able to collect the data, analyze it, and basically, you know, suggest rules, identify patterns, and basically set those up. So it becomes a matter of basically reviewing these suggestions to kind of say, okay. Are these patterns of usage plausible? Like, you know, this is saying, you know, I don't need this much capacity. Should I turn that down? Oh, it's saying the data is bad in these ways. Is that true, or, you know, do I wanna dismiss that, or do I wanna add extra coverage? So I think those are the main aspects for it with us. I would say that's when you're going directly to the datasets and data platforms themselves, like you're going directly to Snowflake or Databricks or Kafka, whatever it might be. I think there's a little bit different adoption path for data pipelines themselves. So in that case, there's a step of instrumentation.
So, basically, going into the Python, you know, code that you're orchestrating with airflow and basically adding a few lines, you know, identifying, hey. This operation is starting now or is, like, is doing this. This chunk is referring to this. And 1 of the things that lets us do is basically then build even though your pipeline was built in code, it wasn't built graphically. You know, you get a visual representation of that with sort of things, you know, lit up and indicating, you know, the quality across those. So I think that's a little different angle and certainly that 1 as well as something that's you know, we're trying to refine and make that as simple as possible, you know, and recognize things. You know, there's a lot to do on that, but those are basically the 3 paths. It's like install an agent, you know, similar to APM tools and things like that. It's all kind of an agent to collect this information, review what's happening, and then for the data pipelines, you know, add a bit of code in order to keep track of what's happening here.
[00:34:16] Unknown:
In terms of the tracking of events in a data platform, being able to pull out the useful metrics, I'm curious how you've approached the design and structuring of those events and being able to correlate them across the disparate systems that are needed to be able to understand what the actual transformations and usages of these data assets are, particularly as you move into things like machine learning training systems and being able to say, okay. Well, there's concept drift in this model because of the changes in the data schema that were introduced to this application database and being able to traverse that whole graph of operations and transformations and usage?
[00:35:03] Unknown:
Yeah. Well, I think there's a couple of senses of correlate that are important here. Like, 1 is basically, like, once you have the map of what's there and you understand, you know, this is what the data pipeline looks like, you can always look upstream and say, okay. What was the dataset that went into this Databricks job? And you can go deeper into that and say, oh, this had, you know, drift in these columns and, like, I didn't expect that. So now that that's why this job took longer, which is why these other things are late, which is why this happened. So part of it is having the map, you know, and that's sort of, like, the human sense of correlation that lets you do this. And being able to flip it back and forth kind of between these lenses so you can actually understand, you know, what happened in each 1 and what effects that could have downstream. I think in sort of the, you know, statistical sense of correlation, like, you know, ultimately, many things in data world, like, our time series. And so, like, if you have a sequence of time series of metrics, like or rather thousands of them emitted by all of the systems that you're using, and you also have a time series of data in in terms of, like, what are the values at a particular moment or certain aggregates of what was the standard deviation at this moment or whatever, like picking random examples. But, like, you can build time series of each of these. And so once you have a time series, you know, with a value, you know, and a timestamp in these time layers, there's many things you can do, you know, on top of that to basically, you know, cluster those or compare them or find similar ones or, you know, detect what's anomalous at the same time as something else, things like that. So decompose them and see what you're trying to gather. So I think, you know, this is something that this is 1 of the ways that we think about it and why we're excited about kind of being able to capture both sides together because we do have both sides together. So I think being able to see what spikes, you know, at the same time and what spikes before something else.
And then as you do that across, you know, multiple use cases and multiple systems, that starts to become a really exciting capability. There are examples. I don't think anyone, you know, in the APM world has taken it, like, to that fullest extent yet. Maybe they have, but, certainly, there are parallels from, you know, premature companies, you know, like New Relic and so on, that do some type of this time series analysis that we can learn from in the data world. I just think the data worlds can be a little more like, because we, like, live in data, like, we can be a little more advanced about that, and there's things you can generalize across datasets that you can't generalize across, you know, applications.
That gives us another tool for kind of finding ways to connect these. But, I mean, the fact is just, like, things were already incredibly complex for data engineers, like, before there started to be this explosion of tools and, like, it's only gonna get more so. And so, you know, we think there's value to add in being able to, you know, actually connect these pieces and line them up. In addition to the
[00:37:44] Unknown:
sort of data observability aspect, 1 of the other pieces that you have in the Excel data platform is being able to do performance monitoring of the compute that is to be able to do these data manipulations. And I'm interested in digging into some of the ways that that performance data that you're able to collect can feedback into optimizations and debugging and maintenance of the data pipelines and data transformations that are necessary to be able to actually derive value from the raw information that we store.
[00:38:16] Unknown:
I have to correct you though because, like, do you distinguish data observability from, you know, visibility into the compute of the actual data systems that are in email? I think that's the distinction we're trying to, you know, basically trying to reject with saying multidimensional. Like, I think we would consider, like, yes, if you're relying on a relational database, you know, to power your dashboard and something's just going from your database to your dashboard, like, maybe you don't need to know that. You just trust someone to handle and things like that. I think when you're doing complex data pipelines, like, there's not an ability, really, to distinguish between these. So I'll try to give a couple of examples. Like, what I would say is, like, we actually do have a tour a product tour up on our site now that kinda walks you through exactly this scenario, you know, in the example of a Databricks job. Basically saying, I won't spoil the story more than I already have of what happens, but basically, you know, takes you through the experience of finding out what really caused, you know, pipeline to be late. So I'll try to avoid that 1. I think, you know, a good example of this is basically, like, 1 so 1 is around, basically, capacity and compute volume. So, you know, 1 of the challenges today is as people are adopting these sort of, like, pay as you go, you know, data platforms, things like that, is basically there's a capacity problem.
So, you know, and so you might have quotas on it, you might run out of space, you might rack some defaults. And so 1 of the things that can drive excessive usage is basically, you know, bad queries and bad joins. So that's something if you have data that's coming in and you have duplicate rows and you end up having a join, you know, that should be 1 to 1 and it ends up being, you know, many to many, that's something that can multiply and explode, you know, basically your processing time and your compute costs. And so I think if you're coming in basically from the perspective of, you know, someone who's like, what is going on with my Snowflake? You know, my Snowflake bill or, like, you know, if I'm in a data mesh team or whatever, I'm responsible with this service, like, why am I, you know, hitting my cap and I'm not processing that? Well, the root cause isn't that something, you know, like Snowflake did something wrong or these things broke. Like, these things are pretty solid. What happened was the data was actually bad that went into it or the query was bad that went into it, and that's actually what caused the problem. So I think that's an example of basically how these things intertwine, where either your cost or basically your time to process can be blown up in this stuff. Yeah. The other 1 that comes to mind on this that we see is basically, you know, the aggregate load on the system.
So if you've got, you know, a 100 data pipelines going, which is probably the underestimate for most companies these days, If you've got a 100 and 80 pipelines going and you're looking at them and you're like, why is this slow? Why is it late? That's a property of you're looking at my dataset isn't where it should be, and so you might look into, like, okay. What's the date like, what's wrong? Do we not get the data? Like, is it taking too long? Is it bad? And the reality might be, actually, no. Like, it's not bad. It's just the underlying system was impaired that was processing it because it had too many things, you know, on it at 1 time, and that's ultimately the cause of this. And I think you would not necessarily be able to perceive even something as simple as that unless you actually just have some insight to that. So if you're looking at that at the logical layer, the data layer, and you don't have much ability to zoom in on the actual data processing platform, you can say, hey. Like, IT, you know, data team, data platform team, go look into this. But the real cause, like, was quite evident if you just had a lens into that and could join it up. So I think those are some of the more, you know, prosaic, you know, examples of basically how the root cause of these things, like, may not be super deep, you know, but it is a level deeper than just looking at the datasets flowing through the system.
[00:41:46] Unknown:
In order to gain value from a system such as Excel data, it's necessary to be able to view and understand and interpret the information that it's presenting and being able to generate alerts without having too many of them and being subjected to alert fatigue. And I'm curious how you've approached the design of the platform to allow for this interpretability and being able to maintain a healthy balance of visibility and
[00:42:19] Unknown:
obfuscation of sort of what is useful information and what's not. Yeah. I mean, alert fatigue, like, we we heard an example of someone who was just getting, like, 7 pages of alerts, like, per day on the stuff, and they're just like, it's just use like, I'd rather have no not an accelerated customer, of course. But they're coming in, you know, and saying, this is not useful. I need something, you know, that's actually gonna prioritize and place in. I think there's a part of this that can be kind of, you know, automated and done with software and a part that is just, like, you know, basically human input. And so we kind of adopt both. So 1 is, like, you know, we make it pretty straightforward to basically go in and prioritize, you know, assets and columns and things like this, you know, in order to say, like, this 1, you know, really matters. This 1 matters less. You know? I would say there's also an ability to come in and, like, you wanna get that by proxy. So if you go in and you're able to analyze the query logs or the usage of things, I think that becomes another tool in the toolbox to say, like, okay. This table is being queried a lot. You know, maybe I should really care about the quality, but not something you could do more or less, you know, automatically if you've got the right right signals. I would say, actually, the compute aspect, you know, this is another aspect where having the 2 layers together is actually quite useful because, like, yes, there are some systems that allow you to parse query logs and analyze that and see what's being used. Some systems do not let you do that, and so you might use basically the cost, how much data it's processing as kind of a cue for, you know, the importance of this pipeline.
So I think that's another element that gives you a bit of a signal on top of it. I think, ultimately, you know, the the dream for everyone, you know, it's a in a race who can do it, like, is basically very, like, data informed, you know, thing and, data informed process where, you know, you're sifting through all of this. You're ranking everything. You're, you know, having people give feedback. Hey. Is this useful? Is this not useful? My read is, like, many organizations are not at the stage where there's actually data to do that as effectively as you'd want to. I mean, maybe other people disagree. Like, I think it's an awesome feature, and it's an impressive demo and things like that. I don't know that that's the stage that we're at in terms of usage to actually let you do real, you know, recommendation on top of that. So I think we try to capture the signals we can, you know, give people a nice alerting framework that lets them rank it, kinda build from there. Yeah. I mean, I think that's basically what it is in terms of alerting. I would say the other big aspect, you know, it ties back to 1 of the persona questions.
It's like, we also, like, you know, look a fair bit at reporting too. So, like, there may be people that are not, you know, looking for alerts and, you know, 7 pages of alerts or things like that. I don't think anyone's looking for that, but they might not be looking for any alerts. But what they might really be looking for is basically the report on how is my spend on data technology is trending. Like, are we actually using that wisely? Like, you know, I just spent $5, 000, 000 on, you know, this data warehouse. Like, why did I burn through that budget in 2 months instead of 6 months? Like, those types of reporting and insights, you know, are quite valuable. And so we also look at, you know, basically, how do we bundle these up, give you a dashboard, let you export it, all of those fun things to help reach other people in the audience who might not be, like, the day to day users.
[00:45:24] Unknown:
Timescale DB from your friends at time scale, is the leading open source relational database with support for time series data. Time series data is time stamped so you can measure how a system is changing. Time series data is relentless and requires a database like TimescaleDB with speed and petabyte scale. Understand the past, monitor the present, and predict the future. That's Timescale. Visit them today at dataengineeringpodcast.com/timescale. In the operation of Excel data, I'm sure that there is a substantial amount of information and reporting and your own data transformations that you're dealing with. And I'm curious how you are kind of dogfooding the platform to help build the platform.
[00:46:09] Unknown:
If this was, like, a CRM system or something like that, like or a dashboarding system or something like that, we would absolutely do that. I would say Excel Data, you know, 3a half year old company that, like, does not necessarily have the data volumes, you know, or the structured use cases that would support that type of investment. So it'd be good to do it, you know, in some ways, but, like, you know, the use cases where it starts to be valuable, like, become valuable when you're a big company. Company yet. We serve very large companies. As much as it's a good practice to dog food and stuff like this, it would be a little bit, you know, synthetic for us right now. So we obviously are exhaustively testing it and using it every day and things like that and banging on it and finding all these problems, you know, and fixing them. But, yeah, I don't know that there's a huge match actually between our needs as a 3a half year old startup and kind of the, you know, gigantic, like, multibillion dollar global 2, 000, you know, companies that we sell to.
[00:47:01] Unknown:
In your experience of working with Excel Data and helping your customers and understanding the product to help drive its direction, what are some of the most interesting or innovative or unexpected ways that you've seen Excel data applied?
[00:47:15] Unknown:
So the ones that have been striking to me the most are around actually, like, monetization of data. Like, I was I think many people are familiar with, like, BI and data warehousing. Like, I was also very familiar with, basically, ML based use cases. 1 thing I was not as, you know, conscious of basically was data marketplaces and and companies that make their business off of exposing their data or basically selling analysis on top of it. Like, I think as I've been here, I've looked into this more and, like, I've seen some stat where, like, I think 75% of, you know, big companies or something like that are basically consumers of assets from data marketplaces and, like, it's probably, like, 25% and then the projection was that it was gonna be 35%, you know, in a couple of years, are basically contributing and monetizing through it. So I was a little bit surprised, like, some of the complexity that goes into some of these things where you're selling, you know, basically exposing data as an asset that you run your business off of. And so this has a much wider and, like, different set of requirements than I had seen, and that's something that's, like, kind of, you know, broadened my understanding of what people are doing in the data world and also, like, what products need to do to match that. So just as an example, you know, if you've got a dataset that you're monetizing around, you know, worldwide business activity, you might be pulling data from 100 countries around the world about the businesses in there. Like, it's gonna be a worldwide thing. There's gonna be huge concerns about how do you compare, you know, these files that are coming from different groups and actually standardize them against a canonical database, and they are coming as files. Like, it's not like the technology was not super centralized, things like that. So, yeah, I think that's an example of something where, you know, getting that wrong or, like, being able to get that to a certain level of quality in a certain amount of time, like, has a real tangible value to a business because they're literally selling their data. And when it's bad, you know, they're not gonna get a plate from the VP that, like, you know, the number was wrong or I don't believe that number. It's gonna be like, yeah. I'm gonna go to your competitor because your data sucks. So I think that was a a little bit, you know, new for me. And then I think I mentioned earlier, like, you know, for me, seeing the richness and kind of the complexity of these and basically the speed of evolution of these, you know, data processing platforms, whether it's databases or other ways of processing data, like, it's been really interesting to see how those grow and how those scale. Like, it's an awesome area of technological innovation these days and so kinda just a lucky, you know, perk of Excel Data for me is being able to learn about some of those from some people that are using them, you know, in anger and and some technical experts on our side. In your own experience of working at Accel Data and helping to explore this problem of observability
[00:49:49] Unknown:
and monitoring and correction of data systems, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:49:59] Unknown:
1 aspect that's overwhelmed me even has been really trying to understand every tool in the in the offering in the in the space today. So I think you look at this stuff and, like, if you're focused on a certain area, you are like, wow. There's a lot of cloud data warehouses or, like, wow. There's a lot of pipe on orchestration systems or there's a lot of ML serving tools. And, like, once you're in their area, you're blown away by what's coming out. But then with observability, when you're really trying to have a view across the evolution in each of these areas, like, it's crazy. Like, it's just trying to understand the distinctions between all of these data catalogs, data governance, like data pipelines. Like, what are these things and how does it relate to each other? I think it was definitely eye opening for me as well as the pace of these things. So if you go into these and you look at the rate of growth for some of these tools and technologies, like open source or not, like, it's just incredible. Like, the cycles that, you know, might have taken, you know, 8 years, 10 years in the past for something to kinda reach a critical mass from when it started. Like, now those are, like, 8 to 10 months or something for some of these projects that are just exploding.
And so I think, you know, it's fun to keep track of all these things, but, you know, it stretches your mind a little bit on this. I would say another thing that's kind of been an interesting, you know, lesson or interesting study is basically, like, how the roles are changing for data practitioners. And it's really this is, like, great. You find people and, like, some people I think are, you know, if they're focused on BI and analytics and things like this, they're looking at, you know, DBT and they're looking at, like, Fireball, like, most crazy stuff. It's, like, awesome stuff and all these different ways of doing it. But, like, there are other people that are looking at completely different tools and kind of have the same feeling that may not be equally looking at it. So I think it's been interesting to see, you know, kind of the balance there or the shifting balance of how people are kind of self selecting into certain use cases or certain technology stacks. And others, like data observability, you know, kind of forces you to look at all of these together and not really, you know, have a stance on, hey. This is the 1. This is the other 1. I think we'll see in the data observing market. Like, probably people will arguably already have kind of segmented themselves a little bit towards particular use cases or particular personas. But, you know, in studying where we are and where we fit, like, you see a lot of diversity in how people are are splitting each other out. So
[00:52:13] Unknown:
And for people who are trying to get better visibility into the operations and challenges in their data platform? What are the cases where Excel data is the wrong choice and they might be better suited with a different vendor or a set of open source tools or building their own internal capacity for these debugging capabilities?
[00:52:36] Unknown:
You know, I think part of the question to ask basically is around, like, basically, the type of use cases. Like, that's what I would start with. And, also, you can look at the type of people you're trying to hire for your initiatives. So, like, I guess what I mean by that is, like, there are absolutely super valuable use cases, like, where people are running ERP systems, and they're focused on reporting and BI use cases and things like that where, you know, there's a very mature tool set for that. And, like, I think those cases are well covered. So if someone's looking for, like, how do I make, you know, the quality better on these dashboards, like, yes, Excel data can do that. You know, but that's maybe not what data observability, at least in our view, like, is really for.
I think, you know, when people are looking at, well, I'm using a mix of technologies, like, I'm moving between, you know, an on premise environment and a cloud environment or using multiple clouds, like, I think that's 1 kind of cue. I think when people are saying, like, I'm not just a reporting center and sort of cost center and things like this, which is, you know, sometimes how data organizations unfortunately have been viewed. Like, I'm actually a profit center as the CEO. You know, I'm making money off of our data, you know, by offering valuable services or by monetizing it. Like, I think that's a cue for, okay, do I have the systems in place to do that at a level where I'm ready to be out in the market, you know, where some of you know, my competitors are click away to do the same thing. And part of my differentiation is my quality is better than theirs, my system is better than theirs. Like, I think that's another cue or, you know, maybe you want those. And then I think the last bit is around, you know, kind of the type of people that you're hiring and you're bringing on. I think for our new initiatives, we're trying to use more open source.
Like, we're trying to write more code, you know, take advantage of all the frameworks and libraries there as well as, like, just version control. Like, I think that's another cue for, like I probably want a system that's gonna let me instrument that code in those systems, like, just as I would for normal software and make sure they're performing well. So I think those are some of the traits that, you know, we see as, like, okay. This organization is probably looking for something that can help them get reliability. I think we also see groups that it's like, you know, they do have a use case of, like, I'm trying to figure out, like, where my data is and what does this column mean and things like that. So, like, these classic lineage, you know, cases, which are super valuable, like, that's in our view, I would say, let you deal with these modern data pipelines and kind of the new use cases and new standards.
[00:54:50] Unknown:
As you continue to work with the team at AccelData and work with your customers and give feedback and direction to the engineers, what are some of the things you have planned for the near to medium term or any projects that you're excited to dig into?
[00:55:05] Unknown:
Yeah. So I think 1 of the cues, you know, for us basically is around, like, ease of use and and user experience, which maybe sounds strange maybe, but, like, I do think in particularly in the data engineering world, like, you know, sometimes, I would say, like, speed and performance and scalability, like, are the first things, which they are for Excel data as well, like in our product. But I think 1 of the cues I'm trying to introduce a little bit more is say, like, let's stand out and, like, let's try to make sure, you know, we're providing something that's just, you know people use the word delightful, but, like, you know, basically, like, delightful experience. Right? And that doesn't mean it's a pretty chart or things like that cause that may not be what data engineers are looking for. It might be a really straightforward API, like a really easy way to instrument these, a really great set of automated rules and alerts and things like that. So I would say that's 1, you know, big aspect we look at, you know, and I think in terms of exciting things, you know, we're building now, like, I do think there's a lot that can be done with, you know, basically, data science best practices and applied data science, like, in the data world. Like, that's the big advantage. There are advantages that some the APM sort of vendors have and things like this where they've been around for a long time, they worked on stuff. The big advantage of data observability companies is we get to work with actual data sets. And so, you know, there's techniques you can apply on those that you can't apply on software and so we've got some nice ones cooking out and, you know, should be exciting to roll out a little later this spring.
[00:56:29] Unknown:
Are there any other aspects of the work that you're doing at at Excel Data or the overall problem space of data observability and data monitoring that we didn't discuss yet that you'd like to cover before we close out the show?
[00:56:41] Unknown:
I think maybe the cost aspect of some of this stuff. And, yeah, I guess the observation we have is, like, you know, we meet with a lot of chief data officers and data executives and things like that, and we just see, you know, kind of a somewhat starting to be alarming. Like, there was a time when people were very excited about the potential and the ease of use of a lot of the cloud data processing technologies. And, like, for many companies, that time is still there. I think they're starting to be you just from our discussions with data executives, there's, like, they're starting to be a little bit of a note of kind of, you know, worry or fear around things like this where some of these costs really are ramping up pretty quickly. And in many cases, they're done for, like, small reasons. Like, this connection was left open, and that meant you choose this type of engine instead of that type of engine. And that means, you know, turns out that engine this week was 3 times more expensive, so you burn through that. So I guess it's just an aspect of that, you know, we try to be mindful in our product and with our customers. It's like people basically looking at this and saying, hey. Like, this technology is all awesome. Like, I love this. Like, I don't have to wait for these things anymore, but how do I get a handle on the cost of these operations, and can you help me, you know, optimize that? So that's an interesting, you know, dimension, I think, for people to be thinking about these days. And I think 1 of the challenges embedded within that is basically, like, maybe some incentives that are not identical between kind of the vendor of the cloud data processing platform and the user of the data processing platform. I think, like, I think cost is a lens. Like, we can, you know, have a lot of fun talking about the technology and the ways we detect we work with these things, but I think, you know, it's probably a good time for everyone involved in data engineering these days and using these new technologies to kind of take a fresh look at some of these things and understand how we're measuring costs, how we're monitoring costs, you know, and and how those are growing. Because my sense is that the back office teams are starting to get a little
[00:58:30] Unknown:
bit surprised by the bills that they're getting from some of these great things. Yeah. That's why there's a whole industry now of companies built entirely for cloud cost optimization and being able to help you understand how to reduce your cloud bills. And it'll be interesting to see how that starts to
[00:58:46] Unknown:
leak into things like these cloud data warehouses and some of these other SaaS vendors. Well, it's kind of the same thing. Yeah. There's been a market, of course, for application monitoring and things like that and software monitoring, and now there's starting to be that for data observability. There's been this thing for cloud 0. I mean, like, all these, you know, cloud optimizations companies, is there gonna be that for data? And, like, are these systems sufficiently complex and varied, you know, such that you need, you know, that expertise to understand how to do that? And, like, I think my perspective is certainly yes. Like, even just trying to master, you know, 2 or 3 of the leading ones, like, let alone get into the 23 different ways you can do streaming data analysis in the cloud, like, it's quite challenging. Actually, there are companies coming out that are built solely around even 1 of these technologies. Like, here's how you optimize, you know, this type of spend. And, of course, the vendors themselves, like, have their own interest in this. You know? So it's gonna be an interesting area to look at. But I think, again, from the multidimensional data observability aspect, like, this is a lens that, you know, we think people can't really leave for someone else as much because, like, whatever analysis you're doing, like, you know, whether you know what the cost was or not, like, it did have a potentially significant cost, and I guarantee that, like, someone at your company is gonna be looking at that a little more closely in the next few years. Absolutely.
[01:00:07] Unknown:
Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[01:00:22] Unknown:
I don't know about data management, like, exactly. Like, you know, I'll just say, like, the reason I joined Excel Data was because I was working, know, basically at the far end of the spectrum where you're relying on process data, and you're doing awesome, very cool AI, you know, things with it on top, and it's very exciting. And, like, obviously, like, many people in the data world are listening to these things and and reading these posts around this awesome, like, AI technology and things like that. And so I think the biggest problem is, like, we don't have enough good enough data pipelines to actually make use of the machine learning technology that we've managed to do, you know, all across the world in open source, Google, Microsoft, whatever. You know, I think that stuff is incredibly cool. Like, to actually apply that to most of the problems we have, like, we have to have better data pipelines.
And so for me, like, I think we're able to move data very effectively. Like, we're getting better at getting automated documentation and management for that. But I think the cost of that has been fragmentation. And so I think the thing we need is something that lets you, you know, plug in all these technologies, and there's gonna be more that evolve, plug them in, and actually do this in a way where you can make sense of it, like, enough to actually depend on like, run your business off of it. So once we have that, you know, and it's not like, you know, hey. These are 2 or 3 problems that we can apply ML to, or it's not, hey. Only this big company, you know, Tech Giant is able to do this. Like, I think that's gonna be the exciting piece because then we're gonna connect, you know, kind of on the 1 hand, our raw assets, you know, on the data side and, like, on their hand, this amazing technology on the ML side and actually be able to do that, you know, for a 100 use cases at every company instead of, like, 1, 2, or or, in some cases, 0. Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing at Excel Data. It's definitely a very interesting and constantly evolving problem domain. It's always great to see the different perspectives that folks have on how to approach data observability, data quality management, being able to optimize our data operations. So I appreciate all the time and energy that you and your team are putting into that, and I hope you enjoy the rest of your day. Thanks so much.
[01:02:31] Unknown:
Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used. And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave review on Itunes and tell your friends and coworkers.
Introduction and Host Welcome
Interview with Tristan Spalding
Tristan's Background and Journey into Data
AccelData and Data Observability
Multidimensional Data Observability
Modern Data Infrastructure
Core Personas and User Experience
Platform Implementation and Architecture
Evolution and Future Directions
Adoption and Integration
Event Tracking and Correlation
Performance Monitoring and Optimization
Balancing Visibility and Alert Fatigue
Dogfooding the Platform
Interesting Use Cases and Applications
Lessons Learned and Challenges
When ExcelData is Not the Right Choice
Future Plans and Exciting Projects
Cost Considerations in Data Engineering
Biggest Gaps in Data Management Tooling
Closing Remarks