Summary
There are an increasing number of use cases for real time data, and the systems to power them are becoming more mature. Once you have a streaming platform up and running you need a way to keep an eye on it, including observability, discovery, and governance of your data. That’s what the Lenses.io DataOps platform is built for. In this episode CTO Andrew Stevenson discusses the challenges that arise from building decoupled systems, the benefits of using SQL as the common interface for your data, and the metrics that need to be tracked to keep the overall system healthy. Observability and governance of streaming data requires a different approach than batch oriented workflows, and this episode does an excellent job of outlining the complexities involved and how to address them.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt.
- You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!
- Your host is Tobias Macey and today I’m interviewing Andrew Stevenson about Lenses.io, a platform to provide real-time data operations for engineers
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by describing what Lenses is and the story behind it?
- What is your working definition for what constitutes DataOps?
- How does the Lenses platform support the cross-cutting concerns that arise when trying to bridge the different roles in an organization to deliver value with data?
- What are the typical barriers to collaboration, and how does Lenses help with that?
- How does the Lenses platform support the cross-cutting concerns that arise when trying to bridge the different roles in an organization to deliver value with data?
- Many different systems provide a SQL interface to streaming data on various substrates. What was your reason for building your own SQL engine and what is unique about it?
- What are the main challenges that you see engineers facing when working with streaming systems?
- What have you found to be the most notable evolutions in the community and ecosystem around Kafka and streaming platforms?
- One of the interesting features in the recent release is support for topologies to map out the relations between different producers and consumers across a stream. Why is that a difficult problem and how have you approached it?
- On the point of monitoring, what are the foundational challenges that engineers run into when trying to gain visibility into streams of data?
- What are some useful strategies for collecting and analyzing traces of data flows?
- As with many things in the space of data, local development and pre-production testing and validation are complicated due to the potential scale and variability of a production system. What advice do you have for engineers who are trying to establish a sustainable workflow for streaming applications?
- How do you facilitate the CI/CD process for enabling a culture of testing and establishing confidence in the correct functionality of your systems?
- How is the Lenses platform implemented and how has its design evolved since you first began working on it?
- What are some of the specifics of Kafka that you have had to reconsider or redesign as you began adding support for additional streaming engines (e.g. Redis and Pulsar)?
- What are some of the most interesting, unexpected, or innovative ways that you have seen the Lenses platform used?
- What are some of the most interesting, unexpected, or challenging lessons that you have learned while working on and with Lenses?
- When is Lenses the wrong choice?
- What do you have planned for the future of the platform?
Contact Info
- @StevensonA_D on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
- Lenses.io
- Babylon Health
- DevOps
- DataOps
- GitOps
- Apache Calcite
- kSQL
- Kafka Connect Query Language
- Apache Flink
- Apache Spark
- Apache Pulsar
- Playtika
- Riskfuel(?)
- JMX Metrics
- Amazon MSK (Managed Streaming for Kafka)
- Prometheus
- Canary Deployment
- Kafka on Pulsar
- Data Catalog
- Data Mesh
- Dagster
- Airflow
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management.
[00:00:17] Unknown:
What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I'm working with O'Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to data engineering podcast.com/97 things to add your voice and share your hard earned expertise. And when you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With their managed Kubernetes platform, it's now even easier to deploy and scale your workflow. So try out the latest Helm charts from tools like Pulsar, Pachyderm, and Dagster.
With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode, that's l I n o d e, today and get a $60 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Andrew Stevenson about Lenses, a platform to provide real time data operations for engineers. So Andrew, can you start by introducing yourself?
[00:01:27] Unknown:
Yeah. Hello. Great that you have me. I'm I'm Andrew. I'm the CTO of Lenses. So I've been working in data now for for a long time, 20 years. So I started off as a c plus plus developer and ended up being a big data specialist around Kafka, and now the CTO of Lenses. And do you remember how you first got involved in the area of data management? I think I was always involved in this. If I if I look back, even when I was doing civil engineering, I was still collecting data for a pressure management system, actually. And even doing c plus plus, it was still matching and settling trades at a at a clearing house.
So it's always been there. I think it actually really became data management when I was at a high frequency trading firm, and we were doing a lot of big data movement before it was what was called big data back then, using a lot of the Microsoft stack, actually. So providing a lot of real time analytics to the trading. So I think that's where I truly transitioned into a full blown data engineer. But I think it's always been there for me. It's always been a constant. You know, I I would say your talks about data being a protagonist. It's always was present and then everything I was doing even when I was more of a traditional developer.
[00:02:45] Unknown:
And so in terms of Lenses itself, you mentioned that you happened upon working with Kafka in the streaming space. Wondering if you can give a bit of a background about what the Lenses product is and some of the story behind how it got started. Yeah. So I think data is a product now. And people and companies are starting to realize that, and they're wanting to get value from their data, and more and more real time. So when I as I was working as a data contractor
[00:03:11] Unknown:
in various companies, normally, the the investment banking scene in London, I came across Antonius, the the CEO, and we were trying to implement many projects on top of Kafka and Spark and others, and actually seeing the difficulty that we have there. So we thought, okay, we can address this. So actually, Lenses, what we're trying to do is trying to take the the pain and the cost and the complexity out of doing that, so you can actually build real time data driven applications easier. And an an important aspect for that is actually bringing the business back into these data projects. I think the business got sidelined too much, and all the focus was on the technology. And where I've seen the success was always by being able to bring the business context to any data that we were using. For example, I'm a technologist.
I'm not a expert in market risk, but I saw these people getting sidelined. So we were how can we bring the tooling to be able to get these people back involved so we can make projects a success? Open source is great, but if we can't make it a success and we can't get the tooling around it and bring the business context to it, that's where I've seen the failure. So that's where we ended up building building lenses out of. We first started building a lot of open source connectors. So we have the stream reactor, which is an open source project. I start this 30 yard Kafka connectors in there, and from that, we went on to actually turn it into a real product. Again, with the focus for us is really at the application layer, at the data layer, making use of that.
So for example, we have 1 client, Babylon Health. It is a unicorn health provider out of out of the UK, and their goal is to provide affordable health care for everyone on the planet. And they will use tech intensity to do this commodity software so they can actually get back to what's important. Right? There's nothing more important than your health. And in these current climates, you know, that's that's very important. So I think that's how we ended up and our key focus of what we're trying to do with Lendlysis. And to do that, we want 1 of the key things we have actually is our own SQL engine. You know, having our own SQL engine to browse data, to process data, also means that we can bridge this divide actually and bring all that amazing business context to the problems we want to solve. And in the tagline for your product, it calls out the data ops as a keyword there. And I know that that's a term that's getting used more frequently now with varying degrees of meaning. So you mentioned being able to bring the business into the context of the data and ensure collaboration across the different team boundaries. But can you give your working definition
[00:05:45] Unknown:
of what you think constitutes DataOps and some of the ways that the Lenses platform helps to support the cross cutting concerns that come up when trying to bridge the gap between those different roles in an organization for being able to deliver that value of the data? Yeah.
[00:06:01] Unknown:
So I I think I tend to agree with, your definition of data ops there. The important bit is we've had a lot of movement around DevOps and it's been very successful of how we, make sure that we apply operation principles to get software delivered quickly. We also now need to make sure that we're applying that also at the data level. So how we manage access to the data, as well as how we actually provide the governance and actually data ethics around what we're doing at the data lay, Plus combining everything from, the DevOps perspective. We want to take all those good parts. So what we do in lenses is we also make sure that not only do we have the visibility aspects from the SQL engine, we also have a very strong role based, security model. So we apply the governance as well, But we also make sure we incorporate all the good points from the DevOps side of it, such as GitOps, everything in Lenses is an API. So we conversion control all the attributes that go into making the data platform successful. So we can move topics, we can move processes, we can move connectors, everything we need to get into production quickly while still providing the visibility
[00:07:13] Unknown:
and the governance around the system. You mentioned the fact that you have your own custom SQL engine. I'm wondering what the motivation was for building that out specifically for this product versus using some of the off the shelf engines that are wholesale available for some of these streaming platforms or leveraging some of the components that exist out there such as the, Calcite project in the Apache ecosystem?
[00:07:38] Unknown:
So when we started out, the first actual bit of SQL we did was a thing called Kafka Connect query language. So this is arguably the first SQL layer that was was there for Kafka. So this was that we introduced into the connectors. We looked at Calisait at the time, but we didn't think we needed that for what we were trying to achieve in the connectors. We did look at Callasight when we for the current SQL engine that we have. However, we weren't able to extend the grammar for where we wanted to take the engine, so we decided to build our own. Now we built it on top of Kafka Stream, so effectively our SQL engine boils down to a Kafka Stream application at the moment, and we chose that route because of, you know, some of the advantage we were seeing that it's just a lightweight library. That doesn't mean that we're not interested in bringing in other technology. We will use other things such as Flink or Spark, and we're looking at how we can incorporate them into lenses because we wanna just take what best technology is out there. So say to Nutella talks about this, it's, tech intensity using the commodity, infrastructure and platforms that are out there now so we can get back to delivering the business far. I talk now about data intensity quite a lot. So that was the reason why we we went our own way with the the SQL engine.
And plus it allows us to actually then, if we want to pull in other technologies
[00:08:57] Unknown:
such as Pulsar, such as Redis Streams, for example, we can do. So in terms of the collaboration aspects of it, SQL has become the universal dialect for working with data and something that is sufficiently descriptive and approachable that it's possible to get different business users on board with it to at least be able to view the query and generally understand what's happening there without having to give them their own computer science course on how to program their own definitions. And I'm wondering what you have found to be some of the ways that that has helped in breaking down those silos between the engineering and business teams, and just some of the overall effectiveness of using that as a means of communication about the intent and purpose of the analysis that's being done and the ways that the data is being processed. So I think I think you're right.
[00:09:51] Unknown:
The the key thing is giving access. So SQL is the level playing field. So there are many companies that I've worked with in the past that are very very highly technical, and engineers are capable of analyzing data without SQL, but the vast majority aren't. And and SQL is a great lever in that. What we see that actually is the benefit of SQL, and it goes for both the business users and even the developers is just the speed with which they can actually move into production. So and the the the saving that they actually have. So we have Playtech. They use they have 600 developers, and 1 of their key things of actually just using SQL to be able to debug their messages saves them about 30 minutes to an hour each day of 600 developers. So it's a big it's a big saving there. And even we have other examples of, for example, risk fuel. They're a risk calculation startup. So what they use the sequel to is to actually feed their machine learning models, so it's it's helping bridge the divide for everybody, actually. It's enabling the developers to actually be more productive as well, and also then bridging the gap with a a more traditional,
[00:11:03] Unknown:
business user. And then around the concept of streaming data and streaming work applications around this paradigm and some of the ways that the Lenses platform helps to address some of those challenges?
[00:11:23] Unknown:
So most of the the reasons of why we actually built lenses, it's a lot of the tooling. So for example, I was on a call today with a large travel company, and their platform becomes a bottleneck because they stuck at the command line. So actually, providing the tooling so people can actually see their data and actually check if that data is good and have a simple controlled and governed way to actually deploy their applications. That's the key. You know? If if I'm in a a smaller company and I'm I can develop quite quickly a smaller application, but when I want to go into serious multinational company with all the compliance and security and auditing around that, that's where it really becomes difficult to use the DIY tooling. So with lenses, because it's all APIs, so you can automate everything. We give them the visibility, so whether that's just looking for data or analyzing data, but also the capability to actually quickly and easily deploy those flows, the the SQL flows so they can get back to more maybe more the more interesting thing, the machine learning model rather than just wrangling data at command line all the time. And then another interesting element of this is just the overall evolution
[00:12:37] Unknown:
in terms of the capabilities of streaming systems and the overall adoption of them. And I'm wondering what you have found to be some of the most notable elements of that evolution or key sort of inflection points of when things started to take off or ways that streaming is being used?
[00:12:54] Unknown:
So personally myself, I've always been working in a streaming world mainly because I've been working in finance. But I think as where I saw it going more mainstream was is is company started adopting Hadoop Hadoop clusters as well. There's all they they see the value of it, but there's always the more need for speed effectively. And Kafka was in a good position because I saw it installed actually in in a lot of environments because of the Hadoop ecosystem. So it was a natural movement on. And especially from the financial point of view, they're used to this anyway. They're used to having real time data. What they're looking for then is scaling. And as we also have a lot of IoT customers now, and that's becomes a natural fit as they,
[00:13:37] Unknown:
they want to progress their data analytics platform. And then in terms of the tooling that you're providing with Lenses, 1 of the pieces that I know is difficult in streaming contexts is being able to get visibility into the amount of data being that's flowing through these pipelines as well as some of the specifics of that information. So being able to do things like tracing as you would do in a regular distributed system in the software aspect, but being able to understand how that data is flowing throughout your system and across the different components that are producing and consuming it. I'm wondering what you have seen to be some of the useful metrics for tracking that and some of the ways that you expose that information for engineers and business users to be able to gain some visibility and understanding of how things are progressing?
[00:14:24] Unknown:
So so typically, there's the standard JMX metrics that are very, very useful. However, that only gets you so far. That tells you how an application is performing. What we're actually seeing is the real value comes when you see how the applications fit together. I call it the application landscape at a high level. Example, we had 1 we have 1 customer, Vortexum. So they used to manage Kafka themselves. Now they've moved they used to have cap what they call Kafka Fridays, and, Kafka would just fall down on them all the time. They've now moved to MSK, but from a lenses perspective, 1 of the biggest things that they realized was that when they put lenses on their infrastructure, they were able to see effectively what a mess they've made. They had no concept of where and how applications were linked together. We actually visualize that for you and then bring the metrics on top of that to show the application performance. But for us, it actually goes just a bit beyond monitoring pure metrics that are coming out. Where is my data?
Where is it going to? Who's using it? And certainly, in the UK, you know, and Europe, we have the GDPR regulation. We now have the ability. It's very important to actually just say, you know, show me everywhere that I have credit card information, and who's using that, and where is it going to. So that's I think was a was a real eye opener, and it was a real eye opener for, our customers certainly for techs when they're able to suddenly visualize not just the low level details, but actually how the applications interact with each of those. Something that you could also get from a tracing a distributed tracing framework.
[00:16:02] Unknown:
And 1 of the benefits of streaming and Kafka in particular and just the overall PubSub paradigm is that you can decouple the applications, and you don't necessarily need to care about what the downstream consumers are of the messages at the point of generation. But as you said, being able to gain visibility into all the ways that that information is being used is valuable for governance purposes and for debugging purposes and to ensure that as you do change the format of the upstream data, you can alert the downstream teams so that they know that they might need to make some changes to their applications. And I'm wondering what are some of the inherent difficulties in being able to generate that overall topology of the application architecture and application environment around those streaming pipelines in some of the ways that you're addressing that in lenses?
[00:16:56] Unknown:
Effectively, what you need to do is you need to guard the application deployment or have a way for applications to register themselves. After that, actually, it's it's not that difficult. As long as you know the inputs and outputs of where the thing is that you're you're running and have a way to collect the metrics. So we hook into the standard frameworks such as Prometheus to collect the metrics. What we do is we provide a way that we can deploy the applications, so we actually know the inputs and the outputs. So if you think about the SQL, right, we know what, field we're selecting from joining and writing to. It's the same with the connectors as well and we also have a various different API's and SDK's that you can write your own custom code to effectively register itself with lenses. So it's relatively easy to build that up effect, especially when you have Kafka as long as we're able to track the inputs and outputs. And, of course, then we can track back even to the git commit for this application, so we can actually get a full lineage. I'm very big on compliance because I came from a financial background as well to actually show me what this application was that was deployed
[00:18:03] Unknown:
and what it was doing and what did it touch. Once you have that visibility of the application topology, what are some of the other interesting things that you can do given that information that was difficult or impractical prior to having that visibility?
[00:18:19] Unknown:
So a a good example was always in in the banking industry. Show me all, datasets that have credit card information in it and who's touching it. We have that now. We are also have a thing called data policies as well. We can where we can apply masking on top of it at the presentation. Right? But that's a good example of showing, you know, where is data being used, especially from a data perspective. Where is credit card number? I wanna not only from a compliance point of view, but maybe I'm a data scientist and I'm interested in in using it. Where is it? Where's it going to? What applications are already processing it? Maybe I can leverage that to share and maybe piggyback on that data as well. So we help create a catalog, and we help create a shareable data experience or bound with the with the ethics and the governance. But also, there are other aspects we found as well, you know, even our own internal systems and reading from a database. What's the impact of this database goes down? Who's gonna be impacted downstream consumers as well? So there's once you have this rich data catalog in a way, you open up a lot of possibilities of the questions you can ask it either from a discovery perspective to share data or even from a compliance perspective
[00:19:30] Unknown:
as well. And then another complicated aspect of working with streaming data and large volumes of data in general is that of the development cycle in terms of ensuring that what you're writing now is going to, 1, work in the production environment and then, 2, scale effectively because of the fact that it's difficult to replicate the production environment into a developer's laptop or into a QA environment for being able to do some vetting of that. And I'm wondering what you have found to be some of the useful strategies for creating that development workflow so that it's useful and effective without being cumbersome for people to be able to get their local environment set up to be able to work together and ensure that also the message structures that they're working with locally are going to match what they are going to be consuming once they get deployed?
[00:20:22] Unknown:
Yeah. So what we have is, actually, we have an an all in 1 Docker, so that's a great test harness actually for developers to work with. And the the also the nice thing about certainly the Kafka Connect framework and the super processes that we have, so config and if you couple that with the schema registry as well and have Avro in there, then you've got that contract. You've got that API contract for the data as well. So promoting between environments is relatively easy. It's a config file to YAML file that we can put in in GitHub. We can, do a pull request onto the master branch, and lenses can actually apply the desired state onto the Kafka cluster in whatever environment that can be. We see people use this to go from on prem into the cloud or or to any cluster. It doesn't doesn't really matter whether it's on prem to the cloud or from 1 provider to another. So that's where the data ops aspect also fits into the DevOps aspect. I can define my entire landscape as configuration and have that applied by lenses using a GitOps approach into into another environment, and that fits very well with a standard developer's workflow. It fits into the CICD path that we have. We're even looking to actually wrap the Lenses CLI that we have into the Kubernetes operators so we can hook into the Kubernetes
[00:21:38] Unknown:
toolchain as well. And another interesting piece from what you're saying there too is the idea of CI and CD of these streaming applications and also going further into some of the DevOps paradigms. I'm wondering what your thoughts are on the viability of doing things like canary deployments to ensure that as you're deploying something, you can just do some sampling to see, is this producing the outputs that I expected and doing some, you know, maybe feature flagging to ensure that as you roll things out, they're not going to break things downstream. You can do some validation, maybe AB testing of it. And just some of the complexities that exist in that CICD and validation phase of building these types of applications.
[00:22:18] Unknown:
So I think with the with the GitOps, the way we see it with the GitOps and everything is in API and lenses is that we allow people to build out whatever strategy they have there. Again, you know, actually being able to deploy something and sample the data. Lenses also gives you that. You can even do that in an ultimate fashion because everything, even the SQL browsing capability we have is via an API. So deploying, checking, and switching, you know, that's it's I haven't to be honest, I haven't really thought too much about that, but we enable that by just actually fitting into standard CICD development practices.
So what I actually like actually, and what I I was doing this not so long ago is this type of workflow. Even if I'm a, let's say, a Python developer, I could sit in Jupyter, and I can query something in inside of Kafka, and I can actually then go, okay. I'm happy with that. Let's deploy SQL processes to do something. I'm happy with the output of that because I can have that stream back to my pipe to notebook, and then maybe to, deploy a connector to take that off to my environment. And, again, I can version control all that. It configs code so I can construct my CICD pipeline however I want to do that. And then in terms of the Lenses platform itself, can you dig a bit more into how it's designed and implemented and some of the evolution that it's gone through since you first began working on it? Yeah. So actually, if you look at what Lenses is, it's actually just a JVM app. It's actually written in Scala, and underneath the hood is just in a way, it's the Kafka standard regular Kafka clients for the consumers and the producers. Obviously, you know, we have some secret sauce in there of how we build out and run the SQL engine, but that's all it is. So it's quite just quite actually quite a small lightweight application and there's a small backing embedded date store we have there with it as well. So it just needs connectivity to to the cluster. In terms of the evolution we have, it started off being very Kafka centric. We're trying to pull it away from Kafka because we maybe wanna swap out Pulsar or Redis or any other streaming technology in the future. The biggest, I think, evolution that's going on now actually is the the SQL engine we have. We're about to release a new version of SQL engine that really opens up what we can do with it to allow us to plug in other systems if we want to to bring in Python support for it as well. So so pushing pushing beyond Kafka, I think, is where we're we're moving to.
[00:24:50] Unknown:
Today's episode of the data engineering podcast is sponsored by Datadog, a SaaS based monitoring and analytics platform for cloud scale infrastructure, applications, logs, and more. Datadog uses machine learning based algorithms to detect errors and anomalies across your entire stack, which reduces the time it takes to detect and address outages and helps promote collaboration between data engineering, operations, and the rest of the company. Go to data engineering podcast.com/Datadog today to start your free 14 day trial. And if you start a trial and install Datadog's agent, they'll send you a free t shirt. And in terms of being able to support those different engines, I watched a presentation that you gave recently at the Redis conference for being able to run SQL across Redis streams. And I know that you also are working on Pulsar support. And I'm curious, what are some of the ways that you're approaching the modularity of your system to support those different back end engines and some of the complexities that arise as far as trying to establish a lowest common denominator of feature parity while still having some capability of being able to take advantage of the specifics of those underlying engines.
[00:26:01] Unknown:
So from from a lens' feature perspective, the hardest part there actually is naming naming conventions. And for example, as we push out and build the data catalog, we're moving away from the notion of of topics because we're bring we also now can query Elasticsearch. So it's finding terminology to model the the datasets inside of lenses is is a challenge, but in terms of the actual way that lenses is being implemented, I mean, at the core of it's the SQL engine, and the SQL engine itself is is pluggable. Certainly for the browsing aspect of it. So it's very easy for us to plug in another SQL engine because it's we effect we call internally, we call them drivers. So it's not that hard for us to do the way that the guys have architected it. Now I'll be honest, I'm you know, sometimes their code is way beyond my my remit, but it's built on a modular level, so we can plug in different systems that we want to inside of out of lenses. So So for example, we're expanding out the SQL browsing, not only for elastic that we have now, we have pulsar support and that's not turned on. It's very similar to what you saw in the the Reddit streams as well. We're pushing that out to every system that potentially we could connect to as well. The SQL engine, we were making that modular so we could run it on top of Pulsar. However, there seems to be a consolidation, I would say, towards the Kafka APIs. For example, stream native, they now have the Kafka to post our bridge. So, actually, they're helping us as well because we we don't have to implement that on the streaming side ourselves. Even even such things like Azure Event Hubs. Right? It's got a category API. Lenses actually does work against Azure Event Hubs. Yeah. I was gonna ask about whether you were using that compatibility layer that they recently introduced
[00:27:49] Unknown:
and also what are some of the other
[00:27:53] Unknown:
aspects of the ecosystem that you're hooking into for being able to process on top of that. But, yeah, I was pretty excited when I saw the introduction of that, protocol compatibility layer in Pulsar, and I'm excited to see where they go with that piece of things. Yeah. I mean, the the they're very eager for us to to work on it. Right? Because I think Pulsar is actually a very good technology. The problem is the ecosystem around it. So we help build out the ecosystem around Kafka, and as soon as Pulsar gets that, then I think it's it's a natural fit. Even the client I was talking to today, they're not looking for vendor lock in, technology lock in towards Kafka. So they're also excited about the possibility of Pulsar. Now we already have it. Right? So we can because the SQL engine's split into 2 parts.
It's into a more table based engine, which allows the debugging and the querying, like, in a relational database format. That's very, very pluggable. We were gonna have to do a bit of work to extend the SQL engine to run over Pulsar. But with a bridge, we we're we're gonna evaluate that now that, to see if we don't need to do that work. There's even another vectorized. Io. They've got another Kafka API compatible replacement out, which they claims 10 times as fast. So, you know, we're also looking to see if we can we can do that. It's the same, I think, with Kubernetes. You end up coming back to a battle of the API, so API has become dominant, which I think also helps helps everyone. Yeah. It's definitely an interesting trend that I've seen in a few different places
[00:29:19] Unknown:
of different technology implementations adopting the API of the dominant worker in the space. 1 of the notable ones being s 3, where all of the different object storages are adding a compatibility layer. In the Python ecosystem, a lot of different projects are adopting the NumPy API while swapping out some of the specifics of the processing layer. And then in the streaming space, they're working on coalescing around Kafka, but they're also working on the open streaming specification to try and consolidate the specifics of how you work with those systems, so that they can innovate on the specifics of how that system actually functions under the hood. Yeah. So
[00:29:56] Unknown:
the ecosystem's very important, you know, this is what we're seeing of the problems. So another example is not necessarily related to lenses, but I still have friends who work in the high frequency trading world. They're adopting Pulsar for the modeling capabilities of it, modeling option derivatives. There's lots of them, there's millions and millions of them. So Pulsar gives them that ability, but they're using it in their trading cell. They didn't want to open it up to the rest of the company for the the day to day integration parts of it because the sis the ecosystem's not not round it as well. So they they actually use Kafka for that. They're looking at lenses to put on there, but for the the more bleeding edge work that they're doing, Pulsar's a great fit, and they're also very excited about the bridge as well because they can bridge that gap.
[00:30:42] Unknown:
So beyond the specifics of some of the monitoring capabilities that you're building and the SQL layer to bring everybody on board with a common language for being able to work with data. What are some of the other aspects of operationalizing data and bringing more people into working on it together and collaborating that you're looking to either within lenses or that you think just generally within the industry, we should be adopting to ensure that it's easier to be able to build value more rapidly from the data that's being collected and consumed?
[00:31:15] Unknown:
So 1 of the things we're looking at is the data catalog. Again, the SQL engine forms a big part of that. It's the discovery of data, and how fast can I share my data experience? And I'm a I'm a actually quite a big fan of the the data mesh architecture. Having delivering data is is a product end to end. So what what we're looking at is how we can use the SQL engine, not only for the processing and the visibility, but also to build up that rich data catalog so we can figure out where data is and share it. But 1 of the the the big things that we're working on is you've gotta do that with some form of compliance. So we actually put a multi tenancy system on top of Kafka. It doesn't have it. Okay? Bulsar does. So we can actually safely onboard people onto these systems, so they can use the data catalog to actually drive back to building a a data product. So I think that's and that's also what I'm seeing. There's a big project from IBM called Jira around the data catalog, So we can actually make use of all this data that's out there. For me, it's always comes back to this, use this tech intensity, the commodity hardware the commodity software that's being built, the commodity infrastructure to go back to actually making use of the data. What am I what am I doing with this data? Why am I here? You know, I'm I'm a little bit controversial sometimes because I say you're not a technology company, you're a company that's delivering a service and technology is enabling it actually. And so in terms of
[00:32:38] Unknown:
the Lenses platform itself, what are some of the most interesting or unexpected or innovative ways that it's being used or insights that you've seen people gain from it as they have adopted it for their streaming architectures?
[00:32:52] Unknown:
Well, what what we actually see is that people start off with Kafka and lenses first just to get visibility on what's happening with Kafka. And then as they mature, they move into different aspects of using the SQL engine as well to feed different machine learning models. Over speaking to 1 client and what they're what they're looking at doing is actually optimizing how the firing lasers at, tin droplets for microchip processing, and they want to to feed that back in a in a real time loop so they can optimize the shape of these tin droplets. So I think there's a wide range of industries we're in. So we have everything from standard ETL work to more machine stream learning driven. We have IoT, especially in Canada, the tracking cows. You know? So this is a range of technologies and use cases.
But, you know, I always go back to Babylon Health because I think that's using lenses to help build that tech intensity, to feed their AI model so they can, build this health care for everyone across the world. And I think that's always a great use case that they're doing. You know, the ability to actually run these SQL engines and to feed data into their, AI models is is pretty cool with a great outcome, especially in the current climate. And 1 other aspect of the
[00:34:14] Unknown:
streaming environment is the, at least perceived dichotomy between streaming processes and batch processes where a lot of some of the ETL workflows are operating in more of a batch mode, and then a lot of people are moving to using things like Kafka and Pulsar for streaming ETL. I'm wondering what your thoughts are on some of the benefits of that dichotomy of those different approaches and maybe some of the ways that they, you know, can and should merge together and just treat everything as a continuous stream. Well, I
[00:34:45] Unknown:
think everything is a stream. Right? You know, I think that's been well publicized, and the batch is just a subset. The reality is though, especially as you move into large organizations, everything is still based around a batch system. Their upstream systems are batch, right? So it's it's very hard to completely move from 1 to the other. For example, when I was a Dutch energy company, they couldn't get past asking where is my FTP? Where's my CSV file? And no matter how much we push them even to a streaming solution, you know, you know, having a Kafka connector stand up to pull in those files from an FTP server and stream in Kafka. The upstream system is still sending it in in batch. And until you actually get the entire landscape moving to a streaming world, you're still gonna have these 2 versions of doing it. It's especially prevalent in the in the financial industry where they try to do risk calculations, and they need the whole set of data at the end of the day. So there's still this infrastructure layer around it that makes it very hard for some companies that are not streaming native to actually take the plunge and completely move away. I personally see no problem of having the 2 working together. Right? You know, if you need to have a batch driven world and you push everything into Kafka and have a trigger or event on the back of that that does something else, That's fine. If you look at the data mesh architecture, you know, push it to something that is suitable for your use case. Kafka's only 1 part of the solution. Right? It's part of a bigger bigger platform, so I have no problem with them working side by side. It's just very very difficult to change. If you've got a mainframe and it's spitting out a file, they're not gonna so you're not gonna get the mainframe changed, so you've got this batch file and yes, you can feed it into a streaming system. But a lot of the architecture and a lot of the business is driven around a batch processing, end of day reporting, especially, you know, in the trading world. What's the end of day risk? What's the end of day positions as well? That's always the bottom line. They do have intraday trading things, but it's always end of day. That's the 1 that's legally binding. So until the business actually shifts as well, it's going to be hard to completely move to a streaming
[00:37:02] Unknown:
architecture. And then another note on the idea of streaming ETL is the thought of the workflow orchestration where we have tools such as Daxter and Airflow that are built around this batch model of take this thing, process it, do the next step. And then there's the concept of, as we were discussing earlier, the topology of applications that are interacting streaming system. I'm wondering what you have seen to be some of the crossover of the workflow management for data processing and how that can coexist or function natively with stream processing?
[00:37:39] Unknown:
So what I've generally seen is that the the batch processing, whether it's, in Dagsdale or anything else, can be triggered off the events. And, you know, they're coming from Kafka. It may be a sequel processor that's running real time that spits out a result that triggers a job to do something. That's the general pattern we see. So that's why I say they can they can coexist, and if you need to have a batch process that's triggered off an event, okay, well, the streaming architecture can help you do that, and Kafka can help you do that, but you're you're reacting to an event to trigger another process. This is quite common. You know, I could I write stuff into s 3, and there's a Lambda on the other side that maybe triggers an email when there's a new file rolled in as well. So they live side by side. If I'm writing to s 3, I've gotta batch it into a file anyway for performance. So it's it's never clear cut that you are
[00:38:35] Unknown:
all in to streaming and everything has to be streaming. I think there's always a place for both. In your own work on the Lenses platform and using it within your own systems, what have you found to be some of the most interesting or unexpected or unexpected or challenging lessons that you've learned in the process?
[00:38:52] Unknown:
I would actually say discovering data. So example, we collect telemetry of our own of lenses as well and for our own systems, and we were actually it was a bit of an eye opener when we were trying to figure out how much data we were collecting. To actually pull it into lenses so we could use lenses internally. So it really opened my eyes to the challenges of being able to discover and manage data at scale. I'm not saying, you know, we're on anywhere near an investment banking scale, but it's interesting how quickly you accumulate different, silos of data and being able to discover them quickly as well. So this is all things that we're feeding back into the product as well and also based on our own experience.
[00:39:38] Unknown:
And then for people who are interested in gaining better visibility into their systems or having a data ops flow around streaming, what are some of the cases where lenses might be the wrong choice and they might be better suited by looking to other tools or building them in house? So I think lenses can generally help. Right? Now I think the the question is do you need Kafka?
[00:39:58] Unknown:
Do you need a distributor streaming platform or not? I think is more, what I would say there. Now these come with a cost. Well, it doesn't matter whether it's, Pulsar or Kafka. There is a cost to running these applications, and then you have to have all the monitoring around it and tooling around it, why we built lenses. So I think it's more of a question of, do you really need Kafka? Do you really need the scale? You know, we've had potential clients that maybe had 1 message every hour. Okay. Probably you don't need Kafka. Right? Maybe you'd be better off with a traditional JMS message broker, something like that along those lines. So if you do need Kafka, if you do need the the scale of Kafka, then we help. But I'm I'm a bit of a pragmatist about it. You know, use the right technology.
If you don't need a big data solution, don't pick a big data solution. And a lot of people don't. I've seen this. You know, I saw this at 1 company. They were not necessarily around Kafka. They were adamant and maybe for a bit of resume plus plus they were like, no, no, we need a spark cluster. We've got lots of data. They didn't have lots of data and they were already familiar with SQL Server, they would have been far better off actually just putting SQL SQL Server in there. So do you need Kafka? Do you need distributed systems?
[00:41:17] Unknown:
That's, I think, that's ultimately the choice there. And as you look to the future of the Lenses platform and the streaming ecosystem in general and the data ops principles that you're trying to operate around? What are some of the things that you have planned for the future of both the technical and business aspects of what you're working on?
[00:41:36] Unknown:
So we are looking at the cloud. You know, we think that managed solutions, you know, because we're all about focusing on on the data and that commodity and technology that they offer. I can go to the cloud and get any get, Kafka and Kubernetes now with all the databases and the key vaults around it I need. So we're looking at how we, further integrate with the cloud as well. We're also continuing to build out the data catalog aspect of it. And on the on the SQL side, you know, we're we're continuing to push out what systems we can query, and also looking to we call it SQL Lambdas, although I'm not convinced that's still the correct term, of pushing more into the SQL engine that we can connect and write to other systems. For example, Kafka Connect, great. But do I want to have to stand up a Kafka Connect cluster when I don't need 1? Maybe actually my SQL engine can actually directly write out to Elasticsearch or anything else. So that's that's where we're looking at along with a lot more still we think we can do around around the government, the governance side of it, approval flows, for example, to making sure that we get data stewardship in there and and the governance aspects. So that's our general
[00:42:50] Unknown:
focus of where we're going. Are there any other aspects of the Lenses platform itself or data ops principles in general
[00:43:06] Unknown:
No. I No. I I think the the most important thing for us is the, you know, having that tooling and the visibility around it. That's where I've seen the success. And we I think we have to really start thinking about what we wanna do with the data and using the the technology to get there. And to try out lenses is pretty simple. We have a free box. You can go to lenses.io
[00:43:27] Unknown:
and download the box as well. We have a hosted online portal if you want to. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I would just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. Visibility.
[00:43:46] Unknown:
It's always been visibility I would say for me. I went from using SQL Server and moving more data than I ever did actually with the big data technologies, and having that visibility is the killer. You know, I when I was at an investment bank and you try and give the head of trading at a t 1 investment bank, the command line to to say, here you go. Go and look at your data. It doesn't work. So if you really want to make a success of your platform, you have to provide that visibility into not just the infrastructure,
[00:44:15] Unknown:
but into the applications and into the data and get that in the hands of the people who understand it. Well, thank you very much for taking the time today to join me and discuss the work that you've been doing on Lenses and some of the ways that you're helping to empower people who are building streaming infrastructures to gain some of that visibility into the applications that they're all the effort you've put into that, and I hope you enjoy the rest of your day. Yeah. Okay. Thank you very much, all the effort you've put into that, and I hope you enjoy the rest of your day. Yeah. Okay. Thank you very much for having me. Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used. And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Call for Contributions
Interview with Andrew Stevenson: Background and Career
Introduction to Lenses and Its Origins
Defining DataOps and Its Importance
Custom SQL Engine and Its Role in Lenses
SQL as a Universal Language for Data Collaboration
Challenges and Solutions in Streaming Data Operations
Evolution and Adoption of Streaming Systems
Metrics and Visibility in Streaming Pipelines
Application Topology and Data Governance
Development Workflow for Streaming Applications
Design and Evolution of the Lenses Platform
Supporting Multiple Streaming Engines
Operationalizing Data and DataOps Principles
Streaming vs. Batch Processing
Workflow Orchestration in Streaming Contexts
Lessons Learned from Developing Lenses
When Lenses Might Not Be the Right Choice
Future Plans for Lenses and DataOps
Biggest Gaps in Data Management Tooling