Summary
The conventional approach to analytics involves collecting large amounts of data that can be cleaned, followed by a separate step for analysis and interpretation. Unfortunately this strategy is not viable for handling real-time, real-world use cases such as traffic management or supply chain logistics. In this episode Simon Crosby, CTO of Swim Inc., explains how the SwimOS kernel and the enterprise data fabric built on top of it enable brand new use cases for instant insights. This was an eye opening conversation about how stateful computation of data streams from edge devices can reduce cost and complexity as compared to batch oriented workflows.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- Listen, I’m sure you work for a ‘data driven’ company – who doesn’t these days? Does your company use Amazon Redshift? Have you ever groaned over slow queries or are just afraid that Amazon Redshift is gonna fall over at some point? Well, you’ve got to talk to the folks over at intermix.io. They have built the “missing” Amazon Redshift console – it’s an amazing analytics product for data engineers to find and re-write slow queries and gives actionable recommendations to optimize data pipelines. WeWork, Postmates, and Medium are just a few of their customers. Go to dataengineeringpodcast.com/intermix today and use promo code DEP at sign up to get a $50 discount!
- You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
- Your host is Tobias Macey and today I’m interviewing Simon Crosby about Swim.ai, a data fabric for the distributed enterprise
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by explaining what Swim.ai is and how the project and business got started?
- Can you explain the differentiating factors between the SwimOS and Data Fabric platforms that you offer?
- What are some of the use cases that are enabled by the Swim platform that would otherwise be impractical or intractable?
- How does Swim help alleviate the challenges of working with sensor oriented applications or edge computing platforms?
- Can you describe a typical design for an application or system being built on top of the Swim platform?
- What does the developer workflow look like?
- What kind of tooling do you have for diagnosing and debugging errors in an application built on top of Swim?
- What does the developer workflow look like?
- Can you describe the internal design for the SwimOS and how it has evolved since you first began working on it?
- For such widely distributed applications, efficient discovery and communication is essential. How does Swim handle that functionality?
- What mechanisms are in place to account for network failures?
- Since the application nodes are explicitly stateful, how do you handle scaling as compared to a stateless web application?
- Since there is no explicit data layer, how is data redundancy handled by Swim applications?
- What are some of the most interesting/unexpected/innovative ways that you have seen the Swim technology used?
- What have you found to be the most challenging aspects of building the Swim platform?
- What are some of the assumptions that you had going into the creation of SwimOS and how have they been challenged or updated?
- What do you have planned for the future of the technical and business aspects of Swim.ai?
Contact Info
- Wikipedia
- @simoncrosby on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
- Swim.ai
- Hadoop
- Streaming Data
- Apache Flink
- Apache Kafka
- Wallaroo
- Digital Twin
- Swim Concepts Documentation
- RFID == Radio Frequency IDentification
- PCB == Printed Circuit Board
- Graal VM
- Azure IoT Edge Framework
- Azure DLS (Data Lake Storage)
- Power BI
- WARP Protocol
- LightBend
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With 200 gigabit private networking, scalable shared block storage and a 40 gigabit public network, you've got everything you need to run a fast, reliable and bulletproof data platform. If you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances.
Go to data engineering podcast.com/linode, that's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. And listen, I'm sure you work for a data driven company. Who doesn't these days? Does your company use Amazon Redshift? Have you ever grown over slow queries or just afraid that Amazon Redshift is going to fall over at some point? Well, you've got to talk to the folks over at intermix.io. They have built the missing Amazon Redshift console. It's an amazing analytics product for data engineers to find and rewrite slow queries, and it gives actionable recommendations to optimize data pipelines.
WeWork, Postmates, and Medium are just a few of their customers. Go to data engineering podcast.com/intermix today and use promo code dep@signuptogeta50discount. And you listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council.
Upcoming events include the O'Reilly AI Conference, the Strata Data Conference, the combined events of the Data Architecture Summit in Graforum, and Data Council in Barcelona. Go to data engineering podcast.com/conferences to learn more about these and other events and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macy. And today, I'm interviewing Simon Crosby about swim.ai, the data fabric for the distributed enterprise. So, Simon, can you start by introducing yourself?
[00:02:28] Unknown:
Hi. I'm Simon Crosby. I am a CTO, I guess, of long duration. I've been around for a long time, and it's a privilege to be with the swim folks who have been building this fabulous, platform for fast streaming data for about 5 years.
[00:02:50] Unknown:
And do you remember how you first got involved in the area of data management?
[00:02:53] Unknown:
Well, I have a PhD in applied mathematics and probability. So I am kind of not data management guy. I'm an analysis guy. I like, what comes out of, you know, streams of data and what inference you can draw from it. So my background is more on the analytical side. And then along the way, I started picking up how to build big infrastructure for IT.
[00:03:22] Unknown:
And now you have taken up the position as CTO for swim.ai. I'm wondering if you can explain a bit about what the platform is and how the overall project and business got started.
[00:03:34] Unknown:
Sure. So here's the problem. We're all reading all the time about these wonderful things that you can do with machine learning and streaming data and so on. It all involves cloud and other magical things. And in general, most organizations just don't know how to make head or tail of that for a bunch of reasons. It's just too hard to get there. So if you're an organization with assets that are chipping out lots of data, and that could be a bunch of different types. You know, you probably don't have the skill set in house to deal with a vast amount of information. And we're talking about boundless data sources here, things that never showed up. And so to deal with these data flow pipelines, to deal with data itself, to deal with the learning and inference that you might draw from that and so on.
And so enterprises do have a huge skill set challenge. There is also a cost challenge because today's techniques related to drawing inference from data in general resolve with, you know, in in large expensive dead lakes either in house or perhaps in the cloud. And then, finally, there's a challenge with the timeliness within which you can draw an insight. And most folks today believe that you store data, and then you think about it in some magical way, and you draw inference from that. And we're all suffering from the Hadoop, Cloudera, I guess, after effects. And, really, this notion of storing and then analyzing needs to be dispensed with in terms of fast data. Certainly, for boundless data sources that will never stop, it's really inappropriate.
So when I talk about boundary status, we're gonna talk about data streams that just never stop. And Ross can talk about the need to derive insights from that data on the fly because if you don't, something will go wrong. So it's of the type that would stop your car before you hit the pedestrian or crosswalk. That kind of stuff. So in for that kind of data, there's just no chance to, you know, store it down on a hard disk and then learn.
[00:06:16] Unknown:
And how would you differentiate the work that you're doing with the swim ai platform and the swim OS kernel from things that are being done with tools such as Flink or, other streaming systems such as Kafka that has now got capabilities for being able to do some limited streaming analysis on the data as it flows through, or also platforms such as Wallaroo that are built for being able to do stateful computations
[00:06:43] Unknown:
on data streams? So first of all, there have been some major steps forward, and anything we do, we stand on the shoulders of giants. Let's start off with distinguishing between the large enterprise skill set that's out there and the cloud world, and, you know, all the things you mentioned live in the cloud world. So at that rough first distinction, most people in the enterprise, when you said Flink, wouldn't know what the hell are you talking about. Okay? Similarly, Guadalupe or anything else. They just wouldn't know what you're talking about. And so there's a major problem with the tools and technologies that we have built for the cloud, really for, I guess, for a lot of cloud native applications and the majority of enterprises who just they're stuck with legacy IT and application skill set, and they're still coming up to speed with the right thing to do. And to be honest, they're getting over the headache of Hadoop.
So then if we talk about cloud made world, there is a fascinating distinction between all of the various projects which have started to tackle streaming data, and there have been some major prog there's been made some major progress there, Jambe delighted to point out, swim being 1 of them. And, happy to go into each 1 of those projects in detail as we go forward. The key point being that, 1st and foremost, the large majority of enterprises just don't know what the heck to do. And then within your specific offerings,
[00:08:27] Unknown:
there is the data fabric platform, which you're targeting for enterprise consumers, and then there's also the open source kernel of that in the form of swim OS. I'm wondering if you can provide some, explanation as to what are the differentiating factors between those 2 products and the sort of decision points along when somebody might want to use 1 versus the other? Yeah. Let's cut it first at the
[00:08:54] Unknown:
distinction between the application layer and the infrastructure needed to run a large distributed data for a pipeline. And so for SWIM, all of the application layer stuff, but then there's everything you'd need to build an app, is entirely open source. Some of the capabilities that you want to run a large distributed data flow pipeline are proprietary, And that's really just because, you know, we we're building a business around this. We plan to open source more and more features more and more features over time. And then as far as the primary use cases
[00:09:33] Unknown:
that you are enabling with the swim platform and some of the different ways enterprise organizations are implementing it, what what are some of the cases where using something other than SWIM, either the OS or the data fabric layer, would be either impractical or intractable if they were trying to use more traditional approaches such as Hadoop, as you mentioned, or data warehouse and more batch oriented workflows? So so let's start off describing what SWIM does. Can I can I do that? And then that that might help. Our in our view, it's our job
[00:10:08] Unknown:
to build the pipeline and indeed the model from the data. Okay? So swim just wants data, and from the data, we will build, automatically build, this stateful data flow pipeline. And, indeed, from that, we will build a model of arbitrarily interesting complexity, which allows us to solve some very interesting problems. Okay? So the swim perspective starts with data because that's where our customer's journey starts. They have lots and lots of data. They don't know what to do with it. And so the approach we take in swim is to allow the data to build the model. Now you would naturally say that's impossible. In general, what's required is some oncology at the edge, which describes the data. You could think of it as a schema, in fact, basically, to describe what data items mean in some sort of useful sense to us as modelers.
But then given data, swim will build that model. So let me give you an example. Given a relatively simple ontology, for traffic, traffic equipment, so pedestrian lights, the loops on the road, the lights, and and so on. SWIM will build a model, which is a stateful digital twin, as it were, for every sensor, every every source of data, which is running in concurrently in some distributed fabric and processes its own raw data and statefully evolves. Okay? So simply given that ontology, Swinom knows how to build stateful concurrent little things we call web agents.
Actually, I'm I'm using that term, I guess, the same as digital twin. And these are concurrent things which are going to safely process raw data and represent that in a meaningful way. And the cool thing about that is that each 1 of these little digital twins exists in a context, a real world context that swimmer is going to discover for us. So for example, a an intersection might have 60 to 80 sensors, so there's a notion of containment. But, also, intersections are adjacent to other intersections in the real world map, and so that notion of adjacency is also real world relationship.
And in swim, this notion of a link allows us to express the real world relationships between these little digital twins. And linking in swim has this wonderful additional property, which is to allow us to express essentially a sub. So in some, there is never a pub, but there is a sub. And if something links to something else, say, if I link to you, then it's like linked into things. I get to see the real time updates of in memory state stored by that digital twin. So digital twins that are linked to other digital twins courtesy of real world relationships, such as containment or proximity.
We can even do other relationships like correlation, also link to each other, which allows them to share data. And sharing data allows interesting computational properties to be derived. For example, we can learn and predict. Okay? So job 1 is to define the song'sology. Sim then goes and builds a graph, a graph of digital twins, which is constructed entirely from the data. And then the linking happens as part of that, and that allows us to then construct interesting computations. Is that useful?
[00:14:46] Unknown:
Yes. That's definitely helpful to get an idea of some of the use cases and some of the ways that the different concepts within SWIM work together to be able to build out to what a sort of conceptual architecture would be for an application that would utilize SWIM?
[00:15:03] Unknown:
So the key thing here is I'm talking about an application. They just said the application is to predict the future the future traffic in any city or what's going to happen in the traffic area. Right? Now I could do that for a bunch of different cities. What I can tell you is I need a model for each city. And there are 2 ways to build a model. 1 way is I get a data scientist to open up and build my model. Maybe they train it and do a whole bunch of other things. And I'm gonna have to do this for every single city where I want to use this application. The other way to do it is to build the model from the data, and that's the swim approach. So what swim does is simply give them the ontology, build these little digital twins, which are representatives of the real world things, get them to statefully evolve, and then link to other things in you know, to represent real world relationships.
And then suddenly, hey, presto. You have built a large graph, which is effectively the model that you would have had to have a a human build otherwise. Right? So it's constructed in the sense that in any new city you go to, this thing is just gonna unbundle and just given a stream of data, it will build a model which represents the things that are the sources of data and their physical relationships. Does that make sense?
[00:16:39] Unknown:
Yeah. And I'm wondering if you can expand upon that in terms of the type of workflow that a developer who is building an application on top of SWIM would go through as far as identifying what those ontologies are and defining how the links will occur as the data streams into the different nodes in the swim OS graph.
[00:17:01] Unknown:
So the key point here is that we think that we can do I don't know. We can build, like, 80% of an app, okay, from the data. And that is we can find all of the big structural all all all structural properties of relevance in the data, and then let the the application builder drop in what they want to compute. And so let me try and express this slightly differently. Job 1, we believe, is to build a model of the stateful digital twins, all by which almost mirror their real world counterparts. So at all points in time, their job is to represent the real world as faithfully and as close to real time as they can in a stateful way, which is of relevance to the problem I've had. Okay? So rather than voltages, I'm gonna have a red light. Okay? Something like that. And the first problem is to build this this set of digital twins, which are interlinked, which represent the real world being studied.
Okay? And it's important to separate that from the application layer component of what you want to compute from that. So, frequently, we see people making the wrong decision. That is hard hard coupling the notion of prediction or learning or any other form of analysis into the application in such a way that any change requires programming, and we think that that's wrong. So job 1 is to have this faithful representation of a real time world in which everything evolves its own state whenever its real world twin evolves, and it evolves statefully.
And then the second component of this, which which we do on a separate time scale, is to inject, operators, which are going to then compute on the states of those things at the edge. Right? So we have a model which represents the relationships between things in the real world. It's attempting to evolve as close as possible to real time in relationship to the real world twin, and it's reflecting its links and so on. But the notion what you want to compute from it is separate from that and decoupled. And so the second step, which is an application or building an application right here, right now, is to drop in an operator, which is going to compute a thing from that.
So you might say, cool. I want every digital twin every intersection to compute, you know, to be able to learn from its own behavior and predict. That's 1 thing. Or you might say, I want to compute the average wait time of every car in the city. That's another thing. So the key point here is that computing from these rapidly evolving world world views is decoupled from the actual model of what's going on in that world at any point in time. So swim reflects that decoupling by allowing you to bind operators to the model whenever you want.
Okay? By whenever you meant, I mean, you can write them in code, in bits of Java or but, also, you can write them in blobs of JavaScript or Python and dynamically insert them into a running model. Okay? So let me make that 1 concrete for you. I could have a deployed system, which is a model, a deployed graph of digital twins, which are, currently mirroring the state of Las Vegas. And, dynamically, a data scientist says, let me compute the average wait time of red cars at these intersections and drops that in as a blob of JavaScript attached to every digital twin for an intersection.
That is what I mean by an application. And so we want to get to this point where the notion of application is not something deeply hidden in somebody's, you know, notebook or Jibbon notebook or in some programmer's brain, and they quit and wander off to the next start up 10 months ago. An application is what I wanna know right now, drop into a running model.
[00:22:03] Unknown:
So the way that sounds like to me is that SWIM essentially acts as you deploy the infrastructure layer to ingest the data feeds from these sets of sensors, and then it will automatically create these digital twin objects to be able to have some digital manifestation of the real world so that you have a continuous stream of data and how that's interrelated. And then it sort of flips the order of operations in terms of how the data engineer and the data scientist might work together, where the way that most people are used to, you will ingest the data from these different sensors, bundle it up, and then hand it off to a data scientist to be able to do their analysis. They generate a model and then handed it back to the data engineer to say, okay. Go ahead and deploy this and then see what the outputs are, where instead, the swim platform essentially acts as the delivery mechanism and the interactive environment for the data scientists to be able to experiment with the data, build the model, and then get it deployed on top of the continuously updating live stream of data, and then be able to have some real world interaction with those sensors in real time as they're doing that to be able to feed that back to say, okay.
Red cars are waiting 15% longer than other cars at these 2 intersections, so now I want to be able to optimize our overall grid. And that will then feed back into the rest of the network to have some physical manifestation of the analysis that they're trying to perform to try and maybe optimize the overall traffic flow? So there are some consequences
[00:23:42] Unknown:
for that. First of all, the every algorithm has to compute stuff on the fly. So if you look at, you know, the kind of store and then analyze approach to big data type learning or any training or anything else, you know you have all the bets, and here you don't. And so every algorithm that is part of SWIM is coded in such a way as to continually process data, and that's fundamentally different to most frameworks. Okay? So for example, the, the the learn and predict cycle is 1. You know, you mentioned training and and so on. That's very interesting.
But, you know, training implies that I collect and store some training data and that it's complete and useful enough to parameterize a model back and then hand back. You know, what if it isn't? And so in swim, we don't do that. I mean, we can if you want. If you have a prebuilt model, that's no problem for us to feed data to. But instead, in SWIM, the input vector, say, to a prediction algorithm, say, DNN, is precisely the current state of the digital twins for some bunch of things. Right? Maybe the set of sensors in the neighborhood of a of a of an insect. And so this is a continually varying, real world triggered scenario in which real data is fed through the algorithm but is not stored anywhere. So everything is fundamentally streaming.
So we assume that data streams continually, and, indeed, the output of every algorithm streams continually. So what you see when you're comparing the average is the current average. Okay? When you see when you when you're looking for heavy hitters, the what you see is the current heavy hitters. Alright? And so every algorithm has its streaming twin, I guess. And part of the art in the swim context is reformulating the notion of of analysis into a streaming context so that you never expect a a complete answer because there isn't 1. It's just what I've seen until now. Okay? And what I've seen until now has been fed through the algorithm, and this is the current answer.
And so every algorithm computes and streams. And so the notion of linking, which I described earlier for swim, between digital twins, say, applies also to these operators, which effectively would link to things they want to compute from, and then they would stream their results. K? So if you link them, you see a continued update. And, for example, that stream could be used to be could be used to feed a Kafka a Kafka implementation, which would serve a bunch of applications. You know, Kafka, the notion of streaming is is is pretty well understood. So we can feed other, bits of the infrastructure very well, But, fundamentally,
[00:27:20] Unknown:
everything is designed to stream. Yeah. It's definitely an interesting approach to the just overall workflow of how these analyses work. And 1 thing that I'm curious of is how data scientists and analysts have found working with this platform in terms of ways that they might be used to.
[00:27:39] Unknown:
You know, you're interested in in what data scientists would view or how they view this. To be honest, in general, with surprise, Our experience to date has been largely with people who don't know what the heck they're doing in terms of data science. So they're trying to run an oil rig more efficiently. They have, whatever, 10, 000 sensors, and they wanna make sure this thing isn't gonna blow up. Okay. So they tend to be heavily operationally focused folks. They're not data scientists. They never could afford 1, and they don't understand the language of data science or have the ability to build cloud based pipelines that you and I might be familiar with. So these are folks who effectively just want to do a better job given this enormous stream of data that they have.
They believe they have something in their data. They don't know what that might be, but they're keen to go and see. Okay? And so those are the folks who we spent most of our time with. I'll give you, a funny example if you'd like.
[00:28:56] Unknown:
Sure. That would be, illustrative. That would be, illustrative.
[00:29:02] Unknown:
We work with a manufacturer of aircraft, and they have very large number of RFID tag parts and equipment too. And so if you know anything about RFID, you know it's pretty useless stuff. It's built from about 10 years ago and 20 years ago. And so what they were doing is from about 2, 000 readers, you're getting about 10, 000 readers a second. And each 1 of these readers is simply being written into an Oracle database. At the end of the day, they try and reconcile us all with what whatever parts they have and where everything is and so on. And the SWIM solution to this is entirely different. It gives you a good idea of why we care about modeling data or thinking about data differently.
We simply built a digital twin for every tag. The first time it's seen, we create 1, and then they, you know, they have been seen for a long time. They just expire. And whenever a reader sees a tag, it simply says, hey. I saw you, and this was the signal strength. Now because tags get seen by multiple readers, the each digital twin of attack does the obvious thing. It triangulates from the readers. Okay? So it learns the attenuation in different parts of the plant, which is a very simple niche. It's, that's the the word learn there is rather stretched.
It's a pretty straightforward calculation. And then suddenly, it can work up where it is in 3 space. So instead of an Oracle database or a database full of tag reads and lots and lots of post processing, you know, but a couple of raspberry pies, And each 1 and and these raspberry pies, you know, have millions of these little tags running in there, and you can ask any 1 of them where it is. K? And you then you can do even more. You can say, hey. Show me all the things within 3 meters of this tag. Okay? And that allows you to see components being put together into real physical objects. Right? So as a fuselage gets built with the engine or whatever it is. And so a problem which was tons of infrastructure and tons of tag reads, got turned 2 raspberry pies worth of stuff, which kinda self organized and into a form which could feed real time visualization and controls around what what bits of infrastructure were where.
Okay? Now that was transformative for this outfit, which was which quite literally hadn't fooled tackling the problem in this way. Does that make sense? Yeah. That's definitely a very useful
[00:32:07] Unknown:
example of how this technology can flip the overall order of operations and just the overall capabilities of an organization to be able to answer of this tag got read at this point in this location, and then of this tag got read at this point in this location, and then being able to actually get something meaningful out of it as far as this part is in this location in the overall reference space of the warehouse is definitely transformative and probably gave them weeks or months' worth of additional time in terms of lead time for being able to predict problems or identify areas for potential optimization.
[00:32:48] Unknown:
Yeah. And I think we save them $2, 000, 000 a year. Let me tell you what from this tale come 2 interesting things. First of all, if you show up at customer with something running on Raspberry Pi, you can't charge him a $1, 000, 000. Okay. That's lesson 1. Lesson 2 is that the volume of the data is not relevant or not related to the value of the insight. Okay? I mentioned traffic earlier. In the city of Las Vegas, we get about between 50 60 terabytes per day of the traffic infrastructure. And every intersection, every digital twin every intersection in, the city predicts 2 minutes into its future. Okay? And those insights are sold in an API in Azure to customers like Audi and Uber and Lyft and whatever else.
Okay? Now that's a ton of data. Okay? It's just you couldn't even think of where to put in your cloud. But the value of the insight is relatively low. This is the total amount of money I can extract from Uber per month per intersection is low. Alright? By the way, all this stuff is open source. You can go and grab it and play and hopefully make your city better. So what from that, you can gather that it's not of high enough value for me to do anything other than say go grab it and run. So vast amounts of data and relatively important, but not commercially relevant value.
[00:34:35] Unknown:
And another aspect of that case in particular is that despite this volume of data, it might be interesting for being able to do historical analysis. But in terms of the actual real world utility, it has a distinct expiration period where you have no real interest in the sensor data as it existed an hour ago because that has no particular relevance on your current state of the world and what you're trying to do with it at this point in time.
[00:35:03] Unknown:
Yeah. You have historical interest in the sense of wanting to know if your predictions were right or wanting to know about it for traffic engineering purposes, which runs on a slower time scale. So some form of bucketing or whatever. Some more terse form of recording is useful. And, sure, that's easy,
[00:35:27] Unknown:
but you certainly did not want to record the the original dead rate. And then going back to the other question I had earlier when we were talking about the workflow of an analyst or a data scientist pushing out their analyses live to these digital twins and potentially having some real world impact. I'm curious if the swim platform has some concept of a dry run mode where you can deploy this analysis and see what the output of it is without it and and see maybe what impact it would have without it actually manifesting in the real world for cases where you want to ensure that you're not accidentally introducing error or potentially having a dangerous
[00:36:09] Unknown:
outcome, particularly in the case that you were mentioning of an oil and gas rig? Yeah. Yeah. Yeah. So I'm with you 100%. Actually, everything we've done thus far has been open loop in the sense that we're informing another a human or an another application, but we're not directly controlling the infrastructure. And the value of a dry run would be enormous, you can imagine, in those scenarios. But thus far, we don't have any use cases that we can report of using SIM for direct control. We do have use cases where on a second by second basis, we are predicting whether machines are gonna make an error as they make, as they, build PCBs for servers and and so on.
But, again, what you're doing is you're calling Freddy to come over and fix the machine. You're not, you know, you're not trying to change the way the machine behaves. Ed, now digging a bit deeper into the actual implementation
[00:37:10] Unknown:
of SWIM, I'm wondering if you can talk through how the actual system itself is architected and some of the ways that it has evolved as you have worked with different partners to deploy it into real world environments and get feedback from them and how that has affected the overall direction of the, product road map?
[00:37:29] Unknown:
So SWIM is a couple of megabytes of Java extensions. Okay? So it's extremely lean. We tend to deploy in containers using a GraalVM. So it's very small. We can run-in, you know, probably a 100 megabytes or so. And so people tend to think of when people tend to think of edge, they tend to think of running in the edge gables or things. We don't really think of edge in that way. And so an important part of defining edge, as far as we're concerned, is simply gaining access to streaming data. We don't really care where it is. But swim is small enough to get on limited amounts of compute towards a physical edge.
And the, you know, the product has evolved in the sense that, originally, it wasn't a way of building applications for the agent. You'd sit down, write them in Java, and so on. Laterally, this ability to simply let, let the app application data or let the data build the app or most of the app, come on us in response to customer needs. But SWIM is deployed typically in containers. And for that, we have, you know, currently relied very heavily on the Azure IoT Edge framework, and that is magical, to be quite honest, because we can rely on Microsoft's machinery to deal with all of the painful bits of deployment and life cycle management for the code base and the application as it runs.
These are not things that we are really focused on. What we're trying to do is build a a capability which will respond to data and do the right thing for the application developer. And so we are fully published in the Azure IoT, hub, and you can download us and and get going and manage us through the life cycle that way. And so in several use cases now, what we are doing is we are used to feed fast timescale, insights at the physical edge. We are labeling data and then dropping it into Azure ADLS Gen 2 and feeding, insights into applications built in Power BI.
Okay. So just for the sake of machinery, you know, using the Azure framework for management of the IoT edge. By the way, I think IoT edge is about the worst possible name you could ever pick because all you want is a thing to manage the life cycle management of a capability which is gonna deal with fast data. Whether it's at the physical edge or not is immaterial. But that but that's basically what we've been doing is relying on Microsoft's fabulous life cycle management framework for all that. We're plugged into the IoT hub and all of the Azure IoT small Azure services generally for back end things, which enterprises love.
[00:41:00] Unknown:
And then another element of what we were discussing in the use case examples that you were describing, particularly, for instance, with the traffic intersections, is the idea of discoverability and routing between these digital twins as far as how they which twins are useful to communicate with and establishing those links. And also at the networking layer, how they handle network failures in terms of communication and ensuring that if there is some sort of fault that they're able to recover from it. Swimbolds,
[00:41:38] Unknown:
Let's talk about 2 layers. 1 is the app layer, and the other 1 is the infrastructure, which is gonna run this effectively as distributed graph. And so Sumo is gonna build this graph for us from the data. What that means is the digital twin, by the way, we technically call these web agents, these little web agents are gonna be distributed somewhere in a fabric of physical instances, and they may be widely geographically distributed. And so there is a need, nonetheless, at the application layer for things which are relates in some way, linked physically or, you know, in some other way, to be able to link to each other. That's a sumacruel of a sub. And so links require that objects, which are these digital twins, have the ability to inspect each other's data, right, their members. And, of course, if something is running on the other side of planet and you're linked to it, how on earth is that gonna work? So we're all familiar with object oriented languages and objects in 1 address space. That's pretty easy. We know what an object handle or an object reference or a pointer or whatever. We get it. But when these things distribute, that's hard. And so in swim, if you're an application programmer, you would simply use object references, but these resolve to URIs.
So in practice, at runtime, the linking, that is when I link to you, I'll I'll link to URI. And that link, once resolved by swim, enables a continued stream of updates to flow from you to me. And if we happen to be on different instances that is running in different address spaces, then that will be over a mesh over a direct WebSocket's connection between your instance and mine. And so in any SWIM deployment, all instances are interlinked, so each link to each other using a single WebSockets connection. And then these links permit the flow of information between linked, digital twins.
And what happens is, whenever an a a change in the in memory state of a linked, you know, digital twin happens, what happens is that its instance then streams to every other linked object and update to the state for that thing. Right? So what require what's required is, in effect, a streaming update to, to JSON. Right? Because if we're gonna record our model in some form of, like, JSON state or whatever, we would now need to be able to update little bits of it as things change. And so we use a protocol called warp for that, and that's a swim capability which we've open sourced.
And what that really does is bring streaming to JSON. Right? Streaming updates to parts of a JSON model. And then every instance in swim maintains its own view of the whole model. So as things stream in, the local view of the model is changed. But the view of the of the view of the world is very much 1 of a consistency model based on whatever happens to be executing locally and whatever needs to view state. So it's an eventually consistent dev model, in which every node eventually learns the entire thing. And, generally, eventually here means, you know, a link so a link away from real time. Right? So a link's delay away from real time. And then the other aspect of the platform is the statefulness of the computation.
[00:45:38] Unknown:
And as you're saying, that that state is eventually consistent dependent on the communication delay between the different nodes within the context graph. And then in terms of data durability, 1 thing I'm curious of is the length of state or sort of the overall buffer that is available, I'm guessing, is largely dependent on where it happens to be deployed and what the physical capabilities are of the particular node. And then also as far as persisting that data for maybe historical analysis, my guess is that that relies on distributing the data to some other system for long term storage. And I'm just wondering what the overall sort of pattern or paradigm is for people who want to be able to have that capability.
[00:46:24] Unknown:
Oh, those are great questions. So in general, then we're going from some horrific raw data form on the wire from the the original physical thing to, you know, something much more efficient and meaningful in memory and generally much more concise. So we get a whole ton of data reduction that way. And so the soon as we're focused on streaming, we don't stop you storing your original data. If you want to, you might just have to have disk or whatever. The key thing in SWIM is we don't do that on the hot path. Okay? So things change their states in memory and maybe compute on that, and that's what they do first and foremost. And then we lazily throw things out to disk because disks happen slowly relative to compute.
And so, typically, what we end up storing is the semantic state of the context graph as you put it, not the original data. You know? That is, for example, in traffic world, you know, we store things like this light turned red at this particular time, not the voltage on all the registers and the light. And so you get massive data reduction. And that form of data is very amenable to storage in the cloud, say, or somewhere else. And it's even affordable at, you know, at reasonable rates. So the key thing for for swim and storage is you can remember as much as you want, as much as you have space for locally, and then storage in general is on the is not on the hot path. It's not on the compute and stream path. And, generally, we're getting huge data reductions for every step up the graph we make.
So, for example, if I go from, you know, all the states of all the traffic, sensors to predictions, then I've made a very substantial reduction in the data amount anyway. Right? So as you move up this computational graph, you reduce the amount of data that you're gonna have to store. It's up to you to really pick what you want to,
[00:48:38] Unknown:
what you wanna store. In terms of your overall experience working as the CTO of this organization and shepherding the product direction and the capabilities of this system, I'm wondering what you have found to be some of the most challenging aspects, both from the technical and business sides, and some of the most useful or interesting or unexpected lessons that you've learned in the process? So what's hard is that
[00:49:06] Unknown:
the real world is not the cloud native world. So we've all seen fabulous examples of Netflix and Amazon and everybody else doing cool things with their due. But, you know, if you're an oil company and you have a regarded seat, you just don't know how to do this. So, you know, we can come at this with whatever skill sets we have. What we find is that the real world, the large enterprises of today, are still acres behind the cloud native folk, and that's a challenge. Okay? So getting to be able to understand what they need, because they still have lots of assets, which is generating tons of data, is very hard.
2nd, this notion of edge is continually confusing, and I mentioned previously that that I would never have chosen IoT edge as, for example, the Azure name because it's not about IoT, or maybe it is. But let me give you 2 examples. 1 is traffic lights, say, physical things. It's pretty obvious that you're what the notion of edge is. It's physical edge. But the other 1 is this. We build a real time model for 1, 000, 000, tens of millions of headsets for a large mobile carrier in memory and evolves all the time, right, in response to continue receive signals from these devices.
There is no edge that it is it's data and derives over the Internet, and we have to figure out where the digital twin for that thing is and evolve it in real time. Okay? And there, you know, there is no concept of a of a of a network to be no or physical edge, you know, in the day of traveling over that. We just have to make decisions on the fly and learn and update this model. So for me, edge is the following thing. Edge is stateful, and cloud is all about rest. Okay? So what I would say is the fundamental difference between the notion of edge and the notion of cloud that I would like to see, broadly understood is that whereas rest and databases made the cloud very successful, in order to be successful with, you know, this boundless streaming data, statefulness is fundamental, which means rest goes out the door. And we have to move to a model which is streaming based and stateful computation.
[00:51:51] Unknown:
And then in terms of the future direction, both from the technical and business perspective, I'm wondering what you have planned for both the enterprise product for swim dotai as well as the open source kernel in the form of swim OS.
[00:52:06] Unknown:
From an open source perspective, we, you know, we don't have the advantage of having come out of a LinkedIn or something where we built it built and used it at scale. Then we're coming out of a start up. What we think we built is something which is of phenomenal value, and we're seeing that grow. And our intention is to continually feed that community as much as it can take, and we're just getting more and more stuff ready for open sourcing and ending up. So we want to see our community go and explore new use cases for using this stuff and, ought to be dedicated to empowering our community.
From a commercial perspective, we are focused on a on a world which is edge. And the moment you say that to people, they tend to get an idea of physical edge or something in their heads. And then, you know, very quickly, you can get put in a bucket of IoT. I gave an example of of, say, building an a model in real time in AWS for, you know, a mobile customer. Our intention is to continue to push the bounds of what edge means and and to enable people to build stream pipelines for massive amounts of data easily, without complexity and without the skill set required to invest in these traditionally fairly heavyweight pipeline components, such as Beam and Flink and and so on, to to enable people to get insights cheaply and to make the problem of dealing with new insights from data very easy to solve.
[00:53:56] Unknown:
And are there any other aspects of your work on swim AI and the space of streaming data and
[00:54:05] Unknown:
digital twins that we didn't discuss yet that you'd like to cover before we close out the show? I think we've done a pretty good job. You know, I think there are a bunch of parallel efforts, and that's all goodness. That is 1 of the harder things has to be in to get this notion of statefulness more broadly accepted. And I see the folks from LifeBender out there pushing their idea of this, a stateful function as a service, and, really, these are stateful lambdas. And there are others out there too. So for me, step number 1 is to get people to realize that if we're going to tame bound this data, that rest in databases are gonna kill us.
Okay? That is there is so much data and the rates are so high that you simply cannot afford to use a stateless paradigm for processing. You have to do things statefully, because, you know, forgetting the context every time and having to look it up is just too expensive.
[00:55:09] Unknown:
For anybody who wants to follow along with you and get in touch and keep in track of what you're up to. I'll have you add your preferred contact information to the show notes. And as a final question, I just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. Well, I think
[00:55:29] Unknown:
I mean, there isn't much tooling, really, to be perfect honest. There are a bunch of really fabulous open source code bases and experts in their use, but that's far from tooling. And then there is, I guess, an extension of the Power BI downwards world, which is something like the the monster Excel spreadsheet world. Right? So you find all these folks who are pushing that kind of, you know, end user model of data, doing great things, but leaving a huge gap, between the consumer of the insight and the data itself. That is assuming data is already there in some good form and can be put into a spreadsheet, whatever it happened to be.
So there's this huge gap in the middle, which is how do we build the model? What does the model tell us just off the bat? How do we do this reconstructively in large number of situations? And then how do we dynamically insert operators which are gonna compute useful things for us on the fly in running models.
[00:56:44] Unknown:
Well, thank you very much for taking the time today to join me and discuss the work that you've been doing on the SWIM platform. It's definitely a very interesting approach to data management and analytics, and I look forward to seeing the direction that you take it in the future. So I appreciate your time on that, and I hope you enjoy the rest of your day. Thanks very much. It's been great to be here. Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used. And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave review on Itunes and tell your friends and coworkers.
Introduction and Sponsor Messages
Interview with Simon Crosby Begins
Overview of swim.ai Platform
Comparison with Other Streaming Systems
Differentiating Data Fabric and Open Source Kernel
Building Models from Data
Developer Workflow with swim.ai
Real-time Data Processing and Analysis
User Experiences and Use Cases
Technical Implementation of swim.ai
Discoverability and Networking
Data Durability and Storage
Challenges and Lessons Learned
Future Directions for swim.ai
Statefulness in Data Processing
Biggest Gaps in Data Management Tooling
Closing Remarks