Summary
Data oriented applications that need to operate on large, fast-moving sterams of information can be difficult to build and scale due to the need to manage their state. In this episode Sean T. Allen, VP of engineering for Wallaroo Labs, explains how Wallaroo was designed and built to reduce the cognitive overhead of building this style of project. He explains the motivation for building Wallaroo, how it is implemented, and how you can start using it today.
Preamble
- Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure
- When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.
- Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind.
- Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
- You can help support the show by checking out the Patreon page which is linked from the site.
- To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers
- Your host is Tobias Macey and today I’m interviewing Sean T. Allen about Wallaroo, a framework for building and operating stateful data applications at scale
Interview
- Introduction
- How did you get involved in the area of data engineering?
- What is Wallaroo and how did the project get started?
- What is the Pony language, and what features does it have that make it well suited for the problem area that you are focusing on?
- Why did you choose to focus first on Python as the language for interacting with Wallaroo and how is that integration implemented?
- How is Wallaroo architected internally to allow for distributed state management?
- Is the state persistent, or is it only maintained long enough to complete the desired computation?
- If so, what format do you use for long term storage of the data?
- What have been the most challenging aspects of building the Wallaroo platform?
- Which axes of the CAP theorem have you optimized for?
- For someone who wants to build an application on top of Wallaroo, what is involved in getting started?
- Once you have a working application, what resources are necessary for deploying to production and what are the scaling factors?
- What are the failure modes that users of Wallaroo need to account for in their application or infrastructure?
- What are some situations or problem types for which Wallaroo would be the wrong choice?
- What are some of the most interesting or unexpected uses of Wallaroo that you have seen?
- What do you have planned for the future of Wallaroo?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- Wallaroo Labs
- Storm Applied
- Apache Storm
- Risk Analysis
- Pony Language
- Erlang
- Akka
- Tail Latency
- High Performance Computing
- Python
- Apache Software Foundation
- Beyond Distributed Transactions: An Apostate’s View
- Consistent Hashing
- Jepsen
- Lineage Driven Fault Injection
- Chaos Engineering
- QCon 2016 Talk
- Codemesh in London: How did I get here?
- CAP Theorem
- CRDT
- Sync Free Project
- Basho
- Wallaroo on GitHub
- Docker
- Puppet
- Chef
- Ansible
- SaltStack
- Kafka
- TCP
- Dask
- Data Engineering Episode About Dask
- Beowulf Cluster
- Redis
- Flink
- Haskell
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the data engineering podcast, the show about modern data management. When you're ready to launch your next project, you'll need somewhere to deploy it, so you should check out linode at data engineering podcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production, and Go CD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today.
Enterprise add ons and professional support are available for added peace of mind. And go to data engineering podcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page, which is linked from the site. To help other people find the show, you can leave a review on Itunes or Google Play Music, tell your friends and coworkers, and share it on social media. Your host is Tobias Macy, and today, I'm interviewing Sean t Allen about Wallaroo, a framework for building and operating stateful data applications at scale. So, Sean, could you start by introducing yourself?
[00:01:25] Sean T. Allen:
Yeah. Hi. My name's, Sean Allen. I always go by the t. If you Google the Sean Allen by itself, then you can see the reason why for that. I'm the VP of engineering at Walleroo Labs. We, make Walleroo. I think you gave a good, intro to what it is. And, I also, I spend time now as a member of the, the, pony core team, but with a language that I believe we're gonna talk a little bit about later on probably. And, I was also, an author of a book from, Manning called Storm Applied about the, Apache storm, streaming framework.
[00:02:04] Tobias Macey:
And could you tell us a bit about how you got involved in the area of data engineering?
[00:02:10] Sean T. Allen:
Well, I don't remember exactly. It's just like I've I've worked on back end systems for for most of my career. Although, I I took a detour through doing back end stuff for websites and everything for a long time. And and somehow when I came back out of it, I was working on back end stuff where it became more and more about the the data that was being processed became more and more important. And then suddenly, apparently, I was in the area of of data engineering. And What the exact steps were, I I don't really know at this point.
[00:02:44] Tobias Macey:
And could you start by giving a description about what Walaru is and how the project got started?
[00:02:51] Sean T. Allen:
So I I think in a lot of ways, Walaroo is the, is we're we're building towards the tool that I always wanted to have when doing a lot of, data processing jobs, where we would take, various tools and we'd try to build these ad hoc solutions. And and it would usually work most of the time when when things were working correctly. But once there were, failures, it like, we'd start to have all sorts of weird data corruption stuff. So Wallaroo is, it's a it's a it's a framework for building, streaming data applications. Although we're we're really looking towards, eventually being able to support, like, batch use cases as well. So you could think of it in the same kind of area where stuff like Apache storm is. But where we wanted to, solve some of the problems that we thought were really hard about building applications like that where, you need to be able to keep track of state. You wanna be able to scale these things up and down and whatnot.
And and so while we I mean, we provide, like, managed in memory state. We allow you to take a cluster machine to grow it up and down. We provide you an API, which is a scale independent. It's 1 of the things that we, are are quite interested in, where, the API that you're writing to doesn't talk in any way about the number of machines that you're running on, the number of servers. You're just talking about the flow of data through it and and what your data structures are and the operations that you do it on it so that you can take the same code which runs on, like, your laptop on 1 process, and you could run it, you know, on your laptop on 3 processes, or you can take that same code and put it out into production or testing, you know, on on 10 processes or whatever that might be without having to make any changes.
[00:04:37] Tobias Macey:
And based on the fact that it will sort of transparently shard your data across multiple nodes, it sounds like there's this particular set of application types that it works well with as opposed to being able to just run, for instance, your run of the mill web application using Django or Rails. So are there any sort of particular use cases that you are most directly targeting with Walaroo?
[00:05:03] Sean T. Allen:
So really at this point, like, back of office, like, data processing jobs, probably ones where latency might be a concern for you. Those are that's certainly the area that we started in. When we originally started building it, we were we were looking at doing finance type applications for banks like risk analysis and whatnot that needed to be done, in real time. But we've we've moved off of that to more generally, for just any any type of, any type of, data application where you're, handling, events as they happen in real time and want to be able to respond to them. So, I mean, this could be, like, this could be ad bidding, you know, for, like, ad ad network type stuff. This could be, you know, in the finance space. It could be analyzing, you know, what's going on in terms of market data and and changing algorithms on the fly.
We're working with a a large, infrastructure monitoring company to, where they want to be able you know, they're taking data in from their customers, and they want to be able to trigger alerts based on that. That's a that's a, you know, that's a very much a, hey. We have real time, data flowing in. We wanna keep, some, state around for it in order to be able to analyze it. You know, like, maybe, like, what's the state of this server over the last 10 minutes or something to trigger alerts? You know, those are those are a couple of the types of use cases, but in general, it's, hey. I have a I have a stream of events, and I want to be able to respond and take action on them in some kind of fashion.
[00:06:35] Tobias Macey:
And at the beginning, you mentioned that you're heavily involved with the pony language. And when looking through the documentation, it's pretty readily obvious that that's the language that you use to implement Wallaroo. Before I started reading through the documentation, I don't think I had ever really seen mention of that language. So I'm wondering if you can just take some time to describe a bit about what it is and the features that it has that makes it well suited for the problem area that you're focusing on?
[00:07:03] Sean T. Allen:
Yeah. So, Pony is a, fairly new, high performance, actor based language where the programming model and concurrency model is similar to, like, what you would see with, with Erlang or or with, ACCA on on the JVM. Those are both, actor based. But, it's it's really 1 of the things that it has, which is novel and different, is a type system, which is designed to allow you to do message passing in the same way that Erlang has in a safe fashion. But Erlang has, some performance constraints where, it copies, most data which gets sent from 1 actor to another, which can be a bottleneck on performance. And Pony uses its, type system to allow you to, like like to say, do unsafe things in a safe way where you can you can, the compiler can guarantee that, like, memory is not being shared across, multiple threads at 1 time so that you don't have to do copies in order to make sure that things are safe. And when when we were looking for something to write Wallaroo in, you know, we had a we had a series of constraints that we were looking at at the time where we very much wanted to have, we wanted to be able to process, you know, high volumes of data. We wanted to be able to do it quickly.
So we were already looking at, you know, performance type things as something we needed. And then we're also, looking for 1 of the things that we really wanted to have was if you if you're doing, stuff like we're looking at the beginning, like like risk and financial applications and whatnot, then tail latencies could be a real concern. Like, a lot of the folks that we are talking to, they were building applications where 1 of their problems was, you know, that they built this on the JVM, and they were struggling with JVM garbage collection where a non trivial amount of their requests rather than finishing in the, like, couple of milliseconds that they needed them to finish in, you know, there would be, there would be a GC pause, and it might take, you know, several 100 milliseconds or or more for those requests to be processed. So we're also looking for a solution where where any garbage collection or memory handling that might go on would be 1 where we'd have nice flat tail latencies.
We looked at an awful lot of different things, and, in the end, decided to go with we decided to go with Pony despite the fact that it was, relatively young in in large part because we thought that, for, that for the for the problem that we were solving, the the actor model would be a really good way for us to go about doing this and modeling our concurrency problems. Because we were really worried about, hey, okay, we're gonna build this high speed thing, and we wanna make sure that we're not accidentally corrupting data that we're handling for the user. So, we're like, you know, the actor model was 1 that we're comfortable with. And then it started to be, okay. If we do the actor model, we decide that this is the route that we wanna go. What what are our options? You know, we looked at we looked at Erlang, and and we talked with, you know, some folks that we know who are using Erlang. And they're like they're like, Erlang is great, but we don't think you're gonna be able to push it the way that you're you're looking to. And, you know, we looked at other stuff like, well, we could do this in c, but we had worries about, you know, memory handling. We we we we seriously considered Rust, which is further along than Pony was at the time, but that would have been at that time, it would have been, hey. We're writing our own actor model. And and, Pony, while being a relatively new language, the runtime had been around for, much longer. It'd been used at 1 of the, major banks in the UK. And we're like, well, if we did this in c, we'd probably end up writing a runtime that looks a lot like this. So let's go ahead. Let's start with this runtime, and, let's build from there and become a part of the, you know, the community for this. We'll try it out. We'll see how it goes. And, you know, over the course of several months, you know, a lot of the folks on the team at Wallaroo Labs, you know, became members of, like, the pony committers and whatnot. And it you know, everything worked out really well for us, and we've been really happy with the choice of language.
[00:11:02] Tobias Macey:
Have there been any instances where the lack of ecosystem has been a problem, or have the primitives within the language been largely sufficient for the type of work that you're using it for? They've been they've been largely sufficient up to this point. A lot of what we're doing is really
[00:11:19] Sean T. Allen:
much more low level type stuff where because of the performance concerns and whatnot that we had, you know, that that taking off the shelf libraries and sticking them together probably wasn't gonna work. Even if we, you know, even if we'd used, like, a JVM based language, for example, we probably would have either ended up using, like, a few of the, like, high performance computing libraries that provide a number of, like, data structures which are more suited to high performance computing than what's in the Java standard library, or we would have ended up writing our own anyway. So while Pony is is young and it has a a small, standard library, that didn't really bother us because we were gonna we were gonna end up writing probably most everything
[00:12:00] Tobias Macey:
for our early work anyway. And would it be fair to describe Pony as a special purpose hybrid between Erlang and Rust in terms of the fact that it's oriented around an actor model, but also has the memory safety guarantees?
[00:12:14] Sean T. Allen:
If you if you wanted to do a a 15 second elevator pitch to folks who knew what Erlang and Rust are, then, yeah, I think that's a that's a that's a reasonable way to initially approach it. And the user API for people who are using Wallaroo,
[00:12:28] Tobias Macey:
at least at the current state and time, is via the Python language runtime. So I'm curious why you chose to focus first on Python as the language for interacting with Waluru and how you've built that integration for being able to embed the Python API within the pony runtime?
[00:12:49] Sean T. Allen:
So our our first API was actually, c plus plus, and we we still have a c plus plus API, but it's, not 1 that we're actively supporting at this point in time. Although if somebody came along and say, hey. We really wanna do a thing with c plus plus. We'd had happily, pick up work on that and work with them. But, I mean, what what really happened was that the folks that we were going out and talking to who were showing interest in in Wallaroo were using Python. And there are more and more people in the Python community who are looking to do real time data processing, and there's there's just not a lot of tools that are there. I mean, if if you go look the the, the Apache the Apache, Software Foundation has an ungodly number of of real time tools to do processing if you're running on the JVM, which is which is great. It's awesome. You have a wide variety of choices. But if you're if you're doing, if you're doing stuff with Python, you know, you have a definitely, like you have a smaller set of them. And most of them that you have are probably going to require you to use the JVM. And there are a lot of shops that are they're primarily Python shops where, you know, they recognize that while the JVM is an awesome piece of technology, it can also be, operationally difficult to run-in terms of, tuning GC and whatnot in order to get optimal performance. So the the folks who we seem to be appealing to were, were ones who were who were doing Python stuff, but didn't want to, didn't want to take on, like, managing the JVM and whatnot. So that's that's really how the the Python 1 came about for for what our primary focus is right now. Although we're also, that we have also had a lot of people in the same vein who are are, using Go who we've been speaking to. And so we're currently working on a, a Go API as well. Because, again, it's the same thing where there's, you know, like, there's just not a lot of tools for, like, real time, like, stream data processing, within the Go ecosystem.
But the question of how do we go about doing this with the Python? So what we have is is you have a Wallaroo is in some ways it's it's a it's a pony library. And what we do is we have a a main entry point, which is in pony. And we embed a, at this point in time, it's a Python, 2 7, interpreter. And we embed that inside of a a pony application. And, you supply you supply our runner our runner application, which includes the embedded Python interpreter. You supply it a a module, which it it would it would load, which is standard Python code. And it it can use any, any libraries, any modules that you have installed within your normal Python environment.
And so you could use, virtualenv to set up environment for it, or you could just use the stock the stock Python, which is installed on your machine. And we have, we use the Python, c API to talk back and forth basically between the pony code and the Python code, over with a small bridge of a c code, which is which is interfacing between, between pony, which has excellent c, FFI, and, the Python c API where we, you know, where we we increment and decrement, GC counts on objects as they come back and forth in and out from the user supply code and whatnot. We call, we call known Python methods that we're expecting to find in certain places and whatnot. All of this using, you know, the the API that you would use to write Python modules in c or call from c back into Python if you were embedded embedding it for some type of game or whatnot. So we use that fairly that standard way of doing it, and then we're, we're gonna be doing that for, Python 3 soon, which, because that API is is pretty much basically the same, should be really straightforward for us because there's actually very little Python code that we have that's involved in it. It's pretty much we're we're doing all the interaction, through c.
[00:16:57] Tobias Macey:
And 1 of the main promises of Wallaroo is the fact that it allows for easily distributing state across multiple machines. So I'm wondering how that works internally and how the internal architecture of Wallaroo has evolved from when you first started working on it.
[00:17:15] Sean T. Allen:
So I think the internal architecture for the most part when it comes to, handling of state has been pretty much the same from the beginning. We did a lot of, internal architecture changes, which were related to getting better performance. Like, the very first the second version of Wallaroo that we ever did ran it maybe 1 100th of the speed that we currently have, but, hey, it worked. And and musically enough, the actually, the very first version of Wallaroo that we did was, had, you know, 1 tenth not not even 1 tenth of the features we have now. But we wrote it we wrote it in Python in order to get our testing harness set up around it. But so there's there's a paper by, Pat Helen called, Beyond Distributed Transactions in Apostates View. And in it him, he talks about it from a database standpoint of, how you could build a a database that in theory could scale infinitely. And 1 of the 1 of the ideas he talks about is you would have these things and he calls them entities.
And these entities would be transactional boundaries where, these entities are based on things in your domain. So, like, if you're if you have, like, some sort of a system which is dealing with customers in your system, then each of these entities might be customer. And you could have, other fees that are for products rather than, you know, operating on tables with relations in them. And then you could spread these, you know, these entities around across an infinite number of machines. You know? As long as you have enough entities to put on each machine, you could keep doing that. And, we hadn't actually read that paper before we started building Wallaroo. But that basic idea of those entities matches up very well with, the the concept that we have that we called, state objects, where the programmer writing their program can define these objects which represent, whatever the state is for their particular domain, which is opaque to Wallaroo, but you provide those you provide those objects in your language, and you also provide, you also provide how that data would be partitioned.
So, like, in a lot of in a lot of like classic, word count examples that are done, like, for streaming data. Things are, you know, it's usually like, hey, I'm gonna break this up where I'm gonna, you know, have these different buckets in order to parallelize based on, like, the first letter of a word or whatnot. So that's a that's a fairly simplistic thing. But if you're writing word count in Wallaroo, and we have it we have a example word count application where people can see this being done is is we do exactly that. We create a, we create a partition for each 1 of these, each of the letters of the alphabet and in the application, and then Wallaroo will manage those. And so if you're running, Wallaroo on 1 process, it's gonna have, like, all of those data partitions in that 1 process. But if you start up, say, a second process and connect it to the first 1, so you're now running with 2 processes, then you're gonna have the state objects of which you had 26. They'll now be 13 on each machine. And in general, you know, it's like as you add more as you add more processes, your state objects will be distributed out across that. Then underneath what's actually happening, with these is that each of these state objects is managed by an actor underneath the system. So there's no way that, for any of your data objects, there's no way because of of how we built it with the actor with the actor model stuff underneath it.
More than 1 thread could ever be, concurrently accessing your particular bit of data so that there's no way that you're gonna ever have race conditions or whatnot, where more than 1 thing is trying to update that data at a time. It means that, you know, that on each 1 of your state objects, you'll end up serializing things. And they're also, you know, potentially depending on how things work. There are there are certain ordering guarantees that we can provide based on that. And
[00:21:05] Tobias Macey:
when reading through the documentation, 1 of the things that you call out is the capability for promising idempotent processing of the computational state. So I'm wondering if you can just dig into that a little bit and particularly in terms of how you ensure that exactly once processing.
[00:21:31] Sean T. Allen:
Yeah. So we we exactly once is a, is a is is a very overloaded term. So we like to talk about, exactly once processing, which is that we we're we definitely might end up delivering messages more than once inside of Wallaroo. But we have a, we have a work acknowledgment protocol, where which we can combine together with a a write ahead log that we use in order to say about state changes. So that, if there's a failure when we bring stuff back, we can make sure that we're not gonna process the same message twice within Wallaroo. And and I've I think the within Wallaroo is a very important, part of that because, within any system, if if you're doing, if you're doing some type of exactly once processing like what we're talking about, is what you're really saying is you have at least once message delivery, which is if, you don't get any acknowledgment that a message was sent and processed that you can resend it. But, if you just have at least once message delivery, that means that you can have, duplicates and you could process stuff more than 1 time, which means that whoever's on the receiving end of that has to be able to have some type of deduplication or the, the operation has to be idempotent. Otherwise, you're gonna end up with incorrect results.
So once you're at the boundaries of of Wallaroo, then we can't we can't make that guarantee unless you're interfacing with another system, which can also make that guarantee. So like when you're sending your results out of Wallaroo system, if you are sending it, using Kafka for example, where Kafka has an idempotent producer, then you can continue to hit exactly once semantics for your, you know, for your actual system as a whole. But if you were sending this out to, like, to, you know, over TCP or something like that to a receiver, which was just taking stuff and writing it out to a file, for example, which does no deduplication, then you could end up with, you could end up with, you know, the same the same line appearing in your file more than once because, you know, you're no longer talking about Wallaroo in internal to Wallaroo, but you're talking about a larger system as well. So we we we definitely like to draw, like, that distinction there of whether you're talking about inside Walaroo as an application or when you're talking about where it interfaces with with, with external systems as part of a larger system as a whole.
[00:23:55] Tobias Macey:
And going back to the topic of the state management, is the state persisted in any way for longer periods of time, or do you only maintain it in memory and within that write ahead log for long enough that the transaction can be completed and then there's a cleanup phase to remove that particular piece of state?
[00:24:17] Sean T. Allen:
So we don't currently have a way to expire state, which is in memory. We've been talking to some folks who, are are have have use cases where it's more like they wanna keep stuff in memory, but maybe only for 10 or 15 minutes, and then they wanna have it expire. So currently right now when you create a state object, it's going to live forever in memory. But, you know, that that idea of expiring ones is something that, you know, that we're looking at in the right way to do that. And then you can either run-in what we have. We have a resilience mode and we have a nonresilience mode. If you're trying to get the absolute fastest performance that you can, you'd wanna run-in nonresilience mode, which means that all of your state is only in memory. And if you were to have a node crash, then there's there's nothing there's nothing you can do to get it back, which, you know, for some scenarios might be acceptable where performance is more important to you than anything else. We also have a resilience mode, which right now, involves taking any of those, any of those state changes and writing them out to a write ahead log as they happen, a local write ahead log. And then also writing out, as part of our acknowledgment protocol, how far along in the write ahead log basically we've actually gotten to for work that's acknowledged so that you if you have a crash, you can come back to whatever the last acknowledge point was, and we can kick off our recovery protocol to replay messages which weren't weren't acknowledged. So that's what you can do in terms of resilience for that. We're we're actively looking at when you have a cluster of machines replicating data from, across, different nodes in it. But right now, it's the resilient strategy is to just use that local write ahead lock.
[00:26:02] Tobias Macey:
And in order to maintain performance when you are taking advantage of that distributed state capability, do you use some form of distributed hash table to ensure that messages are consistently delivered to the same set of nodes based on the prefixes that that you care about for being able to split them out so that you don't have to worry about any sort of coordination between nodes or synchronization of data between those nodes to ensure we have redundancy across them?
[00:26:30] Sean T. Allen:
So everything that we strive to do based on the background is is that we always we want to do everything based on local knowledge, not global knowledge, because, global knowledge is a point of coordination, which is going to end up making things slower. So pretty much, basically, your application starts up and you're running on, say, just 1 process in order to make it simple. And you have, like like, in that word count example we talked about earlier, you have 26 state objects which you're managing. So we know we have, we have knowledge which is built at the time that the application starts up for how you do routing for any of those 26 state objects.
If you were to start up and you had 2 processes where 1 the processes are talking to each other over over TCP, then when it starts up, you know, we know that there's, like, hey. There's 13 state objects on each of these particular machines, and we know how to route to them based on local knowledge at that particular point in time. If you were to change the size of the cluster, we have a period of time there where where 1 state objects would have to be shuffled around to take advantage of the new cluster size or shrink to the new cluster size. And at which point during that period of time, the layout and the routing gets reconfigured, where you have you effectively have 1 of the nodes become a leader for the deciding of where data will live and then coordinating that it's all moved to the right place.
But, yeah, there's there's no, like, there's no global there's no global understanding of, you know, nothing looking down on where it is. It it happens, you know, when the application starts up and when there's a change to the cluster size. And then otherwise, at any of the points where you might route a message, we know where you're supposed to be routing it because well, actually 1 of our 1 of our really early versions of Wallaroo, we had a per process object, which was responsible for, knowing where to route stuff, and that was our number 1, performance bottleneck. We had an application which was designed to, to see what the overhead was, you know, that really that we were adding in. So it was, like, processing, like, trying to process large numbers of small messages that were mostly just shipping those small messages around while doing minimal processing with them. And we couldn't process more than probably 50, 000 messages a second with our first with our first version that we had, and it had this, like, this per process global routing registry.
And when we removed that, we got a a more than 10 x improvement in speed just from removing that 1 bottleneck. And a lot of our a lot of our work as we went along was how do we take knowledge, which is currently global, and move it into, move it into into local knowledge. And really, like, okay. This is gonna be there, so we have to put this here, and it'll always be there. And it it leads to it leads to better performance and whatnot. It certainly it can be a little trickier to follow along than if you just had, like, a global object of, like, hey, where do I route this? Oh, okay. You say to route it there. I'll route it there. Beyond the considerations of trying to reduce the amount of global knowledge that's necessary, would have been some of the other more challenging aspects of building the Wallaroo platform? You you're we're we're talking about we're building a system that allows you to run arbitrary user code with arbitrary, user data objects in, in these applications, we are gonna have, topologies of of an unknown variety. So there's there's basically there's an infinite variety of ways that things could be strung together.
And, that that there's an awful lot of surface area there of stuff that you need to test to know that things work correctly. And I would say, you know, that that, like, that testing has written more than anything else, like, figuring out how do we how do we go about testing things? How do we know that how do we know that this is working correctly? I think we've spent which it was it was it was performance and it was it was, you know, testing and correctness and how do how do we know these things are correct. And we've we invested more time in those 2 areas than anything else. And and, you know, that's certainly been, like I mean, that's that's been a really big area. I mean, for us I mean, and you look at it, you know, there's an awful lot of of stuff out there. There's, like, Kyle Kingsbury's Jepson work.
There's stuff that, Peter Alvaro did for, lineage driven fault injection. There's all the stuff that, like, the folks are doing with chaos engineering, a lot of which is, you know, come out of Netflix of how do you take a distributed system and prove that, you know, that it's working or in the case of chaos engineering, more of, like, we think it'll be able to survive x type failure. And, you know, we have we've we have some tools that we built for ourselves in order to do this, but the amount of time it takes to build those tools or to or to set up tests for those type of things, and then to be able to do it in a repeatable fashion.
The there's an awful lot of engineering effort that goes into that.
[00:31:43] Tobias Macey:
Yeah. It's sometimes shocking how little consideration organizations or engineering teams will put into the amount of effort and knowledge and skill that's necessary for proper testing of software in order to be able to ensure that it's as reliable as it needs to be for, you know, production use cases. All all too often, it's left as a last consideration that is given short shift in the overall engineering budget?
[00:32:12] Sean T. Allen:
Yeah. It's it's very hard. So, like, I I I touched on this briefly at the beginning. But the first thing the first thing we did, and we did this as how we were going to evaluate, using Pony, was we created a really simple version of what we wanted Wallaroo to do. And we did that in Python where you had, Python processes that were communicating with each other over UDP. And, we built, we started to build testing framework tools in Pony around it where we're like, okay. We'll try we'll try Pony out for this. And we started building a framework of of tools for how do we know, like, that we haven't how do we know that we didn't drop messages?
How do we know that we didn't duplicate messages? How do we know that, you know, if we sent into an application which is supposed to double a number, we sent in 1, 2, 3, 4, and 5 that we got the correct results out the backside. And, you know, it was so we built these we built these tools, and and it became, you know, like, that's that's where we tried out Pony, but it's also we got at least a general framework for how we wanted to go about testing Wallaroo. And 1 of the decisions we made early on was that we realized that we were gonna be spending an awful lot of time rewriting the guts of Wallaroo over and over again to facilitate better performance.
And what we what the strategy that we really ended up with is is the vast majority of our testing is end to end, integration tests where we where we run happy pass, and then we also do, we inject various errors into our runs, you know, disconnecting nodes from 1 another and whatnot in attempt to make sure that we get correct results. And, you know, this is this is really effective for finding obvious bugs with the interaction between stuff. But, you know, it it also comes at a cost of you have a bug which might be the result of a timing issue between some number of nodes, which you might only see 1 out of, I don't know, 1 out of a 100 times or something like that. We actually we actually fixed 1, today, which was it only happened at the very beginning of when data was being sent in just this specific timing at the beginning of an application starting up and whatnot. And our tools found it eventually, but that bug had probably been there for a long time. But we also don't know how, like, you know, how else we would have found that. There's no, like, there's no unit test that would have found that because everything technically at the unit level was doing the right thing. It was just that the interaction of stuff as they came together. So yeah. I mean, we've we've put a lot of we've put a lot of effort into that. I have a couple of, talks. 1 that I gave it at QCon in 2016, Then I gave the same basic talk just updated, like, a few months later at CodeMESH in London where, you know, we really talked about that. It was called, how did I get here? Building confidence in a distributed stream processor. It's basically the the talk is more or less a big love letter from me to, Peter Alvaro and the inspiration that his, his, lineage driven fault injection paper, gave us.
[00:35:19] Tobias Macey:
And which, if any, of the axes of the cap theorem have you optimized for, or is that something that's tunable for the person who deploys the, infrastructure that underlies it?
[00:35:32] Sean T. Allen:
We very much want it to be tunable. Right now, we are very much on the consistency side. Like, if you if you lose if you lose a node right now, you have to you have to bring you have to bring the node back before we'll we'll continue processing. When we add in replication of data across, nodes, you'd be able to lose a node and continue processing because we can, we can reroute to 1 of the replicas and whatnot. That can be a little tricky, but it was something we'd be able to do. And from there, we could also start to look at, tunable consistency stuff. But we pretty much believe that if if we're gonna do some type of tunable consistency that we'd no you'd no longer be able to use opaque data objects.
Early on, our idea was that, hey, what we'll do is we'll provide, data objects which you can build yours on top of. And that was very much like hey, we'll provide CRDTs and whatnot, and you can do this and we'll have tunable consistency. But for, you know, most of the people we talked to in the beginning, it was, no, I wanna be able to, you know, use my own I wanna be able to use my own code to define my own, data types. But when it comes to tunable consistency in the future, we think we'd build that on top of, replication and, across of data, across nodes, and then using, CRDTs, and have people be like, hey. If you wanna take advantage of this, you need to, you need to be using CRDTs, which we'd build then.
[00:36:55] Tobias Macey:
And just to clarify for anyone who's not familiar with that acronym, CRDT is consistent replicated data type, which is a type of data structure that can be used in a distributed fashion without having to worry about locking or contention for being able to merge and process them.
[00:37:14] Sean T. Allen:
Yeah. Yeah. The the the idea behind, CRDTs for for folks who are listening is that I can have multiple copies of my data structure, which would be, which can be updated concurrently without having to coordinate with 1 another. And then there are rules for that are built into the data structure for how you, for how you manage any conflicts that might exist. And, you know, you can pick different types of data structures where there are different rules for, like, you know, a set or whatever where if something was added and removed and you see both of those, well, should the thing still be in the set or should it not be in the set? That's a rule that's baked into the the data structure itself. And there's there's a lot of, interesting work that's out there on that. The, the sync free project, which is a a European Union project. They got a lot of great stuff on it as well as some stuff when work was done at Basho while they still existed. There's there's a lot of really there's a lot of really interesting stuff in that space, and it's just starting to to really make its way into industry over the last couple years.
[00:38:14] Tobias Macey:
And for somebody who is convinced and wants to use Wallaroo as the foundation for their next project, what is involved in getting started and building an application on top of it?
[00:38:26] Sean T. Allen:
So we're getting ready to release a new version next week where people would be able to pull, a Docker based environment which they can use, where everything will already be built for them for the, Python stuff, and they'd be able to work with that. So you can go to our you can go to our website, wallaroo labs.com, or you can go directly to the, GitHub repo, which is github.com/wallaroo labs/walaroo. Walaroo being wallar00. And there's, there's, there's links there to the documentation, which includes, installation instructions on how to get started for, setting up on either Ubuntu or OSX. You can, for production type usage, you can use on, any any, any Linux system, but, you know, we wanted to provide instructions, you know, for, hey, step by step how you do this for Ubuntu because that's what most of the people we were talking to are using. The the setup, is not as as simple, as as we would like. There's like we have it set out where, hey. If you follow these directions, it should build, just fine in Ubuntu, but we have the Docker stuff coming out where it can just be, hey, I'm gonna Docker pull and all of this is set up for you. So there's a little bit there's a little bit of a hump there compared to just I'm gonna curl, you know, a thing and it'll be installed, but we're working towards that. Then, we have, we right now, what we really have is we have a lot of Wallaroo applications. And we have a a number of example applications that we walk you through, which demonstrate different parts of the API that you can go through and play around with. And, then after that, it's like, hey. We have a mailing list that you can sign up for where you can ask questions. You we have a IRC channel, and we're very happy to, you know, help people get started, with whatever whatever whatever it is that, you know, they have.
Most of us are, time zone wise, we're either on the east we're probably on the East Coast of the US, but we have folks as far west as Vancouver and as far east as as The Hague. So, like, if, you know, people stop by our IRC, then, you know, during business hours for 1 of those time zones, and it's quite possible that, you know, there'll be somebody there who can answer your question. If not, you know, we have it we have it all logged, and we we answer any questions, you know, after the fact as well.
[00:40:50] Tobias Macey:
And in terms of deployment, what are the infrastructure considerations and resources that are necessary for being able to deploy the application to production? And then what are the scaling factors for being able to scale the application infrastructure out horizontally or vertically as is necessary?
[00:41:13] Sean T. Allen:
For for the Python stuff, you have you have the runner application, and then you also have your your Python your Python module, which you you'd have where you've written, you know, your the code for your application. So those need to be distributed together. Plus, you need to have, a working Python environment there for any of the other libraries that you might need. So probably, most folks would probably want to use something like a containerization on that. I would suspect, although you could also do it, you know, where, hey, you're using something like a puppet or chef and you have a a standard way to install Python dependencies onto machines that you might already have. But certainly, this is this is a case where you need to, you know, bundle up, you know, some Python modules and everything else together with an application.
Docker would would be a good way to go about doing that or some other container solution. Then, I mean, in terms of in terms of getting, in terms of getting Wallaroo nodes to talk to 1 another, you can start up. Basically right now what happens, is that you tell it's like when you start up a new when you start up a new, Wallaroo worker, you can tell it the address of another 1, that exists somewhere and that that will form a cluster with it So that, you know, you could do this where you can say, hey, I'm starting up this application and I want there to be 5 nodes and it'll wait until all 5 join or you can you can this isn't this isn't actually a released feature yet, but we have this thing that we call, auto scale where you're able to, add and remove nodes from a running cluster.
And you and then once once that's in place, you can go ahead and you'd be able to change the number of nodes. It'll redistribute your state and whatnot. We have we have 2 we have 2 means to get data into Wallaroo right now. 1 of those is Kafka. The other 1 is, is TCP. If you're using TCP, then you you have a you have a TCP connection which is open. You're pumping all your data in, and you'd be sending the data into that and then needs to be distributed out across the cluster. So, network traffic into that if you're using TCP is going to end up being a, a consideration for how much you can scale.
For the Kafka stuff that we have. Currently, the Kafka stuff has the same bottleneck, but we're in the process of making it so that as you change the, size of a a Kafka cluster, it will it will distribute, out the partitions from a particular topic that it's reading, from across however many however many workers you have in order to be able to, spread that out. So, really, it's it's like, hey. You can add new workers to it, as as it's going along. There's a there's a it's a fairly simple and straightforward, command line option in order to do that. And then beyond that, it's the how do you scale the, you know, the data which is coming into yours in into your system and then the data that's leaving your system, which in in part depends on, you know, the type of source that you have.
[00:44:19] Tobias Macey:
And we've talked a bit about some of the failure modes that Wallaroo can find itself in, but are there any others that people should be guarding against in the application logic or the infrastructure architecture that they should be considering?
[00:44:35] Sean T. Allen:
So I I would say, like, in terms of in the application, no. Like, that is that is if if there's if there's a problem if there's a problem there, then it's it's something that we need to address at the Wallaroo level. That's that's 1 of the big ideas of Wallaroo is that is that all of that nasty infrastructure plumbing, that can take up so much time while you're while you're building an application. Right? Where, hey, I could write word count that runs on my laptop in, like, an hour or so. But getting that so that I can run, you know, like at Twitter scale where my word count becomes, like, you know, trending topics, you know, that's the the amount of infrastructure plumbing that goes into that is really huge. We wanna put all of that infrastructure into the, you know, the Scalarware Wallaroo aspect so that you can write this scale, you know, independent API.
I think that I mean, the big 1 is, you know, there's stuff like, you know, people ask, you know, have asked us is like, hey. Can I can I do this across data centers? And, you know, our statement right now is, you probably could, but really we've designed this right now to work with, you know, to have a cluster of stuff that works within, a given data center. And if you wanted to do it across data centers rather than setting it up as 1 cluster, you'd be probably much better off setting this up as as, clusters that are treated as independent applications that work together with each other in some fashion. I think that's, like, really the big 1. Then, Wallaroo is is still definitely it's still young.
We have, off of the GitHub, we have, we have a a gotchas page, which includes, like, there's a number of failure scenarios right now that we're planning on addressing over the next, over the next 2 to 3 months, but we don't right now where certainly, you know, if somebody was to stick it in production right now immediately, we really don't think that anybody probably is going to, usually. But, you know, if they were, like, to be take it right now and put it into production, they should very much be aware of these of these gotchas. Right? Where where these are things that we're we're completely aware of and we're we're working on addressing. Like, it's like, okay. Like, we can handle 1 failure of this sort. But if we have 3 nodes that fail all at once, where our recovery protocols currently don't handle that correctly, you know, things like that that we're that we're working on. And what are some of the situations
[00:46:53] Tobias Macey:
or problem types that Wallaroo would be poorly suited for where you would advise someone against choosing that as the implementation target?
[00:47:03] Sean T. Allen:
Yeah. So, I mean, right now, we're very much about, like, event by event, streaming data stuff. And a lot of problems, like, that people don't think of as streaming problems can actually actually really are streaming data problems, but then there are some which just aren't. Like there are problems where you wanna do like really large matrix math or something like that, which is which is not which is that's that's not a that's not a streaming problem. That's suited to tools that are more like, the the HPC tools, you know, that a lot of people were doing in the in the nineties where where they were putting together Beowulf clusters or, like, Dask, for example, for Python, you know, has some stuff where it's good at handling handling stuff like that, and that's not that's not in the wheelhouse for what Walloo does at this point in time.
There's a there's a whole segment of of things that people consider batch type jobs, which are really are really streaming jobs. They they just don't look like it. And those are ones that we're gonna add be adding, support for in the future, and we have, like, we have a blog post already of how you can simulate with Wallaroo what looks like a batch job for some type of stuff and whatnot. But really, it's it's right now, it's like, hey, I need to do some type of, like, event by event processing. That's that's really Wallaroo's sweet spot right now. That could be, you know, like, hey. I got a we just started working with somebody where they're taking, they're taking Twitter follower data that they have in files, and it's gonna be shipped into, Wallaroo. And they're building a they're building a graph of, the relationships between them, and they they're modeling that in it. And then when it's all finished and done, building that graph, they, you know, all the files have been processed. We're gonna send it another signal to it, which will cause it to output, like, the the resulting graph. So, like, that might that might sound like a a batch job until you actually look at it and go, oh, hey. That is a streaming job. But, yeah, certainly certainly ones where you're doing, like, you know, you're trying to parallelize, like, matrix type operations or something like that where you need to have, like, access to large segments of the dataset and whatnot at a time. Not really not really a WALU job.
[00:49:18] Tobias Macey:
And what are some of the most interesting or unexpected uses of Wallaroo that you've seen?
[00:49:24] Sean T. Allen:
I can't really say that there's been, I can't really say that there's been an unexpected 1 so far. I think there have been a lot of things that that we're talking to people about which have been, which have been interesting or or stuff that, you know, that that we didn't that didn't necessarily occur to us when, when we first started building. Like, there's there's a lot of there's a lot of ones where we talk to folks where it's really about, I wanna have this stuff kept around in memory for a while, and I wanna do it where in a lot of ways what they're what they're what they're looking for, and this is something that we could provide while eventually, but we don't right now, is they're really looking for something like, something like Redis, where they have like, where they can distribute where they could distribute, but where they could distribute, data structures around across the number of machines and run computations on those machines rather than having to pull the data structures back and forth and then also still at the same time be able to expire those those things at some point in the future.
We've had people we've we've talked to more than a couple of people about that and it's that's not actually very far from from what Wallaroo does. Like, the 1 big part of that would be the expire. But I I never really thought it would be like, oh, hey. That that's that's how, like, that's how people would be thinking of it, which after after I, you know, after I thought about it for a little bit, I was like, oh, well, that makes a lot of sense because, you know, 1 of the reasons why we, you know, put this in memory state in WALU in the first place is, you know, that having to to run external data stores, you know, for some applications can be, you know, a severe bottleneck or can, you know, add a operational overhead or whatnot.
But the the, hey. We want this to be, like, you know, where our data is short lived rather than, you know, long lived for this the application, that 1 that 1 definitely took me by surprise as it it wasn't really the type of use cases that we were thinking about really in the beginning.
[00:51:17] Tobias Macey:
Yeah. I can see situations where being able to set a TTL on a particular state object would be useful because if you can't process it within that time horizon, then it's not worth bothering with at the after that point because it's no longer useful information.
[00:51:35] Sean T. Allen:
Yeah. Yeah. I think, like like, the the first time we came across this was with the infrastructure monitoring company. And the use case there really is, like, the the they're they're monitoring for alerts from customers, but not all of their customers are always sending in data. Right? And the alerts are based on a time window, like, 5 minutes or 10 minutes or whatever. So, you know, if you haven't heard from the customer in, like, 10 minutes and, you know, this is for you know, it's supposed to be a 5 minute window, there's no point keeping their information around but about what any of their alerts are. You can go ahead and drop that. And then when they might send in some data, every now and then, you could go ahead and look up their information again as compared to, you know, a lot of the stuff, you know, when we were when we're first looking at this, it's like, hey. This is a continuous thing where it's really for 1 customer.
And it's more of when, you know, we started talking to folks who where this would be used to support many customers' data at 1 time. That's where, really that's sort of like, hey, we wanna be able to expire this stuff really started coming up a lot more.
[00:52:33] Tobias Macey:
And what are some of the things that you have planned for the future of Wallaroo that we haven't touched
[00:52:39] Sean T. Allen:
on yet? We're working on a more Pythonic, API, for which we're we're planning on having out, by the end of, December. We have a Go API that we're working on. We have the auto scale, functionality, which is, you know, dynamic cluster size that we're working on. We also we have a project where, we wanna be able to handle a a wider variety of, wider variety of, failures where really right now it's like, if you have multiple failures that occur all once our recovery protocols won't necessarily handle those. So we have a project in GitHub for that.
We also have the, the replication across nodes that we're really looking at, providing API support for, for making it easier to to take those batch type jobs, which can be streaming jobs, and be able to do those. In a lot of ways, that would give you, like like, there's, like, Flink does does a lot of that where it's like, hey. Underneath, this is a streaming thing, but we're gonna give you a batch like API in order to work with it. Like, those are some of the higher level ones. We also have some, you know, ones which are a bit more inside baseball, like, like the ability to dynamically add new partitions while it's running, which is 1 that would be really great. So, like, if you have a if you have a system where you're keeping track of customers or even if you're just if you're counting words or whatever, to be able to not have to define ahead of time how many partitions you're gonna have for your data, but simply say, I'm gonna partition this by customer or by word, and if I haven't seen this word before, create this, put it somewhere at somewhere in the cluster, and then go ahead and start routing data to it. That that would that that's gonna be really powerful. It's gonna allow for stuff to be scaled out more and and make the API easier. We also have, you know, work around, you know, additional performance improvements for, like, the Python stuff, support for more sources and syncs.
We're looking, you know, quite seriously at, you know, should we should we, you know, should we really, like, spend a lot of time doing a lot of integrations for, you know, like, all of AWS's stuff. An awful lot of people, you know, who we're talking to or working with are running in AWS. So should we have, you know, out of the box support for Kinesis and a lot of these other things? That's, you know, some of the stuff we're looking at. But really, you know, like, right now, it's it's the new Python API, Go API, the auto scaling stuff, are the big 3. And then, you know, a lot of work around the, various, failure modes, which we which we know that we don't fully support right now, and, replication of data so that you can so that, you know, you have a fully replicated cluster.
[00:55:29] Tobias Macey:
And are there any other topics that we didn't talk about yet that you think we should bring up before we close out the show?
[00:55:35] Sean T. Allen:
I'm sure once we're done, I'll think of some. But, no, I I feel like I feel like we did you know, we covered a a decent amount of stuff. I'm I'm sure by the time people have reached reached this point, you know, there's all their their brains are already a little like, oh, that was a lot of information. So no. I I think I think we're probably pretty good. And for anybody who wants to follow-up with you and keep in touch with, work that you're doing or ask any additional questions that we didn't cover here, I'll have you add your preferred contact information to the show notes.
[00:56:07] Tobias Macey:
And with that, I'll ask you 1 parting question to give people something to think about. So from your perspective, what is the biggest gap in the tooling or technology that's available for data management today?
[00:56:20] Sean T. Allen:
I I think I think the biggest gap at this point is it's just it's still too hard to do all of this stuff. There are a lot of tools that allow you to that that allow you to throw stuff together, but there's not a lot of tools that make it easy to do the hard things. I think we in in we have a lot of things that make it easier to do easy things, but we need to we need to spend more time investing in in tools which might be a little harder to get started with, but will make, like, the the hard problems much easier, which, you know, are are issues around scale and data integrity and whatnot. And, you know, we have a lot of ideas with War Room. We've, you know, implemented some of them. We have a lot more where we wanna be part of that. You know, I think there are a lot of other tools which are they're all we're taking very different interesting approaches to trying to solve, different hard problems. And I think, you know, in the end, that's to me, that's what's interesting about various tools is how do what what is the hard problem that I have that this tool is is helping me solve? And I I always have this idea that, that no matter what you're gonna build, there's the same amount of pain that's gonna go into it. It's just a matter of where is the pain gonna come.
And I always like to look for what is what is what is the hard problem that I really have here, and what's the tool that's gonna make that hard problem much easier for me, and then I'm willing to accept some level of additional complexity that might come along with, that. You know, like, for example, Haskell as a programming language, like, has a type system which can provide, like, if you're using it can provide an awful lot of confidence, when you're doing refactorings and whatnot that you wouldn't necessarily get with a large code base with a language that doesn't have a type system like that.
And right there, it's helping you solve what might be a hard problem for you where you gonna have you expect to have a large code base that you're gonna need to be able to fearlessly refactor. But in return, what you're also doing is, you know, you're accepting that there are fewer people who know this language and that probably it's going to take you longer to get your programs to compile and whatnot. So I'm hoping for within, you know, the the data engineering space, you know, that we have, you know, that we continue building more tools which identify a difficult and hard problem that people would have
[00:58:40] Tobias Macey:
and, you know, provide solutions to that for folks who are willing to make the trade offs that come with it. Alright. Well, thank you very much for taking the time out of your day to join me and tell me about the work that you're doing with Walaru. It's definitely a very interesting platform and 1 that I may even find use for in my own work in the not too distant future. So thank you again for that, and I hope you enjoy the rest of your evening. Thank you, Tobias. It was a great fun.
Introduction to Sean T. Allen and Wallaroo
Sean's Journey into Data Engineering
What is Wallaroo?
Use Cases for Wallaroo
Introduction to Pony Language
Wallaroo's Python API
Distributed State Management in Wallaroo
Idempotent Processing and Exactly Once Semantics
State Persistence and Resilience
Challenges in Building Wallaroo
CAP Theorem and Tunable Consistency
Deploying Wallaroo in Production
Poorly Suited Use Cases for Wallaroo
Unexpected Uses of Wallaroo
Future Plans for Wallaroo
Biggest Gap in Data Management Tools