Data Infrastructure Automation For Private SaaS At Snowplow

Hello, and welcome to the Data Engineering podcast, the show about modern data management.

When When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy them. So check out our friends at Linode.

With 200 gigabit private networking, scalable shared block storage, a 40 gigabit public network, fast object storage, and a brand new managed Kubernetes platform, you've got everything you need to run a fast, reliable, and bulletproof data platform.

And for your machine learning workloads, they've got dedicated CPU and GPU instances.

Go to data engineering podcast.com/linode,

that's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show.

You listen to this show to learn and stay up to date with what's happening in databases,

streaming platforms, big data, and everything else you need to know about modern data management.

For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season.

We have partnered with organizations such as O'Reilly Media, Corinium Global Intelligence,

ODSC,

and Data Council.

Upcoming events include the Software Architecture Conference in New York, Strata Data in San Jose, and PyCon US and Pittsburgh.

Go to data engineering podcast.com/conferences

to learn more about these and other events and to take advantage of our partner discounts to save money when you register today. Your host is Tobias Macy. And today, I'm interviewing Joshua Beemster about how Snowplow manages deployment and maintenance of their managed service to their customers' cloud accounts. So, Josh, can you start by introducing yourself? Sure. Pleasure to be here, Tobias. My name's Josh, and I'm the technical operations lead at Snowplow. I've been heading up that role for the last

3 years. Been working with Snowplow for the last 5. Yeah. I'm responsible for for all the cloud infrastructure

and maintenance,

across,

currently

150 plus 150 plus clients. And do you remember how you first got involved in the area of data management? It was it was a little bit by by accident, honestly. So I started at Snowplow as a as a data engineer and kind of expressed quite interest in automation,

and and kind of getting repetitive tasks to kind of naturally moved into infrastructure management. And, yeah, it just kind of kind of went from there. And so can you start by giving a bit of an overview of the overall components

in the system architecture of Snowplow

and a bit about the nature of how you deploy and maintain the managed service that you offer?

Sure. So I think this one's probably best to start with what the nature of the managed service is first before before jumping into into system architecture. So what we offer is

essentially,

what we've coined as private SaaS. So what we mean by that is that it is a fully managed service, but it's isolated in a client's own own subaccount. So, essentially, what that means is that each client comes to us and gives

us, you know, their own subaccount, their own Google Cloud project, and we set up and maintain a full data pipeline stack within that subaccount. So every client has their own isolated infrastructure

entirely segmented from every other client.

So it's SaaS in that we manage everything end to end, and we're responsible

for for all of the running of it. But it's very much not SaaS in that. There's no shared tenancy

across

across anything there. That's obviously quite a quite a difficult difficult thing to manage. In terms of what that means from a system architecture,

on our side,

we have kind of we have a lot of tooling that we've built to manage that, which leverages heavily the the Hashi stack. So we're using a lot of Terraform to

create all of our infrastructure's code. We've got HashiCorp console to manage all of our metadata and Vault to manage our secrets and then Nomad to do that widespread distribution

of of deploying all of that all of that infrastructure.

So I guess it's it's different in that sense. There's some parallels, I guess, with companies that would offer you to buy a license to their software, and then you would go away and deploy it yourself. And we, I guess, take that a step further where you're not only buying the license, you're buying the whole the whole,

stock experience.

And for people who wanna dig deeper into what Snowplow itself is, I'll add a link to the interview that I did with your cofounder, Alex. Yep. But at a high level, it's a event data management platform for being able to replace things like Google Analytics or Segment. And so in terms of

the private SaaS nature of your product, what are some of the challenges that are inherent to that deployment model that you're trying to overcome with some of the tooling and automation that you've built out? So the obvious 1 is that rather than having 1 big data pipeline to manage, we have a 150 of these. And rather than being in 1 or 2 or 3 regions across the world, we're in kind of every region of the world.

So there's there's difficulties first off of just the sheer number of servers that we are responsible for numbering in the tens of thousands. And with quite a small SRE team, that is a big, big challenge in and of itself with, you know, threats like, you know, at the start of last year, we had Meltdown and Spectre and all of these kind of scary, scary security concerns. We're suddenly going, well, we're responsible for all these servers. We have to go in have to go and manage that. So that's a big challenge for us in terms of staying on top of all of these systems and make sure that they're

always always up to date. The other side of it, though, is just how much automation we need to build into it. There there can't be anything that really requires

any manual interaction or any manual intervention. We have to go top to bottom. Everything must be self healing. Everything must be able to automatically recover,

because otherwise, we just can't can't scale our operation

at all. Specifically, though, on on kind of managing clients in in this context, there's there's obviously a a sharp change dynamic where in normal SaaS, you wouldn't get to see the underlying infrastructure. You wouldn't get to see how things have been set up. You wouldn't be able to go and poke around at any of these things. And what we've had with quite a few

clients is that they they dig a little bit too deep or they go and change things that maybe they shouldn't have or they go and break things that maybe they shouldn't have. And those are quite difficult for us to manage because, obviously, we've we've gone and deployed something, and we've deployed in a way that we expect it to work. And then someone else has come in and and turned something off or broken something or changed our access,

whatever whatever the case might be. So there's those trying to balance that is is quite difficult. So we've got a lot of, you know, constantly checking for any drift detection,

style systems to make sure nothing has changed and everything is is staying

staying exactly as it is. So there's a lot of those those challenges with, I guess, what what you call a shared responsibility model between us and the client in terms of managing that subaccount, which we we do struggle with do struggle with sometimes.

The other the other interesting things with with private SaaS is just how auditable and how

exposed it is. So as we as we work with more

security conscious clients, we do end up getting audited a lot. And we have a lot of very long conversations about, you know, how things have been set up, how things need to change to suit their their particular business needs, and where

normally you'd go and buy a service and you're not too worried about exactly how how they've instrumented. Suddenly, when we're deploying inside the client's ecosystem,

we have to fit their checklist. We have to fit all of their security concerns. We have to pass with their security teams, with their SRE teams to make sure that, you know, everything is exactly how it needs to be be for them. So we've got a lot of challenges there, not only in, you know, managing and orchestrating and running the whole thing, but also just getting sign off for a lot of these teams. And, you know, is this is this up to up to spec? So we have we have a lot of extras added into the platform as as we have these conversations that we have to adjust and manage in such a way that it is still scalable. It is still gonna work for everyone, but we have to add lots of lots of extra things on the fly. And because of the fact that you are running in the customer's account, I'm sure that there's also some measure of cost consciousness in terms of the bill for running all these different resources and handling scaling and trying to minimize the amount of resources that are necessary to keep this running because in a SaaS, the provider

eats all of those costs

and just passes that on in terms of the cost that they charge to the end user. But because

the end user, in this case, is running all of this in their own infrastructure, they're

much more cognizant of the actual actual overall cost of running all of these pieces of infrastructure. And so I'm wondering how you handle

minimization of the resources necessary while still allowing for robustness and scalability in the platform that you're deploying. So that that's really, really, pertinent question. It's a great 1 to raise. It's it's a bit of a balancing act. We we do have kind of hard rules that we we don't wanna breach when it comes to deploying a production environment that we we do get sign off from the client from. So, for example, you've got a production environment. It has to be highly available. So you have to have a minimum of, you know, 2 availability zones. You need to be setting up enough servers that if a catastrophic data data center failure happens, that happens. So there are carve outs to say there's a minimum

spec for for what what this looks like. But beyond that, the the architecture is is flexible enough that we can save costs in a lot of in a lot of ways. So, you know, we work quite closely with clients on how we do instance reservations, how we right size pipelines, how we right size it for their particular traffic patterns. So there is a a level of customization and and work that we have to go through to make that happen. On the whole, though, we've we've kind of we've come to a pretty good, a good level of, I guess, you know, what do we, what are the minimum things that we need? And that's where we, we sort of start. And as clients ramp up traffic, we then tend to have the harder discussions around, okay, to make sure that this is stable. These volumes are gonna have to massively upscale

Kinesis, for example, to to make sure that there's no no back pressure issues or you've got a 1 second latency requirement, which is gonna require these sorts of these sorts of changes. So every client is different in that sense. Some of them won't mind there being a bit of latency buildup in in their pipeline, and then we can size it down for cost. Some are very, very latency conscious, but far less cost conscious. So it also,

you know, in that sense, comes down to what what is the day that there was just the value of the data

for for the company. If it's just a report that runs once a week, we can we can definitely optimize for cost. If it's something that needs to be run every second, then we need to optimize for for performance.

It's very it's very conversation heavy heavy topic with with clients that we we find. There are obviously

blanket strategies. Instance reservations

are are a classic 1. You know, turning off certain parts of the service as well. It being quite a modular

data pipeline, you kind of plug in different different exports. For example, in, you know, the the real time pipeline that we deploy on AWS, You can send data into Elasticsearch. You can send data into s 3. You can send data into Snowflake DB. You can send data into indicative. You can send data into Redshift. But you don't have to pick all of these all of these targets. So we also work with clients to figure out, well, what is the best way for you to consume this information and then setting up their their pipeline accordingly. So they're only paying for what they really need. But you're exactly right. The the fact that it is running in their sub account, they do they do bite that that cost. On the flip side to that particular point, though, is is an interesting 1, which is that none of our clients

really have to worry about volume based pricing in the same way as you would with with a competitor

like Segment, for example, or or any of the other SaaS SaaS analytics providers, which are volume based pricing. So

with Snowplow,

as you scale up, as you track more, your costs are not going to drastically increase. They will increase linearly with infrastructure costs,

but there isn't kind of an exponential cost growth that comes generally with with volume based pricing. And then another thing that's about your model is that shared responsibility that you mentioned of because the servers are running in the client's account and they have their own way of managing infrastructure. I'm sure that there are some

instances where you have conflicts as far as how they would prefer to handle deployments where they have their own infrastructure automation and configuration management.

And then on the monitoring side, I know that you keep track of the health and well-being of the overall system, but I'm sure that the customer is also interested in being able to consume those metrics into their own systems to get visibility. So I'm wondering how you handle that aspect of the responsibility

being on your end to keep everything running, but at the same time, the customer wanting to have greater visibility and tight integration with the systems that they already have deployed? Yes. That that is a really common theme. On the on the monitoring side, we do tend to leverage

on Amazon CloudWatch and on on GCP Stackdriver quite heavily

to alleviate that that to some extent. So rather than

figuring out how do we, you know, pull all these metrics in and turn them back to the client in a nice reportable way. We we leverage the the cloud zone tooling tooling there and then provide kind of easier ways for them to hook into, know, SNS topics for getting all of the same alerts that our own ops teams get so that they can they can look at that. But also by virtue of the fact that it's in their subaccount, they can explore any metric

that's exported to to those systems. And that's where all of our all of our metrics live is in is in those

those systems. There's very little

bespoke monitoring,

so to speak, that that's coming only to us that only we we have visibility over. So on the monitoring side, that that's mostly handled now, which which does make our life

life easier.

On the on the customization

side or kind of we're not fitting the perfect mold, that is that is often fairly difficult. There are

times that we we do need to develop

more more custom solutions,

and which we'll do kind of on a on an ad hoc basis.

The difficulty for us in in offering

any any sort of bespoke solution though is that we are running it across, you know, a 150 different stacks. So the name of the game for us has to be consistency across across the client base.

What ends up happening is that as clients have more security requirements, it becomes part of the

the kind of standards,

that we make available for everyone. So any feature that we develop for 1 is developed for for all. And in that way, as, as we go, we'd end up being able to tick more of those security boxes without necessarily having to do something bespoke

every single time. And, in a lot of cases, we do manage to convince them

that making too many changes, not always, not always necessary.

And that we can we can kind of have that shared responsibility

where we run things how we need to run them without needing to change everything. We're yet to come up against someone that really won't won't let us work how we wanna work. And another element of customization

is the fact that the overall snowplow pipeline is very composable

and different components within the stack can be swapped for some equivalent system. For instance, the Kinesis that you would run-in an AWS account might be replaced with Google Cloud PubSub in a Google account or if somebody's already running Kafka or Pulsar.

And so I'm wondering how you approach that aspect of customization and allowing the customers to be able to specify how they want different elements of the system to be manifested based on their preferences or what they already have running to allow for better integration with the data systems that they might want to integrate with. With the managed service that we offer, at the moment on GCP, we only only support PubSub. And on Amazon, we only support Kinesis.

So for that core part of the pipeline where we're kind of collecting and enriching and storing that data, there's not a whole lot of flexibility

just yet in in terms of the fact that we we have to own and have a standard

part of the pipeline that is is the same for everyone. Where we allow a lot a lot of flexibility, though,

is what you can plug into the pipeline on top of that. So for example, if you've got a big internal Kafka cluster, what we see a lot of clients end up doing is

streaming all the data that we place to Kinesis into their own Kafka cluster to do larger fan out operations, which Kinesis doesn't support

as well,

as Kafka as Kafka might do. The key issue there, though,

that that that's worth touching upon is not that we don't wanna support, you know, loading into someone else's someone else's

data stream is so much that we we support SLAs

on on the latency of of data into those data streams. And if we have a third party dependency that we can't control, that's very difficult to meet.

For example, if if client did want us to load directly into their Kafka cluster, but we had no authority over said Kafka cluster and they had a massive spike in traffic, there's no way for us to really account for that. There's no way for us to go and say, hey. You know, we we need to increase the size of your Kafka cluster because, the pipeline can't keep up. So there is a certain need for

a separation of concern there as well so that we can ensure the health of the pipeline and having too many external dependencies or any external dependencies makes that exponentially more more difficult. Not only from a term in terms of, you know, making sure that the pipeline is working, but even just debugging why, why something's happening becomes harder because you're not in control of the entire

entire fabric, the entire system. So

core part, very, very locked down to what what we what we support and we we very much want to be in control of that. But yeah, as I said,

forking off the pipeline into into other systems is definitely something we see see a lot of.

So writing custom

Lambda functions or or Google Cloud functions or, you know, streaming that data with Kinesis to Kafka connectors, is a quite common patterns to kind of

push that data into into new and different different places.

And we obviously also have our own analytics SDKs that can plug in on top of that stream to then do custom mutations

before

sending it sending it somewhere else as well. And so in the overall system,

which components are the ones that are most subject to variability in traffic

or resource pressure? And what are some of the strategies that you use to ensure proper capacity

as there might be burstiness in the events that are being ingested or,

being able to meet some of those latency SLAs that you mentioned? So, obviously, we we leverage a lot of a lot of auto scaling to to account for that that burstiness, but not everything is auto scaling. So

what the biggest issues that we we come up with in terms of dealing with that burstiness is generally to do with,

how fast we can get new EC 2 nodes online. But, generally, it's with the few non autoscaling components within within the pipeline.

If we take GCP for an example, we're using PubSub there. Now PubSub is this beautifully elastic

auto scaling system where you can throw as much as you wanted it, and it will scale to meet meet demand without any issues.

Where we run into issues is on the flip side in in how we run AWS, which is using Kinesis, which has kind of fixed fixed size. And Kafka or Azure Event Hubs would have kind of the same the same sorts of issues where you've got

much more fixed kind of sharding based ingress limitations.

There's kind of 2 ways we tend to tackle that. 1 is with,

so we've built our own sort of proprietary

auto scaling tech that kind of goes and does re shards of Kinesis as and when as and when needed. But that tends to fall over quite quickly at at higher sharding rates.

Where I'm talking, you know, if you're getting into the 100 or 200 plus shards,

doing a resize can take anywhere up to 30:35

minutes, which is often far too slow for a very big burst burst in traffic. So

in these cases, we tend to work with clients and and look at their traffic patterns and look at, you know, where they're going to be evolving up to. We do do quite a bit of trend analysis there, and then we can say, well, if you wanna keep the pipeline healthy, we're gonna have to get this much of a buffer in place for this non auto scaling component. Otherwise, we're gonna run into issues that are not gonna be recoverable very quickly. This is obviously not the best strategy. You're instead of having a nice auto scaling elastic architecture, you've suddenly got hard coded capacity, which means that we have to have a around the clock ops ops availability

to, you know, check for alerts, check when we're starting to reach reach those thresholds, and go and scale that up. We're kind of actively looking at alternatives,

at the moment for for kind of how we can swap out those systems for something a bit more pub sub esque,

especially especially on Amazon we're looking at. So, you know, how could we maybe swap out Kinesis for SQS and and SNS

for that similar sort of elastic auto scaling queuing with with fan out rather than rather than leveraging leveraging something like Kinesis. So that's in the streaming side, that's probably the biggest the biggest bottleneck.

The rest of it all all is quite easy to to auto scale, and it's generally generally quite fast. The other areas we have issues is is then with downstream

data stores that are, by nature, a lot more a lot more static in size. So Snowflake DB, in a lot of ways, has solved that, and BigQuery obviously has sold that as well, where it's kind of unlimited storage capacity. You can just throw whatever data you want into there, and it's backed by blob storage. So you can just you have your data lake in in that sense. Where we then run into some issues, which, you know, Redshift is starting to address with their new new instance types that they've released. But Redshift and Elasticsearch still,

serve as a weak point in in the architecture because there is capped capacity. And especially when you're looking at a streaming

pipeline and you wanna stream data in as quickly as it's arriving, big spikes in traffic can

overwhelm

CPU

resource. They can overwhelm

suddenly the amount of provisions, you know,

capacity that you have for these systems, which can cause service interruption and downtime. So we have, again,

around the clock teams that are awaiting for these alerts to go and upscale systems as and when as and when we breach breach.

Yeah. I guess the strategies there are are to look at

patterns and look at, you know, what how's my evolution of tracking been developing over the last months? How has the pipeline handled spikes in the past and then sizing it up with a healthy, healthy buffer to make sure that when these things happen again, you're covered. But there's, there's a limit to what we, what we can do, especially running so many of these systems to then know, try to to catch all edge cases, which is why we still need that that around the clock ops team to deal with that. And then another interesting point for me is the fact that you ended up going with Nomad

as your substrate for being able to handle bin packing and managing the processes for all the different components that you're running, where a lot of the mind share right now is with Kubernetes. And so I'm wondering if you can talk through the the overall decision making process that led you to that conclusion

and maybe talk a bit about some of the ways that your infrastructure

management has evolved since you first began tackling this problem? So there's a there's just a quick point of clarification there. So for the client client side pipelines, where we are actually leveraging Kubernetes

in the GCP pipeline, and we're looking to leverage ECS in the in the Amazon pipeline. So we only use Nomad internally for internal orchestration and scheduling

fabric. The main reason we've we've chosen Nomad for that task is really just its its deep level of integration with the rest of the the Hashi stack. So it seemed natural to say, well, we're using Terraform.

We're using console. We're using vault, using Packer.

We should use nomad as well. Cause it's all the, all the kind of nice, neat, neat native native integrations.

But on on the the the latter point around and I can touch on why we've why we've used ECS as well,

possibly possibly after. But

the latter point on where where our evolution has has happened, that's that's very long, long history we

could

we we could kind of get by with with a lot of manual work. So the decisions we made at that point in time were very much, you know, we could do a human driven approach to to deploying infrastructure. We could take some some measures to be yeah. We'll we'll put some of it in cloud formation And for us, we'll have some checklist, and we'll go through it. And and we'll we'll just kind of get things running,

as quickly as possible. So it all started with just

Ansible playbooks,

that would then run some cloud formation,

and that would spin up spin up the pipelines. When we first started writing writing that that automation,

we made a few

big mistakes. We made it made, Well, we made several big mistakes. Mistake mistake number 1 was

not making,

the infrastructure granular. So

where

we'd, you know, be deploying a VPC and a Elastic Beanstalk stack and maybe some Amazon s 3 buckets, maybe a redshift cluster as well. We'd put all of that in 1 1 giant cloud formation template. And at the time that that made perfect sense. Right? We had 1 version of what we're deploying, and we go and deploy it. And then we'd run into all these interesting issues where you'd you'd have clients say, well, I don't want this part of it. I knew on this part of it. You go, okay. So now I've got a whole new version of of my stack. So you fork fork that stacking. You get this version. You get this version. But then you've got all those permutations. And what what we quickly realized is that what you need in infrastructure automation is is not that big bang stack. You need very, very high granularity

in all the components.

And you want the same way as as you mentioned that Snowplow is very composable from lots of microservices.

You we needed to approach infrastructure in much the same way,

or even more more composable.

So where

where we are now in in that journey is that we went from big bang Ansible playbooks with very large cloud formation templates. We then moved that to a bespoke

tooling system, I guess, which was still based on Ansible and CloudFormation, but had a lot of that granularity

starting starting to appear, which which worked for quite a long time, but still wasn't wasn't very flexible.

Part of that was

was probably our use of CloudFormation and that we found that to be a little bit awkward to work with and a little bit awkward to make very, very, very, very flexible. But the the key journey there was really about going from kind of low granularity to high granularity. And then we ran to a further issue, which was about state. So up until we started using Terraform, all of our deployment tools have been completely stateless.

We'd we'd leveraged

essentially the fact that, you know, we were using CloudFormation, so we could just query the outputs of of CloudFormation

templates or, you know, we were we were going and writing API calls to go and check if certain components have been deployed and what configuration they had at times. So it was all kind of just in time just in time resolution. And that that was really flexible

and kind of reduced any any need for us to cache or worry about state anywhere. But it also made us very lazy,

and that we we weren't caring about that. So we we weren't necessarily

making the right decisions. Then as well, every time we wanted to expand the system, we had to go and fetch all of this information all the time. And it also meant it was very hard to build a view of what had been deployed. It made it very difficult to go and write a tool that could just build a report and say, hey. This is everything that's deployed. This is current state of the entire system because it was all stateless. So to do that was all very expensive, you know, long API calls and checks that were just not not very useful.

And that that system, that bespoke system being what it was, was very hard to then turn into a nice API. It was also impossible to train the rest of the team on it, which I quickly discovered as I started hiring, hiring more SREs and trying to train them into using this tool that no 1 could actually use it easily.

So that's that's when the kind of next part of our our journey began, which was saying, well, hey. Let's let's throw it all out and and start again, which we we started actually at the beginning of last year. And we we set it on Terraform

because it was

it was kind of flexible enough to to support multiple clouds, which we which we now do. So we we needed something that could have a common instruction language for GCP or AWS and possibly in the future Azure or any other any other cloud that might might appear. We also wanted something that a new

engineer joining could kind of deal with. They wouldn't have to learn the configuration language. They just have to learn what we built on top of it, which which was a massive, massive difference,

in in terms of how how well we could we could support this. And it also had all the all of the the heavy lifting done. It had state. It had integrations with console and vault, which we're leveraging quite heavily to build

attachments and, you know, metadata management, centralized metadata management, centralized secret management that we could then feed into all of these the this configuration

that we we've done with Terraform.

And as well with with that, and, you know, as I mentioned previously, with our adoption of Nomad as well, we've now been able to slap, an API on the front of it all, which we call which we call our deployment service, which is very, very aptly named. That can then go and go and use this whole ecosystem to go and manage manage all of this this infrastructure. So we've come from, you know, a human driven, choose your adventure style

infrastructure management tool based on based on CloudFormation and Ansible to kind of this this world of API driven, chat ops driven,

infrastructure management, which is which is kind of where where we've gone to gone to now.

I guess the the the other quick thing I'd love to touch on there as well and that you mentioned was that kind of Kubernetes is that the the flavor flavor of the month almost for for how everyone's managing managing containers.

I guess on on Google, we we did we did roll with Kubernetes. We did that because

Google has like a fully managed Kubernetes offering, which is very attractive to us because then we we didn't have to home roll our own Kubernetes. And that's that's also a big thing we look for in our implementation is

we need minimum overhead in in every way, shape, and form. We we can't have too much overhead. We can't do too many custom things. When we can use a cloud tool, we use the cloud tool,

because that is

very important for us for for scaling out scaling out our operation. And for GCP, Kubernetes seemed seemed like a very good fit. For AWS, though, we're we're looking in a slightly

different direction. There's a few there's a few reasons for that. So 1 1 is that,

Amazon Managed Kubernetes isn't quite the same breed as as the Google Managed Kubernetes. You're still responsible for kind of setting up the the underlying data nodes. So you still need to do it it was, I guess, a bit like when Elastic Container Service first arrived and you still had to manage all of those EC 2 servers yourself.

It's not a fully managed service in that sense. You're still responsible for setting up your auto scaling group, setting up those servers, and kind of hooking them up to

to the cluster.

So there's that there's that reason. But as well, what what I found personally and, you know, take take this for for grain of salt is that there is

a lot going on in Kubernetes. And

for what we're trying to do with it, it's

not that it's overkill,

but it's it does more than what we need it to do. And by virtue of that fact, it costs more than than maybe we want it to cost,

in terms of you know, we don't need all of this advanced extra scheduling and and management systems

necessarily to just run a couple of pods.

All they need to do is,

run a simple,

simple Docker container that scales up and down. There's no extra service discovery. There's no,

you know, internal load balancing needs to happen. That's all kind of done done for us already. So we're we're looking at ECS as kind of just a very

simple container container management fabric as opposed to Kubernetes, which, to its credit, is much more powerful.

But it's just much more than what

we tend to tend to need for the snowplow stack.

And in terms of your experience of building out this automation and managing this platform, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

That's,

it's it's that's that's a tough 1. That's a tough 1. I think

most most interesting has been

this is kind of in in the the time working with clients,

but just how

how deeply

interested

a lot of clients are in in kind of how how we're deploying it. So and how how hands on, all of these teams want to be. So that that's been a it's been an an interesting and a challenging lesson in in terms of having having to kind of defend your work constantly,

I guess, is is the point of trying to trying to get to. There's there's a lot of, you know if you were developing a system that was for kind of internal eyes only, you don't expect

so much

so much attention on it. You don't expect people to rip it apart in in quite so many ways. That's been that's been a challenge in developing the stack. It's definitely a good thing to have.

I think if you, if you onboard a client and every time they they kind of audit you, you can only get better. And and we've seen that as as we've evolved is that, you know, it's been challenging to try to meet all of these requirements and meet all of these expectations, but that's that's actually worked out for the better because we've got a much better system at the end of it than we than we otherwise would have. And I think the other the other difficulty is is really in just figuring out how to manage

so many servers

concurrently and how to monitor all of them and how to ensure the uptime of of all of them. You know, we we run into a lot of really

challenging

challenging scenarios with with what we've what we've deployed.

1 1 big issue is actually in in kind of anything that has a sort of a femoral nature.

1 big reason why and I think there's another another question coming about what what things we wanna change in the stack. But

batch processes or ephemeral processes

are much

more fragile,

than you than you possibly imagine. And in a situation where you're just you're 1 company running 1 ATL process per day, You probably won't run into these issues, but we we often see giant cluster failures across across Amazon, which, you know, is is massively challenging. As as we've scaled out, we've started to be impacted by those a lot more, obviously. So for example, we might have,

the EMR

API fail across US east 1 for, you know, a couple of hours. Now, again, if you're running 1 ETL job, you got, okay, I got 1 failure. 1 thing that I gotta go and recover. On our side, we've got our support team, which might have 40 or 50 failures that they could have suddenly clear and

communicate out to clients, make sure that they understand what's happening. Why is their their data late? Why is it not arriving in their data warehouse?

So that's that's definitely

it's a big challenge. I don't I'm not sure if it's an unexpected challenge, but it's definitely an interesting 1 to figure out how we how we deal with. And, you know, and just how we deal with any of the scale scale problems really as as kind of a small small SRE team trying to figure out how we how we manage the infrastructure of so many clients and keep it secure

and make sure that, you know, none of them are costing too much.

So there's there's a lot of lot of interesting challenges there that, you know,

we're looking to solve

in some parts with with actual snowplow monitoring.

So we we build a lot of a lot of monitoring on top of all of these systems to kind of try to do some trend analysis. And we're hoping to get a lot more time to look at, you know, solving these challenges with, you know, machine learning to try to figure out how can we detect trends in data? How can we

scale up systems intelligently ahead of ahead of anything, anything happening so that we can provide the best possible

possible service.

A lot of scale problems.

And are there any elements of your experience of managing

the snowplow

platform that you think are more broadly applicable to data infrastructure as a whole that you think are worth calling out?

I think

the biggest 1 is probably just,

you know, general general rules is just your infrastructure has to be super, super composable.

It has to be as granular, as granular as you can, as you can make it.

If you want to

be able to evolve it quickly, if you want to be able to change things quickly, if you wanna be able to, you know, attach attach lots of different different pieces together,

start it very composable. Don't

don't big bang,

an infrastructure stack. That will that will

catch you out very, very, very, very quickly.

So I think that's that's probably the biggest biggest learning

I've had and also kind of how how you scope your resources. So making sure that you're you're understanding, is it regional contracts, global contracts?

How do you how you group things? How you manage your different different

infrastructure

resources. I think from kind of just managing

data pipelines,

getting a clear understanding

of what you're tracking,

why you're tracking it, and how much data you're expecting to to collect or want to collect is

is incredibly important. Making sure that you you kind of design it for

design it for purpose. I think going too generic

with with what you're with what you're building doesn't really help, but

really thinking about

what am I what's the what's the end goal here?

So for example, within within our own kind of internal pipelines, we've got, you know, an ops or monitoring pipeline that's really streamlined for dealing with, you know, lots and lots

of small small metrics. And we've got our kind of business pipeline, which is more geared towards, you know, our longer term

analytics of our website and those kind of things. So figuring out what structure you need to serve serve the business is also

also really important.

And if you were to start over today with all of Snowplow and the infrastructure automation that you're using for it, what are some of the things that you would do differently or ways that you would change some of the evolution of either the Snowplow pipeline itself or the way that you've approached the infrastructure management? So I won't I won't speak to the kind of the overall snowplow snowplow components.

But for the infrastructure

side of it, I think the biggest

I I think we'd probably

not go for

using something like Kinesis, or we would have approached it very, very differently. I think we would have leveraged

something that is truly auto scaling.

That's that's probably 1 of the biggest infrastructure problems we have at the moment is is Kinesis and its lack of elasticity.

It does cause us causes a lot of lot of heartache, causes a lot of heartache. And I think beyond that, in terms of managing the infrastructure,

where we are now is pretty close

to to where I want to. And, you know, as I mentioned, we we did rebuild the entire system last year. So that that was with with kind of a lot of years of learning going, how can we do things better? How can we do things differently? I think if I could, though, take go back to the start of last year and take a slightly different tack, I would go even more granular in terms of the stack separation and the the topology of of infrastructure that we've built. We've still gone to big bang,

which is making it quite hard to

evolve certain parts of the infrastructure stack that we're we're we're trying to manage.

So we've we've coupled components together that should have been split apart. We've made it quite difficult for our for ourselves, and we're gonna be paying for that for that in the near future. You know, how do we unpick some of these

associations and how can we make it more more flexible

again. And that's that's really a problem that that we suffer from just because we have so much variance in what our clients want. So there's not kind of 1 pipeline. There's there's pipelines of lots of different pluggable components is needs for, you know, having maybe multi region,

support, multi

cloud support for for kind of sinking data into a single pipeline, really custom, you know, fabric, custom ETL

systems that that are required,

which yeah. Just the more granular we are, the the easier that would have all been. And not that we've we've coded ourselves into a whole, but we do have some work to kind of make it that much more flexible again.

But that's really, you know,

I guess, it's just a learning as we as we go that we just need to be super, super flexible,

because every client's gonna have different wants and different needs. And we wanna be able to serve them all, but in a manageable way.

And that is

that's difficult if you if you've only got 1 version of what you can what you can set up. And what's in store for the future of the Snowplow product and the way that you're approaching management of the service that you're providing for it? So

hopefully coming soon, and I hope the product team won't won't get upset at me for saying any of these things. What I'm hoping for quite soon is that from our managed services UI,

we'll be able to start managing infrastructure directly from from that UI. So at the moment, the way customers interact with that, it's they have a lot of transparency when they log into the sub account or into the GCP project to go and look at things, but they don't have,

I guess, a lot of a lot of visibility over, you know, what is configurable, what is what are all the different options for my pipeline that's still somewhat hidden away.

So hopefully coming soon, we'll have a lot of that surfaced into the UI and almost like

a ETL builder

style thing. So if you were a company looking to get a robust,

highly available data pipeline, you'd be able to kind of click and drag components,

set up a snowplow collector in DCP

that streams data into,

a Kafka cluster in in AWS,

and then set up a second collector in AWS that you kind of have that high availability, not just across regions or across availability zones, but across clouds.

And

that's that's really where I see the infrastructure

going. It's,

and we'll need to go for that really true robust

highly available structures. Banking on 1 cloud is is not risky, but there's definitely there's definitely a requirement

and will be more requirements to have that much more split and much more spread

so that you can really have that, you know, the 100% uptime taking all all all kind of worries away that, you know, if Amazon has a has a glitch,

or GCP has a glitch, that that's not gonna suddenly stop your your pipeline from from working. It'll it'll still keep carrying on. It's just

all that all that extra fail over. The other the other big big changes,

and I sort of touched on this with, you know, when we were talking about batch processes or ephemeral EMR processes, that a lot of that or if not all of that, will be moving to streaming architectures

quite soon. So we're we're currently there are still some batch processes left. Snowplow will be moving to kind of a 100%

streaming, which we're very

hopeful will result in not only better cost efficiency, especially at scale,

but just much higher robustness and stability of the platform

in general. There's a lot of there's a lot of difficulties with batch that

we we struggle to overcome.

So, you know, dealing with, you know, what is my cluster specification

between between spikes

is is quite difficult. You know, you don't necessarily always have you can't auto scale a batch process in the same way that you can auto scale a streaming streaming process.

You can't, you know, deal with

the the overhead of, you know okay. I've had a spike in traffic. I just need to grab a few extra servers, or I've had a spike in traffic for a really long time, and I now need to get an enormous cluster to get through this backlog in in a reasonable amount of time. Those kinds of problems we're hoping will just go away with with migration to a full a full streaming architecture.

It's definitely

though on the flip side, more expensive to run on the lower scale.

It's

but we're I guess we're not really building

for for kind of low low scaling mode. It is it is a big data architecture. It's a big big data pipelines, and, you know, streaming definitely has a lot of cost efficiencies and performance efficiencies,

at at scale. And that's that's, I think, where

where we wanna take take Snowplow next. Are there any aspects of the Snowplow product or your management of the infrastructure and the service that you provide for it that we didn't discuss yet that you'd like to cover before we close out the show? I think that that's probably

that's probably everything.

Alright.

Well, for anybody who wants to follow along with the work that you're doing or get in touch, I'll have you add your preferred contact information to the show notes. And as a final question, I'd just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. I can't see a big a big gap actually in the tooling. You know, there's so much amazing tooling now for managing managing this stuff. It's it's just maybe

maybe the biggest gap is probably some way to harmonize a lot of the tools that are available now, which which a lot of cloud providers are starting to do. But it's still

there's still a lot of there's almost too many options

in in managing managing data that it makes it hard to know which way you should go.

Should you put everything on s 3 in in kind of parquet? Should you put everything in Redshift? Should you put everything in Stonefate d d? Should you put everything

kind of everywhere for different business use cases?

I think there's probably there's just no silver bullet. There's nothing that solves all of these all of these problems. And they there might never never be. But there's

yeah. There's just such a wealth of options that it's quite hard to pick any 1 option. I don't know if that's that's a good gap or if that's just, you know, a factor of too much choice. Yeah. It's definitely something that continues to be a problem as the is the the paradox of choice, particularly as we add new platforms and new capabilities

to the overall

landscape of data management.

But there's there's a lot it's it's a lot better than it used to be. That's, that's for sure. So it is, you know Certainly. Alright. Well, thank you very much for taking the time today to join me and discuss the work that you're doing at Snowplow and how you're managing the infrastructure and automation around that for all of your different customers and the private SaaS nature of your business. It's definitely an interesting area that,

is something that doesn't really get a lot of attention as far as how to manage the underlying infrastructure for these data products. So thank you for all of your time and effort on that front, and I hope you enjoy the rest of your day. Thanks, Spies. Cheers.

Bye.

Listening.

Don't forget to check out our other show, podcast.init

atpythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at data engineering podcast dotcom to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links