Summary
Every business with a website needs some way to keep track of how much traffic they are getting, where it is coming from, and which actions are being taken. The default in most cases is Google Analytics, but this can be limiting when you wish to perform detailed analysis of the captured data. To address this problem, Alex Dean co-founded Snowplow Analytics to build an open source platform that gives you total control of your website traffic data. In this episode he explains how the project and company got started, how the platform is architected, and how you can start using it today to get a clearer view of how your customers are interacting with your web and mobile applications.
Preamble
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
- You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science.
- Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
- This is your host Tobias Macey and today I’m interviewing Alexander Dean about Snowplow Analytics
Interview
- Introductions
- How did you get involved in the area of data engineering and data management?
- What is Snowplow Analytics and what problem were you trying to solve when you started the company?
- What is unique about customer event data from an ingestion and processing perspective?
- Challenges with properly matching up data between sources
- Data collection is one of the more difficult aspects of an analytics pipeline because of the potential for inconsistency or incorrect information. How is the collection portion of the Snowplow stack designed and how do you validate the correctness of the data?
- Cleanliness/accuracy
- What kinds of metrics should be tracked in an ingestion pipeline and how do you monitor them to ensure that everything is operating properly?
- Can you describe the overall architecture of the ingest pipeline that Snowplow provides?
- How has that architecture evolved from when you first started?
- What would you do differently if you were to start over today?
- Ensuring appropriate use of enrichment sources
- What have been some of the biggest challenges encountered while building and evolving Snowplow?
- What are some of the most interesting uses of your platform that you are aware of?
Keep In Touch
- Alex
- @alexcrdean on Twitter
- Snowplow
- @snowplowdata on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- Snowplow
- Deloitte Consulting
- OpenX
- Hadoop
- AWS
- EMR (Elastic Map-Reduce)
- Business Intelligence
- Data Warehousing
- Google Analytics
- CRM (Customer Relationship Management)
- S3
- GDPR (General Data Protection Regulation)
- Kinesis
- Kafka
- Google Cloud Pub-Sub
- JSON-Schema
- Iglu
- IAB Bots And Spiders List
- Heap Analytics
- Redshift
- SnowflakeDB
- Snowplow Insights
- Google Cloud Platform
- Azure
- GitLab
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline, you'll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40 gigabit network, all controlled by a brand new API, you've got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today to to get a $20 credit and launch a new server in under a minute. And you work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform for Metas Machine was built to give your data scientists the end to end support that they need throughout the machine learning life cycle.
Skafos maximizes interoperability with your existing tools and platforms and offers real time insights and the ability to be up and running with cloud based production scale infrastructure instantaneously. Request a demo today at dataengineeringpodcast.com/metis dash machine to learn more about how Metis Machine is operationalizing data science. And go to data engineering podcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch, and join the discussion at data engineering podcast dotcom/chat. This is your host, Tobias Macy. And today, I'm interviewing Alexander Dean about Snowplow Analytics. So, Alexander, could you start by introducing yourself? Yeah. Sure, Tobias. So my name's Alex Dean. I'm the cofounder
[00:01:34] Unknown:
and, CEO of Snowplow Analytics Limited,
[00:01:38] Unknown:
based in London, UK. And do you remember how you first got involved in the area of data engineering and data management?
[00:01:44] Unknown:
Yeah. I've I've been involved in it, kind of on and off throughout my my career. So after university, I joined, Deloitte Consulting and and, their business intelligence division. And so I saw there quite a lot of the kind of the the more classical data warehousing techniques. And this is back in sort of the early 2000. So that was that was my introduction to to this space, before data engineering was was even a a kind of a a term or a a career. A little bit later, I was working at OpenX, which is an an open source ad technology company. It's where I met my, Sniper Analytics cofounder, Yeli Sassoon.
And at OpenX, we had we had quite early exposure to the the whole kind of wave around Hadoop. So, again, this was around 2, 007, 2008. We're doing quite a lot of work on the West Coast and and and got exposure there to to the Hadoop ecosystem. So that was really interesting. That was a step forward. And then before, before Yeli and I set up Snowplow, we were doing quite a lot of, consulting work for clients here in London around ecommerce and digital product businesses and things like that. And and there, we we got exposure quite a lot of exposure to the web analytics scene, Google Analytics, tools like that, and also, the kind of growing burgeoning AWS ecosystem, and the data tools that were that were coming along there like elastic MapReduce and and things like that. So, yes, throughout my my career really, I've had exposure to to first data warehousing and and classic BI, and then moving into the data engineering, data, management spaces, with Hadoop and AWS and things like that.
[00:03:28] Unknown:
And so can you talk a bit about what it is that you've built with Snowplow Analytics and the problem that you were trying to solve when you first started the company and the project?
[00:03:39] Unknown:
Yes. Sure. So, Snowplow Analytics is, an open source project and and a company. So it's been an open source project, and and we put a we built a company around it from, the beginning. So we started Snowplow about 6 6 years ago. We, as I mentioned, Yali and I had met at at OpenX, and we've we've got a lot of exposure to to data engineering approaches there. We were then doing consulting here in London for, these digital product companies and retailers, and and and pretty much all of them were using Google Analytics. We were quite frustrated by the fact that we couldn't get at the, underlying data.
So, you know, we looked into Google Analytics. We could see dashboards. We could see reports, but we couldn't get at the actual underlying event stream click stream data. And we wanted to get at that data because we wanted to do, you know, ad hoc analysis, bespoke investigations on the data, all the kind of stuff that could actually change those those companies and and give them competitive advantage over over other other businesses. And we just couldn't get at that data, and it was it was frustrating because those companies were able to give us really rich offs, offline data. So, you know, know, they were able to give us really rich data from their transactional systems, from their CRMs, things like that. And and it was a paradox because this, you know, this new wave of awesome digital data, we couldn't actually get at it. And so from our from our time at, OpenX, we understood this idea of, a clickstream data data pipeline. We understood that companies on the West Coast were building these and and hiring data engineers to put these into practice.
We knew that, like, the way that, Amazon Web Services was blowing up at the time, the way that it was, getting getting loads of traction and and introducing all these, you know, fundamentally data engineering as a service tools like Elastic MapReduce and, you know, storage with s 3 and things like that. We knew that that was really, like, powerful and and and and a great way of essentially building these pipelines, without having to, without having to to to to high roll it all yourself, using Hadoop and HDFS and all that sort of thing. And so what we thought is what if we put out a pretty simple prototype on GitHub and just say, look. This isn't that difficult. You can take this, run it on on AWS, and have your own event data pipeline, and and it was quite it was quite unusual. I mean, it was it was cloud native, if you like, before before that was even a a trendy term.
We we designed this all around AWS initially, and, and we put it out there. And we started to get some really interesting traction. So it wasn't hobbyists picking it up. It was quite big enterprises in Europe, the US, Australia who wanted to to to have a a clickstream data pipeline like this. And that was really interesting to us. We thought we're we're onto something here. There's there's clearly a need for this that that isn't being met. There are some pretty big companies here who don't know how to build this stuff themselves and don't know how to build a data engineering team to do this. So that's how we got started. We kept building out functionality. So our initial focus coming from from from that consulting work around Google Analytics. Our initial focus was around web analytics. So we put in a lot of, work to get functionality into to the Snowpark prototype to to get towards parity with the the commercial solutions in the marketplace.
And it just really steam steamrolled from there. That's that's how we that's how we got started.
[00:07:16] Unknown:
Yeah. For anybody who has used Google Analytics, that's definitely 1 of the biggest pain points is you ship all of this information to them, and then they expose this weird query API that doesn't really fit any of the semantics that you're used to from things like SQL or any of the other data platforms, and you don't have any way to get at the raw data that you ship to them. So it's, it's it's definitely a very asymmetric relationship that you form when you're using something like that. And I'm wondering what it is about the sort of customer data and, web analytics space that is in terms of the ingestion and processing requirements versus, some of the other more traditional sort of transactional data sources that companies would already have been working with?
[00:08:08] Unknown:
That's a really that's a really good question. I think there's I think there's I think there's lots of interesting differences. I think there's, there's 3 broad ones. They relate to kind of the volume of data, the the variety, and and also immutability is really important. So I think to start with volume, what you're looking at when you start collecting customer event data is is just an order of magnitude, sometimes 2 orders of magnitude more data than than you're used to seeing from your your transactional systems. And that that changes a lot of things. That that really does force your architectures to be much more scalable, differently composed. You can't use the same kinds of technologies in it. And it really is a big, it's a big difference, and it and it and it forces some really tough questions some really tough questions around, how you're gonna store the data, how you're gonna manage that going forward. Are you gonna keep aggregates? Are you gonna keep full archive? All that kind of stuff. So the volume aspect of it really can't be underappreciated.
It's it's really important. I think the variety is really interesting as well. So when you're collecting customer events stream data, you know, you're potentially collecting it from all the different touch points you have with your customers. So you're collecting it from your mobile apps. You're collecting it from your your website, from your customer, support centers. So many different places that you can be collecting that from. And the number of different interactions and behaviors that your customers, probably have with you across their life cycle is so is is so diverse that you're really looking at a lot of different kind of data points. You're looking at a lot of different, even just the the types of entity that that that attract through this and the the the things that your customers are are, experiencing with you. So variety, I think, is really, really big as well. You've got a real diversity, challenge to meet there.
Immutability is interesting. So, when we're talking about, event stream, when we're talking about, customer customer events, we're we're talking about specific actions, that your users or customers are are taking. So we think of those as almost like facts. Things that happened in in in the past at a specific point in time, and those things are, you know, unarguable. They're immutable. They are the facts that you wanna track through the system, and that's really important as well. And that's caused some really interesting challenges recently with, the advent of GDPR and and and other data privacy, initiatives where suddenly we're in a world where mutability isn't really acceptable anymore when you're looking at things like right to be forgotten. So, yeah, customer event data is really, quite distinctive from those those kind of transactional forms. And also given the fact that it is coming from the
[00:11:06] Unknown:
There's the requirement for being able to create a canonical record for a given user, even though you may not actually know ahead of time who a given user is because there because there isn't necessarily any sort of login information like you would have with a transactional system. So I'm wondering what types of approaches you use for being able to dereference these users or these entities for being able to create a unified view in the analytics back end of being able to build these reports of how this traffic is correlated?
[00:11:43] Unknown:
It's a really, it's a really good question. So we call that whole area identity stitching, and it's it's tough. So the the starting point that we always kind of encourage is to to collect as many discreet potential identifiers from your your customers or your, you know, anonymous users or pre logged in users as possible. And then there's there's a it's it's a science and an art identity stitching. So you want to bring those distinctive event streams together. You want to use the different identifiers that you have and try and map them together, you know, essentially build a join table and then use that to to reconcile. It's not perfect.
A lot of the identifiers will, you know, in the case of things like cookies, they'll get get reset or deleted. Many of the identifiers are super, super crude and you need to be careful using them. So it's not it's not perfect, but we find it's really important. If you can, as you say, most companies now have many different channels and they have different concepts of the user or customer in those different channels. If you can do that join process, if you can do that stitching and bring those different events streams together, you end up with a much more interesting view of of those users. So it's it's well worth doing, but it's it's it's yeah. It's an art as well as a science.
[00:13:05] Unknown:
And as you mentioned, 1 of the other challenges with event stream data and web analytics is the question of volume. And with that also comes the potential for inconsistency in the data, particularly since it's generally come coming from a website, and there might be arbitrary JavaScript loaded on the page that might interfere with your tracking snippet and the possibility for people willfully submitting incorrect information or just spamming your pipeline to try and create some sort of a DDoS attack. So I'm curious what the collection portion of the Snowplast stack looks like and how you've designed it to be able to validate the correctness of the data and try and ensure the cleanliness of the resulting analytics when it reaches the back end. That's really that's a really,
[00:13:56] Unknown:
it's a really important area. So the way that Snowplow is architected, we start with a set of trackers. So we start with a set of tracking SDKs. Those are the things that you're embedding in your website or in your mobile app or in your server side systems, and those are emitting events. Those events are then landing on, what we call a collector. The collector is a a pretty dumb piece of technology. It just receives the events over HTTP, and then it writes them on into, into a queue of some form, that's some sort of message queue like Kinesis, or Kafka or Google Cloud Pub Sub. From that queue, we then have another process, which we call the enrichment process, and that process takes charge of validating the data, runs those on on the event. And It runs those on on the event. And so coming out of that, that that writes back out to another queue, you know, again, another kinases stream or or Kafka topic or or Google, cloud pub sub queue. And, again, that's writing out enriched events. That's now writing out the enriched event stream. Then we do various processing downstream of that, which is which is really interesting and important and, like, we'd love to to drill into a bit later. But but, yeah, fundamentally, those are the those are the the most important steps from a a data quality perspective. So the collector is deliberately not enforcing anything around data quality. It's a pretty simple pass through. It's it's there to be a kind of a a guardian gatekeeper between the the the public Internet and the the the trackers and, and on the inside, the the the processing pipeline and the the the queues, the streams. The enrichment process is the the piece that's doing that validation and enrichment. And the validation is is, like you say, it's really, really important. So we we quite early on the the SnowCloud journey, we realized that it was really important for the event streams to be well structured.
And and and the way we the way we got the event streams being well structured was by using schema technology. So we use JSON schema, which is, schema technology. It's a way of of modeling representing JSON data. It's a way of saying this JSON data will have this structure. And if it doesn't have the structure, we can we can reject that data. So we elected to use JSON schema to to model the event data flowing through the system. The way we structure that is we ask our Snowfly users to associate the data that they're sending in with JSON schemas. So we we get them to define the JSON schema that models the data that the event data that they're sending in. What we then do in the enrichment process is we validate that the data they're sending in conforms to the JSON schema that they say that they say it will.
And that's a really powerful technique because what that means is that we're essentially receiving self describing customer events. We're we're we're receiving events that say, I'm gonna fulfill the schema, and then we're checking them against the actual schema, which we store in a schema registry we call igloo, and we check that they, they meet that that, that schema. And if they don't, then what we're able to do is we pass them off to a a separate queue. We pass them off to a bad a bad events queue, and we're able to store why they don't meet the schema, so why they don't match the the structure that we expect. And we we kind of hit on that architecture for Snowplow quite early on. It was a a we were it was a reaction to the SaaS, analytics platforms that we'd worked with where what we found is that if the data we were sending in didn't didn't conform to the structures that was expected, it would just be silently swallowed. It would just disappear in the black box. And we wanted to do something different with Snowpal. We wanted to say, you're sending in you're sending an event stream data from unreliable clients like JavaScript trackers and mobile apps and things like that. You're sending in that data. And if it doesn't match the schema you expect it to, and there's lots of different reasons why it might not match like like you mentioned earlier, Tobias. If it doesn't match, then rather than just silently swallow it, we're gonna put it in a separate queue, and you're gonna be able to look at that data and understand why it's failing validation and then, you know, go back to the source and and fix fix whatever that that underlying problem is. And that would also help reduce
[00:18:13] Unknown:
a large amount of the traffic that somebody might be submitting where they're just, you know, bulk generating events because they're not necessarily going to try to match a given schema. They just want to throw packets at you without really caring what kind of information is in them. So you could pretty easily just filter those out as being invalid because they're not conforming to the specified schema and have all the attributes that you need. So it raises the bar for a potential attacker just right off the bat. Yeah. That's that's right. So we,
[00:18:43] Unknown:
we regularly see, for example, scripts, targeting collectors and trying to kind of send in nonsense data. More more to probe the vulnerabilities than, actually to to devalue or corrupt people's event streams. But but we see that kind of thing, and and you're right. This kind of validation, stops stops that stuff from having having an impact. I think we do a couple of other things as well to, to mitigate against kind of bad sources of traffic. It's a it's a really hard problem because, of course, you're sending, you know, you're putting putting your tracking SDKs out there in in the web, and you're you're you're embedding them in mobile apps and stuff. So it's a really tough problem, But we do quite a lot. So we have a couple of different user agent passes that can help you to identify bot traffic. We have an enrichment that uses the IAB bots and spiders list. So again, it's, it helps you to understand if there are different different sources of of of, artificial traffic hitting your site. So there's there's some stuff we can do to analyze analyze the the event streams and and figure out if there's sort of traffic you don't want in there. But it is a it is a tough problem. There's something interesting around potentially authenticating the events that we're we're thinking about, and we might, we might well do an RFC, which is a kind of public request for comments from our community around that. But, but, yeah, it's a really tough problem. And
[00:20:08] Unknown:
on the tracker side as well, I know that 1 of the issues that comes up with people who are implementing web analytics is that there might be some sort of event that they want to be able to track, and they don't realize that when they initially set it up. And I know that that's 1 of the problems that Heap has set out where they just collect every event by default. So I'm curious if there are any default events that you automatically track when somebody sets up the snippet or, what your approach is for making it easy to lead people in the right direction of tracking the types of events that would be meaningful for their business.
[00:20:45] Unknown:
That's a that's a a really, it's a really important area. So we've always done quite a lot of that with the JavaScript tracker. So we've made it really easy to capture page views. We made it really easy to, capture, what we call page pings, which is, someone staying on the the web page for for a period of time. So we've made that quite easy. We've we've got kind of automated link tracking. We pull quite a lot of, context out of the page so we can grab things like your optimizer data or your Google Analytics cookies. So we've always done quite a lot on on JavaScript.
On mobile, so on iOS and Android, we've tracking historically has been a bit more manual, but we're doing quite a lot of work there at the moment to capture a lot more around the the overall application life cycle. Because as you say, there's there's quite a lot of actual standard behaviors, and it's really nice to be able to just put a simple, tracking snippet into your mobile app and get quite a quite a lot of that. So it's it's it's a you know, auto tracking is an important area. I think that quite quickly, you end up in especially in a web app or a mobile app world, quite quickly, you end up in a place where your your application probably has some quite distinctive features, maybe some quite distinctive entities that you're representing on screen or, you know, some quite complex data models.
And quite quickly, the auto tracking will will break down around that because actually you want you just want those entities, you know, well tracked and well well modeled in in your event pipeline. But, yeah, auto tracking gets you a long gets you a long way before before you start running into those issues. And for somebody who is running the Snowplast stack on their own, I'm curious what types
[00:22:36] Unknown:
of metrics and monitoring they should be keeping an eye on to ensure that the data is able to be collected reliably and that you aren't getting a bunch of poorly formatted data so that you can maintain the health of your overall analytics process?
[00:22:54] Unknown:
Yeah. So I think there's 2 areas here. So there's there's there's a kind of more like a DevOps layer. So more of a systems layer, which is around the the pipeline health. So the health of the actual services that you're running. And then there's a layer above that, which is around data data quality. And you need both because, if you if you have the the pipeline running smoothly, but the data quality is not there, then you have problems and and vice versa. So on on the kind of systems layer, the pipeline infrastructure observation, there's quite a few different things there, but a lot of it comes down to to to latency. So you're really interested in the collectors behaving behaving well.
You know, when clients are online and sending data, you want your collectors to be up and available and and and pulling that in. You wanna make sure that the end to end latency is pretty good as well. So if you have increasingly snow power pipelines are being used for quite quite operational use cases, so, you know, maybe fraud detection or, insight in app recommendations, things like that. So you wanna make sure that you're feeding those algorithms in a really timely fashion. So so you're interested in the end to end latency from the events landing at the collector to those events being enriched. And then even after enrichment into, you know, the the real time apps that you're writing that, are are doing the doing the the processing.
So latency is really important. You need to make sure your queues or or streams are are really healthy. So in a with a Snowpal real time pipeline, essentially, you're you've got a bunch of microservices, and they're reading and writing from queues or streams. So So it's really important that those queues or streams are available. They're scaled to the right size, which can be a can be a problem sometimes, and and they're just a key to your your plumbing. And then you're always looking for systemic failures. So, you know, is there some kind of problem with the the writing to Redshift, for example, when we're we're, syncing that, that event stream out to to the the Redshift database. Is there a problem with a specific data source for your enrichment? So, you know, have you got some sort of systemic problem in your enrichment process? So you always you always need to be careful about and and and make sure that all those those data sources you're joining are are available as well. That's on the systemic side. On the data quality side, again, it's it's really important. So you could have a really low latency pipeline that's working really well and is not having any failures. But actually, perhaps you botched the release of your latest iOS app, and the tracking isn't quite right in there. And the the the data being sent in isn't isn't matching the schema, and then suddenly, you know, a significant proportion of your data is is failing validation. So so that's something you need to monitor. You want to be monitoring the volume of events being successfully enriched with the volume of events that are failing validation, and you want to look in into those failures and see if there are specific patterns, like maybe a certain event is failing more because that that event is the 1 that's that, got broken by the by the app rollout. And we've discussed a number of the different components of the overall Snowplow stack piecemeal
[00:26:08] Unknown:
as we discuss some of the different layers. But can you give an overall view of the life cycle of a given event as it flows throughout the architecture from when it first gets registered all the way through to when somebody is running an analysis against it? Yeah. Sure.
[00:26:24] Unknown:
So the events will start in the the trackers. The trackers will try and send the events to a collector. If there are any issues, they'll they'll be held in the the tracker, and it'll it'll try again. Once they land on the collector, they will be written out to the stream. So the the stream or or message queue, and then they'll be read from that and and consumed by the the enrichment process. And it's that enrichment process that is validating the the events against their schemers, and then adding the extra data from from the enrichments, making that event much more more meaningful. Once that microservice has done its work, it will write out again to another queue. And then after that, a a few different things can happen. So 1 of the nice things about these stream or stream oriented architectures is you can have lots of different consumers on a stream. So, you know, when the data is written out to that kinase stream, the enriched events are sitting in that kinase stream. We can put multiple different consumers on that stream to do different things. So 1 of the most common ones we put on there is a sync that writes the data, writes the events to Elasticsearch. So a lot of Snowfly users are keeping up to 7 days of their event stream in elastic search, and that makes it quite easy for them to query that data, run counts, look at look at patterns in that data, or even build dashboards and things on top of it. But then equally, we are also storing that data out to s 3. So we have another consumer that's writing that data to s 3. Downstream from that, we have a process that can can prepare that data for redshift and and write that out to to redshift or indeed to snowflake d b. And then another really, really common thing to do with that event stream is to build, as I mentioned earlier, to build real time apps that sit on that event stream and and process it in some way to do something important in real time. So maybe do some real time decisioning or, for detection or content or product recommendations.
So there's quite a lot of different stuff that that can be done. In terms of what happens so if we take if we take the example of the data being written to redshift, again, that's not the end of the journey. So a lot of Snowplow users and customers will do further modeling event data modeling on that that data. So, you know, that the data sitting in Redshift is extremely rich, extremely granular. It's very, very common and useful to take that data and roll it up into more meaningful aggregates for the business. A good a really good example of this is around video consumption online. So if you think about it from the tracking perspective, someone's watching a video and we're sending ping information every 10 seconds that say about, you know, this person is still watching this video. That flows through into redshift as very, very granular information. Whereas the analyst is much more interested in actually this video view just, you know, forget the ping information. What was the actual length of time that this person spent watching this video? So there's a lot of that that work. We call that data modeling. There's a lot of that that that we do on the the granular granular data. And you can really think actually of the real time processing
[00:29:17] Unknown:
that that I mentioned earlier. You can think of that as just data modeling in stream. You can think of that as the enriched event stream is coming in, and we're doing processing on that to to make more meaningful outputs or decisions. As you were discussing all the different points that that the data flows through, I'm thinking about different ways that you can hook into them and, as as you said, add your own processing. So between Kinesis and the enrichment process or as part of the enrichment stream or once it hits redshift that isn't accessible to you when you're using just a fully managed analytics platform such as Google Analytics because you don't get exposure to any of those pieces of the system.
And for somebody who's running the full Snowplow architecture on their own, they could very easily just run their own processes to hook into those. But for somebody who's using your managed instance, do they also have access to those various different integration points? Yes. They do. So
[00:30:11] Unknown:
when we're running Snowplow Insights for a customer, we're actually deploying Snowplow into their own Amazon account. And so the whole of the the Snowplow pipeline is is running in their own Amazon account and and and soon in their own Google Cloud account. And so that means that actually they have a lot of that flexibility, and and we can work with them to to come up with those kinds of custom applications that they wanna deploy into the into AWS and wire into the enriched events. We don't find much use for interacting with the raw events because fundamentally the raw events are just a format that's taken and and and processed further by our validation and enrichment process. But once those events have been enriched, yes, exactly. So, all our customers can can use that stream and and build really interesting apps on top of it. So how has that overall architecture
[00:30:56] Unknown:
evolved from when you first started? And do you think that if you were to
[00:31:02] Unknown:
begin again today that you would make some of those same choices? Or were there other aspects that you would do differently? That's a that's a great question. So the the biggest the biggest change since we we started 6 years ago has really been the shift from batch based architectures to to real time based architectures. So when we when we started Snowplow, the really, the only real time component was the event collection. Those were you know, the event collections are running all the time. They were bringing the events in, and then they were storing them in s 3. And and downstream of that was all quite quite a kind of a classic ETL style architecture. So, you know, we would spin up a a dupe job on Elastic MapReduce. We'd take the raw events. We'd enrich we validate them. We'd enrich them. We'd write them back to s 3. We have another kind of batch based process that would do the load into to Redshift. So it was quite a kind of a classic almost data warehousing style approach. And, yeah, the the really big change in the last few years has been the the the move from those kinds of approaches to to real time approaches. A lot of that was driven by, the guys that were working on Kafka, and that that that ecosystem.
Of course, Amazon launched Kinesis, and that was a that was a really kind of, you know, powerful kind of Kafka equivalent running on, running on AWS. And so that was that was something we, we integrated with really, really early on. But the the real time approaches have have changed quite a lot of stuff. Like, we're now you know, you think about your event pipeline today, and it's it's a living, you know, breathing system that's always available. You're trying to get data through it as fast as possible. You know, it's very different from that old that slightly old school batch approach. There are still some interesting, merits of the batch approach around, you know, bulk processing, reprocessing, things like that. But, yeah, fundamentally, we've moved to a to a real time world. To the second part of your question around what we would do differently, we we actually have quite a lot of that question actually comes up because when we started were on AWS exclusively.
We did a prototype around Kafka support couple of years ago, and we've been working pretty heavily this year on Google Cloud Platform support. And so thinking about what Snowplow should look like on Google Cloud Platform has been a really good opportunity to revisit, you know, the assumptions of Snowplow and revisit how we, you know, designed snowplow originally. I think 1 of the most interesting divergences that we did with g with GCP is we've gone real time from the start. So we haven't focused or even really scoped out a batch version of of of the Snow Cloud pipeline on GCP.
We've gone straight to using Google Cloud Pub Sub, using Cloud Dataflow, building building SnowCloud and GCP as a set of asynchronous microservices that,
[00:33:49] Unknown:
that, run-in run-in the cloud. And that brings me to the discussion of the business aspects of Snowplow and how you're supporting the continued development and growth of the analytics pipeline and the ways that you're helping customers achieve a better view of their customers through these event streams and web analytics and mobile analytics? So as I mentioned,
[00:34:17] Unknown:
a little bit earlier, Snowplow started as a just as an an open source project, really a prototype. But we we always had aspirations to to to develop it into into a business and a company. And and we realized fairly early on that to build Snowplow out and and and meet all the requirements that different constituents had Snowplow, we would we would need to build a team around it. I think the the interesting thing for Snowplow was we knew that a lot of value of Snowplow was in the open access, the transparency, the the fact you can you can see the data flowing through the system. And so we weren't particularly interested in building a kind of a hosted Snowplow. To us, that was kind of more of the same. That was that was more like Google Analytics and another SaaS analytics platform. So we came up with the idea of a managed service. And so, essentially, with the managed service, which we we launched about, 4 years ago, Snowplow users or prospective users would come to us, and we would spin up and set up Snowplow running in their own AWS subaccount. So we would they'd give us access to that subaccount. We would deploy Snowplow into it. We would manage that. We would maintain it. We'd monitor it, upgrade it. SnowCloud goes through a lot of lot of new releases. And and that was really the the core of the the managed service. So another way I think of the managed services, to use Snowflake open source, you really needed 2 things. You needed, assist admin, and you needed a a data scientist or or perhaps a data engineer. Someone who can really work with work with the, the enriched event data. With the managed services it was then, we we were saying you don't need to have your own sysadmin. You know, your your internal systems team can go away and work on other stuff. You can pay us and and we'll run, SnowCloud for you in your own AWS subaccount. So you would still pay you pay your own AWS bill and then pay us for the the managed service. And our pricing model was quite interesting. We we were charging a a flat fee to to deliver the managed service. So we weren't saying that, you know, if if you were, you know, if your event volumes went up a bit, we were gonna charge you a bit more money. We'd say, no. You'll pay more money to Amazon, but but our fees will be flat within within certain parameters.
So that was really the that that managed service was really the engine of our growth that allowed us to hire a team, bring in a support team, build out towards kind of 247 support for our our customers, hire data engineers to work on on the platform and and keep adding features, and all the other, you know, necessary facets of for business. So that that was the kind of the a big step in our evolution from kind of open source project to to company. Something interesting happened a couple of years ago, which is that we realized that we were we were doing much more than just providing a managed service. So we realized that we'd actually built a lot of software to, to manage and and monitor snowplow, and we had aspirations to build an actual kind of management and and and monitoring and, UI for our for our customers. So we had aspirations to actually give them a kind of a guided self-service experience around the SnowCloud pipeline. And at that point, we started kind of moving beyond the managed service into what we call SnowCloud Insights. And so Snowplow Insights is it's really our kind of commercial product that sits on top of Snowplow and really helps you through the whole kind of Snowplow experience, whether you're a a data engineer, a head of data, a data scientist, or or or whatever. Under the hood, we're still deploying Snowplow into your own AWS subaccount and running it for you. That's really important. But but, yes, you're getting more than than just a management process. Yeah. I imagine that that fixed cost overhead for running this full data pipeline was very attractive and helped, encourage a lot of people to come on board versus these tiered pricing models that a number of other providers have where the more data you have, the more you pay, and it's not necessarily a linear scale because of the inherent complexities of larger volumes of data. Yes. I think I think that's exactly right. So we found those SAS those analytics SAS models are quite challenging for a lot of companies because what happens is as you add more and more tracking into your various applications and sites and your event volumes go up, And potentially, you're gonna go from from a lower tier to a much higher pricing tier. And the challenge is that the the value you're deriving from those analytics isn't actually increasing. Like, you know, it has a certain value to your business, but the fact that you're suddenly doing 5 times the event volume doesn't mean you're getting 5 times the value. So been, those the SAS pricing models have often been been quite challenging.
And, of course, there's a there's a there's always a risk with those SAS pricing models where, you know, you've got you've got that tracking embedded in, you know, microsites, mobile apps, all that kind of stuff. It's there's quite a lot of lock in there. So it can be really expensive finding out a couple of years down the line that actually your the pricing for for your event volumes is gonna be pretty high. And, you know, a lot of those companies, they do subsidize at the lower levels and and charge more at the enterprise levels. Yeah. The lock in piece is interesting too because
[00:39:04] Unknown:
given the fact that Snowplow has so many different components to it, there's a lot of potential for individual users to be able to swap out some of those different layers for their own preferred platforms that they might already be running for other systems, even just the tracking snippet where if you move away from using Google Analytics, then you have to reimplement all of those snippets where given that yours is open source, somebody can just modify the destination point and then still be able to leverage all that same information that they're collecting if they decided to build their own in house platform or go with another provider. Yeah. There's, there's just a lot more a lot more flexibility in the stack. And what have you found to be some of the biggest challenges or unexpected lessons learned in the process
[00:39:49] Unknown:
of building and growing both the platform and the business? That's a really that's a really good question. I'd say that 1 of the 1 of the big challenges we found is around the sheer the sheer size of the the technical state that we've had to build. What we found over the years is that a lot of the the building blocks we've needed to build a pipeline like Snowpads don't exist in a in an open source way. So we've had to build a lot of those ourselves from API clients for enrichments to, you know, a lot of our scheme ring technology to to to good robust ways of of writing data into data warehouses and things like that. So it's been really, you know, interesting and surprising to us just how much we've had to build and and thus how much our our team of data engineers has to to maintain. I think that's been a been an interesting challenge. I think building on top of public cloud, like, we have over the years with AWS and we're doing now with with GCP, I think that's been really interesting. So the pace of change in in in those clouds is is is frantic.
New services are coming out. Other services are being sunsetted or, you know, people are moving away from the paradigms are shifting. That's been, really interesting to observe. And we've had to make some, you know, tough calls over the years as to how quickly will we adopt certain new technologies, will we sunset other things that we've we've relied on. And so I guess the point I'm making there is really that, you know, the fabric that we're building on, the foundations we're building on are are ever shifting, and and that's been that's been a really interesting interesting ride. I think those have been been really interesting challenges. We used to have a really interesting discussion and challenge around homebrew. Actually, we've had some interesting challenges over the years that have have come and gone. So we had a really interesting challenge a few years ago around why should I warehouse my event data. You know, I'm happy with out of the box SaaS tools. Why do I need to to build a kind of a, you know, an archive of my my my customer behavioral data myself? That was a really interesting debate. I think I think we won that debate and and and encouraged a kind of an ecosystem of of of companies that that help you store your event data to to places like Redshift. But that was a really interesting debate. We've had another interesting debate at Snowplow over the years, which is around companies thinking of building their own, so kind of home brew in house solutions versus snowplow. Truth be told, those discussions and debates happen happen less and less often. I think that a couple of things have changed. So I think snowplow does a lot more than it used to. So, you know, we've had 6 years of of building and adding enrichments and adding tracking SDKs and, you know, just like I mentioned earlier, the tech estate is pretty huge. And I think it's I think it's daunting, for a lot of data engineering teams that are starting from scratch to to think about reproducing a lot of that in house. I think the other thing that's changed a lot is the data privacy and and and data protection debate. So the where before kind of collecting and warehousing a ton of event data, that was something that could be directed out of a kind of an internal IT team, a a data engineering team. Now the sensitivities around that, you know, across the whole business. So so, you know, it's it's it's concerning the c suite. It's bringing in data protection offices, all that kind of stuff. And so having a company out there like us who are thinking about that stuff and thinking about how this stuff should be done properly, thinking about what's right for the data subjects and and how do we provide the right tools for data protection offices and stuff like that, I think that's really, really valuable, and and and that's, again, that's quite a daunting thing for a for an internal data engineering to bootstrap. And what are some of the
[00:43:25] Unknown:
directions for future
[00:43:28] Unknown:
improvements or future additions or business growth that you're focusing on? So that's a really that's a really good question. We're working on quite a few different initiatives at Snowflake, and I think that's that's kind of inherent in the the nature of the platform. It it's it's so kind of broad now that that there's a lot of different things that different constituencies are looking for. So 1 of the most important things we're doing is that expansion to to other other public clouds. So we're working on GCP at the moment. We're gonna work on Azure after after GCP. And that's really important because we see a lot of adoption of GCP and Azure. We see, customers and users that want to have their pipeline on, you know, as you're not AWS or they wanna have it on GCP instead of Azure or whatever it is. That's really important. And if we can give SnowCloud users that ability to kind of lift and shift their pipeline from from 1 cut to another, that's that's really empowering for them. So the public cloud support is is really, it's really important to us. Auto tracking is really important. So we're really interested in doing more around reducing the the integration headache on mobile. So we're doing doing a lot there. And we're really interested in doing more around the data modeling piece. So how do we how do we get how do we get more meaningful aggregations, more meaningful bundles of of roll ups and things like that out of the atomic event data? How do we how do we roll the the atomic data wrapping to the high level more valuable data? So that's a that's another area of focus. And are there any other areas
[00:44:53] Unknown:
discussion that we didn't cover yet that you think we should go into before we start to close out the show? Maybe just the most interesting uses of the platform. I I thought of a a few fun ones. And what are some of the most interesting or unexpected uses of the snowplow platform or some of its individual components that you have seen? So
[00:45:14] Unknown:
there's there's quite a few. I mean, because snowplow is is is really kind of horizontal technology. You can you can use it to power any kind of event pipeline really. We see really diverse use cases and, you know, lots of different industries and, different different departments inside companies using Snowplow. We have a customer that's building a triple a game and embedding snowplow in that, and that's really that's really exciting. We GitLab, which is an open source competitor to GitHub, they're looking to add snowplow tracking into, into the the GitLab products. So we're following that with with interest. And we also use Snowplow quite a lot internally. So we do quite a lot of dogfooding of Snowplow. So we have various systems that emit streams of Snowplow events, and then we build our own analytics on top of that. So we're using Snowplow for kind of our own operational use cases. So, yeah, really diverse set of of companies worldwide using using Snowflake in in some pretty cool ways. Alright. For anybody who wants to follow the work that you're up to or get in touch, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today? Oh, that is a really good question. I think, I think as an industry, our kind of comfort levels, our awareness and and and tooling for working with structured data with schemers, I think that's really, I think that's still really immature. We'd love to do more on that at Snowplow. I think the pendulum is shifting back towards, you know, strong schemas for data and and and thinking about those schemas upfront. But that imposes a a lot of a lot of burden on on on various different, people in the company. So I think I think we we need to do more. We need to do more around that, around around how we communicate, work on, collaborate on on schemers together. I think the I think the other side of things, I think that the the the analytics toolkits and this, you know, the interactive studios and stuff, I think I think that's that's awesome. I think, yeah, as as data engineers, we we probably need more help on the the schemering side of things. Alright. Well, thank you very much for taking the time today to discuss the work you've been up to with Snowplow. It's definitely very interesting platform
[00:47:28] Unknown:
and 1 that I've been considering using for some of my own use cases. So I'm looking forward to digging deeper into that. So thank you for that, and I hope you enjoy the rest of your day. Oh, fantastic. Thank you so much. Really, really enjoyed it.
Introduction and Guest Introduction
Building Snowplow Analytics
Challenges with Google Analytics and Customer Data
Snowplow's Data Collection and Validation Process
Monitoring and Ensuring Data Quality
Lifecycle of an Event in Snowplow
Evolution of Snowplow's Architecture
Business Model and Managed Services
Challenges and Lessons Learned
Future Directions and Improvements
Interesting Uses of Snowplow