Summary
Event based data is a rich source of information for analytics, unless none of the event structures are consistent. The team at Iteratively are building a platform to manage the end to end flow of collaboration around what events are needed, how to structure the attributes, and how they are captured. In this episode founders Patrick Thompson and Ondrej Hrebicek discuss the problems that they have experienced as a result of inconsistent event schemas, how the Iteratively platform integrates the definition, development, and delivery of event data, and the benefits of elevating the visibility of event data for improving the effectiveness of the resulting analytics. If you are struggling with inconsistent implementations of event data collection, lack of clarity on what attributes are needed, and how it is being used then this is definitely a conversation worth following.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- If you’ve been exploring scalable, cost-effective and secure ways to collect and route data across your organization, RudderStack is the only solution that helps you turn your own warehouse into a state of the art customer data platform. Their mission is to empower data engineers to fully own their customer data infrastructure and easily push value to other parts of the organization, like marketing and product management. With their open-source foundation, fixed pricing, and unlimited volume, they are enterprise ready, but accessible to everyone. Go to dataengineeringpodcast.com/rudder to request a demo and get one free month of access to the hosted platform along with a free t-shirt.
- You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!
- Your host is Tobias Macey and today I’m interviewing Patrick Thompson and Ondrej Hrebicek about Iteratively, a platform for enforcing consistent schemas for your event data
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by describing what you are building at Iteratively and your motivation for creating it?
- What are some of the ways that you have seen inconsistent message structures cause problems?
- What are some of the common anti-patterns that you have seen for managing the structure of event messages?
- What are the benefits that Iteratively provides for the different roles in an organization?
- Can you describe the workflow for a team using Iteratively?
- How is the Iteratively platform architected?
- How has the design changed or evolved since you first began working on it?
- What are the difficulties that you have faced in building integrations for the Iteratively workflow?
- How is schema evolution handled throughout the lifecycle of an event?
- What are the challenges that engineers face in building effective integration tests for their event schemas?
- What has been your biggest challenge in messaging for your platform and educating potential users of its benefits?
- What are some of the most interesting or unexpected ways that you have seen Iteratively used?
- What are some of the most interesting, unexpected, or challenging lessons that you have learned while building Iteratively?
- When is Iteratively the wrong choice?
- What do you have planned for the future of Iteratively?
Contact Info
- Patrick
- @Patrickt010 on Twitter
- Website
- Ondrej
- @ondrej421 on Twitter
- ondrej on GitHub
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
- Iteratively
- Syncplicity
- Locally Optimistic
- DBT
- Snowplow Analytics
- JSON Schema
- Master Data Management
- SDLC == Software Development Life Cycle
- Amplitude
- Mixpanel
- Mode Analytics
- CRUD == Create, Read, Update, Delete
- Segment
- Schemaver (JSON Schema Versioning Strategy)
- Great Expectations
- Confluence
- Notion
- Confluent Schema Registry
- Snowplow Iglu Schema Registry
- Pulsar Schema Registry
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the data engineering podcast, the show about modern data management. What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I'm working with O'Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to data engineering podcast dot com slash 97 things to add your voice and share your hard earned expertise. And when you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With their managed Kubernetes platform, it's now even easier to deploy and scale your workflow, so try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. Stir. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode, that's l I n o d e, today and get a $60 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. If you've been exploring scalable, cost effective, and secure ways to collect and route data across your organization, RudderStack is the only solution that helps turn your own warehouse into a state of the art customer data platform.
Their mission is to empower data engineers to fully own their customer data infrastructure and easily push value to other parts of the organization, like marketing and product management. With their open source foundation, fixed pricing, and unlimited volume, they are enterprise ready but accessible to everyone. Go to data engineering podcast.com/ Rudder today to request a demo and get 1 free month of access to the hosted platform along with a free t shirt. You listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.
For more opportunities to stay up to date, gain new skills, and learn from your peers, there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to data engineering podcast.com/conferences to check out the upcoming events being offered by our partners and get registered today.
[00:02:14] Unknown:
Your host is Tobias Macy. And today, I'm interviewing Patrick Thompson and Andrei Rybicek about Iteratively, a platform for enforcing consistent schemas for your event data. So, Patrick, can you start by introducing yourself? Yeah. Thanks so much for having us, Tobias. Definitely appreciate it. My name is Patrick. I'm a designer product manager by trade. Andre, I've been working on it for about a year and a half. But previous to that, we worked together at a company Andre co founded down in the bay called simplicity. And this is our round 2 together. So really excited to work on another company and excited to be on the on the show. And Andre, how about you?
[00:02:45] Unknown:
Absolutely, Tobias. Pleased to be here. I'm Andre. I'm the engineer in the group. Like Patrick said, I previously cofounded a company called Syncplicity.
[00:02:52] Unknown:
And, before that, I was at Microsoft up here in Seattle, and it's good to be here. And, Patrick, going back to you, do you remember how you first got involved in the area of data management?
[00:03:01] Unknown:
Definitely. So when Andre actually came back to together in early 2019, we decided that we really wanted to focus on solving problems for software teams. Those are the people that we loved. We're really, interested in in seeing how we can help them be more impactful in their roles. So we spent about 6 months actually just doing customer discovery, interviewing as many folks as we could, product managers, designers, engineers, analysts, growth people, really trying to understand the pain points that they had. And the number 1 thing that we actually learned throughout this problem was it was just, very painful to to manage data. And, quite often, it was messy, inconsistent, and unreliable, which hindered the team's ability to get their job done and ultimately hurt the growth of the business. So that was what we decided to actually focus on for Iterative. And, Andre, do you remember how you first got involved in data management?
[00:03:48] Unknown:
Yeah. It's really funny. I, actually was not in data management at all prior to cofounding Iteratively with Patrick. I'd done a lot of, data work back at Syncplicity, but it was really just a a hobby more than anything. But about 2 years ago when Patrick and I decided to venture out and co found another company, data management turned out to be the biggest problem that we saw. So we quickly became experts in the in the space and I I think we're still, working on that today to be completely honest, trying to learn as much as we can and and, see how we can help the the people working in this, space as as much as possible. And as newcomers to the area of data management specifically, what are some of the useful resources that you've been leaning on or the types of questions that you're asking customers to help you gain more context
[00:04:33] Unknown:
and deeper knowledge of the problem space that you're trying to solve for?
[00:04:38] Unknown:
I think there's a lot of good resources, actually, in this community. It's a really tight knit group. There's the locally optimistic Slack. There's the DBT group. What we have found is a lot of folks are super helpful for us. Us. Obviously, we try to come in to this with as much of a beginner's mindset as we can. And talking to folks who are actually practitioners and trying to learn as much as we can from them has been really helpful. I'd say, like, 1 of the things that has been super insightful for us is just understanding the different maturity models data maturity models within organizations and how that sort of match to the pain points that folks experience. There's been a lot of good material produced in this space on that side. And then understanding kind of how the each key stakeholder and persona within organization deals with data from your product manager to your analyst to your data scientist to your data engineer and the problems and pain points that they actually have. It's been it's very insightful to at least have those conversations. I don't think we would have been anywhere we are today without, you know, at this point, chatting to hundreds of people.
Anything you'd add, Andre?
[00:05:40] Unknown:
No. I think that covers it, Patrick. I maybe 1 1 thing I would add is part of customer development, we've been following it a pretty, structured process for figuring out what, the biggest pain points are that teams are struggling with, what really keeps them up at night, and that open ended script that we followed and we followed it, I think over 202 150 times metric, maybe maybe even more, has been just a great source for just figuring out what these people are thinking about, what's, on their mind, and, it's helped us kinda dig into the areas that that actually matter the most. And so you've been working on iteratively
[00:06:15] Unknown:
as a company and as a product. So can you give a bit more description of what it is that you're building there and some of the backstory of how you settled on this as a particular problem space that you wanted to solve for and build a business around?
[00:06:29] Unknown:
Definitely. To give you a little bit of backstory, when we actually had all these interviews, And 1 of the things that we saw quite often is that teams are trying to really solve this problem internally, and they did it in a variety of ways. And when we talk to data consumers, folks like your analysts and your your data scientists, your PMs who are consuming the data, quite often, they would try to add some structure or governance around the data that they're actually capturing. It was typically in the form of a spreadsheet or a Notion page or a Confluence page. They all they usually all had some sort of homegrown solution. Some companies took it up 1 step further and had some sort of schema management solution in place, either something open source or off the shelf. And then the engineers would usually be left to sort of fend for this fend for themselves and try to convert this over into to code. And that was really where a lot of the human error was introduced into this equation or data integrity issues have come up, lack of testability and and really analytics as an afterthought for most organizations. And when we saw ways that we could potentially help with that was more trying to automate and reduce the human error by really trying to add some strong governance and data management in the form of schematizing, analytics. This isn't anything new for probably most of the people listening to your podcast. But for us, primarily as software people, this is something that we we thought was pretty insightful. And we had saw other companies like Snowflake Analytics during this and a few other few other tools. And we really wanted to make it easy for software teams to collaborate in a single source of truth. You can kind of think of it as GitHub for your analytics, define what are the metrics that you care about, what are the events that you need to, instrument, what are the properties or payloads for those events, the data types, all in 1 centralized place where it's easy to manage, consolidates all of this into 1 place.
You have the next review process commenting all built in. And then on the second side, it's really making it easy for developers to actually instrument and collect data, about your customers and the behaviors that they take within your your products and software. And so we built a developer toolkit that really interfaces with this web application. It makes it easy to do the instrumentation and automates a lot of the manual QA. And the way that we do that is through generating strongly typed SDKs to really help remove, as much of the human error as possible and make it really quick and easy to instrument your products.
[00:08:44] Unknown:
And you mentioned strong typing for the SDKs and being able to enforce the structure of these schemas. And I know that a number of the language runtimes that you're targeting are dynamic by nature. And Python, in particular, is 1 that I'm familiar with, which has recently added support for gradual typing and type hints. I'm curious what some of the challenges are that you faced in terms of being able to provide these type enforcements
[00:09:11] Unknown:
in the SDKs, in these dynamic run times. Yeah. It's really, unfortunate to to see languages that don't have great support for strong typing. It's a bummer to to not be able to provide that value to to our customers. We started with, languages that that do though I suppose JavaScript was 1 of the 1 of the first languages as well and there was really not much we could we could do there. It is what it is. And, there are teams that deal with these, this lack of support in in their own ways. Unfortunately, we we couldn't provide this capability to them. But what we do do and we always fall back on is the JSON schema validation that we perform under the covers at runtime. So even when we can provide great typing support for the engineers using the SDK, you know, we can always at least make sure that, when they're running their test cases or they're, using the code in development, by hand, testing it manually, that we tell them as quickly as possible that the data that they're passing in doesn't match the expected schema, that there's something wrong. And do you also do runtime type validation so that if you have events in production that match the
[00:10:18] Unknown:
schema shape, but not necessarily the data types, do you put those into sort of a dead letter queue for manual resolution?
[00:10:25] Unknown:
We do, validate everything, and, we have the really the power of the JSON schema validator under our feet that helps us not just validate the shape, but also the data types themselves. The handling of the the data that doesn't pass muster is really up to the, the customer. We can throw an exception to let the developer know right then and there that something is wrong. That's the default behavior in the development and QA environments, for example. In production, we, log an error or let you kinda define your own error handling, if you'd like to, log that error somewhere else or,
[00:10:57] Unknown:
notify somebody. That's all configurable and possible in the SDK. And as far as the challenges that you've seen for people who are working on capturing event data or customer data or people who are capturing event data or customer events, what are some of the ways that you've seen inconsistency
[00:11:17] Unknown:
in the message structure or the data types or the overall schema? What are some of the problems that you've seen caused as a result of that? Yeah. I think the the clearest example is the the more software teams you have actually shipping analytics, the more this issue compounds. So for example, if you have an iOS team, capturing data, you might they might be capturing event called user signed in. If you have an Android team capturing it, it might be capturing event called user logged in. And you as the data consumer love to, you know, try to manage this data, get into a format that's actually consumable by the business. And, typically, you know, there's no real good QA process for this. So depending on what tools you're using, you're really just left to deal with the mess.
[00:11:57] Unknown:
Yeah. I think, maybe the only thing I would add here is it's kind of funny. The common cause the most common cause of problems that we see is, this type of inconsistent schema inconsistent structure. And, oftentimes, it's just dumb, silly human error, people fat fingering event names, property names, misspelling things, different following different naming conventions, some developers using underscore state case and other other developers using kebapcase. It's really all over the place. But I think that this is what surprised us the most is just the the simplest problems are the ones that cause the, the greatest issues down the road. And for these types of problems where you have inconsistencies
[00:12:38] Unknown:
in the formatting of the names or the structure of the schema, a lot of that has led to the overall practice of master data management where you clean it up after the fact of after collection where you have all this information, and then you need to reconcile it into a common format for then being able to do your downstream analytics. I'm curious what you have seen in terms of your customers who are adopting your SDKs and your platform for being able to enforce these structures at collection time if they have still been leaning on master data management as a necessary component of their data infrastructure?
[00:13:12] Unknown:
Yeah. Good question. I think in in regards to the customers who are using it virtually today, what we're trying to do is minimize the amount of work that they have to do downstream to get value from their data. So in context of some of our customers using tools like DDT, there's still very valid use cases for wanting to transform this much of the data into a consistent format. What we want to do is try to make it as easy as possible to shift the burden left into the SDLC so that it's thought of ahead of time. As far as data management goes, I think there's a couple of different schools of philosophy. 1 is, you know, capture as much as you can and make use of it downstream. But what we see the teams who are actually benefiting the most from collecting customer data is actually having a well defined process to define it, think about the structure, think about the data strategy behind what they're actually collecting, ahead of time before it makes its way into your data warehouse, and and you have to,
[00:14:09] Unknown:
try to derive some value from it. And 1 of the more more exciting things here is eliminating that burden, as much as possible. I think a lot of the work that data engineers and data engineering teams do, after the fact to clean up the data is completely avoidable if you just spend a little bit more time upfront to define the structure and the schema of the data thoughtfully, and then implement it using tooling that, make sure that the problems and the mistakes aren't made in the first place. It really reduces the cleanup and the munching and the transformation, the kind of the the non productive transformations that teams often have to lean on today to get the data into a clean state by not producing and generating dirty data to begin with. And in terms of the
[00:14:54] Unknown:
customer usage of Iteratively and some of the existing practices that you've seen for people who are leaning on iteratively and starting to adopt it, what are some of the anti patterns that you've seen as far as managing the structure of event messages that led to the initial problems that put them into place where they could gain to benefit significantly from the work that you're doing?
[00:15:18] Unknown:
Some of the anti patterns we see generally when it comes to data collection is not making uses of properties or attributes in in context. So a lot of teams just will capture the event data, but they don't capture enough information to be able to slice and dice and segment their customer base effectively to really answer harder business questions and drive more value from their data. The other thing that we quite see quite often is, inconsistent capture. So you're capturing it from your just 1 property. So you're capturing just from Android or just from iOS, but not across your your customer base. And the way that those products are designed impacts potentially the way that those users actually using these products. 1 of the other things that we see quite often as a as a anti pattern is trying to capture too many things at once.
So, a lot of folks think of analytics as, like, we must capture all of the data and and understand to understand or make use of it. What we really try to recommend to people is take a iterative process to it and think about it as an evolution. So try to capture your core business metrics. 1st, try to make sure those are rock solid, and then slowly over time, add more instrumentation to your product as needed and as use cases come up. But don't try to jump whole hog in to, instrumenting every single interaction in your product or trying to auto capture all of the the data that you have or your customers are doing. You're just increasing the size of the head stack, but the needle sort of remains the same size.
[00:16:48] Unknown:
Those are pretty much the ones that I see. Andres, any ones that you can think of on the engineering side? The over collection is is definitely the the biggest 1. I think folks that haven't thought through the data strategy for their team just, fall back on capturing everything they can possibly think of. And it turns out that when that happens and it happens quite often, you know, 80, 90% of the data that folks collect just never ends up being looked at. And it causes headaches for for teams to, manage and think about all this extra data that they're not actually using. And it shifts the focus from the, data that actually should matter for the company that helps them,
[00:17:23] Unknown:
understand the metrics that drive the business. Yeah. I'd add to the the 1 thing that we see quite often is the garbage in, garbage out fallacy within Teams. And what tends to happen is, you know, 1 tiny paper cut from a data quality or data integrity issue will arise. And slowly over times, teams just lose trust in their data, which causes them to use this data less, which causes them to invest less in fixing the problem. And, ultimately, it's kind of a death spiral that occurs within organizations, which really hinders them from becoming data driven and really unlocking the value from this data.
[00:18:00] Unknown:
And with all those something that needs to be handled by engineering teams. But what are all of the roles that fit into the scope of being of collecting and defining the types of events that you're trying to collect and the formats and the attributes that you want to use? And and how does the usage of iteratively help with turning that into a full cycle and a full feedback loop that involves more than just an engineering team or the data engineers in a company?
[00:18:39] Unknown:
Yeah. Good question. I think when we think about, you know, data collection and instrumentation, it's definitely very much a collaborative approach. So it's not just something that the engineering team typically does within an organization. Typically involves both your your product management team, your analysts, your data scientists, and your engineers. And often QA is actually validate that the data is is captured if, you are actually QA ing your analytics. So 1 of the nice things about Iterative is we bring all those people into the same tool to really make sure that folks are reviewing changes to the schema, improving those changes. They understand why the data is being captured.
I think the reality of the situation is that the more descriptive folks are with their analytics, the less issues that they have. So having the ability for folks to describe the events that they're capturing in the form of rich text descriptions, being able to embed that inside of the ID as code docs for the developers,
[00:19:38] Unknown:
Making sure everyone is on the same page about what is actually being captured has been, super valuable for for our early access customers. There, Andre, anything I'm missing? That was good. I think 1 1, 1 of the unexpected stakeholders that has come up in a lot of the conversations with especially larger companies that we talk to is the security and the legal and the compliance, group within those companies. They didn't actually come up in our customer development interviews in the beginning. But, as we dove a little bit deeper with, with those teams and got broader adoption with the organization, that team raised their hand pretty quickly as 1 of the reviewers and approvers of schemas, for, the team's analytics. And it's really those teams wanting to make sure that the organization as a whole is capturing data that complies with the company's, best practices for privacy and security, and the, the individual PMs and engineers and analysts on the team are, in compliance with those, best practices and the the regulations that those companies have set for themselves. As they've been a a a core stakeholder actually in the adoption of Iteratively at especially larger companies.
[00:20:45] Unknown:
And for people who are adopting Iteratively and using it for handling their event structures and event data collection, actually doing analytics and the different roles and, touch points for the different people involved? How that helps in terms of the overall collaboration?
[00:21:08] Unknown:
Yeah. Definitely. So, typically, we're brought into the organization from the head of data or, product analyst that's working with some sort of event data today. And typically, it's not it's not good quality. There's a lot of issues with it. That's how they usually discover iteratively and and bring us into the fold. From that point, we typically import some of their existing event data. So what is the data they're actually capturing today? And then we work with their product and engineering teams to, try to add new instrumentation to their existing product. So moving forward, there's a couple different ways that Teams do it. They either try to govern their existing instrumentation through iteratively, they migrate their existing instrumentation to iteratively, or they usually start from scratch depending on how bad of a state that they're in to begin with. And at the point where they've made those decisions, there's a lot of work required actually to go in and define the schema. I think the majority of the work isn't necessarily on engineering. It's typically on the analyst and product management team to actually go and define what needs to be captured, what does the structure of that data look like. But once they've done that on the engineering side, typically, they slot it into a sprint.
They pull down those SDKs. They do the instrumentation. They integrate iteratively into CICD, and they're pretty much set up for success moving forward as their analytics schema evolves. So there's usually this 1 time cost for the onboarding, which is relatively minimal on the engineering side. And then moving forward, as you add new features, as you rethink about your existing instrumentation, you have everything in place to to iterate. And then I'd say the last thing on the consumption side, because we don't actually process the data today in production. The data goes directly into whatever destination you want to send that data to, be it third party tools like Amplitude or Mixpanel or Snowpla Analytics. Or if you actually have your own data pipeline that you manage, we'll send that data directly there once it's been validated.
And your consumption of that data, whether or not it's in amplitude or mode, you have high quality analytics that you can trust.
[00:23:10] Unknown:
And in terms of the actual iteratively platform, can you describe the overall architecture and how it's implemented and some of the ways that the design has changed or evolved since you first began working on it and started bringing more teams on board?
[00:23:24] Unknown:
Yeah. Definitely. As a startup, Tobias, I'm sure you've seen this with other companies. We try to keep the, the architecture as as simple as possible in order to move quickly and provide as much value as we can as quickly as possible. But at the end of the day, this is a tool that serves primarily the the data consumers and the data producers. So for the data producers, we offer a command line interface tool that primarily engineers install on their computers to interact with the tracking plan that's managed in, iteratively. And that CLI is responsible for, managing and creating and coaching the SDKs that have been personalized for their teams and for linting the, source code for making sure that the implementation of analytics is actually correct and complies with, the tracking plan that's been defined. So that's 1 part of it is the CLI. Another part of it is the web application that, serves the data consumers where, events are are managed, properties are managed, templates are managed, and the team comes together as Patrick described earlier for actually collaborating on the tracking plan and finalizing each and every version of it. And then the last person is, the last part is our back end, where the schema is actually stored.
And, we'll talk a little bit later about how our schemas get persisted and how they get exposed in different formats, depending on the use case that a customer has for iteratively. As far as how the architecture has changed over time, we started as a pretty simple kind of a Credi app, for tracking plans. And quickly we realized that 1 of the killer features for that tracking plan is the, cogeneration of of strongly typed SDKs. So we built a a, cogen service that, I think currently supports about 12 languages, language platforms, combinations.
That's been an interesting challenge to actually create the strongest type of SDKs for for all these languages. That's been 1 of a major change that we added, after the the founding of the company. And 1 other architectural change that's, coming up is actually being able to process the data that, customers are generating with iteratively, server side as well. It lets us do a lot of validation, and PI detection and anomaly detection on the on the server and provide much more value to customers related to, the correctness of their analytics implementation.
[00:25:50] Unknown:
And so as far as the overall data flow for people who are capturing the event data, does it get sent to your servers first for the validation step before then getting passed along to their destination point? Or is it something that they call out to you for the actual validation, but all the data flows directly through their own systems?
[00:26:11] Unknown:
Yeah. It's definitely the latter today. The SDKs that we, generate today to all of the validation client side and handle exceptions client side as well. And if the validation passes, is every if everything looks good, then we simply pass the data onto the underlying SDK, the analytics provider SDK directly. So if you're using, let's say, Amplitude and Snowplow together, the iteratively SDK will validate the data and then call into the Snowplow SDK or the Amplitude SDK directly, and those SDK will will send the data directly to those destinations.
[00:26:45] Unknown:
And as far as building out the different integration points for the underlying analytic services that your customers are using, what are some of the challenges or difficulties that you faced in being able to build and maintain all of those different integration points?
[00:27:02] Unknown:
Yeah. So we we support, 4, analytics providers today, Segment, Amplitude, Mixpanel, and Snowplow Analytics. Our last 1 is a custom analytics provider, which is used if you have your own data pipeline or your own data warehouse and you want to receive clean data from Iteratively, validated data from Iteratively and then send it on to your own back end. In terms of challenges, I I think we we've been lucky that a lot of these destinations, have been around for for quite some time and their SDKs are well thought through and, and, high quality.
So, well documented as well. So you've been able to build adapters and and integrations for all of them relatively easily. There's always the the the challenge of, each SDK working a little bit differently that we've had to overcome, but I think that's just part of the part of the process of of supporting, these destinations. There has been 1 challenge that, we've come across. It's the fact that not all of these analytics providers offer, SDKs across all the platforms that we support. Some do, some some don't, which meant that for the ones that that don't, we've had to build our own adapters that, speak to their HTTP APIs in order to properly multiplex the events to all the places that our customers want that data to go even if those providers haven't built SDKs for the platforms that they use. 1 of the other complexities
[00:28:31] Unknown:
that often comes up when you're dealing with any sort of structured data is the idea of schema evolution. And I'm wondering what your experience has been in terms of helping to support that both on the collection side where you want to ensure that you're collecting all the necessary events, but that if you have multiple versions of the schema running in production or in different environments that you're able to support them. And then also downstream where you can help to ensure that the schemas that you're, that are being collected are compatible with where they're being stored and just some of the overall challenges that exist in that overall life cycle of an event? Mhmm. Great questions. The schema versioning is
[00:29:17] Unknown:
a is a big part of the iteratively, design. It's something that came up pretty quickly in our conversations with customers that needing to know which versions of which events are being collected, when they've started to be collected, and and when instrumentation has been upgraded to those latest versions, is really important for a company so that they know on the receiving end, when they're doing BI, what data they're actually looking at. What does it mean? What did it mean when it was implemented in the product? So we added versioning support pretty quickly, and we've been able to get inspired in a big part by the work that the Snowplow analytics team has done. They've come up with a a versioning scheme for JSON schemas called schemeover, that defines very clearly how events, should be, versioned, when should they be versioned, and when a change is a to draw a parallel to software versioning, a software package versioning when a change should be considered a patch version, a minor version, or a mat a major version.
So we've adopted that same mechanism in iteratively, and actually implemented the logic, inside our tracking plan application automatically. So as teams iterate on their tracking plans, add, change, delete events, we automatically version all of them, we follow the, schema of our model, and, we embed all of the version numbers, for all of the events inside the SDKs that we co generate so that when events are sent to any destination, that the, that the customer has configured, we include the version numbers there as well and let the, data consumers know what version of each event is and, deal with it accordingly on the receiving end.
[00:31:00] Unknown:
And then another challenge in the evolution of schemas and in the management of event data is being able to build useful tests to determine whether the code that you're writing is able to properly handle the different formats of event data or being able to coerce events of certain types into the proper structure? And what have you seen as some of the useful strategies for building these effective integration tests and handling the overall continuous integration and continuous deployment
[00:31:33] Unknown:
strategy for the event data? Yeah. There's 2, 2 thoughts that come to mind, Tobias. 1 is related to just backwards compatibility for these, for these schemas that the developers implement. Backwards compatibility has been a big deal, especially for companies that have their own data pipelines, their own data warehousing. They wanna make sure that when events are iterated on, that, it is done in a backwards compatible way. That we don't create more problems for, the data consumers and the the data engineers that are handling this event when breaking changes occur in the event schema. So 1 of the best practices and 1 of the 1 of the features that we're actually working on is, helping teams avoid backwards incompatible changes to simplify processing of the data and storage of the data that's being collected by especially, kind of on prem, in house, data management teams. So that's 1 of the things that, that we're looking at. The other thought that comes to mind as far as testing in in general is is concerned is, you know, analytics historically hasn't been a an area of that a lot of teams have been paying attention to in terms of testing, especially automated testing. It's typically managed, or typically tested by hand manually, by the data analysts or the data engineers themselves that kinda at the last minute swoop in to make sure that the engineers have actually implemented the the tracking plan correctly. So we do 2 things here and we've seen companies kind of follow this, this this practice well.
1 is we try to eliminate as many of the problems that, need to be tested to begin with through things like JSON schema validation, and strong typing. The second thing we do is we try to make it really easy for companies to be able to unit test and and integration test their existing analytics, implementation with the help of the AWS SDK by, making it very easy to build test cases or augment test cases to, in addition to validating the core functionality of the product, also validate that the right events have been sent, at the right time in the right order.
[00:33:36] Unknown:
And another element of managing schemas and being able to capture useful context is the idea of documenting the different fields and what their purposes are. I'm curious what level of support you have for being able to capture that metadata and provide that context to the engineers who are capturing the schemas to ensure that they're pulling attributes from the right sources, but also for downstream users to understand
[00:34:03] Unknown:
what the intent is of those different attributes and fields within the schema? Yeah. Communicating this information to to engineers is super important. I think there's 1 1 thing to be said about, like, defining a, a great schema for the data. What's really important is that the engineering team actually knows where to get the data, what the data, actually means, what did the the PM or the growth manager have in mind when they defined a property named x. What exactly do they want to collect? So there's 2 things that, that we do here. 1 is every element data element in the tracking plan has a description and a set of example values that is entered when the tracking plan is is being created, to make sure that the documentation for this data is very clear and explicit.
The second thing we do is we actually take this, information. We take the descriptions, the example values, the possible values in the case of enums and we include those as code docs inside the SDK that we generate as well. So if you're using any modern IDE for development, all of that context, all of that metadata shows up right there inside your IDE as, IntelliSense or as as codocs just by hovering over an event name, or a property name inside your, environment.
[00:35:17] Unknown:
And also to add on the data consumption side, so 1 of the things that's very important, and we see a lot of parallels in tools like Great Expectations and dbt, is always keeping up to date documentation for folks outside of the IDE as well, so data consumers primarily. So 1 of the things that we've done is generate a shareable unique link for anybody that includes an up to date schema with the latest and greatest for their events and properties they can share within their organization. We've seen a few teams embed this in tools like Confluence or or Notion so that they have a similar source of truth outside of the logged in state of the tracking plan as well. And another thing that we see data consumers wanting quite often is to be able to use the schema in third party tools. For example, with Mixpanel and Amplitude being able to take the schema that exists inside of Iterative. Ly and sync that up into these 3rd party tools so that you can enrich the experience for analysis inside of inside of these tools themselves. So imagine you could have the description inside of Iteratively, being able to sync that description into Amplitude. So when a PM is writing a query, they can know exactly what that event means, where it's being captured from, and why it's being captured.
[00:36:27] Unknown:
And as a platform that sits between these different layers of the engineering stack, what has been some of your biggest challenges in terms of providing useful messaging and working with customers
[00:36:41] Unknown:
to identify ways that you can help them and helping them to understand the overall value that you provide? Yeah. Good question. I think 1 of the things that comes up quite often is is the notion of open source software. So, for us, the SDKs that we cogen are open source. They're very lightweight. They're very thin wrappers of the existing clients that teams use today. So that has helped mitigate a lot of the concerns that teams have for adoption. The other thing is obviously comes down to the switching costs. So minimizing the amount of effort it takes to actually put good data management practices in place, and we built a ton of features recently to help decrease that amount of effort for adopting it.
And specifically with most of our early adopters, 1 of the things that they have valued quite quite a bit is the the fact that they get to choose whether or not they send us the data. So by default, we see the data in development, which helps you with building a you know, like, we have a live stream of all the event payloads that are coming in, so you can actually QA that very easily across both your back end events and your your client events. They have the choice to disable that, and that has made it very, very easy to say, hey. You know, we're not a we're not a data sub processor. You don't have to worry about GPR compliance here. We don't see your data in production. And that has been very beneficial for us to unlock larger deals for more than mid market enterprise.
[00:38:03] Unknown:
In the app, 1 of the difficulties, Tobias, that we've had, with messaging the the value of the product is that a lot of companies don't realize that a product like this even exists. They'll naturally fall back to using a document or spreadsheet, pretty much the same mechanism they've used for documenting other things about their software. And this idea that a, you know, an event registry service exists that lets them define their tracking plans in a very structured way and helps them guide this process throughout the team. And then it actually Cogent's SDKs, strongly typed SDKs that make it easy for developers to implement that schema, that analytics. That's not something that that folks are necessarily searching for, looking out for. So 1 of the, biggest, challenges I think that Patrick primarily has been has been dealing with is educating the the community and the, the, the folks in the space that this is possible, and, they should reconsider using, kinda old school Google Sheets or Excel sheets for something like iteratively.
[00:39:10] Unknown:
And I know that there are other approaches to handling schema definitions. Like, I know that there's the schema registry for Kafka, and Snowplow has the Igloo schema registry. And there are other ways of providing enforcement of schema collections. But what have you seen as the challenges or shortcomings of those types of systems in terms of being able to manage the overall life cycle of event data and ensuring its accurate capture?
[00:39:40] Unknown:
I think 1 of the biggest ones just tends to be usability. So the fact that you have sometimes a varying degree of folks on the data consumer sides with different skill sets. So for for instance, like product managers being able to use the igloo registry to define the analytics events hasn't been something that we've seen successfully adopted within some of the companies that we've interviewed. Folks being able to use the Kafka registry as a PM or an analyst hasn't been something that we've been able to see them successfully adopt. In context of usability, the reason that spreadsheets exist today is they're just they're, generally speaking, easy to use, but don't do a great job specifically with this type of data. So in context, while it might be easy for an engineer to go use the Igloo registry or other types of registries, we haven't seen the same level applied to data consumers.
And that's where I think we're, Italy, we've spent a lot of time is just trying to improve the usability, of being able to actually define, share, and collaborate on that schema without having to jump through, a ton of hurdles or having to take necessarily an engineering approach to solving that problem.
[00:40:48] Unknown:
Our biggest learning, Tobias, has really been the this gap, this chasm that exists between the data consumers and the data producers. And, you've got the, the spreadsheets, which is what, the in the, you know, the industry has had to offer for the best the industry has had to offer to data consumers. The data producers, the engineers have had a little bit more power available to them through tools like the Kafka registry or the igloo scheme registry. But at the end of the day, those are engineering focused focused tools, and they're typically pretty pretty complex, to set up and manage.
When we did the customer development interviews around this and we've talked to we've spoken to people, we realized this really is a collaborative effort, and this is a multifaceted problem for companies And, needing a solution, a single solution that works for, all of the stakeholders and has the both the usability and the power in 1 package,
[00:41:42] Unknown:
is really the only way to go. As far as users of your platform, what have you seen as some of the most interesting or unexpected or innovative ways that it's being used?
[00:41:53] Unknown:
Okay. I'm happy to talk about, like, ecommerce, I think is interesting. Mobile
[00:41:57] Unknown:
Or or your thoughts on that? Go for go for it. I think, interesting or unexpected.
[00:42:03] Unknown:
You know, we can talk about, like, API level access. Like, people want to be able to do more things programmatically. You can talk about unit test or end to end test integration.
[00:42:14] Unknown:
K. I'll I'll throw I'll throw 1 1 out there. So 1 of the more unexpected ways that we've seen, it really get used is just the the way developers consume the SDK that we that we generate. We thought that everybody would love an API that has strongly typed methods for all of the events that you have defined in your tracking plan. It turns out that there's a lot of teams out there that have built abstraction layers around their analytics in their code basis and being, forced to all, methods to track events on a on a library like iteratively just doesn't fit well into their system. So what we've done is we've built effectively strongly typed classes for all of the events. Instead of strongly typed methods, we've built strongly typed classes for all of the events that have been defined in the tracking plan, allowing developers to instantiate those classes anywhere they like without any sort of a dependency on the iteratively SDK itself, and then pass those classes throughout their application, throughout whatever messaging workflow they might have in place and then call the method to actually track that event only once, at the very end of that process to send that event to, let's say, a computer mix panel or or Snowpile.
So the different ways that, engineering teams implement analytics, collect analytics inside the the, the source code has been, 1 of the more surprising and kind of, yeah, exciting things that we've seen and and we're trying to build the SDK to support all of these use cases.
[00:43:48] Unknown:
Not great, Patrick. I was gonna say I think 1 of the things that's that's interesting on my side is, specifically because I'm more sitting in the, the marketing and sales team is how often we see referring traffic come in from 3rd party tools. So folks obviously copying and pasting the URLs for the Iterability definitions into tools like Asana and Jira to basically try to stitch together that entire workflow and collaborative experience. So we see quite a lot of referrers coming in from from Jira instances. So the ability for us to sort of streamline that whole process for our engineers actually instrumenting it when you actually push a definition and create a new version. It's something that's really top of mind for for me at the moment. It's trying to iron out that entire workflow because, yeah, we do see quite often folks are just copying and pasting, event definitions into third party tools and wanting to provide that context in, in the tools that their teams are already using. I find that pretty interesting.
[00:44:45] Unknown:
And as far as your own experiences and lessons learned while building the product and the business of Iteratively, what are some of the most interesting or unexpected or challenging elements that you've experienced?
[00:45:01] Unknown:
I think most businesses are built primarily with 1 stakeholder in mind, 1 persona, whereas in the context of analytics, it's such a team function. It's a it's very much a team sport within organizations. And that's 1 of the the ways where we we definitely see the existing tools in the market falling down, which is, you know, for analytics to be successfully adopted and for the organization to get get value, it really needs to be thought of as a team sport. So we spent a ton of time trying to provide and focus on providing value for not just the data consumers, but also the data producers. And so when we think about data consumption, it's really around defining your analytics schema in in a way that really minimizes error, helps you know whether or not you have, similar properties being used, allows you to create templates and and reuse properties across events.
And that has been super helpful on the data consumer side. But then on the data producer side, trying to simplify the instrumentation process, bringing that IntelliSense code docs into your IDE, make it 10 times easier to do analytics instrumentation with a a high quality result. And for us, it always comes down to, where the focus should be. And what we've realized is in the in the context of analytics, we we really do have these 2 different groups of folks that we're catering for. And the only way for us to deliver a successful outcome at the end of the day is to make the experience better for both the data consumer as well as the data producer.
[00:46:24] Unknown:
Yeah. I'd say 1 of the, expected but nevertheless challenging, lessons has been just how high of a bar, developers have for the software that they use, and that they include in their in their products. That didn't didn't surprise us. It it we've been building products for a while ourselves, but when you're on the on the other end and actually building products for, developers, for engineers, that definitely comes in. This idea of building MVPs and kinda cutting corners to ship quickly and iterate on the value and run a lot of experiments and see what works and, what doesn't work. I think that all works pretty well on the web for primarily our PMs and growth managers and and analysts or our consumers and customers there. But on the engineering side, we found a lot of pushback very quickly when the SDK or the CLI wasn't just right and, didn't, pass the usually high bar that developers have for, for this type of software. There's definitely a lot of natural instinct, from the engineering teams to just build things ourselves.
We we have to be convinced that something shouldn't be, built ourselves. That's something we should just, hire or use off the shelf. The iteratively engineering team has definitely spent much more time on the on the developer side than on the, on the VM side for this exact reason. And what are the cases that iteratively is the wrong choice and somebody would need to look to another tool or another set of practices for managing their event data? Like, for managing the analytics schema and the the metric definitions and the event definitions. I think the biggest issue that we see is when teams
[00:48:02] Unknown:
don't necessarily care about data. So for iteratively, we force you for good reason to do the work upfront to actually define all of the event structures, which means that you have to do that work where there's a lot of folks. There's a there's this dichotomy in the market right now where you have explicit tracking, folks who are willing to have their engineers write analytics code into their product. You have implicit tracking where you want to auto capture as much data as you can and then hopefully make use of that data downstream. And we're primarily best suited for folks who want to do the explicit or the explicit tracking and have developer resources. If you don't have developer resources or you're not using this data for any sort of revenue driving purpose, The value is just not quite there in in our minds. Like, we we primarily work are well suited for companies who understand that this method
[00:48:56] Unknown:
of tracking. It's a it's a good good train of thought. I think, there's definitely 2 schools of thought here, Tobias, around data collection, event collection. 1 is the explicit, the other is the implicit, implicit meaning auto capture. And, every team has to think long and hard about which of the 2 camps they fall in. You have to figure out what are the pros and cons of of each 1 for their own particular, use case. And like Patrick said, they're definitely best suited for teams that have opted for the explicit, school of thought. They appreciate the the forethought and planning that, is done to, define schemas that actually matter for the business and the metrics that they that they care about. And they're, they appreciate the reliability and quality of the data that gets on the dependability of the data that gets, collected as a as a result. But there's definitely teams, and there are definitely use cases where this is not top of mind for for companies and they're happy to use an auto capture tool and collect anything and everything that happens in the, in the product and then more and kind of sort out the, the details at the end, and figure out more, actually, more from a, exploratory point of view, what's happening in the product, what are people doing, what events should we be paying attention to rather than actually relying on formally structured properly structured high quality, highly reliable data for BI, for marketing and sales automation.
Yep. I'll reword that as well. Yeah.
[00:50:34] Unknown:
I think the last thing that I add on that is because it really sits inside of the software development life cycle, we're really focused on teams who are doing explicit capture as part of the way that they deploy code. So not using things like tag manager solutions, either Adobe or Tlingium or or Google Tag Manager, but actually integrating analytics, similar to how you integrate other types of of products into your code base. That helps with things like testability, ensuring that as your underlying product changes, your analytics don't break. And that's been very insightful for us as as we don't work well with those solutions for a variety of reasons. And we tend to sit in the camp where, it's better to have analytics written by engineers.
It's the only way to maintain good consistency and rigor as far as data collection.
[00:51:27] Unknown:
And in terms of your plans for the future of the product and the business, what are some of the things that you're looking forward to working on or some of the requests that you've gotten from your customers to improve your their
[00:51:41] Unknown:
to improve the overall value that you can provide to them? Yeah. Great question. I think to to second some of the things that Andre pointed out earlier, Paul, the sort of software development rigor that we add with iteratively today has been super helpful in minimizing a lot of human error in the definition and instrumentation phase and helping sort of integrate a QA into that process. What we have seen quite often from a lot of our early access customers is the the ask for, more monitoring of how their analytics are performing in production. So being able to understand not just the frequency, but how often events are failing, being able to add rigor there as well. So giving them a good understanding if things are working or if things are breaking, adding some alerting and monitoring.
And then additionally, as part of the software development life cycle, actually improving how analytics are thought of as part of your existing test suite. So not just unit test integration, but intent test integration, making sure as your product changes that your events don't change as well and that the way that the data is consumed downstream for instance, let's assume you have a a funnel, report, but then your event capture changes. How is that actually impacting the reporting capabilities downstream? And that's 1 of the things that comes up quite often is really trying to stitch these 2 worlds together, how the event is collected, how it's being used.
And there's not really a a good way to do that today.
[00:53:10] Unknown:
1 of the things that I'm that I'm most excited about, Tobias, is just weaving, analytics definition and metrics definition more into the, process of, product management itself and helping teams and folks think about analytics from the earliest of stages of actually coming up with a new feature or a new design, for a feature. Today, the the tracking plan app that we've, built doesn't really help that much there. It's a way for as we talked about before, it's a way for, multiple folks on the team to come together and define what the tracking plan looks like, but we still see it used after the feature has already been thought through. The spec has already been written and oftentimes, after the code itself has been written. And usually, that's that's too late. And I think 1 of the things we're trying to do with iteratively is, hook into the life cycle a little bit earlier, integrate with tools like Jira, for example, to, put analytics front and center when, features are being prioritized and costed and designed, and, spec'd out and, help teams think through that sooner than later.
[00:54:20] Unknown:
Are there any other aspects of the work that you're doing at Iteratively or the overall space of event data capture and the analytics life cycle that we didn't discuss that you'd like to cover before we close out the show? I think that was a good,
[00:54:33] Unknown:
good overview, Tobias. I I'm trying to think about things that we didn't discuss. I think for us, the major talking points typically are around testability, alerting, monitoring, collaboration. We've touched on most of those today.
[00:54:47] Unknown:
That's right, Patrick.
[00:54:48] Unknown:
You know, for me, I think the the reality of 1 of the things that we see quite often is that we're really trying to change analytics from being an afterthought for most organizations to folks thinking of analytics as a p 1 feature. There's a lot of lip service played to to data within most organizations, and we really want to not just simplify that process, but help evangelize the work that, you know, data engineers and analysts and data scientists do and the value that they deliver to these organizations and, make it more integrated into that software development process where right now, today, the the tools and the the workflow is very disjointed across these these users and folks within within organizations.
[00:55:34] Unknown:
Alright. Well, for anybody who wants to get in touch with either of you and follow along with the work that you're doing, I'll have you each add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. If you wanna go first, Patrick.
[00:55:53] Unknown:
Yeah. So today, the biggest gap in the market in my mind has to deal with collaboration. There are systems of records in place, but the reality is it's not easy for folks to use across the organization. You have folks like marketing and customer success and sales and products who all consume this data. But the reality is that the systems that have been designed have been designed for engineers, which means that the adoption for other folks within your organization is is typically hamstrung by the the tooling that exists within the market. And what we're very excited about is to see new ways for folks who typically don't fall within the engineering bucket to collaborate on data, be that with data visualization tools, be that with data automation tools, or in context of it, or really do that with tools that help you define an instrument high quality analytics.
[00:56:42] Unknown:
And, Andre, how about you? I think, Tobias, you had the the the biggest gap, and, you know, we're obviously biased after having, spent the last year and a half, 2 years working on iteratively, but the the lack of, productized, event registries. I think that is a that is probably the biggest gap. We see, some open source projects, coming up to try to address this problem from companies like Lyft and Uber, but there really isn't a kind of an end to end solution, commercial solution that tries to serve the needs of of all of the stakeholders, at a company. And that's what we're trying to do with, with Iteratively, over here and really build a a single platform for, events schemas that company can rely on and, sync into all of the places where that schema, is needed. At the end of the day, data is highly distributed today, and companies send, event data to dozens and dozens of systems today, marketing, automation tools, sales automation tools, product and business intelligence tools, and needing to know what this data means across all those tools so that, stakeholders and consumers of those tools can leverage the data to the greatest degree is, is really important, and it really doesn't doesn't exist today.
[00:57:55] Unknown:
And, it was 1 of the more surprising things that we've seen on our journey so far. Well, thank you both very much for taking the time today to join me and discuss the work that you're doing with Iteratively. It's definitely very interesting and important problem domain and something that I'm happy to see being addressed in a fashion that helps to enhance the collaboration opportunities for all of the stakeholders of data and how it's being used throughout the company. So thank you both for the time and effort you've put into that, and I hope you too enjoy the rest of your day. Thanks, Sebas. Definitely appreciate it. You too. Thank you. Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used.
And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Career Advice
Interview with Patrick Thompson and Andrei Rybicek
Iteratively: Background and Purpose
Challenges in Data Management
Anti-Patterns in Data Collection
Customer Adoption and Usage
Schema Evolution and Testing
Usability and Collaboration
Lessons Learned and Future Plans
Closing Thoughts and Contact Information