Summary
Building data products is an undertaking that has historically required substantial investments of time and talent. With the rise in cloud platforms and self-serve data technologies the barrier of entry is dropping. Shane Gibson co-founded AgileData to make analytics accessible to companies of all sizes. In this episode he explains the design of the platform and how it builds on agile development principles to help you focus on delivering value.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
- Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.
- Prefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect.
- Data engineers don’t enjoy writing, maintaining, and modifying ETL pipelines all day, every day. Especially once they realize 90% of all major data sources like Google Analytics, Salesforce, Adwords, Facebook, Spreadsheets, etc., are already available as plug-and-play connectors with reliable, intuitive SaaS solutions. Hevo Data is a highly reliable and intuitive data pipeline platform used by data engineers from 40+ countries to set up and run low-latency ELT pipelines with zero maintenance. Boasting more than 150 out-of-the-box connectors that can be set up in minutes, Hevo also allows you to monitor and control your pipelines. You get: real-time data flow visibility, fail-safe mechanisms, and alerts if anything breaks; preload transformations and auto-schema mapping precisely control how data lands in your destination; models and workflows to transform data for analytics; and reverse-ETL capability to move the transformed data back to your business software to inspire timely action. All of this, plus its transparent pricing and 24*7 live support, makes it consistently voted by users as the Leader in the Data Pipeline category on review platforms like G2. Go to dataengineeringpodcast.com/hevodata and sign up for a free 14-day trial that also comes with 24×7 support.
- Your host is Tobias Macey and today I’m interviewing Shane Gibson about AgileData, a platform that lets you build data products without all of the overhead of managing a data team
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what AgileData is and the story behind it?
- Who is the target audience for this product?
- For organizations that have an existing data team, how does the platform augment/simplify their work?
- Can you describe how the AgileData platform is implemented?
- What are some of the notable evolutions that it has gone through since you first started working on it?
- Given your strong focus on Agile methods in your work, how has that influenced your priorities in developing the platform?
- What are the most interesting, innovative, or unexpected ways that you have seen AgileData used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on AgileData?
- When is AgileData the wrong choice?
- What do you have planned for the future of AgileData?
Contact Info
- @shagility on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan's active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to data engineering podcast.com/atlin today, that's a t l a n, to learn more about how Atlin's active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork, and Unilever achieve extraordinary things with metadata.
When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode. With their new managed database service, you can launch a production ready MySQL, Postgres or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs. Go to data engineering podcast.com/ lunode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy, and today I'm interviewing Shane Gibson about AgileData, a platform that lets you build the data products without without all of the overhead of managing a data team. So, Shane, can you start by introducing yourself?
[00:01:39] Unknown:
Yeah. Hi. I'm Shane Gibson. I'm the chief product officer and cofounder of Agilentat. Io
[00:01:45] Unknown:
and my longtime listener and second time caller. Yeah. And for folks who didn't listen to the previous episode you were on, I'll add a link to that where we discussed some of the ways that agile methodologies can be applied to data engineering and data management. And so today, we're focused more on the product aspect of what you're building Before we get into that, for folks who didn't listen to that past episode, if you could just give a refresher on how you first got started working in data.
[00:02:11] Unknown:
Yeah. So I started my career out working financial systems. So in the in the world before enterprise resource planning or ERP, when we had separate accounts payables and accounts receivable type modules. So I was kind of in the systems accounting land looking after those financial systems. And from there, I jumped across into vendor land. So I started working for some of the big US software companies, but based out of New Zealand. And as part of that, realized that my passion really wasn't in those ERPs, but I really liked the idea of data and analytics. And back then, you know, it was really cool business intelligence. That's how old I am. So did that for probably about a decade. And then from there, jumped into founding my own consulting company. So typical data and analytics consulting company, 10 to 20 people on the team, and we go out and help customers implement platforms and strategies and do that cool data work. And from there, I morphed into agile coaching. So I found that I had a passion for working with data and analytics teams that were starting their agile journey and coaching them on ways of working. So applying or helping them apply patents that I've seen from other organizations or teams that have worked with particular context and helping them apply it in their way of working and and see that that success of that team really just starting to rock it, enjoy their work, deliver stuff, get value, get feedback.
So as part of that, we have about 3 and a half years ago, my cofounder, Nigel, and I started agile data dot io, taking a product focus to that kind of capability.
[00:03:39] Unknown:
Can you give a bit of a high level on what it is that you're building and some of the ways that that product is manifesting and some of the target audience that is driving the focus of how you're building it?
[00:03:50] Unknown:
So we're combining, software as a service product is the way we think about it that manages the whole data process from collection through to ready for consumption with those agile ways of working. So we believe that any product should be bound by ways of working you can teach people that make sense. Right? So not a methodology as such, but a set of good practices that make sense when you're doing a certain part of that data supply chain. So that's our focus of of building out both this way of working as repeatable and and teachable with the product that supports it and makes it easy. In terms of the audience, we kind of break it down now to 2 audiences. The first audience is the buyer who would actually buy a product capability. And that really is anybody that's got a data problem. You know, there's a bunch of people that are in organizations that know they have data. They can't get access to it. They just want their data to be turned into information so they can make a business decision and take action. Right? And then from that action, get some outcomes, and they struggle to get that done. And and there's different reasons for different organizations why is that why that happens. So that's the buyer. The user, we are hyper focused on a data savvy analyst.
Yeah. It might be a data analyst, might be a business analyst, or somebody who's data savvy, but actually gets frustrated with the fact that they have to hand off the data work to an engineering team and wait. And it's not the engineering team's fault. Right? They just got too much work to do. But the analyst actually wants to get that work done. And so if we can enable them to do that work themselves, given that self-service and that data management space with the right guardrails, so it's not ad hoc, then we think that we can help those people get that information to their stakeholders quicker,
[00:05:30] Unknown:
better, and ideally with a bit more fun. In terms of the types of organizations that are likely to use the AgileData platform, I'm curious if you can give some characteristics of maybe the size or scale or some of the ways that their technical capabilities or engineering teams might be structured or the skill sets that they might have or maybe even more notably, the skill sets that they might lack on the team?
[00:05:55] Unknown:
So currently what we find is customers that don't have a data team, the ones that we can help the most. So they are typically somewhere between 2 to 50 or a 100 people. They have started either building out their own platform software as a service kind of internal capabilities. So they're a startup or they're an organization that is using a bunch of different softwares and service products to run their business. And they're starting that journey of, okay, we've got all this data and now need to put it together. When we look out in the market to do that, we now have to hire a bunch of data people. You know, we have to hire a bunch of data engineers. We have to buy a bunch of software or software as a service. We have to cobble together this modern data stack, and that's expensive and it takes time. So they are currently our customers that we serve.
Our focus in the future is really enabling organizations that have those analysts in place and then have a constraint around the engineering practice. And therefore, that whole idea of bringing self-service into that organization. I kind of liken it to the wave we had previously around self-service BI and visualization. You know, the tableau kind of wave, I call it, where there was a constraint of the ability to create reports. Right? We used to use, we'll see if reporting services or business objects. It was a technical product. There were a bunch of gates to use it, which were there in place because, you know, you had to have confidence here around a whole lot of things to make that stuff work. And we then saw this wave of self-service capability come in in that visualization space, and we enabled more people in the organization to do good work.
We see that same wave starting to happen for data management, and we believe that the same self-service capability can be brought to that audience and enable them to do that work, freeing up the engineers to do the work that they are really, really good at and should be focused at.
[00:07:43] Unknown:
For teams that maybe have an existing, you know, data engineering group or a set of data professionals, what are some of the ways that they might interact with the Agile Data platform and some of the maybe support burdens that are alleviated because other stakeholders and analysts in the business can just use agile data to be able to perform the workflows that might otherwise require the intervention or support of those data engineers or data professionals?
[00:08:14] Unknown:
Typically, what happens is data engineers like to code. They like to build things out themselves. That's what they're trying to do. That's what they love doing. So we often find if there's a big data engineering team in an organization, we probably aren't good for them because culturally, they like to build rather than buy or lease. If we take that away and we look at it, we look at it as there's a bunch of plumbing that everybody has to do for data. Yeah. If you think of it as plumbing, you think of it as moving water. There's a bunch of pipes that you always have to build. And our data engineering teams and organizations are stuck with building that plumbing and spending that time day in and day out.
So what we look at is if we automate that plumbing for them, right, if we take just that pure movement of water from the collection through to ready for consumption, then the engineers can focus on the more fun stuff. Now, 1 of the areas that they still need to focus on is around data collection. Because when you talk about a big organization, especially 1 that has a bunch of data engineers, they typically have a myriad of source systems or systems that capture where data goes into. And some of them are, you know, easy to get to software as a service products like Shopify or Salesforce where it's fairly open to get to data. But 9 times out of 10, there's also a bunch of proprietary firewall, whatever on prem or private cloud capabilities that we can't connect to. So there's still that data collection work, right? To be able to pick up the data and make it available to be ready to be transformed and consumed.
And then if we think about we take care of that plumbing bit, then once that water turns up at the end of that pipe, that's where the real value of data engineer can add, Right? Or an engineer in general, because engineers are good at solving problems. And so if we think about that, water comes out of the tap where it's nice and clean, you know, as much as it can ever be. Now we have to make that data useful. Right? It's consumable, but it's not useful. It's still plain water. So how do the engineers then get involved in the business problems that the organization has around how to use that data? Is that helping the data scientists make better machine learning models, which give better recommendations to their customers. Right? Is that focusing on some things like master data management and the complexity of what are the actual rules we need to combine that data because the keys aren't the same? Right? There's still a 101 problems to be solved, but the problems aren't moving data from left to right. That's the bit we take care of. So we still see the engineering skill set as having massive value in every organization, but we also see the engineering skill set as being 1 of the primary constraints for organizations right now to get data work done, and that's the bit that we wanna automate for them.
[00:10:45] Unknown:
So as far as the design aspects of how you think about building the Agile Data Platform, given the focus on these analytical workflows and on organizations that don't have a large footprint of data engineers, I'm curious how you think about the design and the interfaces that you want to build out to make these potentially very complex workflows understandable and manageable by people who don't necessarily want to make that their entire job. They just wanna be able to get something done.
[00:11:22] Unknown:
So we think about it in 2 ways. So, yeah, Nigel, my cofounder is the techie. Right? He's the plumber. He's the guy with the experience building these things out over over generations. So we think about complete automation in the back end every time. Right? We think about if people are using this product and we're not there to watch it happen, how do we make sure it's bulletproof? And as part of that, we become highly opinionated. So for example, we have baked in a pattern in our product where everything is historized. Every time a new recall comes in, we check to see whether it's a change. And if it is, we still have change. We don't have a conversation about, do you want any CD 1 or 2 or 5 or 6 type behavior? We go, actually, we're gonna automate it. So every change comes in, it's stored as a change and that history is immutable.
And by being opinionated with that, we take away some of that complexity from the user. Now there's a hell of a lot of complexity on our side. Right? Because if we're getting feed snapshot data, we've got to do the delta differences. If we're getting event streaming, we've got to figure out that it was an event change or a new event record. There's a lot of complexity for us that we need to plumb under the covers, but that's our job. So that's the first thing we do is whenever we see a pattern that had complexity when we were consultants and delivering that for a customer, then how do we automate that in the back end to make it as bulletproof as we can? The second lens we then take, which is the lens I take is around the product. Right? So how do we get this interface, this app, and how do we make it available to an analyst where they're building out data stuff that is typically complex and making it easy for them? So an example around there is data design.
So we are really strong proponents and highly opinionated that data should always be designed. It should be lightly designed. So how do we take this data design process, this data modeling process, and help an analyst do it in 5 minutes rather than take, you know, 6 months? And how do we give them tools that allow them to do it where they're not having to do entropy result ERD diagrams, entity relationship diagrams, and we have to teach them about ducks feet and, mini to mini joins and all those kind of things. And so what we did was we prototyped and tested this idea of design canvas, where they can go in and say, I have some cool concepts, customer product order.
I can drag those tiles to save you as an event, customer orders product, and that actually builds a conceptual design in our product. And then, okay, now I've got a conceptual design. How do I populate it with data? Well, we've built a rules engine that uses natural language. Right? So given the status in this history tile, say, Shopify customer, and we've got these fields, you know, name, date of birth, those kind of things, then populate this detailed tile about their customer. And then when your data comes in, automatically trigger a load, right, and trigger the change detection. And, you know, somebody's changed their last name, bring in a new record that seems to change on the state that the the name changed. So effectively by working with analysts, we look at, some things that we used to do unconsciously as engineers and data specialists.
And we try and figure out how we build interfaces that make it really simple for them to do that work without knowing the complexity that lives under the covers. And to be fair, that's hard. Every time we do 1, you have to sit back and go, okay, as humans, what do we do? Right? What are we looking for? What are the things we look for that are triggers to make a decision that we need to do a piece of data work? And therefore, can we automate that so they never have to do it? Or do we need to prompt them with some decisions? And by making a choice of yes or no, we do that work for them. Yeah. So that's our focus, right? How do we bring that interface to an analyst where they do the work without knowing they're doing it? To your point of
[00:15:08] Unknown:
not forcing the analyst to understand or have to, you know, go through the process of educating themselves on these various different types of joins and how to find the appropriate data sources to be able to answer their question. I'm wondering what are some of the, I guess, thorny engineering problems that you've had to work around to be able to make the end user experience simple?
[00:15:34] Unknown:
Every 1 of them is thorny. Every time we go and do something that's simple. And we're hyper focused on removing the effort or removing the clicks. So, yeah, I'll give you an example. We had a customer that we were doing a data migration for that was a use case. And we built the ability to bring in some data from something like Shopify, land it into what we call our history layer. So that's the immutable layer that holds all data over all time, all changes. Right. And we really care about that layer. And then creating the rules that we use to populate a concept. You know, so that's a customer, that's a product. We We build all that out, and I got really good at it. You know? I could go and, you know, do a rule for that. Normally, about 2 to 3 minutes, it was in production. And, yeah, the time was looking at the history data, understanding where the key was, creating the rule, running the rule, going back and having a quick look at the data, making sure that, you know, the test had passed and that the data looked right. So I got it down to 3 to 5 minutes, and I was like, yeah. That's great.
At the time, we had to write a rule from the land of data to the history data. So I had to manually go and do those rules. And, you know, you bring in Shopify data, it's, you know, at least in 25 tables, 3 minutes. I'm like, I'm a machine. For the state of migration system, though, we had a on premise SQL Server database we had to bring in. And because we're migrating the data, we had bring it all in. So they were just over 700 tables. Now, therefore I had to go and create 700 history rules. Yeah. 2 to 3 minutes a rule. That was a couple of days of my life that I did not enjoy. So we went back and we said, okay, how do we automate that? You know, how do we drop in 700 tables and get those history tables, tiles created automatically without us? And so we went and automated that. And so now that's what happens. Right? So, yeah, every time we do something and it takes us time or we have to make some really complex decisions, we just go back and think about it. How do we refactor that? How do we make that simply magical?
But that is hard. Yeah. Every time we do an something as an engineer and data, there is a hard problem to be solved, and we can't underestimate that.
[00:17:38] Unknown:
In terms of the actual implementation of agile data, I'm curious if you can dig into some of the architectural elements and some of the technical aspects of how you've designed the platform to be able to slot into an organization's existing stack without having to reengineer everything around Agile Data?
[00:17:59] Unknown:
So the first thing we did was we had to pick a cloud platform. Right. And we had to make a decision if we're going to be multi cloud or single cloud. And we decided that as a software as a service product, which was our goal, our vision, we wanted to pick a single cloud platform. Right? And in theory, the customer shouldn't care. Actually, they do. It's amazing how often the customer cares. But in theory, we're just a software as a service platform. As consultants, we've worked with Microsoft and Azure for a long time. You know, we've done the what I would call now the legacy data warehouse cloud platforms, you know, the redshifts, the Microsoft PDW.
And we knew we didn't wanna use those. Right? We knew that they they were great at the time, but they had been cloud washed, and there was a whole lot of, you know, problems that we'd we'd counted. You know, vacuuming out your reach of tables was always a nightmare. And at the time, this is 3 and a half years ago, you know, Snowflake really was the up and comer. Right? So we had planned to build our entire platform, our entire product on Snowflake. At the time, I was coaching, data and analytics team for an organization, and they were starting to build out their platform. And they did a shootout between BigQuery and 1 of the other vendors.
And I'd never seen BigQuery before. And so I was lucky enough to sit through that whole process of evaluation. And at the end of it, I I said to Nigel, look, you know, I know we're going to go with Snowflake, but we probably need to give this BigQuery thing a bit of a bash because it looks like Snowflake. Sounds like snowflake, but something weird about it. It just seems a bit different. So we did a proof of capability around it, and we found that actually there were some things in there that we really liked. And so we started building off on BigQuery And that was we got a whole lot of unintended consequences out of that side of style, shares of engineering effort. And what it was was that by going into BigQuery, we went into the Google Cloud infrastructure, into their ecosystem.
And there are a bunch of services that are available in there that we keep using. We keep adding new things we need to our platform in the back end that we leverage Google Cloud Services for. And the majority of their services are serverless, which means the way we pay for it is amazing and it's pre integrated. Yeah. So when we pick up a new service, it just tends to work with all the other Google services. And the last thing is they're highly engineered. You know, the amount of engineering thing if they put into their products is amazing. I mean, their marketing sucks, right? As a partner of them, not the, you know, in my experience running a consulting company, you know, the other cloud vendors and the other product companies are so much better to work with. But in terms of Google Cloud, the engineering is amazing.
The downside is you have to be an engineer to use it. Right? Yeah. You have to be a Nigel, not a Shane to configure a platform on Google. But, yeah, that's what we do is we build the easy bit on top. And so from my point of view, that decision around using Google Cloud was an under tended consequence when we got a massive amount of benefit out of it. If we then think about our platform, we made some really big choices upfront. So serverless only. We would only ever use serverless capabilities unless we had no choice. So no containers, no Kubernetes. And that's had a massive amount of benefit for us. API, everything.
So we have this idea of a config database that holds all the core logic for the data. That's fronted by a set of APIs and our actually sits on top of the API. Those APIs are secure but open. So, you know, we're really intrigued about whether customers will actually start using the API and not the app at some stage. And that's had some massive benefits for us as well by decoupling the data from the config the config from the API and the API from the app. Right? That kind of standard software engineering for everybody else. But, you know, Nigel and I aren't software engineers. It's not our background. Right? We came from the data space. Then the next thing for us was volume. That trade off between over engineering, everything to take a volume of users or a volume of data versus getting something to work and then dealing with the volume problem when it hit us. And it comes around that idea of knowing where your data is, knowing that, you know, you're gonna have a problem. So, you know, for example, when we were bringing in Shopify data for 1 of our first customers, you know, the numbers were tiny.
When we started doing, event data from a customer in a clean room use case, you know, we were getting somewhere between 40, 000, 000 and 3, 500, 000, 000 road changes or road transformations a day. You know, we had to refactor some of our stuff to, a, keep the cost down, and b, just make that thing bulletproof. But we knew we had to do it. We just waited until it was a problem. And then we went into hyper engineering mode to solve it. Right? We'd already designed the patents in our heads. Right? We've done what we call agile teacher. So we'd done some pictures and some narrow boards about how we would solve it as a guess. And then once we had that problem, we ripped into saying, okay, will that solve it? Do we need to iterate it? So that idea of incrementally building it out just in time, but still doing agile texture. Right? Still figuring out we've got this problem coming up in 6 months. If we're successful, what would we do? And not backing ourselves into the quarter that we're refactoring it was horrendous.
[00:23:08] Unknown:
As far as the overall integration path for using agile data, what what are some of the technology stack that an organization needs to have in place for them to be able to take best advantage of what AgileData provides? And what are the options for being able to extend the set of integrations that Agile Data will be able to work with?
[00:23:34] Unknown:
Yeah. So if we think about it as left and right, right, left ring data collection and right being data consumption. So the first thing is we collect the data and we store it. So we're not a virtualized warehouse. So that's the first thing for our customers. If they're uncomfortable us holding the data securely for them, then we're we're not the right solution. And then we looked on the left around data collection, and, you know, I thought that was a solved problem when I started this. It's a solved problem. Right? We don't need to deal with it. And boy was I wrong. So we have built out collection patents, we call them, as we get new customers that have new problems. So, you know, there are a set of patents around software as a service apps. If you've got Shopify, Xero, QuickBooks, Salesforce, some of that has an API, There's a bunch of patents for going and grabbing that API data and bringing it back and collecting into our history layer. The next 1 we had was we had a customer that wanted the file drop.
Right? They had a bunch of ad hoc type data, but it was repeatable, if that makes sense. So there was no system behind it, but we needed to get that data in. So we built a file drop capability, right, where they can go and upload the data, CSV or Excel or JSON, and then that, you know, manually just file upload, and then we take care of it from there. The next 1 was we had a customer that wouldn't let us go in and actually touch their system. So they wanted to push the data to us and it was event data. Right? So they said, okay. We want effectively, you know, demonetize zone, and we're gonna connect to you, and we're gonna push the data to you when we feel like it. So we had to build that pattern out. How do we actually have automated file drops? How do we do it based on event data? How do we trigger the fact that they turned up? How do we deal with the fact that they're always gonna give us data they've already given us? Because no matter how much it's automated, eventually you get that file again, or you get a file with overlapping where, you know, half the data we've seen before and half we haven't. How do we engineer for that?
Then we had 1, because everyone, where it was on premise. Yeah. So how do we actually go into an on premise database and pick that data up and bring it back? And then we had 1 which was a clean room. How do we actually have, in this case, 38 different companies, each 1 of them using a different collection mechanism to give us the data and then make sure all of that comes in on a consistent basis every day so we can mesh it up and provide that single view. So again, I thought data collection was a solved problem and I was wrong. So that's the left. Right? And then the right hand side is data consumption. So we don't do that last mile. We looked at it and we said, if you look at that last mile space, you know, visualization dashboards, natural language queries, analytics, you know, machine learning, table stakes to play in that would be 3 to 5 years of our lives.
And we're bootstrapping, not venture funded, right? So we're very focused on where we spend our time. So for us, 1 of the benefits BigQuery gave us was 9 out of 10 data consumption tools, those last mile tools, talk to BigQuery. So as long as you can talk to BigQuery, you can use the data that we make consumable for you. We have an API layer. Right? So in theory, the last mile tool can consume the data via API. But what intrigues me at the moment is I I struggle to find a last mile tool that queries data based on APIs that isn't a data warehouse. Right? Patton. That isn't taking that data and then storing it again to make it consumable. And so I'm really intrigued to see whether that's gonna change, you know, whether the market's gonna move to have last mile tools that are truly API focused.
[00:27:01] Unknown:
Yeah. It's definitely an interesting space, and there actually has been some growth there with things like the embedded analytics use case. Some of the companies that are operating there that come to mind are things like Kube JS, or I think they've been renamed to to just cube so that they're not tied to the kind of JavaScript aspect. I think it's maybe GoodData and Sisense, I know, are also focused in that embedded analytics case. And then there are projects like Tinybird that build on top of ClickHouse. You can use their data as an API, so you can embed it into some products. So definitely an interesting aspect of it as well. And to your point of building on top of BigQuery, I'm curious if you've had to construct any sort of custom caching layers for being able to maybe precompute or pre aggregate requests that users are making frequently so that you don't have to go back to BigQuery every time and pay either the latency costs or pay the extra query costs all over again because somebody happened to query it, you know, 2 times, 5 minutes apart.
[00:27:59] Unknown:
So that's a really interesting space right now. So we're waiting for the metric or semantic service, the the LookerML core out of Google because we need that at some stage soon. But till now it hasn't been a problem. And so what we do, right? So effectively the data comes into our product, into the history layer. Right? History layer looks like the source system, but it's time series, right? All changes over all time. And that's a beautiful, right? We never, we never lose that data. And then we have an event layer, which is modeled, right? It has the idea of concepts, details, and events, and that's in the middle. So it's a typical 3 tier architecture. And then our last layer is our consume layer. And we denormalize it. We actually have big wide tables. Now we can do star schemas, Right? We can actually dimensionalize the data. We just never had to.
Now the reason we do those big wide tables is when you actually talk to an analyst, they just wanna query a table. Right? Yes. They can do joins, but they don't want to. Why should they have to? So effectively, we give them big wide tables. And the good thing about BigQuery is it just eats it. In terms of volume, you know, we've got some consumable tables here that are hitting, you know, 100, 000, 000, 200, 000, 000 rows now. And BigQuery just eats it. You know, the latency of responses is seconds. You know, it's not sub second. There's a brilliant analyst out there in Twitter world called Mim who's doing lots of really cool research on his own time around latency and things like DuckDB and those cool things. And if anybody wants to see somebody doing some cool stuff and sharing, go follow them. But, you know, we're not dealing with that sub second response use case. So most people are willing to wait, you know, a second or 2 for that data to come back. And most last mile tools actually have a whole problem of getting the data back and visualizing it. There's a delay there anyway.
What we found is we actually introduced the BI engine into the architecture for us purely as a cost saving. So with BigQuery, there is this idea of cache hits. So if you query a table and that query gets the results, that query gets persisted for 24 hours and you don't. But if you watch what a user does when they use a dashboard or report, they're filtering. Right? They're constantly hitting filters and therefore those cache hits don't happen. So we've introduced BI Engine in the middle, and that now keeps our costs down because that's doing effectively in a memory OLED cube. It's rudimentary at the moment in terms of the way it deals with it, but it's good enough for us right now. Now as soon as they bring in the Looker ML core, that semantic cloud, that metric cloud, we're all over that. We've done all the agile architecture for it. We've done the UI design work for it. We just need that service right now because we want that extra layer of caching, right, to give us some benefits and bring the metric out of the last mile tool back into us. Right? So that's a core feature that we need to save some of the complexity.
But, yeah, right now, BigQuery just eats it. I mean, it really is kind of amazing how lazy you can be with some of the things that we used to worry about 10 years ago where our databases couldn't handle it or they were too expensive.
[00:31:01] Unknown:
That's 1 of my learnings for the last couple of years. Yeah. I brought up the question about kind of the caching and pre aggregates because, number of years ago, I think it was maybe 6 or 7 years ago at this point, I was involved in a project where I was trying to generate user facing analytics off of data that we were loading into BigQuery. And the multiple second response times were a nonstarter for that purpose. So I actually had to pre aggregate the data out of BigQuery into Postgres so that that could act as sort of like the OLAP layer. At the time, I wasn't yet experienced enough to really be able to think quite in those terms and design it effectively, but I got it to work.
[00:31:38] Unknown:
Yeah. And because we've been around in the market for a long time, we remember the OLAP, OLAP, OLAP days. And then the horrible disconnect we had between our relational data warehouse and our OLAP cube and making sure they were synced and refreshed. And the other thing is, as I said, we always do an agile texture, about things we worry about. So we knew we had to worry about performance. We knew we had to worry about cost. And so we have a whole lot of what we call levers. Right? We have a whole lot of patents that we haven't implemented yet that we know we will need to at some stage. So things like materialized views. Yeah. At some stage, we know we're gonna need to bring those in and use them. Early in the Google kind of product life cycle, they're fairly basic compared to materialized use that we used to have in the Oracle days or, you know, the Teradata equivalent.
So we particularly won't invest in that layer until we really, really need it because we get the benefit from Google of all the engineering without us having to build it. The number of times we've found a feature from Google turn up just in time that we didn't have to build it ourselves has been amazing. But there's been a couple of examples where there's some early stuff came out of Google that we really wanted that didn't make it into their production. And we're like, we really want that feature, right? Would, be save us money or save us time or be awesome for our audience, users. And then it just it didn't make it. Right? And so we've got to be really careful what we invest in. Another area in terms of location that we're keeping a really close eye on and we're going to have to make a decision on at some stage is DuckDV.
Right? This idea of taking a subset of that data out as a WASM into the browser and using that to provide immediate response to our users in our app, we think that is an area we'll probably invest in, right? Because that feedback of being able to see here's a piece of data. Here's my rule for transforming it. Here's the impact of that change without clicking a button and waiting for that query to run. We think that immediate feedback to an analyst in terms of designing the rules that transform data, we think that'll be magical as well. Right? But we just got to decide when we're gonna build that out or whether Google's gonna give us a service that does it for us.
[00:33:49] Unknown:
Prefect is the data flow automation platform for the modern data stack. Empowering data practitioners to build, run, and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn't get in your way, PreFECT is the only tool of its kind to offer the flexibility to write workflows as code. Prefect specializes in gluing together the disparate pieces of a pipeline and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20, 000 community members, Prefect powers over 100, 000, 000 business critical tasks a month. For more information on Prefect, go to dataengineeringpodcast.com/prefect today.
That's prefect. Another interesting aspect of what you were discussing is the question of data modeling for that end user analyst and also some of the ways that you think about the semantics and documentation of the attributes and the columns that are available in the modeled tables. And wondering if you can talk to how you think about that when you're building this as a service, and you just wanna be able to say to the analyst, pull in the data that you want, write the queries that you want, we'll worry about the rest.
[00:35:05] Unknown:
So the first thing is we think about documentation as something that should happen, but it should happen at the right time, and then it should be actionable. So 1 of the first set of features we built out was a data catalog because we believe data catalog should be embedded in every product. Right? This idea of being able to see what the data looks like, what the profile is, have context in terms of notes against it, see the lineage and the rules. Right? Every data product in the world should have that bundled and it should be table stakes. You still need an enterprise 1 if you've got multiple platforms and multiple products. Right? And you wanna combine those together, but every product should just do this as table stakes.
Looking back, I probably over engineered it. It was a personal 1 of mine, had massive value, but I probably added a whole lot of features that I don't think we ever get used. And so, again, with my product head on, yeah, it's gonna be really interesting to see whether I'm brave enough to remove them, and and time will tell. Right? But what we know is documentation should be done at the time we're doing the work because we never go back and redo it. So if I'm creating a rule and I need some context about what that rule is doing, then we should create documentation. And what we did was as we built out our rules engine and the interface for it, we did a whole lot of prototypes. Right? I had an idea in my head of what we're gonna build, and it wasn't what we built.
What we actually ended up building was almost notebooks, almost Python or Jupyter notebooks as in, you know, there's a line at the above that has the data coming in. We call that given. Right? So given this data, and then there's a bunch of end statements, right? And I do this and I do this. So this data is coming in. So let's say I've got a table that's got party entity in it, right? And a history table that's got employees, suppliers, and customers in the same table from the source system because the software engineers found that's the most effective way to build out their system. So, in the rules engine, your screen, you effectively go in and say, okay, given that table, yeah, and the type equals customer, there's this filter, And maybe the customer has had a transaction in the last 30 days, and the customer is not deceased.
Then create a flag called active customer. Now there's a piece of detail. And so that idea of actually just adding rows like a notebook, we found has been particularly successful. And then what we see is, okay. We're doing that. We need to put some notes against it. And why do we? Well, because what happened was I do a piece of work for a customer. I create a rule, and then I go do some work for another customer. And 6 months later, the first customer asked me to make a change. Right? So I go back into that that rule and I look at it and it's very simple to understand, right? Given this and, and, and then. So I know what I've done, but I don't know why. Why am I filtering out that record?
There's gotta be a reason that I filter that record out. Like, I wouldn't do it if I didn't need to because I'm lazy. Right? But what happens if I remove it? Yeah? And so what we found is putting notes at each line means I can do a really simple statement, Filter this 1 out because there was a problem with the source system 6 months ago. Therefore, I've got to remove those records or change this 1 because for some reason the type coding was out of sync for a little while and the easiest way to clean that up is to do it in line. And so by putting those little notes in, I get the documentation I need at the right time to take the action that I want.
So what we found is those in place notes are really valuable. The data catalog thought had high value. Time will tell. But, yeah, every time we find an action where documentation would have been useful, we then go and figure out how we had that documentation. So 1 that we've got to add back in, which is really interesting for me was we version our rules by default. So it's effectively get behavior. So if you go in and change a rule, you have no choice. The previous version of that rule's stored for you. There's no git pushy, pulley kind of stuff. Right? It's just in the background. Every time you modify something, the previous version is stored because it's good practice. What I don't have is a box to say why I made that change.
So we're gonna add that piece of documentation in because I'll go back in 3 months and go, oh, I've got 2 versions of the rule. Why did I version that rule? Right? And so, again, it's not until you use it in anger for a while, you go, oh, that piece of documentation would be highly valuable. I'll type it in as it gets, you know, effectively might get note. And so we have to go back and add that. And so for me, that's what documentation's about. How do we add it at the right time? That is actionable and is actually gonna be used. Otherwise,
[00:39:32] Unknown:
we're not gonna fill those boxes out, and they'll be blank. Yeah. And you talked a little bit about the kind of discovery aspect with the data catalog question, and I'm wondering what are some of the, I guess, benefits that you've seen of biasing towards the wide table in terms of the data discovery question and some of the pieces of feedback or challenges that your customers have run into as far as being able to understand what pieces of information are available to be able to factor into their analyses and then digging more into that semantic aspect of understanding, this column looks like it does what I want it to do. How do I understand that it really does?
[00:40:12] Unknown:
Yeah. So let's talk about the cool stuff, and then let's talk about the hard stuff. So 1 of the cool things that, again, was was a surprise to us how often we use it now is Google introduced in BigQuery the ability to search for a set of data. So you can go into a table, you can write a query and say, you know, here's a word. So let's say, for example, I use in our demo. Right? I've got a table of every car in New Zealand from our transport agency. So every time a car is registered, the they publish the details of that car. And so, you know, just 4 or 5, 000, 000 rows in there for every car. I can go in and I'm a big fan of minis And in the mini world, there's a race version called John Cooper Works or JCW.
So with a big query, I can go in and say, search this table, this history table of cars for the word JCW, and it will search every column and every row and bring back a subset where it has that set of letters. So we put that into our app, right, because it was a freebie. Okay. Cool. Go and search. What we didn't realize is how often we'll use that. So, you know, an example would be when we first work with a customer and we're building out a business rule and that rule is reliant on a piece of data. Right? It's a filter or query that says actually, you know, subset by this or calculated based on this field.
And they use a term, which is the word in the app, but that's not the word that's in the data. Right? Under the covers, the engineers have given a completely different name for that field. And so we've got to find that piece of data to be able to search. And so, yeah, what we used to do was say, give us a screenshot and then we go and look for it and then we talk to engineers and say, okay, you see that field there or what's it in the database? What we can do now is say, give us an example of that data. And then we go into the history table, we search for it, and it comes back with those rows and we go, okay, we think it's that column. That saved us so much time. Another example is validation. I've got an example where a customer goes, look, I'm looking for this order number, blah, blah, blah, and I can't find it. So we just go into the catalog and we go into, you know, either start at the left or start at the right. Right? If you go into the consume table and search for it, that's not there. Damn. Okay. Let's go into history, search for it. Yes. Yeah. Okay. It's a rule problem.
You know, the data's turned up, but my rules are excluding it. And then you go into the rule and find the rule and see why that one's been excluded. So that ability to search actual data has we've used it time and time again in a ways that we didn't think about it. So that's some cool stuff, right, that has value. 1 of the problems we still have is the last mile tools. It's still incredibly difficult to be in a last mile tool, a dashboard, a report, an analytical tool, and see the context of the data. Right? And we haven't solved that problem. Like, at the moment, you have to go back into the catalog. Right? Then you have to search for it. You know which consume table you're using. Right? But then you have to go search for it and say, okay. What's that field? And then look up the config. And then say, okay. What's the, like, logical thing? And so that bundling of that process back and forward would be great. Now we're seeing a lot of the last model tools now start to do that with add ins to the browser. It's kind of interesting.
It's almost like the browser's taking care of the linkage between your metadata app, your catalog, and your BI tool, your your last model. So we're keeping an eye on that space to see if that's where we should play. Should we open ourselves up to that kind of pattern? If you talk to an analyst, right, it's what they want. I'm in this report. I can see a number. Just tell me how you gave it to me. Yeah. Don't make me go away somewhere else and find it with 600 other clicks.
[00:43:38] Unknown:
But I don't think that's a solved problem yet. For people who are interested in being able to adopt the Agile Data Platform, I'm curious what the onboarding experience looks like and some of the ways that you have iterated on that process to make it easier for people to be able to adopt and adapt to Agile Data as a way of being able to build their analytical use cases?
[00:44:02] Unknown:
So we're going on a journey, right? We gave ourselves 7 years to be an overnight success. And so our first 3 and a half years have been focused around that buyer, solving a business person's problem to get access to their data. And being a technologist, right, and especially having worked for big software companies before where I was in what was called presales, which was system engineering or customer experience, I think now, where my whole job was to work with the salesperson to present the product to the customer. So they said, yes, that looks like it's gonna solve our problem. Can we buy it, please? I naturally wanna go into product demos.
And so that was 1 of the mistakes I made really early on was I was talking to somebody who had a business problem, and I wanted to show them the cool features we built. And they turn around to me and go, why should I care? Yeah. That's not what I'm buying. Right? What I'm buying is you're solving my data problem. I'll give you some money. You'll get my data. You'll give me some information. I can make some business decisions. I don't care how you do it. Just make it go away. So that whole messaging around that has been really key to us is that actually we've got a bundle in both our platform and our services as a fixed monthly fee to just make that problem go away. The second half of our 7 years, though, is focused on the analysts, right, about their software as a service. And that's what we're working on the moment is how do we actually onboard them and get them to be able to come on and use the software as a service platform or product to do the work they need to do without having to be data experts. Now they have to be data savvy.
Yeah. They have to understand, you know, how to troubleshoot and what data means and what a concept for data is, you know, what a customer versus a product is. They have to understand that because all data is complex, but how do we onboard them where they can just do it without us? And that's what we're working on right now. That's a hard problem. Absolutely. You know, I look at some of the products I use every day, and I'm like, I don't have to go on a course. You know, I don't have to watch YouTube videos 9 times out of 10. I just go in and use it. So how do we do that for data management?
And, you know, we haven't solved that problem yet, but we're spending a hell of a lot of time trying.
[00:46:04] Unknown:
Absolutely. And so in your experience of building this platform and working with your customers and helping to make that challenge of analytical use cases tractable without having to have an entire data team to be able to support those end users? What are some of the most interesting or innovative or unexpected ways that you've seen the agile data platform used?
[00:46:26] Unknown:
So for us right now, it's been the use cases that we've delivered. So I used to have a joke in my previous world that the 2 projects you never wanted to do was data migration and payroll. Because what would happen is if you were successful, it just worked. There was no hurrah. It was like, yeah, the data was more graded and it all turned up. Oh, yeah. The payroll got turned on. Everybody got paid. And so success was like, yeah. And everything else was worse. Data didn't turn up. We lost it. You know, payroll didn't pay you right and people really cared. 1 of our first projects was data migration project, right? It was the customer was moving from their legacy on prem system to a new cloudy thing. They had a vendor partner that was implementing their new cloudy software as a service.
Their vendor knew how hard it was to do data migration, not because they didn't understand their problem, but because they didn't understand the core bespoke system that every customer had. Right? And they had to do all that hard work. They knew that, you know, it was a money sink. And then also it distracted the implementation people who just wanted to focus on new ways of working, right. Get the benefit out of the new system. So what we did was we became the middle person. So if you think about core business process, who does what? Customer orders product, customer pays for order. Within an organization that never changes and they say fundamentally change and pivot themselves as a business, their systems change.
So what we did was we took the spoke data. We bought it in. We made out their core business processes, you know, customer orders product, customer makes payment. We then said to the new vendor, okay. Are those processes fundamentally changing? No. Lots of admin change, status change, workflow change, but those core processes exist. Cool. So here they are. How do you want the data? We'd like to consume from API. Cool. Would it be good if we gave you the APIs that match your import, you know, your schema for importing migrated data? Yeah. Because then we don't have to map it. Cool. We expose that data virus in of APIs that they have.
Now, really interesting. We had a contract in place with them, which says whenever you find a problem, you'll tell us about it. We'll reiterate either our rules for the core business processes or the APIs. Right? So you tell us that something's not right. We'll go away and fix it really quickly. You go and hit it again and tell us we got it right. Naturally, what people do is they want to do it themselves. So they found some problems, so they built some cleanup scripts on their side and didn't tell us. And then as we're reconciling, things weren't reconciling. We're like, nah, nah, we've reconciled from source to that consume API. Right? We know the data matches.
But when we go into the app, the last software as a service, the data skewed, right? We've lost records. And then we found out they were cleaning it up and they were cleaning it up wrong. So we just said, right, we're going to go and solve that problem of why you needed to clean it. We'll push it back to us. You just hit the API. And by doing that, we saved a whole lot of problems around reconciliation and conversations because we knew where it was breaking and we can go fix it. So never gonna do a data migration. 1 of the first things we did, what was that? The other 1 is this idea of a data clean room. Right? This idea of being data Switzerland. So customer had 38 different people that needed to give them data.
Some of those organizations had data teams, some didn't. So we needed to combine it and make it safe so that our customer can only see a subset of the information they were allowed to see. So some of the larger organizations would filter and clean the data before they sent it to us. So we only saw the data our customer was allowed to see. For some of the other providers, though, it was easier for them just to give us all the data and for us to apply the rules. So that's what we did. Right? They give us every event and then we cut it down and then our customer can only see the stuff they're allowed to see. And then you go, okay, now what happens? Well, this is kind of event viewing use case. So, you know, some of the providers are using Google Tag Manager. Some are using Google Analytics. Some are using Adobe.
Then we had to go into the Insta, Facebook, Twitter. Then we had to go into podcast events. Then we had to go and connect the TV events. And if you wanna see complexity, go and talk to people that actually have video on demand on TVs and see the problems they have. Because what I didn't realize was every TV provider, Samsung, Sony, uses a different SDK for tracking the app usage on their TVs. And they often use different SDKs for different models of TVs. If you think about the complexity of just capturing who viewed what when you're reliant on the device provider, not your app yourself, you're not in control, there's a whole lot of complexity there. So, yeah, that data clean room, right, was gonna be simple, but, yeah, massive amount of complexity every time we touched a new set of data, which comes back to my problem that data collection is not a solved problem. I thought it was. Right? I thought Fivetran and Citibank did and solved it for us. Yeah. In my experience, every time you get a new data source, there's a new problem that's gonna turn up and bite you in the bum. Absolutely.
[00:51:24] Unknown:
And in your experience of bootstrapping this platform and building it out and exploring this overall space of building a data analytics back end as a service, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:51:39] Unknown:
That often we do things as humans without thinking about it, and it looks simple. And when we try and codify it into a platform, it becomes hard. Yeah. So in the previous podcast, I talked about that idea, idea of key identification. Yeah. Understanding when a table comes in or a piece of data comes in where the unique key is to say that's a concept. There's 1 we're working on right now, which is really interesting. So it's around this idea again of removing clicks. So so as I said, when I do something for a customer and our product and I have to click a few times or it takes me a while, we want to automate that. Where I have to think and make a conscious decision, we want the system to make a recommendation on how to deal with it. And so, as I said, we're a 3 tier architecture, right? Data comes into the sisterized layer. We then model it lightly, concepts, details, and events, and then we make it consumable.
And so when you create a concept in detail, you got a concept of customer and some detail about them, we automatically create that consume table, that big denormalized table on demand. Right? You don't have to do anything. It just brings back the detail and concept and denormalizes it out and makes it available for query. When you create an event, you know, customer orders product, we automatically go and pick up the customer, the product, the orders, all their details, and denormalize that out to a consumable event table. Right? We can see every order, every customer, and every product that was ordered at that time. And that's all automatic. Right? I don't have to do anything for that. But I still have to do the rules in the middle. Right? I still have to say for this historical data, that's a customer, that's their name. And that feedback loop was a little bit time consuming because I get some data coming in, and I can profile it in the history layer. So within our catalog, you can profile it and see the shape of the data and and nulls in there. But I naturally wanted to go and just query it. Right? I wanted to go and do some draggy, droppy stuff sometimes to get a little bit more context around it, and we didn't wanna build that in the product.
So what we're just finishing off now is when you land a piece of data, you have a choice for auto rules. And what it will do is it'll go and grab that data. We'll have a guess at what the key is and ask you whether it is or not. And if it is, you go, yes. And then it will build out all the rules in the design stage and create the consumable table for you. Now, we're not a great fan of source specific data models. We think you should always model your business processes, not the way your source system works. And so the idea of conforming concepts, you know, single view of customer, single view of product across multiple systems we think is highly valuable.
But doing that big design upfront actually sometimes doesn't have value. Right? You want to do it later. So this idea of auto rules almost gives us a prototyping capability. We think you land the data. It's still going to lightly model it for you. Right? You slightly designed. You then make it consumable, and then you're going to use it for a while, and then you're going to go back and refactor. And we think that prototype process, drop it, use it. It's still likely designed. There's still rules being created. Right? So it's not ad hoc. And then you're gonna go and refactor your your data design in the future. We think that life cycle may be 1 that has value. But as with all things agile, right, build it quickly, get it out, use it in anger, see what happens. If we stop using it, it didn't have any value, and then we should burn it down, which is hard.
[00:54:53] Unknown:
Keying off of that doing an agile statement, 1 of the hallmarks of agile is that you do things quickly, and you iterate rapidly so that you can identify the mistakes early rather than waiting until they become more expensive. And I'm curious, what are some of the most interesting or informative mistakes that you've made in the process of building agile data and some of the useful lessons that you've learned in the process?
[00:55:20] Unknown:
So there's been some that we planned that we knew we had some technical depth. So last time I talked about the idea of, you know, our config being in BigQuery. We knew that we had to iterate that. We did the no sequel move to data store because we thought we wanted to be cool. And we ended up throwing all that work away and and refactoring it into Spanner, which is now, you know, a foundational piece for us. And we and we won't change them unless we really had to. So that was kind of, an investment mistake where we knew we had to make a change and we made a wrong bet. And we we lost that work and that was okay. These mistakes in the product, these things we've built that I don't think we ever use. So, you know, when we built up the data catalog, I really wanted to review thumb up, thumb down.
Alright? I thought it would be valuable. We profiled the data where those statistics were available. We have context that you can add against the data about the history of, you know, where the data came from, why I should care, any gotchas. But that review process of, you know, the Yelp voting, yes, it's it's good and no, it's not. We built that out. Didn't take us long. I'm not sure it's ever gonna get used. I mean, we monitor it. Right? So we'll see if that never gets used. We'll probably take it off the screen. The next 1 is we build out a whole lot of what we call trust rules.
So they are data quality rules or data test rules. So when you're building out your change rules, you know, make this into a concept or detail, a screen pops up and says, here's some rules you can apply. You know, is it not null? Is it unique? Is it a phone number? Those those kind of simple data quality tests. And that information is all displayed on the catalog. Yeah. So you can go into a tile and you can see all the rules that ran, the trust rules, and whether they passed or whether they failed. So that's been successful from that point of view, but it's not actionable.
Right? We haven't built the but we actually needed was what happens when it fails. So that's what we gotta do now. Right? Now we have a refactor of the trust rules or not we don't know, but we have to solve that next problem. Right? Okay. Something failed. Do I really wanna have to go and check it? Do I wanna be notified of it? If I'm notified of it and I get 600 failures, right, for different things, is there a classification of these ones are important and these ones aren't? Should I be able to mute them? Yeah. What's the actual action I wanna take? And so that's 1 of the lessons we've learned, and it's a product lesson as much as an agile or a data lesson is actually when this feature turns up, what action do we expect an analyst to take by that feature? What's the value in it? And it took me a while to kind of flip the model to think about that first.
Yeah. But having said that, there's a bunch of vanity features that I really want, and it's really hard to beat yourself up and say, no, we're not going to build them next because we probably don't have value for them. Or what's the least we can spend on building it and how do we prove it had value or it didn't? And if it doesn't have value, you know, that is a hard human process I've found. Because I don't come from a product background traditionally. Right? So it's all been a good learning for me.
[00:58:09] Unknown:
Data engineers don't enjoy writing, maintaining, and modifying ETL pipelines all day every day, especially once they realize that 90% of all major data sources like Google Analytics, Salesforce, AdWords, Facebook, and spreadsheets are already available as plug and play connectors with reliable intuitive SaaS solutions. Hivo Data is a highly reliable and intuitive data pipeline platform used by data engineers from over 40 countries to set up and run low latency ELT pipelines with 0 maintenance. Boasting more than a 150 out of the box connectors that can be set up in minutes, Hivo also allows you to monitor and control your pipelines. You get real time data flow visibility with fail safe mechanisms and alerts if anything breaks, Preload transformations and auto schema mapping precisely control how data lands in your destination.
Models into workflows to transform data for analytics and reverse ETL capability to move the transformed data back to your business software to inspire timely action. All of this plus its transparent pricing and 247 live support makes it consistently voted by users as the leader in the data pipeline category on review platforms like g 2. Go to data engineering podcast .com/hevodata today and sign up for a free 14 day trial that also comes with 247 support. For people who are interested in being able to reduce the overhead involved and being able to just ask questions about the data for their business without having to invest in building out a whole team to support it? What are the cases where Agile Data is the wrong choice?
[00:59:39] Unknown:
So for us, it's real time. We're we're definitely not a platform for real time reporting. Yeah. We can collect data in real time because, effectively, we're we're running on Google Cloud. You know, we can use the pub sub queuing topic stuff where people can actually stream us records in real time, and we can just land and store them into the history layer. We could, in theory, take our rules, which run sequentially at the moment. So if you think about it, I think about it as London Underground. Data comes into a history. By default, it turns on what we call auto sync. So as soon as a new record comes up into a history tile, we automatically trigger every dependent rule, and it uses a manifest process. It kind of dynamically builds the DAG every time rather than have the DAG stored. So we query the config. We go right. What's dependent on the history tile? Right. Run those rules. Okay. What's dependent on the output of those rules? Right. These things run the next rules, and we daisy chain through that manifest.
So, yeah, that runs pretty quick because, you know, Google just scales like snot, but it's still sequential. Right? It still waits. Now, in theory, we could pick up the entire config, encapsulate that end to end piece of code as a single piece of code that runs in memory. And every time a new recall comes up in real time, effectively re instantiate that code and give a real time score or answer at the end. But we're not gonna invest in that. It's not a big use case for us. It's a high level of complexity for engineering to get that right. And we find very few use cases where a customer actually wants your number to change in real time at the end of the mile.
Yeah. I look at a number and it's just changing. And so that's not a use case for us. Right? We could do it, but we're not going to. Yeah. That's probably the main 1. Everything else is we are really good at it, or we're gonna get good at it as long as it's a data problem. I suppose the other 1 is if you don't want the platform to store the data, if that's something that you can't let happen, then we're not a product for you because we do collect the data and store it in history. It's a foundational piece of our pattern. Yeah. We've got Google Omni, which is really interesting, which in theory says you can have your data in S3 or Azure Blob Storage, and we can go and hit that.
But we'd have to change a bunch of patterns to say that you hold that history layer. Yeah, which we can do. But what happens if you change it? What happens if you delete it? Right? We lose that immutable record, so we can't replay all our rules and give you the same answer anymore. And for us, that's an important feature. Right? Is that at any time, you can see what the number you reported was, and we think that's a core part of data for what we do.
[01:02:14] Unknown:
As you continue to build out and iterate on the product and the platform, what are some of the things you have planned for the near to medium term or any particular projects or problem areas that you're excited to dig into?
[01:02:27] Unknown:
So we kind of look in 2 ways. Every time I touch something and it takes too long or it's complex, we're gonna go and make that puppy simple. And then there's a bunch of core capabilities or features that we know we haven't invested in you. So, you know, if I look at the whole idea of semantic metrics layer, we've done the designs for that, and we're waiting for the kind of LookerML service to come out so we can bind our API and our app with that service layer. So we'd have to build out the execution of those metrics ourselves, but we gotta build that out. Alright? And so at the moment, when I build a metric for a customer, I'm 9 times out of 10, I'm building on that last mile tool, and we wanna move that metric definition back into our product. Almost, you know, that headless BI pattern because we see that as a high value.
We want the machine to do more work, so we wanna get more into writing algorithms under the covers that provide recommendations. So we have some simple ones here at the moment. Things like data drift, you know, where we run algorithms to tell us that the data looks like it's skewed and it's gonna alert us. Hey. That looks weird. Go look at it. Rather than waiting for the customer to say, hey. The data's kind of not what we thought it was. We want to build more of that. Right? So more of those models, but we see them as recommendations. Right? They're not doing the work. They're just the machines giving us a hint that to reduce the work that we do as humans. And the last 1 that I'd really like to do, but it was 1 of the ones that Google removed, was the idea of natural language q and a. So we do research spikes every now and again. We call them mix spikies.
And I was really keen on this idea a couple of years ago of, well, we're not gonna build a last mile tool in our product. 9 times out of 10 people just wanna answer a data question. You know, how many customers have I got? What products did a customer order? What was total sales in this region? And so that whole natural language, ask a question and get an answer. So we did a mixed spike here to see what it would take for us to build that out. And we estimated if we found 5 really, really smart engineers in that space and we gave them a year, we'd have a really basic, but crap capability to ask a question and get an answer. And we're like, okay. We're not gonna invest in that. And then Google announced their q and a service as private beta. So we jumped on that 1. And what that was was the ability to ask in SQL a natural language question against BigQuery and get an answer. And we tested it out, and it was a thing of beauty. It was like, wow. That's actually better than a lot of the commercial products we've seen. And then we're like, why is it good? And it was like, well, I didn't realize that Google Analytics, you can go and ask it a question, and it's a natural language question. In Google Sheets, you can go and ask a natural language question. In Google Slides, when it's recommending some style changes, that's a natural language question.
So Google had taken the whole dictionary capability out of Google search and been testing these dictionaries and these language algorithms and these other products of these for years, and then they embedded this in the q and a service. So we're like, wow, that's cool. And then our product, we have a thing called 80, a little chatbot that comes out from the side. And then when you want to do something that says, okay, You've done this. The next step we normally would see is do this next or, okay. You're gonna do this. You've got a choice. Right? For example, if you're gonna go delete a rule and there's a draft rule in play, do you just wanna do the draft, or do you wanna delete the draft and all the previous versions? So effectively remove that rule completely and it's no longer in production. So Adi pops up and says, which of these 2 would you like to do?
So we're like, cool. Really simple. We embed natural language into Adi. We have ask Adi. Ask her a question. She'll give you an answer. Right? Because the config holds all the metadata. We understand what each field means. We understand how it's used. We understand the rules. So that was all cool. So we backlogged that until we had some time and we made it a priority. Come in 6 months, 12 months later, I'm like, cool. Let's go into building a cool feature because we just need a bit of a break from the plumbing stuff. Go back to go and touch it, and it's gone.
And I was like, okay. Reach out to the product team. Where's that gone? Oh, you know, it's it's we've turned the beta off. Okay. So when's it gonna come? Oh, it's not. You might see it in the BI tool, and they'll look at Google Data Studio stuff is where we think it'll turn up, but it's not gonna be a cool service we can use. And that was like, oh, that saved us years of engineering. I mean, we're not gonna engineer it ourselves. Right? But it was, I really wanted that. I think it had high value. So ups and downs, you know, that's what happens when you are relying on other people's technology often to do stuff so you don't have to. And that's okay. Yeah. Especially when that's someone who's Google who has a long track record of releasing really cool things that people like, and then it's like,
[01:07:04] Unknown:
no. We changed our mind.
[01:07:07] Unknown:
That's the definition of agile products. Right? Is you go and invest in something, and then you build it out. And if people don't use it, then you kill it. That's hard. Right? I think for somebody who's using it, not enough, but, yeah, somebody's relying on it. But that's gotta go. So, yeah. I I mean, they are the epitome of agile product development. But as a consumer, we get to love to hate their behavior sometimes. Absolutely.
[01:07:31] Unknown:
Are there any other aspects of the work that you're doing on the Agile Data Platform or the overall space of building a back end service for analytics that we didn't discuss yet that you'd like to cover before we close out the show? I've focused a lot about our product,
[01:07:46] Unknown:
yeah, and how we have back end services and front end apps that are just beautiful to help the industry that work. But I've probably underpaid a lot around that way of working That actually, you still have to be data savvy. And I use the word data savvy now on purpose because I read an article or listened to a blog cast or somebody has this really good comment, which is when you tell somebody they're not literate, actually, you've been quite derogatory. Yeah. Everybody's literate. It's about the level of literacy. And so, you know, as engineers and as data modelers, you know, there's a big thing going around LinkedIn at the moment around data modelers and the fact that we've lost the art of modeling. And for those of us that believe modeling has value, we see the loss of it as a big loss, and we're quite negative against it. We need to use words that are polite, you know, and true. Right? So you don't have to be an expert in data. You just have to be savvy. Right? And so you need to practices or ways of working that help you understand how to take the savviness that you've got and apply it in the data domain. So that idea of data modeling. Right? The idea of saying, well, there's only 3 things you need to worry about. You have a concept of a thing, customer order product. You have some detail about it. You know, customer has a name, product has a SKU or a name or a type, order has a quantity and a value and a date. They go to give her an event, which where we see a relationship, we see a customer order a product, and we wanna record the fact that that happened because it's important to us. If you take that way of working and you combine that with a product so that they have their ensemble, that's where we think the value is. That's where we think we can enable the analyst to do the work and engineer.
The plumbing work. Right? And I'll go back to that. The boring old pipes that just moved the water. Yeah. We still need smart people that are really good at problem solving to do that hardcore engineering analysis, problem solving at the end of that pipe. Right? That's where the value is, and that's where those constrained resources, those people that don't have enough time should be focused. We should just automate the plumbing. Right? It's what we do in our houses. Why don't we do that with our data?
[01:09:47] Unknown:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today?
[01:10:03] Unknown:
Same answer as the last podcast. Right? I think there's too much complexity in our platforms and the way we work, and that's what we're hyper focused on fixing. I still think there is a lack of machines doing the work for us where they can. The example I used last time was finding the unique key on the table. Actually, somebody on LinkedIn replied back and said, you know, the data quality tools of 20 years ago did it. They found the foreign key relationships. And, yeah, that's true. But I don't see it in the modern world again, so we kinda lost that art. So for me, it's when the machines get smart enough to do the work for us and we don't notice. We just take the recommendation and go, yeah. That made sense. And that's where the data world needs to get to. So, hopefully, that's the next generation of data platforms, ones that, you know, make our toast for us, pour out coffee for us. Right?
[01:10:55] Unknown:
That kind of thing. Yeah. And do it at the right time so it's not cold by the time you get to it.
[01:11:00] Unknown:
Yeah. You know, just think about it, the Nespresso. Right? We still have to put the water in. We still have to put the pot in, but that's it. Yeah? Yep. We still pick the size of coffee we want, the strength. We still have a lot of choice. But the core plumbing, that's taken care of. Oh, it's not really, is it? Because I swear to fill up the water thing. But, anyway, you know what I mean.
[01:11:17] Unknown:
Yeah. Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing on the Agile Data platform and product. It's definitely a very interesting approach that you've taken and interesting service that you're providing. So I appreciate all the time and energy that you and your cofounder are putting into making analytics a set and forget operation as much as possible. So thank you again for that, and I hope you enjoy the rest of your day. Yeah. Thanks for having me on. It's been great as always.
[01:11:49] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast.init, which covers the Python language, its community, and the innovative ways it is being used, and the machine learning podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com. Subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction to AgileData with Shane Gibson
Building AgileData: Concept and Audience
Target Organizations and Use Cases
Interaction with Existing Data Teams
Design and Interface Considerations
Architectural Elements and Technical Aspects
Integration and Data Collection Challenges
Caching and Pre-Aggregation Strategies
Data Modeling and Documentation
Data Discovery and Semantic Understanding
Onboarding and Adoption of AgileData
Lessons Learned and Challenges
When AgileData is Not the Right Choice
Future Plans and Exciting Projects
Closing Thoughts and Contact Information