Summary
The predominant pattern for data integration in the cloud has become extract, load, and then transform or ELT. Matillion was an early innovator of that approach and in this episode CTO Ed Thompson explains how they have evolved the platform to keep pace with the rapidly changing ecosystem. He describes how the platform is architected, the challenges related to selling cloud technologies into enterprise organizations, and how you can adopt Matillion for your own workflows to reduce the maintenance burden of data integration workflows.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
- Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.
- Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit http://www.dataengineeringpodcast.com/montecarlo?utm_source=rss&utm_medium=rss to learn more.
- Your host is Tobias Macey and today I’m interviewing Ed Thompson about Matillion, a cloud-native data integration platform for accelerating your time to analytics
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Matillion is and the story behind it?
- What are the use cases and user personas that you are focused on supporting?
- How does that influence the focus and pace of your feature development and priorities?
- How is Matillion architected?
- How have the design and goals of the system changed since you started working on it?
- The ecosystems of both cloud technologies and data processing have been rapidly growing and evolving, with new patterns and paradigms being introduced. What are the elements of your product focus and messaging that you have had to update and what are the core principles that have stayed the same?
- What have been the most challenging integrations to build and support?
- What is a typical workflow for integrating Matillion into an organization and building a set of pipelines?
- What are some of the patterns that have been useful for managing incidental complexity as usage scales?
- What are the most interesting, innovative, or unexpected ways that you have seen Matillion used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Matillion?
- When is Matillion the wrong choice?
- What do you have planned for the future of Matillion?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
- Matillion
- IBM DB2
- Cognos
- Talend
- Redshift
- AWS Marketplace
- AWS Re:Invent
- Azure
- GCP == Google Cloud Platform
- Informatica
- SSIS == SQL Server Integration Services
- PCRE == Perl Compatible Regular Expressions
- Teradata
- Tomcat
- Collibra
- Alation
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means? Our friends at Outland started out as a data team themselves and faced all this collaboration chaos. They started building Outland as an internal tool for themselves. Outland is a collaborative workspace for data driven teams, like GitHub for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlan enables teams to create single source of truth for all of their data assets and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.
Go to dataengineeringpodcast.com/outland today. That's a t l a n, and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $3,000 on an annual subscription. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Pacaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.
Go to data engineering podcast.com/linode
[00:01:46] Unknown:
today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Ed Thompson about Matillion, a cloud native data integration platform for accelerating your time to analytics. So, Ed, can you start by introducing yourself? Yeah. Sure. My name's, Ed Thompson. I'm a CTO and cofounder of Matillion. And do you remember how you first got started working in the data space?
[00:02:15] Unknown:
Yeah. Sure. So we founded Matillion back in 2011. And my background prior to that was not really pure data. It was tangentially data. We've been doing some work with some fairly heavyweight IBM technology at the time, IBM DB 2 and Cognos on top of that, building data warehouses, but it was still very much a learning I was I was still very much at the at the start of the learning curve once we when we decided to found Matillion. But the original idea behind Matillion was to do fairly simple BI projects, essentially. So our kind of business pitch was we'll do business intelligence. We're gonna use cloud technology to deliver it, and we're gonna make it so that it's cost competitive or affordable to sort of small, medium enterprises in the UK.
And I think we spent probably a year sort of building it and then probably 3 years selling it and building up what was, I think, in retrospect, a pretty small customer base of companies. But the business itself, it was 1 of those kind of things that, like, it wasn't hugely successful, but it wasn't a complete failure either. It was so kinda sort of stoically carried on trying to make it work. And the key to making it work became the speed at which we could build data pipelines, essentially. Talend was kind of our chosen ETL tool, probably because it had a fairly comprehensive open source edition. We made that work for us. But the issue that we kinda gradually sort of crystalized our realization of was that Talend wasn't kinda purely built for cloud environments at that time. So we kinda set about trying to find a cloud data integration tool that was more suited for the data warehouse that we're using, which was at the time was Redshift, pretty early adopters of kind of Redshift and AWS. We built the whole business on AWS, so it made sense.
And about 2014, we kind of thought, hey. There's not really, an ETL tool out there which does the job, so we decided to build 1. And we built it purely for internal use, and it was purely to drive down the time it took from onboarding a customer to having that first sort of data warehouse stood up. But what we didn't realize at the time was that well, that meant at the time we were a pretty small business, probably about 20 people. But what we had is, like, an engineering team, very small engineering team building the product and a small team of data engineers who were just building data warehouses.
So we ended up with this really kinda tight feedback loop where, you know, the engineers would walk across and say, hey. Here's some software. Does it do what you need? And the, and the data engineers would say, that works. That works. That doesn't fix that. Change that. Even occurred to us at the time, but we had this really sort of tight sort of product market feedback loop going on, which served us really well later on. And what I didn't realize at the time is it's something that so many businesses strive to create after the fact, but we kinda accidentally, I guess, happened upon it at that time.
Then in 2015, once we kinda built this ETL to it was quite easy to kinda realize at that 0, but it's useful for us, and it's working for us, so maybe some other people would find it useful. It happened to coincide with the time that AWS launched their marketplace. And they were looking for vendors to go on the marketplace, and they'd approached a load of kinda big traditional vendors with all sorts of software, and they'd done a fairly mediocre job of just kinda throwing their software up on this marketplace. And we decided the only way that we sell our software. And that meant that they helped us out a lot, because we were, like, yeah. We'll be exclusive, and we'll be like the poster child for it and so and so forth. They helped us out a lot with that. And part of the kinda helping us out, they gave us a boot that reinvent that we otherwise wouldn't have been able to afford.
So we went out there on a bit of a shoestring budget compared with what traditional conferences cost these days. And that's where we launched the product, and it was fairly clear that it was gonna be fairly successful. So we launched Matillion ETL for Redshift at the time, and then we started, you know, just gradually building it up and picking up customers from there. That's really how we got into or how I got into into data integration. I was kinda figuring out how to build an ETL tool. It's an ELT tool. That's the fundamental difference between Matillion's architecture. At the time, it was pretty much the only pure play ELT architecture tool. It's definitely interesting
[00:06:50] Unknown:
that you were so such an early adopter of the cloud because, you know, 2011 time frame, the cloud itself was still pretty early, and there weren't really a lot of use cases for it beyond just here's some compute and maybe here's some support. And now the cloud has grown up into this massive market with players. And, you know, if you go to the AWS console page, it's hard to even enumerate all of the different products that they've got on there now. It is a massive number of products. Yeah. I don't know anybody that's kinda on top of the whole
[00:07:20] Unknown:
feature set. I'm sure there is somebody that I kinda consign myself just to the data ones. 1 of the big headaches had at Matillion is we started with AWS and probably still kind of have a little bit of a bias towards AWS because it's kind of the in the DNA of the company. But we very quickly wanted to have products on Azure and GCP particularly, which meant for me personally, I needed to be able to talk to customers competently on AWS, Azure, and GCP, which means I'm carrying around this enormous matrix in my head of all of the equivalent service names and try not to trip up when talking to it as your customer. You start talking about VPCs rather than VNET, and then they're looking at you like, what you want about? So you can very quickly get a bit get a bit confused with yourself if you're working across those 3. But, yeah, we do work across all 3 of the cloud platforms now. It doesn't help too when you have things like AKS, which is for Azure Kubernetes Service versus EKS for Elastic Kubernetes Service, which runs on Amazon. So No. No. Exactly. Exactly.
And then all the subtleties of the kind of differences with platforms. It gets pretty deep. I always think particularly with Azure and AWS that's I don't know whether it's because they both came out of Seattle. You start to see some really curious parallels between the 2 products. Like, oh, there's definitely been some inspiration going both ways in in this software.
[00:08:46] Unknown:
Yeah. You've probably just got employees bouncing back and forth between the different companies. I can believe that. I can believe that. Definitely. Another interesting element of what you're describing of the system that you built is that you were very early to what has now become sort of the odd pressure versus ETL, which it seems like maybe in the past 3 to 5 years, that has become kind of the dominant paradigm. Whereas prior to that, it was, you know, very much heavy on the transformation before you loaded into a heavily structured data warehouse. And I'm wondering sort of what the insight was early in the company that allowed you to hit that sweet spot and recognize that this was actually a more effective and useful pattern?
[00:09:30] Unknown:
The realization, I think, came from so the original idea of Matillion, we needed to be able to build data warehouses that were tailored. So, essentially, we were selling packages of facts and dimensions. Right? So we basically go to someone and say, hey. For this much money, we'll set you up this many facts and this many dimensions, and, obviously, that led to quite a long explanation of what all that meant. We'll sell some facts. We'll sell some dimensions. We'll wire them all up. And, essentially, what we then have would be like a catalog of kind of semi prebuilt stuff, and then it would be a matter of tailoring, you know, that turnkey data warehouse for that particular customer.
And sometimes that worked really well. So you'd have, like, you know, a fairly simple business that you'd go to, and they'd want a very fairly simple data warehouse that more or less match the template. And then sometimes you go to customer and, you know, you take a complete bath on it, because it would be a really complicated data warehouse with lots of complex interactions and really tricky source data from really tricky systems, or lots of systems was a very common 1. We'd approach cost companies that say, hey. Yeah. We've got 9 different ARPs, and we'd like to have a data warehouse that shows all these different ARPs from different vendors on 1 view. We took on some pretty mad projects in hindsight, but what we realized was a lot of the ETL that we were building, they were complex because there was a lot of it, but not because any of the particular operations that we need to do on the data were complex.
And the modern data warehouses, not only were they more than capable of doing the data transformations in pure SQL, it was much, much faster doing that way if you take advantage of the parallelization that Snowflake or Redshift or just about any data warehouse decent data warehouse has nowadays. So you just got much, much better performance if you kept the data where it was and you transformed it in situ. So that's what we did. And then things that kinda where you get more complicated data operations, they could generally deal with it with a window function or with a stored procedure and so on and so forth. But the data engineers really just needed a tool needed a set of tools where, you know, they're coming from a talent background. So they need something that they can build data pipelines visually, you know, hardly any code, low code or no code. And, you know, maybe they wanna do a they'll they'll go around a bit of Python or the right stored procedure. But for the most part, they're just doing lots and lots of simple SQL operations by putting in a low code, like DAG or whatever. They then have something that's kinda really easy for any other person to pick up at any time and maintain.
And then when you do ELT like that, the other kinda really nice feature that everyone that saw it just immediately loved in the product was you can kinda just query the data at any stage in the data pipeline because it's just like it's a stage in the SQL that's built up. So you can say, you know, what's my source data look like? Oh, this oh, yeah. I'm gonna do these 3 simple operations. Now what does it look like? Okay. Have I got as much data as I was expected? Yeah. Show me the basic distribution of the data. Is it what I expected? And when you're kinda going through that kinda cognitive sort of, help me to understand what's going on, whereas all the other ETL tools at the time, like, Talent, Informatica, you know, DataStage, the Microsoft 1. They all have the same sort of paradigm where you draw everything out as you kind of intended it, and you only find out that you'd done it wrong when you hit go, compile, run, or you'd wait for a load of data to move, and then something would fail.
So the ELT worked really well. It's an interesting debate kinda going on internally at Matillion now because we called the tool Matillion ETL, even though it was always an ELT tool. Called it that because at the time, no 1 was googling ELT. They were googling ETL. So we wanna be found, so we need to call it that. And, actually, as you kind of alluded to in your question, that's kinda tilting now. So it's like everybody's talking about ELT. Everybody wants an ELT tool. Should we be called what we actually are, which is Matillion ELT or or something like that? So there's some debate going on that in our product team, so I'll see where that comes out. Yeah. I mean, really, it should just be, you know, square brackets,
[00:13:48] Unknown:
e t l, and then a plus sign at the end because, you know, you're gonna have each of those some arbitrary number of times in any pipeline. So there's always gonna be an extract stage at the beginning, but you're probably also gonna have multiple other extracts further down the road, and you're gonna do transforming and loading multiple times. So
[00:14:05] Unknown:
Yeah. Now the technical guys would love it if we expressed it like that, but I'm not sure the marketing team would love it quite so much. So we'll see who wins on that 1.
[00:14:15] Unknown:
Yeah. You have to explain peachy to the marketers. Exactly. And so in terms of where you are now, I'm wondering if you can talk through the sort of main use cases or industry verticals or user personas that you're focused on supporting and how that has influenced the recent direction and focus of your feature development and prioritization.
[00:14:38] Unknown:
Yeah. Sure. So I think we are quite closely aligned with our partners, AWS with Redshift. In recent years, been quite a lot of focus around Snowflake and now Databricks. You know, we try and kind of treat those partners equally. The other decision that we took fairly early on was we were gonna build additions of the products that were specific to those platforms. So, yeah, Matillion ETL for Redshift has features in it that are Redshift specific. Matillion ETL for Snowflake has features that are Snowflake specific. So whatever data warehouse has a USP, we can expose that as first class features in the product. So, like, with Snowflake, that kinda USP was their ability to separate storage from compute and scale separately. So we added features in to allow you to control and manage that inside of the utility ETL, giving you, like, single pane of glass over those features. Same with Redshift, same same with Databricks.
So what all that means is that, you know, very often, Matillion customers are also the kind of customers you'd see Snowflake and Redshift and BigQuery, so forth. And so as a result of that, I think we tend to have quite a strong bias to the enterprise, and we have a lot of large kinda large enterprise logos. And the CTO still asked me to name too many of them, but, you know, the likes of, Nike and Siemens, PepsiCo, big corporations who traditionally would run fairly, you know, expensive data warehouse stacks, maybe on prem data warehouse stacks with Teradata boxes and that sort of thing, Teradata and Informatica.
And then they want to go through a modernization process, and they go to the cloud. They very often go with vendors like Snowflake and tools like Matillion. So we find ourselves more biased into the, enterprise space, although we have also a long list of, like, what we've classed as commercial customers who are sub enterprise but still fairly significant businesses. And very often, the data engineering teams in those companies come from, you know, low code tooling, like Informatica, like Talend. They want a relatively easy transition from that tooling to something more cloud native like Matillion. Up to now, it's been very much our sort of key target persona, data engineers in large corporations that are used to building data pipelines with that kind of toolset. What we're finding more and more is that organizations that set themselves up, so they have, like, a small data engineering team that try and do everything, like, all the way from source data right through to delivery of reports, feeding ML models, whatever it might be.
Organizations have very, very quickly got themselves in trouble because they find that they're really constrained around those people. You know, there's a million requests coming from a million different people, and they're not able to kinda operate, innovate with data effectively. And because Matillion kinda has this sort of everything's based on SQL, everything is quite guided in how you configure it and set it up. We find it's quite accessible to the kind of modern data savvy line of business user who just wants to get something done with data. So organizations that are a bit more savvy, and they're kind of going for a lake house architecture, where really what you have is your best data engineers are focused on ingestion and preparation and cataloging of datasets, but they're not getting bogged down in answering business problems or figuring out a particular dataset or a particular data transformation that needs to be done to support team or 1 particular manager.
But instead, you have, you know, this clear catalog prepared lake and then a whole bunch of different teams that are able to build their own data transformations, do their own analysis, build their own pipelines off that single source of the truth. And that's what we're seeing, I think, more and more different types of users or multiple different types of users within an organization building the data pipeline.
[00:18:58] Unknown:
In terms of the architecture of Matillion, you mentioned that it started very early on with the cloud. And I'm wondering if you could talk through what it looks like now and some of the ways that the advancement of cloud technologies and some of the surrounding tooling have influenced the way that you think about the overall design and implementation of your product and just some of the evolution that it's gone through from the early days to where you are now?
[00:19:22] Unknown:
Yeah. I think it's probably true to say that we're in a period of pretty rapid evolution right now at Matillion. So where we are actually is a little bit old fashioned in some ways. So we deliver Matillion ETL primarily as a AMI, Amazon Machine Image, which customers run-in their own AWS accounts. So we're not like a pure SaaS company for the ETL product. We do have SaaS products for our MDL product, which is Matillion DataLoader, which is like a simplified sort of data loading pipeline. But the core ETL product is delivered as a as Amazon Machine Image or as a VM Image into Blasier or, the web GCP equivalent is.
But, actually, what we found is a lot of enterprises really like that delivery model because it takes a lot of the security questions that they like to ask off the table if you say, are you gonna run it in your own cloud infrastructure? The evolutionary debate really focuses around kind of 2 axes. So on the 1 hand, almost all of our customers are constantly coming to us and saying, hey. We like the fact that you deliver your software as an AMI, but we hate the fact that we have to manage it, and we hate managing infrastructure, and we'd just like you to manage all the infrastructure for us. So it would be better for us if you were a software as a service.
Great. But then they have this competing other access, which is, oh, and by the way, we don't want you guys to have any control over our data. We'd like our data to never leave the premises or never leave our cloud, please. And so it's like, okay. So how are we gonna square that circle? So I think the future architecture, and you see this architecture in lots of places in the industry now, particularly, Databricks, is to is to kinda separate the control plane and the data plane. And where we're, I think, headed is we'll see a a SaaS control plane, which will do all of the coordination of the data pipeline.
And then an on premise data plane, which will be as lightweight as we can possibly make it, will do all the heavy lifting around moving data, controlling the data warehouse, so on and so forth. So we're not there quite yet, but we're very close, and we're working to that end. And the nice thing about because we're relatively compared to our competitors, we're a relatively young company. Got a relatively young code base. We can kinda take our customers on that journey into the SaaS without kinda asking them to do a big complex migration or rewrite all their data pipelines or just complicate stuff that would really turn them off from a new product like that. That's kind of the big scoop, if you like, of kinda what's going on. The product itself, pretty simple. It's already in Java. It's a Apache Tomcat app.
And because we've always had this kinda ELT architecture, most of the heavy lifting data is done by the underlying data warehouse. So very often, Matillion is just waiting for the data warehouse to do some data transformation. Where we do get a bit more involved is on the ingestion side. Most vendors do. We have a stack of data connectors for pulling data into the data warehouse. Some of those pull the data directly into the data warehouse, but where that's not feasible, we're kind of streaming data through the system. But we always kinda take the philosophy, put the data in the data warehouse, put the data that you need, don't take more than you need, but keep the data in its original as close to the original shape and format as you can. And then once the data is in the data warehouse or in the data lake, that's where you wanna start doing the transformation. So we try not to do any even fairly light transformation of the data while it's in flight because that's where we start to worry about scale and things like that. Yeah. The whole
[00:23:10] Unknown:
in cloud software as a service or, like, trying to think of what the terminology is that I've come across before, but that's definitely a sort of growing trend, particularly with the advent of Kubernetes and being able to say, we'll deploy a managed Kubernetes service into your system for you, and then we'll deploy this helm chart or what have you, and then we'll wire that into our control plane to manage the software. Definitely interesting how these underlying tooling has been allowing that to be more of a accepted and supported practice.
[00:23:38] Unknown:
I feel like it's probably from most organizations' point of view, it's probably still quite early on that architecture, but I think they like it. I think they get it. I think if we can do a good job of making it clear what the communication is between the control plane and the data plane, I think people will get pretty comfortable with that. It'll avoid a lot of complexity that you get once you start handling customers' data. Things things start to get very serious very quickly and very complicated very quickly. So best avoid it if we can. What you can't avoid in that sort of architecture or you have to manage very carefully are secrets, authorization, you know, what can connect to what, how things are controlled. So it's also coming out of the wash pretty nicely.
[00:24:27] Unknown:
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it's often too late and the damage is done. DataFold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. DataFold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying. You can now know exactly what will change in your database.
DataFold integrates with all major data warehouses as well as frameworks such as airflow and DBT and seamlessly plugs into CI workflows. Visit data engineering podcast.com/datafolds today to book a demo with DataFold.
[00:25:21] Unknown:
Given that you are working to support so many different cloud environments and so many different data warehouses and starting to adopt the data lakehouse architectures, what are some of the most challenging aspects of being able to manage that sort of feature matrix and support all of those different engines and SQL dialects and being able to do that in a way that doesn't drive you insane or your customers?
[00:25:45] Unknown:
I think it's a case of deciding that you're going to do it like that and then being smart about how you approach it. Because it is very advantageous to have the wide ranging support, particularly with Snowflake customers. You know, Snowflake across all 3 clouds as well, you kinda gotta follow them. And once you get into, like, the retail space, particularly, you will kind of hear people kind of saying things like, oh, yeah. You can run it on anything, but it can't touch Amazon, and that kind of stuff. So we have to kind of traverse all of that. Architecturally, I think, is where most of the care has to be taken.
So to give you a couple of examples, the way that Matillion is architectured internally, we have an internal system that we've built which abstracts all of the SQL operations. So essentially, we you know, a Matillion component, simple example, something like a filter, and then that design intent is passed to the generator engine, which would generate here's how the filter gets implemented in SQL on Redshift, and it looks slightly different on Snowflake and slightly different again, so on and so forth. We kinda have all of that very much kinda architected and baked into the product. And we've always been quite quick, but we wanted to go to another platform, relatively easy lift to take the product and kind of turn on a new on a new data platform.
And then a lot of the features and functions, particularly on the orchestration side of the product. So there's a whole kind of orchestration mechanism in Cilium, which doesn't look a 1000000 miles from what you might do in airflow. It's a little bit higher level with airflow. It's kinda like everything's a bit more prebaked because we are able to kinda, you know, understand, like, more common operations that customers are trying to do, particularly in the cloud. But a lot of those kinda common operations that kind of just make it possible to build a real world data pipeline with all of the gnarly, awkward situations that you find yourself in moving data around in the real world, They tend to be common. And then the final thing is where we have features that we wanna build for specific platforms, we always try and build those as individual components. And because they're quite self contained, you can either have them in the product or not have them in the product, and then you end up with a big matrix of which features go into which version of which product.
But it's not without its downsides. Right? You do have to manage all that, but we need to educate our salespeople, for example, of of the differences in the product and this features in this version of the product, but not in this particular edition. That kind of managed all that. It makes for an interesting challenge, but when a customer comes along and they've got their shiny new Snowflake data warehouse, and they want a tool which is firmly kind of designed to work exactly with that data warehouse and has all what the shiny new features that they've been sold by their Snowflake rep. As first class citizen in the tool, then we can deliver that. That's really important.
[00:28:43] Unknown:
And for people who want to adopt Matillion and integrate it into their systems, what is the process for actually getting it set up and integrated and starting to build out your first workflow?
[00:28:54] Unknown:
Yeah. All pretty simple, hopefully. So we essentially will there's a there's like an onboarding hub, which will get you to the point where you can download an AMI into your cloud environment. And once you've kind of stood that up, the next stage is to connect to your data warehouse, then you're good to go. You can start building pipelines. Probably typical customer would probably start with, like, I've got some data in a SQL Server, so I'm gonna I'm gonna connect to that. I've got some data in Salesforce. I'm gonna connect to that. Pull some SQL server data, pull some Salesforce data. Suddenly, you've got 2 tables, and you can start doing some basic transformations on there. A lot of the advantages that we have is just that sort of quite easy easy ramp up period. It's kind of a fairly all inclusive platform as well. So, you know, you don't need a separate scheduler. You don't need a separate orchestration engine. It's all kind of built in. You don't need anything separate to kinda integrate it with the cloud. Got the right permissions. You can see all your s 3 buckets. So you're at your data lake. If you're on AWS, you know, you see your data lake, you're in Athena or whatever. It's all just kinda ready to go and just start building with with existing components.
So, yeah, that's the key thing is getting people to some kind of data pipeline that is of value to them. Show some value as quickly as you can. Once you've got there, then people people kind of build all sorts of crazy stuff that's
[00:30:15] Unknown:
great to see. 1 of the other growing trends these days is having a centralized metadata catalog, and I'm wondering how you've approached that integration or if you sort of defer that because because of the fact that you're running primarily on the data warehouse and letting that do the work, just using those existing integrations with those data warehouses and those metadata stores for being able to manage the sort of lineage graph and the catalog elements?
[00:30:41] Unknown:
We've kinda had a couple of goes up there. So we each started building some basic lineage features into the product. What we realized fairly early on, it's always a good test in any business to make sure you're focused on the thing that you're best at, and let other people focus on the things that they're best at. So we partnered with a few data catalog companies. Probably the most prominent would be Calibra and Elation. Because we are doing everything in the data warehouse, that means essentially that any decent cataloging tool can see everything that we're doing anyway. So they can see the inbound data being landed, they can see the SQL that's being run against it to transform it, and then they can see whatever the output schema might be.
So what we were able to do is quite simple. To work with those guys better was just to provide an API which allowed them to embellish on a whole load of extra contextual information that Matillion knows because the pipeline author knows it. So if you look at a database table, you look at that database table in, say, Calibra, you can then say, okay, that was a table that was populated by Matillion. And then you can say, okay. What was the intent? What calculations have been done on the data? Follow the lineage back to the various sources of data, and where did those come from. So Matillion knows where data came from, how it was loaded, at which kind of transformations have been formed on it, and how they're expressed in Matillion, and how all that was orchestrated.
So really, for us, it was about providing a whole load of additional context to extend what a traditional catalog tool, like Calibra, like Alation, there's a whole load of others in the market that are able to understand about the data. But we don't necessarily need to provide them with a whole load of complex information about the intent of the transformation because, essentially, their users couldn't go straight from the data to the intended transformation, see how that data was built.
[00:32:56] Unknown:
And another aspect of working with any tool, but particularly for something like Matillion where you're going to have a bunch of people interacting with it, lots of different pipelines and data flows is how you manage the complexity of the tool and the transformations as you Absolutely. This
[00:33:19] Unknown:
Absolutely. This is something that again, because we were building the tool but also using the tool, you could kind of see ahead of steam growing just from a very small pool of users that, hey, we really need to be able to make things reusable. And I guess it's part of it comes from you have an engineering team, and and they're writing Java code, and then thinking about how to make everything reusable and how you can minimize the amount of code and make lots of reusable useful functions. So why would you not do that in the ETL tool kind of thing? So we very much built it with that in mind. Probably, you know, 2 or 3 years ago, we were kind of really kinda in the weeds of thinking about this stuff.
It started with a simple variable system that became a slightly more comprehensive variable system. Once you've kinda got a really good variable system in in your ETL processes, you can start to make parts of the ETL reusable. But the really most important thing is actually making it able to take a piece of ETL logic, which is usually a bit of orchestration and quite a lot of transformation, and turn that into a self contained thing in its own right. And that became a feature which we it's not the greatest name. We should probably think about it, but at the time, we called it shared jobs. So we have this kind of shared jobs feature inside of Matillion. Once we got that in, that allows the next obvious leap is, okay, You know, customers can share jobs with each other in their organization, but how do we make it so that they can share it outside of the organization? So we have, a thing that's been running for a couple of years now called Matillion Exchange, which is designed to do exactly that. So if you've got a really good piece of ETL logic that builds, like, the world's greatest date dimension, you can put that onto the Exchange, and lots of other customers would come along and take advantage of that. I wanna call it call it an open source community, but it's certainly a community of Matillioners who wanna share their ideas and share their stuff. Works pretty well. As you have been
[00:35:19] Unknown:
building Matillion and working with your customers and helping them with their onboarding and adoption, what are the most interesting or innovative or unexpected ways that you've seen it used?
[00:35:29] Unknown:
Kirtiliant's quite permissive, and we wanted to build a, like, a a low code tool, but we knew you gotta satisfy the entire data engineering team. Right? So you got data engineers in a typical company who who have come from, you know, like a tool background, like like a Talend. And then you have data engineers that wanna use dbt, and you've got data engineers who wanna write Python. So we try to satisfy all of those by having kinda allowing the tool to orchestrate all of those things relatively easy. So you've got dbt functionality in there. And if you look at, like, our sort of internal telemetry, like, the Python component is incredibly popular. And if you look at what customers actually do with it, very often, they're like 3 line Python scripts where they just manipulate a variable, and it's like, oh, do you know what? It's just easy to do in Python. Okay. So because the tool's quite permissive, you do see quite a lot of fairly exotic fairly exotic use cases out there.
Where it's been used in places that I never expected, I think probably the first 1 we saw was, like, someone had taken it and used it in, like, a biotech life sciences company, and they were using it to crunch data that was the output of a process that was sequencing DNA. And we just never envisaged it. We were like, this is a data tool. It's for for businesses that have got, you know, sales orders and invoices and things like that, and it was a completely different use case. And then another 1 that I came across a few years ago is big engineering company that built massive gas turbines, and they had all these gas turbines out in the field. They were collecting all of this detailed performance data, like, I think it was coming out, like, 5 Hertz. So it's quite a lot of data, lots of data points, and then they were just crunching this into a big data warehouse using Matillion. I was like, oh, yeah. Never envisaged that. And then the final 1, I think, probably is it's very easy to underestimate the sheer volume of data pipelines that a company will build over time.
So, very large, come up with the right choice of words to make to make it too obvious, sandwich chain, should we say. The US, They kinda I think they were like, oh, well, you know, we've got some performance issues with a couple of our pipelines. Can we can we take a look? And then we kinda got on the call with them. Oh, show us all of the pipelines that you've got scheduled, and it was just this list. And we're like, oh, okay. I have to figure this 1 out. But the scale that an enterprise can work out, the amount of throughput of stuffs that they're creating was quite a surprise. But I asked internally at Matillion when I saw this question, so I was like, oh, I bet there's some other good examples.
There are quite a few, but those are the ones that I can remember. A few of those I've kind of witnessed witnessed firsthand, but I'm sure there are quite a lot more because people definitely like to bend the tool, bend the product to their to their will, and everyone comes up with a sort of exotic use cases from time to time.
[00:38:27] Unknown:
In terms of the sort of support factor of working across all these different cloud providers and warehouse engines, I'm curious if there are any that stand out as being either a breeze to work with, and they've been sort of, you know, easy to get started with and low maintenance. And if there are any that are particularly challenging or troublesome to be able to actually support over the long run.
[00:38:48] Unknown:
I'll try and be kinda nice about it. So working like we do definitely gives you quite a lot of insight into and sometimes we end up going kind of relatively deep. I mean, it's clear that the move into the cloud, I think it's freed up a lot of innovation, particularly in companies that you wouldn't necessarily think of as the kind of most forward thinking or the most innovative companies. So it's kinda given them a little bit of breathing space to innovate. I mentioned the bias before. I have a slight bias to AWS because that's where we started. That's where I have the most knowledge. I think it's probably fair to say, and I imagine that they would admit to this, that there was certainly a chunk of time on Azure where they were kind of playing catch up.
You could tell they were trying to reuse a lot of technology that they already had on the shelf, and they were sort of bolting it in, but it didn't necessarily end up being the most coherent platform or system as a result. But, you know, they've managed to iterate and steadily improve things over time. It's kind of a lot better to work with than it was. And then with Google, you always get impression whenever I've worked with Google, with people at Google, is that they very much kinda go their own way, and they like to kinda reinvent from scratch. So you end up with something that is somewhat different to the other platforms and doesn't kind of follow the same patterns very often. And that kind of gives you some headaches, understanding it and getting around it. But ultimately, the use cases are the same. There is a kind of danger that data warehouses themselves start to become somewhat commoditized.
The actual core data warehouses, as I see them, at 1 time, there wasn't really feature parity. Now there more or less is feature parity, and then so they're competing on performance instead. But that will quickly flush itself out a little bit, and there won't be massive performance differences, and then they maybe they'll have to compete on price instead. Very interested to see how that market evolves. But I think what you said in the market is particularly like the Snowflakey Databricks, because they're not part of a big platform. They're built on existing platforms. They're kind of trying to expand their footprint now into being kind of more all singing, all dancing data providers, doing much more than just lakes and warehouses, doing getting into ML in a big way, and really kind of becoming almost data centric cloud providers in their own right. And it'd be really interesting to see where that goes.
I don't have a massively strong opinion on how that's gonna end up, but I think that we'll be seeing data platforms in the future, really, you know, imagine AWS, but where data is at the center rather than compute being at the center. That'll be pretty interesting to see how all that fashes out, but I'm not gonna make any predictions
[00:41:49] Unknown:
on how to be very wrong. Yeah. No. It's definitely an interesting view on that. And in terms of your experience of building Matillion and growing it to where it is today and working with your customers and the ecosystem, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:42:05] Unknown:
Probably touched on a few of them. I think don't underestimate the complexity or sophistication of what people can and will build if you give them a tool that's flexible enough to allow it. I think that would probably be 1 lesson. And then, you know, the biggest 1, I think it probably applies to just about any software. But successful software is always walking a tight rope between the dream of the software architect and what they would like to create and the pragmatic reality of what the business needs tomorrow, next week, next month.
So I found myself always having to be really pragmatic and commercially minded, which isn't always a natural trait in technical people. When we first released Matillion ETL, it my heart of hearts. I was like, I don't think this product is good enough yet. It needs a lot more work, but we also need to get some software out the door to allow us to continue to be a business. And as is so often the case, I was proved wrong because the first kind of customer to ever use our software was PricewaterhouseCoopers. The weird aspect so at the time, Amazon Marketplace, they told you someone was using your software. They told you which country they were in, but you didn't know any other information about them apart from that. So we knew someone was paying for our software, and we knew that they were in Australia, but we didn't know who it was. And it was only a month later when they called us up or we called them or something, and we actually got to find out who this first person was that actually bought our software.
It turned out to be PricewaterhouseCoopers, and they were like, yeah. I just wanted to build a data warehouse, and it seemed like a simple tool to do it with, and it worked great. I was like, oh, wonderful. You gotta start somewhere. But that kind of pragmatism is, I think, always essential when you're building any software. But particularly when you're building tools, you kind of set out with a vision, and it's like, oh, it's only gonna be good enough when it can do all of these things. And then you have to kind of check that reality at some point and actually test it with the market, and that's the that's the scary bit.
[00:44:22] Unknown:
Are you struggling with broken pipelines, stale dashboards, missing data? If this resonates with you, you're not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end to end data observability platform. Trusted by the teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, DBT models, airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes.
Monte Carlo also gives you a holistic picture of dotcom/montecarlo to learn more.
[00:45:15] Unknown:
And for people who are interested in being able to manage their data integration and transformation workflows and be able to scale that out, what are the cases where Matillion is the wrong choice?
[00:45:27] Unknown:
Yeah. Good question. So I guess fairly clear from everything I've said so far that we're a very cloud centric company. We work best to an extent exclusively with cloud data warehouses. So most of our customers, they want to build a data warehouse or a data lakehouse. They generally want to build, like, a traditional kind of Kimbell Starkscape or Inman style model. And very often, they have some downstream systems that they wanna feed with that data. So they wanna feed an ML system or, you know, do some reverse ETL back into a transactional system, update some fields in Salesforce, whatever it might be.
That's kind of the sweet spot, and where we major is in helping and making it easier for users to navigate that complex transformation piece that exists at the center of that. And then everything else is kind of the tooling that we provide around it. Where I think the pitfalls are really when customers try and use our tool more as if you try and use it like more of a sort of a business process automation or you use it like it's a traditional ATL tool. You very often see customers would do strange things where they kinda fight against the tool because they'll move the data out of the data warehouse, do something to it, and then move it back in. You throw away some of the benefits of the ELT model when you do that. So you wanna try and keep everything in the data warehouse.
That's where it'll run fastest and scale best. I think that's that's essentially where I've seen customers kind of not use the tool in the optimal way. It's been when they've they've been using it kinda to orchestrate business processes as opposed to building data pipelines. There's never been, like, a dominant industry vertical at Matillion. It's always been kinda all across industries, lots and lots of different types of companies. But, yeah, have data. You can make it useful in a data warehouse.
[00:47:29] Unknown:
As you continue to build and grow and support the platform, what are some of the things you have planned for the near to medium term, or any particular projects or areas of interest that you're excited to dig into?
[00:47:39] Unknown:
I don't wanna preannounce anything by accident here. So I talked a little bit earlier about moving to a software as a service, control plane, data plane model. So without kind of announcing anything there, gives you a little pointer to indication as to our direction there. I think another big area that we've been working on is there's always demand in the market for turnkey data pipelines. You know, very simple, I've got data here, and I want it in my data warehouse. And I just wanna do that with a wizard with 3 clicks. I don't wanna transform it yet. Just very, very simple. And that's what our MDL product does, and it's designed to to do that. And you can never have enough connectors there. You know, you can always be making that data transformation more efficient, And you could be using better techniques, like better change data capture and things like that. So we put a lot of time and effort into making that kinda get data into data warehouse story as slick and as simple as possible.
I think looking more broadly into the future and how the market is going to evolve, I think 1 of Matillion's big advantages is kind of being a 1 stop shop for the whole kind of data landscape and data transformation landscape within an organization. And when we think about that, it becomes more about providing a catalog of services that can allow you to load, transform, do reverse ETL to catalog your data, to manage kind of downstream data pipelines into ML engines, do some kind of simple light data mining or ML or AI style operations. If you take a look at kind of that whole thing, it almost becomes like a data operating system, where you have a whole load of low level features of functions that you can call upon to build data centric and data driven apps.
There's a lot of kinda deep thinking, I guess, going on about that sort of stuff at the moment and how we kind of evolve into that story. We'll be talking quite a lot more about that in the future.
[00:49:53] Unknown:
Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tool in our technology that's available for data management today. Well, the biggest problem
[00:50:08] Unknown:
that hasn't really been totally cracked, I think, is DataOps. There's quite a few people out there in the marketplace that want to apply DevOps techniques to DataOps, and most of them apply really well. So that's a great starting point. However, there are some really important differences with DataOps, which I haven't seen the perfect answer on yet. So fundamentally, with DataOps, you are dealing with data, probably live data, with state. So unlike when you build a piece of software with a DevOps pipeline, you know, you expect the output piece of software to have passed all its tests and to work, and that's kind of the thing that you ship or you run or deploy or whatever.
With data ops, you need to do all of that, but it starts to matter, like, how you deploy and, critically, when you deploy if you're deploying a new iteration of a data pipeline against a fast moving stream of live data. And I've not seen, like, a good solution yet that does that kind of final manages that kind of final step really effectively. I'm really interested to kind of explore that area a little bit further and use some of the kind of more advanced features of the existing data warehouse platforms to allow you to essentially, you know, build test, deploy into production a new version of a data warehouse, but critically be able to get out of that if something goes wrong without affecting, damaging,
[00:51:53] Unknown:
or generally corrupting that stream of data going along. It's quite a gnarly problem. I think there's a lot people kinda kinda working on solutions that'd be really interested to see see where we end up on that. Yeah. Something we're thinking quite a lot about. Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing at Matillion. It's definitely a very interesting platform, and it's great to see that you've been able to stick with it this long and continue to provide value to the organizations that you're supporting. So appreciate all of the time and energy that you and your team are putting into that, and I hope you enjoy the rest of your day. Wonderful. Thank you very much, Tobias.
[00:52:33] Unknown:
Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used. And visit the site of data engineering podcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave review on iTunes and tell your friends and coworkers.
Introduction and Sponsor Messages
Interview with Ed Thompson: Introduction and Background
Founding Matillion and Early Challenges
Adopting Cloud Technologies Early
Transition to ELT and Product Evolution
Current Use Cases and Industry Focus
Matillion's Architecture and Future Directions
Challenges of Supporting Multiple Cloud Platforms
Integration with Metadata Catalogs
Managing Complexity and Reusability
Innovative and Unexpected Use Cases
Challenges and Ease of Supporting Different Platforms
Lessons Learned and Pragmatism in Software Development
When Matillion is Not the Right Choice
Future Plans and Exciting Projects
Biggest Gaps in Data Management Tools
Closing Remarks