Leading The Charge For The ELT Data Integration Pattern For Cloud Data Warehouses At Matillion

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means?

Our friends at Outland started out as a data team themselves and faced all this collaboration chaos.

They started building Outland as an internal tool for themselves.

Outland is a collaborative workspace for data driven teams, like GitHub for engineering or Figma for design teams.

By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlan enables teams to create single source of truth for all of their data assets

and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.

Go to dataengineeringpodcast.com/outland

today. That's a t l a n, and sign up for a free trial.

If you're a data engineering podcast listener, you get credits worth

$3,000 on an annual subscription.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Pacaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Ed Thompson about Matillion, a cloud native data integration platform for accelerating your time to analytics. So, Ed, can you start by introducing yourself? Yeah. Sure. My name's, Ed Thompson. I'm a CTO and cofounder of Matillion.

And do you remember how you first got started working in the data space?

Yeah. Sure. So we founded Matillion back in 2011. And

my background prior to that was

not really pure data. It was tangentially

data.

We've been doing some work with some fairly heavyweight IBM technology at the time, IBM DB 2 and Cognos

on top of that, building

data warehouses, but it was still very much a learning I was I was still very much at the at the start of the learning curve once we when we decided to found Matillion.

But the original idea

behind Matillion was to do

fairly simple

BI projects, essentially. So our kind of business pitch was we'll do business intelligence. We're gonna use cloud technology to deliver it, and we're gonna make it so that it's cost competitive or affordable

to sort of small, medium enterprises in the UK.

And I think we spent probably a year sort of building it and then probably 3 years

selling it and building up what was, I think, in retrospect, a pretty small customer base

of companies.

But the business itself, it was 1 of those kind of things that, like, it wasn't hugely successful, but it wasn't a complete failure either. It was so kinda sort of stoically

carried on trying to make it work.

And the key to making it work

became

the speed at which we could build data pipelines, essentially.

Talend was kind of our chosen ETL tool, probably because it had a fairly comprehensive open source edition. We made that work for us. But the issue that we kinda gradually sort

of crystalized

our realization of was that Talend

wasn't kinda purely built for cloud environments at that time. So we kinda set about trying to find

a cloud

data integration tool that was more suited for the data warehouse that we're using, which was at the time was Redshift,

pretty early adopters

of kind of Redshift and AWS. We built the whole business on AWS,

so it made sense.

And

about

2014, we kind of thought, hey. There's not really,

an ETL tool out there which does the job,

so we decided to build 1. And we built it purely for internal use, and it was purely to drive down the time it took from onboarding a customer

to having that first sort of data warehouse stood up. But what we didn't realize at the time was that well, that meant at the time we were a pretty small business, probably about 20 people.

But what we had is, like, an engineering team, very small engineering team building the product

and a small team of data engineers

who were just building data warehouses.

So we ended up with this really kinda tight feedback loop where, you know, the engineers would walk across and say, hey.

Here's some software. Does it do what you need? And the, and the data engineers would say, that works. That works. That doesn't fix that. Change that. Even occurred to us at the time, but we had this really sort of tight sort of product market feedback loop going on, which served us really well later on. And what I didn't realize at the time is it's something that so many businesses strive to create after the fact, but we kinda

accidentally,

I guess, happened upon it at that time.

Then in 2015,

once we kinda built this ETL to it was quite easy to kinda realize at that 0, but it's useful for us, and it's working for us, so maybe some other people would find it useful. It happened to coincide with the time that AWS launched their marketplace.

And they were looking for vendors to go on the marketplace, and they'd approached a load of kinda big traditional vendors with all sorts of software, and they'd done a fairly

mediocre job of just kinda throwing their software up on this marketplace.

And we decided the only way that we sell our software. And that meant that they helped us out a lot, because we were, like, yeah. We'll be exclusive, and we'll be like the poster child for it and so and so forth. They helped us out a lot with that. And

part of the kinda helping us out, they gave us a boot that reinvent that we otherwise wouldn't have been able to afford.

So we went out there on a bit of a shoestring budget compared with what traditional

conferences cost these days. And that's where we launched the product, and it was fairly clear that it was gonna be fairly successful. So we launched Matillion ETL for Redshift

at the time, and then we started, you know, just gradually building it up and picking up customers from there. That's really how we got into or how I got into into data integration. I was kinda figuring out how to build an ETL tool. It's an ELT tool. That's the fundamental difference

between Matillion's architecture. At the time, it was pretty much the only pure play ELT architecture tool. It's definitely interesting

that you were so such an early adopter of the cloud because, you know, 2011 time frame, the cloud itself was still pretty early, and there weren't really a lot of use cases for it beyond just here's some compute and maybe here's some support.

And now the cloud has grown up into this massive market with players. And, you know, if you go to the AWS console page, it's hard to even enumerate all of the different products that they've got on there now. It is a massive number of products. Yeah. I don't know anybody that's kinda on top of the whole

feature set. I'm sure there is somebody that I kinda

consign myself just to the data ones. 1 of the big headaches had at Matillion is we started with AWS and probably still

kind of

have a little bit of a bias towards AWS because it's kind of the in the DNA of the company.

But we very quickly wanted to have products

on Azure and GCP

particularly,

which meant for me personally,

I needed to be able to talk to customers

competently on AWS, Azure, and GCP,

which means I'm carrying around this enormous matrix in my head of all of the equivalent service names and try not to trip up when talking to it as your customer.

You start talking about VPCs rather than VNET, and then they're looking at you like, what you want about? So you can very quickly get a bit get a bit confused with yourself if you're working across those 3. But, yeah, we do work across all 3 of the cloud platforms now. It doesn't help too when you have things like AKS, which is for Azure Kubernetes Service versus EKS for Elastic Kubernetes Service, which runs on Amazon. So

No. No. Exactly. Exactly.

And then all the subtleties of the kind of differences with platforms. It gets pretty deep. I always think particularly with Azure and AWS that's I don't know whether it's because they both came out of Seattle.

You start to see some really curious parallels between the 2 products. Like, oh, there's definitely been some inspiration going both ways in in this software.

Yeah. You've probably just got employees bouncing back and forth between the different companies.

I can believe that. I can believe that. Definitely. Another interesting element of what you're describing of the system that you built is that you were very early to what has now become sort of the odd pressure

versus ETL, which it seems like maybe in the past 3 to 5 years, that has become kind of the dominant paradigm. Whereas prior to that, it was, you know, very much heavy on the transformation before you loaded into

a heavily structured data warehouse. And I'm wondering

sort of what the

insight was early in the company that allowed you to hit that sweet spot and recognize that this was actually a more effective and useful pattern?

The realization, I think, came from

so

the original idea of Matillion,

we needed to be able to build data warehouses

that were

tailored.

So, essentially, we were selling packages of facts and dimensions. Right? So we basically go to someone and say, hey. For this much money, we'll set you up this many facts and this many dimensions, and, obviously, that led to quite a long explanation of what all that meant. We'll sell some facts. We'll sell some dimensions.

We'll wire them all up. And, essentially, what we then have would be like a catalog of kind of semi prebuilt stuff, and then it would be a matter of tailoring, you know, that turnkey data warehouse

for that particular customer.

And sometimes that worked really well. So you'd have, like, you know, a fairly simple business that you'd go to, and they'd want a very fairly simple data warehouse that more or less match the template. And then sometimes you go to customer and, you know, you take a complete bath on it, because it would be a really complicated data warehouse with lots of complex interactions and really tricky source data from really tricky systems,

or lots of systems was a very common 1. We'd approach cost companies that say, hey. Yeah. We've got 9 different ARPs, and we'd like to have a data warehouse that shows all these different ARPs from different vendors on 1 view. We took on some pretty mad projects in hindsight, but what we realized was

a lot of the ETL that we were building,

they were complex because there was a lot of it, but not because any of the particular operations that we need to do on the data were complex.

And

the

modern data warehouses,

not only were they more than capable

of doing

the data transformations

in pure SQL,

it was much, much faster doing that way if you take advantage of the parallelization that Snowflake or Redshift or just about any data warehouse decent data warehouse has

nowadays. So you just got much, much better performance if you kept the data where it was and you transformed it in situ.

So that's what we did. And then things that kinda where you get more complicated data operations, they could

generally deal with it with a window function or with a stored procedure

and so on and so forth. But the data engineers

really just needed a tool needed a set of tools where, you know, they're coming from a talent background. So they need something that they can build data pipelines visually,

you know, hardly any code,

low code or no code. And, you know, maybe they wanna do a they'll they'll go around a bit of Python or the right stored procedure. But for the most part, they're just doing lots and lots of simple SQL operations

by putting in a low code, like DAG or whatever. They then have something that's kinda really easy for any other person to pick up at any time and maintain.

And then when you do ELT like that, the other kinda really nice feature that everyone that saw it just immediately loved in the product was you can kinda just query the data at any stage in the data pipeline because it's just like it's a stage in the SQL that's built up. So you can say, you know, what's my source data look like? Oh, this oh, yeah. I'm gonna do these 3 simple operations. Now what does it look like? Okay. Have I got as much data as I was expected? Yeah. Show me the basic distribution of the data. Is it what I expected? And when you're kinda going through that kinda cognitive sort of, help me to understand what's going on, whereas all the other ETL tools at the time, like, Talent, Informatica,

you know, DataStage,

the Microsoft 1. They all have the same sort of paradigm where you draw everything out as you kind of intended it, and you only find out that you'd done it wrong when you hit go, compile, run, or you'd wait for a load of data to move, and then something would fail.

So the ELT worked really well. It's an interesting debate kinda going on internally at Matillion now because

we called the tool Matillion ETL, even though it was always an ELT tool.

Called it that because

at the time,

no 1 was googling

ELT. They were googling ETL. So we wanna be found, so we need to call it that. And, actually, as you kind of alluded to in your question, that's kinda tilting now. So it's like everybody's talking about ELT. Everybody wants an ELT tool. Should we be called what we actually are, which is Matillion ELT or or something like that? So there's some debate going on that in our product team, so I'll see where that comes out. Yeah. I mean, really, it should just be, you know, square brackets,

e t l, and then a plus sign at the end because, you know, you're gonna have each of those some arbitrary number of times in any pipeline. So there's always gonna be an extract stage at the beginning, but you're probably also gonna have multiple other extracts further down the road, and you're gonna do transforming and loading multiple times. So

Yeah. Now the technical guys would love it if we expressed it like that, but

I'm not sure the marketing team would love it quite so much. So we'll see who wins on that 1.

Yeah. You have to explain peachy to the marketers.

Exactly.

And so

in terms of where you are now, I'm wondering if you can talk through the sort of main use cases or industry verticals or user personas that you're focused on supporting and how that has influenced the recent direction and focus of your feature development and prioritization.

Yeah. Sure. So I think

we are quite closely aligned with our partners,

AWS with Redshift.

In recent years, been quite a lot of focus around Snowflake and now

Databricks.

You know, we try and kind of treat those partners equally.

The other decision that we took fairly early on was we were gonna build

additions of the products

that were specific

to those platforms.

So, yeah, Matillion ETL for Redshift has features in it that are Redshift specific. Matillion ETL for Snowflake has features that are Snowflake specific. So whatever data warehouse has a USP,

we can expose that as first class features in the product. So, like, with Snowflake,

that kinda USP was their ability to separate storage from compute and scale

separately. So we added features in to allow you to control and manage that inside of the utility ETL, giving you, like, single pane of glass over those features. Same with Redshift, same same with Databricks.

So what all that means is that, you know, very often,

Matillion customers are also the kind of customers you'd see Snowflake and Redshift

and BigQuery,

so forth. And so as a result of that, I think we tend to have quite a strong bias to the enterprise,

and we have a lot of large kinda large enterprise logos. And the CTO still asked me to name too many of them, but, you know, the likes of,

Nike and Siemens,

PepsiCo, big corporations

who traditionally

would run

fairly, you know, expensive data warehouse stacks, maybe on prem data warehouse stacks with Teradata boxes and that sort of thing,

Teradata and Informatica.

And then they want to go through a modernization process, and they go to the cloud. They very often go with vendors like Snowflake

and tools like Matillion. So we find ourselves

more biased into the, enterprise space, although we have also a long list of, like, what we've classed as commercial customers who are sub enterprise but still fairly significant businesses.

And very often, the data engineering teams

in those companies

come from,

you know, low code tooling, like Informatica, like Talend.

They want a relatively easy transition from that tooling to something more cloud native like Matillion.

Up to now, it's been very much our sort of key target persona,

data engineers in large corporations that are used to building data pipelines with that kind of toolset. What we're finding

more and more

is

that

organizations that set themselves up, so they have, like, a small data engineering team that try and do everything, like, all the way from source data right through to

delivery of reports, feeding ML models, whatever it might be.

Organizations have very, very quickly got themselves in trouble because they find that they're really constrained around those people.

You know, there's a million requests coming from a million different people, and they're not able to kinda operate, innovate with data effectively.

And because Matillion kinda has this sort of everything's based on SQL,

everything is quite

guided in how you configure it and set it up. We find it's quite accessible

to the kind of modern

data savvy line of business user who just wants to get something done with data.

So organizations

that are a bit more savvy, and they're kind of going for a lake house architecture, where

really what you have is

your best data engineers

are focused on

ingestion

and preparation

and cataloging of datasets,

but they're not getting bogged down in

answering business problems or figuring out a particular dataset or a particular

data transformation that needs to be done to support

team or 1 particular manager.

But instead,

you have, you know, this

clear catalog

prepared lake and then a whole bunch of different teams that are able to build their own data transformations, do their own analysis,

build their own pipelines off that single source of the truth. And that's what we're seeing, I think, more and more different types of users or multiple different types of users within an organization

building the data pipeline.

In terms of the

architecture of Matillion, you mentioned that it started very early on with the cloud. And I'm wondering if you could talk through what it looks like now and some of the ways that the

advancement of cloud technologies

and some of the surrounding tooling

have influenced the way that you think about the overall design and implementation of your product and just some of the evolution that it's gone through from the early days to where you are now?

Yeah. I think it's probably true to say that we're in a period of pretty rapid evolution

right now at Matillion. So

where we are actually is a little bit old fashioned in some ways. So we deliver Matillion ETL primarily

as

a AMI, Amazon Machine Image, which

customers run-in their own AWS accounts.

So we're not like a pure SaaS company for the ETL product. We do have SaaS products for

our MDL product, which is Matillion DataLoader, which is like a simplified sort of data loading pipeline.

But the core ETL

product is delivered as a as Amazon Machine Image or as a VM Image into Blasier or, the web GCP equivalent is.

But, actually, what we found is a lot of enterprises really like that delivery model because

it takes a lot of the security questions that they like to ask off the table if you say, are you gonna run it in your own cloud infrastructure?

The evolutionary

debate

really focuses

around kind of 2 axes.

So

on the 1 hand, almost all of our customers

are constantly coming to us and saying, hey.

We like the fact that you deliver your software as an AMI,

but we hate the fact that we have to manage it, and we hate managing

infrastructure, and we'd just like you to manage all the infrastructure for us. So it would be better for us if you were a software as a service.

Great. But then they have this competing other access, which is, oh, and by the way,

we don't want you guys to have any control over our data. We'd like our data to never leave the premises or never leave our cloud, please.

And so it's like, okay. So how are we gonna square that circle?

So I think the future architecture, and you see this architecture in lots of places in the industry now, particularly,

Databricks, is to is to kinda separate the control plane and the data plane.

And where we're, I think, headed is we'll see

a a SaaS control plane, which will do all of the coordination of the data pipeline.

And then an on premise data plane, which will be as lightweight as we can possibly make it, will do all the heavy lifting

around moving data, controlling the data warehouse, so on and so forth.

So we're not there quite yet, but we're very close, and we're working to that end.

And the nice thing about because we're relatively compared to our competitors, we're a relatively young company.

Got a relatively young code base.

We can kinda take

our customers on that journey into the SaaS without kinda

asking them to do a big complex migration or rewrite all their data pipelines or just complicate stuff that would really turn them off from a new product like that. That's kind of the big

scoop, if you like, of kinda what's going on. The product itself, pretty simple. It's already in Java. It's a Apache

Tomcat app.

And

because we've always had this kinda ELT architecture,

most of the heavy lifting data is done by the underlying data warehouse.

So very often, Matillion is just waiting for the data warehouse to do some data transformation.

Where we do get a bit more involved is on the ingestion side. Most vendors do. We have a stack of data connectors

for pulling data into the data warehouse.

Some of those pull the data directly into the data warehouse,

but where that's not feasible, we're kind of streaming data through the system. But we always kinda take the philosophy,

put the data in the data warehouse, put the data that you need, don't take more than you need,

but keep the data in its original as close to the original shape and format as you can. And then once the data is in the data warehouse or in the data lake, that's where you wanna start doing the transformation. So we try not to do any

even fairly light transformation of the data while it's in flight because that's where we start to worry about scale and things like that. Yeah. The whole

in cloud software as a service or, like, trying to think of what the terminology is that I've come across before, but that's definitely a sort of growing trend, particularly with the advent of Kubernetes and being able to say, we'll deploy a managed Kubernetes service into your system for you, and then we'll deploy this helm chart or what have you, and then we'll wire that into our control plane to manage the software.

Definitely interesting how these underlying tooling has been allowing that to be more of a accepted and supported practice.

I feel like it's probably from most organizations' point of view, it's probably still quite early on that architecture, but I think they like it. I think they get it. I think if we can do a good job of making it clear

what the communication is between the control plane and the data plane, I think people will get pretty comfortable with that. It'll avoid a lot

of complexity that you get once you start handling customers' data. Things things start to get very serious very quickly and very complicated very quickly. So best avoid it if we can. What you can't avoid in that sort of architecture or you have to manage very carefully

are

secrets,

authorization,

you know, what can connect to what,

how things are controlled. So it's also coming out of the wash pretty nicely.

Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code.

Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks.

By the time errors have made their way into production, it's often too late and the damage is done.

DataFold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests.

DataFold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production.

No more shipping and praying. You can now know exactly what will change in your database.

DataFold integrates with all major data warehouses as well as frameworks such as airflow and DBT and seamlessly plugs into CI workflows.

Visit data engineering podcast.com/datafolds

today to book a demo with DataFold.

Given that you are working to support so many different cloud environments and so many different data warehouses and starting to adopt the data lakehouse architectures,

what are some of the most challenging aspects of being able to manage that sort of feature matrix and support all of those different engines and SQL dialects and

being able to do that in a way that doesn't drive you insane or your customers?

I think it's a case of deciding that you're going to do it like that and then being smart about how you approach it.

Because

it is very advantageous

to have the wide ranging support, particularly with Snowflake customers. You know, Snowflake

across all 3 clouds as well,

you kinda gotta follow them. And once you get into, like, the retail space, particularly,

you will kind of hear people kind of saying things like, oh, yeah. You can run it on anything, but it can't touch Amazon,

and that kind of stuff. So we have to kind of traverse all of that. Architecturally,

I think, is where

most of the

care has to be taken.

So to give you a couple of examples, the way that Matillion is architectured internally,

we have an internal system that we've built which abstracts

all of the

SQL operations.

So essentially, we you know, a Matillion component, simple example,

something like a filter, and then that

design intent is passed to the generator engine, which would generate here's how the filter gets implemented in SQL on Redshift, and it looks slightly different on Snowflake and slightly different again, so on and so forth. We kinda have all of that very much kinda architected and baked into the product.

And we've always been quite quick, but we wanted to go to another platform,

relatively

easy lift to take the product and kind of turn on a new on a new data platform.

And then a lot of the features and functions, particularly on the orchestration side of the product. So there's a whole kind of orchestration mechanism in Cilium, which doesn't look a 1000000 miles from what you might do in airflow. It's a little bit higher level with airflow. It's kinda like everything's a bit more prebaked

because we are able to kinda, you know, understand,

like, more common operations that customers are trying to do, particularly in the cloud.

But a lot of those kinda common operations that kind of just

make it possible to build a real world data pipeline with all of the gnarly,

awkward situations that you find yourself in moving data around in the real world, They tend to be common. And then the final thing is where we have

features that we wanna build

for specific platforms, we always try and build those as individual

components. And because they're quite self contained, you can either have them in the product

or not have them in the product, and then you end up with a big matrix of which features go into which version of which product.

But it's not without its downsides. Right? You do have to manage all that, but we need to educate our salespeople, for example, of of the differences in the product

and this features in this version of the product, but not in this particular edition.

That kind of managed all that. It makes for an interesting challenge, but

when a customer comes along and they've got their shiny

new Snowflake data warehouse, and they want a tool which is firmly kind of designed to work exactly with that data warehouse

and has all what the shiny new features that they've been sold by their Snowflake rep. As first class citizen in the tool, then we can deliver that. That's really important.

And for people who want to adopt Matillion and integrate it into their systems, what is the process for actually getting it set up and integrated and starting to build out your first workflow?

Yeah. All pretty simple, hopefully. So we essentially will there's a there's like an onboarding hub, which will get you to the point where you can download an AMI into your cloud environment.

And once you've kind of stood that up,

the next stage is to connect to your data warehouse, then you're good to go. You can start building pipelines.

Probably typical customer would probably start with, like,

I've got some data in a SQL Server, so I'm gonna I'm gonna connect to that. I've got some data in Salesforce. I'm gonna connect to that. Pull some SQL server data, pull some Salesforce data. Suddenly, you've got 2 tables, and you can start doing some basic transformations on there. A lot of the advantages that we have is just that sort of quite easy easy ramp up period.

It's kind of a fairly all inclusive platform as well. So, you know, you don't need a separate scheduler. You don't need a separate orchestration engine. It's all kind of built in. You don't need anything separate to kinda integrate it with the cloud. Got the right permissions. You can see all your s 3 buckets. So you're at your data lake. If you're on AWS, you know, you see your data lake, you're in Athena or whatever. It's all just kinda ready to go and just start building with with existing components.

So, yeah, that's the key thing is getting people

to some kind of data pipeline that is of value to them. Show some value as quickly as you can. Once you've got there, then people people kind of build all sorts of crazy stuff that's

great to see. 1 of the other

growing trends these days is having a centralized metadata catalog, and I'm wondering

how you've approached that integration or if you sort of defer that because because of the fact that you're running primarily on the data warehouse and letting that do the work, just using those existing integrations with those data warehouses and those metadata stores for being able to manage the sort of lineage graph and the catalog elements?

We've kinda had a couple of goes up there. So we each started

building some basic lineage features into the product.

What we realized

fairly early on, it's always a good test in any business to make sure you're focused on the thing that you're best at,

and let other people focus on the things that they're best at. So we partnered with a few data catalog

companies.

Probably the most prominent would be Calibra

and Elation.

Because

we are

doing everything in the data warehouse,

that means essentially that any decent cataloging tool can see everything that we're doing anyway.

So they can see the inbound data being landed,

they can see the SQL that's being run against it to transform it,

and then they can see whatever the output schema might be.

So

what we were able to do is quite simple. To work with those guys better was just to provide

an API

which

allowed them to embellish on a whole load of extra contextual information that Matillion knows because the pipeline author knows it.

So if you look at

a database table, you look at that database table in, say, Calibra,

you can then say, okay, that was a table that was populated by Matillion.

And then you can say, okay. What was the intent? What calculations have been done on the data?

Follow the lineage back to the various sources of data,

and where did those come from. So Matillion knows where data came from, how it was loaded,

at which kind of

transformations have been formed on it, and how they're expressed in Matillion, and how all that was orchestrated.

So really, for us, it was about

providing a whole load of additional context

to extend what

a traditional catalog

tool, like Calibra, like Alation,

there's a whole load of others in the market that are able to understand about the data. But we don't necessarily

need to provide

them with a whole load of complex

information about the intent of the transformation because,

essentially, their users couldn't go straight from the data

to the intended transformation, see how that data was built.

And another aspect of working with any tool, but particularly for something like Matillion where you're going to have a bunch of people interacting with it, lots of different pipelines and data flows is how you manage the

complexity

of the tool and the transformations as you

Absolutely. This

Absolutely. This is something that

again, because we were building the tool but also using the tool,

you could kind of see ahead of steam growing just from a very small pool of users that, hey,

we really need

to be able to make things reusable. And I guess it's part of it comes from you have an engineering team, and and they're writing

Java code, and then thinking about how to make everything reusable and how you can minimize the amount of code and make lots of reusable useful functions.

So why would you not do that in the ETL tool kind of thing? So we very much built it with that in mind. Probably,

you know, 2 or 3 years ago, we were kind of really kinda in the weeds of thinking about this stuff.

It started with a simple variable system that became a slightly more comprehensive variable

system. Once you've kinda got a really good variable system in in your ETL processes, you can start to make parts of the ETL reusable.

But the really most important thing is actually making it able to take a piece

of ETL logic,

which is usually a bit of orchestration and quite a lot of transformation,

and turn that into a self contained thing in its own right. And that became a feature which we it's not the greatest name. We should probably think about it, but at the time, we called it shared jobs. So we have this kind of shared jobs feature inside of Matillion. Once we got that in, that allows the next obvious leap is, okay, You know, customers can share jobs with each other in their organization, but how do we make it so that they can share it outside of the organization?

So we have, a thing that's been running for a couple of years now called Matillion Exchange, which is designed to do exactly that. So if you've got a really good piece of ETL logic that builds, like, the world's greatest

date dimension,

you can put that onto the Exchange, and lots of other customers would come along and take advantage of that. I wanna call it call it an open source community, but it's certainly

a community of Matillioners who wanna share their ideas and share their stuff. Works pretty well. As you have been

building Matillion and working with your customers and helping them with their onboarding and adoption, what are the most interesting or innovative or unexpected ways that you've seen it used?

Kirtiliant's

quite permissive, and we wanted to build a, like, a a low code tool, but we knew

you gotta satisfy the entire data engineering team. Right? So you got data engineers

in a typical company who who have come from, you know, like a tool background, like like a Talend. And then you have data engineers that wanna use dbt, and you've got data engineers

who wanna write Python. So we try to satisfy all of those by having kinda allowing the tool to orchestrate all of those things relatively easy. So you've got dbt functionality in there. And if you look at, like, our sort of internal telemetry, like, the Python component is incredibly popular. And if you look at what customers actually do with it, very often, they're like 3 line Python scripts where they just manipulate a variable, and it's like, oh, do you know what? It's just easy to do in Python. Okay.

So because the tool's quite permissive,

you do see quite a lot of fairly exotic fairly exotic use cases out there.

Where it's been used in places that I never expected,

I think probably the first 1 we saw was, like,

someone had taken it and used it in, like, a biotech life sciences company,

and they were using it to crunch data that was the output of a process that was sequencing DNA. And we just never envisaged it. We were like, this is a data tool. It's for for businesses that have got, you know, sales orders and invoices and things like that, and it was a completely different use case. And then another 1 that I came across a few years ago is big engineering company that built massive gas turbines,

and they had all these gas turbines out in the field. They were collecting all of this

detailed performance data, like, I think it was coming out, like, 5 Hertz. So it's quite a lot of data, lots of data points,

and then they were just crunching this into a big data warehouse

using Matillion. I was like, oh, yeah. Never envisaged that. And then the final 1, I think, probably

is

it's very easy to underestimate the sheer volume

of data pipelines that a company will build over time.

So,

very large,

come up with the right choice of words to make to make it too obvious,

sandwich chain, should we say. The US, They kinda I think they were like, oh, well, you know, we've got some performance issues with a couple of our pipelines. Can we can we take a look? And then we kinda got on the call with them. Oh, show us all of the pipelines that you've got scheduled, and it was just this list. And we're like,

oh, okay. I have to figure this 1 out. But the scale that an enterprise can work out, the amount of throughput of stuffs that they're creating

was quite a surprise. But

I asked internally at Matillion when I saw this question, so I was like, oh, I bet there's some other good examples.

There are quite a few, but those are the ones that I can remember. A few of those I've kind of witnessed witnessed firsthand, but I'm sure there are quite a lot more because people definitely like to bend the tool, bend the product to their to their will, and everyone

comes up with a sort of exotic use cases from time to time.

In terms of the sort of support factor of working across all these different cloud providers and warehouse engines, I'm curious if there are any that stand out as being either

a breeze to work with, and they've been sort of, you know, easy to get started with and low maintenance. And if there are any that are particularly challenging or troublesome to be able to actually support over the long run.

I'll try and be kinda nice about it. So working like we do definitely gives you quite a lot of insight

into

and

sometimes we end up going kind of relatively deep.

I mean, it's clear that the move into the cloud,

I think it's freed up a lot of innovation,

particularly in companies that

you wouldn't necessarily

think of as the kind of most forward thinking or the most innovative companies.

So it's kinda given them a little bit of breathing space

to innovate.

I mentioned the bias before. I have a slight bias to AWS because that's where we started. That's where I have the most knowledge.

I think it's probably fair to say, and I imagine that they would admit to this, that there was certainly

a chunk of time

on Azure

where they were kind of playing catch up.

You could tell they were trying to reuse a lot of technology that they already had on the shelf, and they were sort of bolting it in, but it didn't necessarily end up being the most coherent

platform or system as a result.

But, you know, they've managed to iterate and steadily improve things over time. It's kind of a lot better to work with than it was.

And then with Google, you always get impression whenever I've worked with Google, with people at Google, is that they very much kinda go their own way,

and

they like to kinda reinvent from scratch. So you end up with something that is somewhat different to the other platforms and doesn't kind of follow the same patterns very often. And that kind of gives you some headaches, understanding it and getting around it. But ultimately, the use cases are the same. There is a kind of danger that

data warehouses

themselves

start to become

somewhat commoditized.

The actual core data warehouses,

as I see them,

at 1 time, there wasn't really feature parity.

Now there more or less is feature parity,

and then so they're competing on performance instead.

But that will quickly flush itself out a little bit, and there won't be

massive performance differences, and then they maybe they'll have to compete on price instead. Very interested to see how that market evolves.

But I think what you said in the market is particularly like the Snowflakey Databricks,

because they're not part of a big platform. They're built on existing platforms. They're kind of trying to expand their footprint now

into

being kind of

more all singing, all dancing data providers,

doing much more than just lakes and warehouses,

doing getting into ML

in a big way, and really kind of

becoming

almost

data centric

cloud providers in their own right. And it'd be really interesting to see

where that goes.

I don't have a massively strong opinion on how that's gonna end up, but I think that we'll be seeing

data platforms in the future,

really, you know, imagine AWS,

but where

data is at the center rather than compute being at the center.

That'll be pretty interesting to see how all that fashes out, but I'm not gonna make any predictions

on how to be very wrong. Yeah. No. It's definitely an interesting view on that. And in terms of your experience of building Matillion and growing it to where it is today and working with your customers and the ecosystem,

what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

Probably touched on a few of them. I think

don't underestimate the complexity or sophistication of what people can and will build if you give them a tool that's flexible enough to allow it. I think that would probably be 1 lesson.

And then, you know, the biggest 1, I think it probably applies to just about

any software.

But

successful software

is

always walking a tight rope

between

the dream of the software architect and what they would like to create

and the pragmatic

reality

of what the business

needs

tomorrow,

next week, next month.

So

I found myself

always having to be really pragmatic

and commercially minded, which isn't always a natural trait in technical people.

When we first released Matillion ETL,

it my heart of hearts. I was like, I don't think this product

is good enough yet. It needs a lot more work,

but we also need to get some software out the door to allow us to continue to be a business.

And

as is so often the case, I was proved wrong because

the first kind of customer to ever use our software

was PricewaterhouseCoopers.

The weird aspect so at the time, Amazon Marketplace,

they told you someone was using your software. They told you which country they were in, but you didn't know any other information about them apart from that. So we knew someone was paying for our software, and we knew that they were in Australia,

but we didn't know who it was. And it was only a month later when they called us up or we called them or something, and we actually got to find out who this first person was that actually bought our software.

It turned out to be PricewaterhouseCoopers,

and they were like, yeah. I just wanted to build a data warehouse, and it seemed like a simple tool to do it with, and it worked great. I was like, oh, wonderful.

You gotta start somewhere. But that kind of pragmatism

is,

I think, always essential when you're building any software. But particularly when you're building

tools, you kind of set out with a vision, and it's like, oh, it's only gonna be good enough when it can do all of these things.

And then you have to kind of check that reality at some point and actually test it with the market, and that's the that's the scary bit.

Are you struggling with broken pipelines,

stale dashboards,

missing data?

If this resonates with you, you're not alone.

Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end to end data observability

platform.

Trusted by the teams at Fox, JetBlue, and PagerDuty,

Monte Carlo solves the costly problem of broken data pipelines.

Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, DBT models, airflow jobs, and business intelligence tools,

reducing time to detection and resolution from weeks to just minutes.

Monte Carlo also gives you a holistic picture of

dotcom/montecarlo

to learn more.

And for people who are

interested in being able to

manage their data integration and transformation

workflows and be able to scale that out, what are the cases where Matillion is the wrong choice?

Yeah. Good question. So

I guess fairly clear from everything I've said so far that we're a very cloud centric company.

We work best

to an extent exclusively

with cloud data warehouses.

So most of our customers, they want to build a data warehouse or a data lakehouse.

They

generally

want to

build, like, a traditional kind of Kimbell Starkscape or Inman style model.

And very often,

they have some downstream systems that they wanna feed with that data.

So they wanna feed an ML system or, you know, do some reverse ETL

back into a transactional system,

update some fields in Salesforce, whatever it might be.

That's kind of the sweet spot, and

where we major is

in helping

and making it easier

for users to navigate that complex transformation piece that exists at the center of that. And then everything else is kind of the tooling that we provide around it.

Where I think the pitfalls are really when customers try and use our tool more as

if you try and use it like more of a sort of a business process automation

or you use it like it's a traditional ATL tool.

You very often see customers would do strange things where they kinda fight against the tool because they'll move the data

out of the data warehouse, do something to it, and then move it back in. You throw away some of the benefits of the ELT model when you do that. So you wanna try and keep everything in the data warehouse.

That's where it'll run fastest and scale best.

I think that's

that's essentially

where I've seen customers

kind of not use the tool in the optimal way. It's been when they've they've been using it kinda to orchestrate business processes

as opposed to building data pipelines.

There's never been, like, a dominant industry vertical at Matillion. It's always been kinda all across industries, lots and lots of different types of companies.

But, yeah, have data.

You can make it useful in a data warehouse.

As you continue to build and grow and support the platform, what are some of the things you have planned for the near to medium term, or any particular

projects or areas of interest that you're excited to dig into?

I don't wanna preannounce anything

by accident here.

So I talked a little bit earlier about

moving to a software as a service, control plane, data plane model. So without kind of announcing anything there, gives you a little pointer to indication

as to our direction there.

I think another big area that we've been working on is there's always demand in the market for

turnkey data pipelines.

You know, very simple, I've got data here, and I want it in my data warehouse. And I just wanna do that with a wizard with 3 clicks. I don't wanna transform it yet. Just very, very simple.

And that's what our MDL product does, and it's designed to to do that. And you can never have enough connectors there. You know, you can always be making that data

transformation

more efficient,

And you could be using

better techniques, like better change data capture and things like that. So we put a lot of time and effort into

making that kinda

get data into data warehouse story as slick and as simple as possible.

I think looking more broadly into the future and how the market is going to evolve, I think 1 of Matillion's big advantages is kind of being a 1 stop shop for the whole kind of data landscape and data transformation

landscape within an organization.

And when we think about that,

it becomes

more about

providing

a catalog of services

that can

allow you to load,

transform,

do reverse ETL to

catalog your data,

to manage kind of downstream data pipelines into ML engines,

do some kind of simple light data mining or ML or AI style operations.

If you take a look at kind of that whole thing, it almost becomes like a data operating system,

where you have a whole load of low level features of functions that you can call upon

to build data centric and data driven apps.

There's a lot of kinda deep thinking, I guess, going on about that sort of stuff at the moment and how we kind of evolve into that story. We'll be talking quite a lot more about that in the future.

Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tool in our technology that's available for data management today. Well, the biggest problem

that hasn't really been totally cracked, I think,

is DataOps.

There's quite a few people out there in the marketplace that want to apply

DevOps techniques to DataOps, and most of them apply really well. So that's a great starting point.

However,

there are some really important differences with DataOps, which I haven't seen

the perfect answer on yet. So fundamentally,

with DataOps,

you are dealing with data,

probably live data,

with state.

So unlike when you build a piece of software with a DevOps pipeline, you know, you expect the output piece of software

to have passed all its tests and to work, and that's kind of the thing that you ship or you run or deploy or whatever.

With data ops,

you need to do all of that, but it starts to matter,

like, how you deploy and, critically, when you deploy if you're deploying a new iteration

of a data pipeline

against a fast moving stream of live data.

And I've not seen,

like, a good solution

yet that does

that kind of final

manages that kind of final step really effectively.

I'm really interested to kind of explore that area a little bit further

and use some of the kind of more advanced features of the existing data warehouse platforms

to allow you to essentially, you know, build test,

deploy into production

a new version of a data warehouse,

but critically

be able to get out of that

if something goes wrong

without

affecting,

damaging,

or generally corrupting that stream of data going along. It's quite a gnarly problem. I think there's a lot people kinda kinda working on solutions that'd be really interested to see see where we end up on that. Yeah. Something we're thinking quite a lot about. Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing at Matillion. It's definitely a very interesting platform, and it's great to see that you've been able to stick with it this long and continue to provide value to the organizations that you're supporting. So appreciate all of the time and energy that you and your team are putting into that, and I hope you enjoy the rest of your day. Wonderful. Thank you very much, Tobias.

Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site of data engineering podcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave review on iTunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links